netns: restrict uevents

commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces") enabled sending hotplug events into all network namespaces back in 2010. Over time the set of uevents that get sent into all network namespaces has shrunk. We have now reached the point where hotplug events for all devices that carry a namespace tag are filtered according to that namespace. Specifically, they are filtered whenever the namespace tag of the kobject does not match the namespace tag of the netlink socket. Currently, only network devices carry namespace tags (i.e. network namespace tags). Hence, uevents for network devices only show up in the network namespace such devices are created in or moved to. However, any uevent for a kobject that does not have a namespace tag associated with it will not be filtered and we will broadcast it into all network namespaces. This behavior stopped making sense when user namespaces were introduced. This patch simplifies and fixes couple of things: - Split codepath for sending uevents by kobject namespace tags: 1. Untagged kobjects - uevent_net_broadcast_untagged(): Untagged kobjects will be broadcast into all uevent sockets recorded in uevent_sock_list, i.e. into all network namespacs owned by the intial user namespace. 2. Tagged kobjects - uevent_net_broadcast_tagged(): Tagged kobjects will only be broadcast into the network namespace they were tagged with. Handling of tagged kobjects in 2. does not cause any semantic changes. This is just splitting out the filtering logic that was handled by kobj_bcast_filter() before. Handling of untagged kobjects in 1. will cause a semantic change. The reasons why this is needed and ok have been discussed in [1]. Here is a short summary: - Userspace ignores uevents from network namespaces that are not owned by the intial user namespace: Uevents are filtered by userspace in a user namespace because the received uid != 0. Instead the uid associated with the event will be 65534 == "nobody" because the global root uid is not mapped. This means we can safely and without introducing regressions modify the kernel to not send uevents into all network namespaces whose owning user namespace is not the initial user namespace because we know that userspace will ignore the message because of the uid anyway. I have a) verified that is is true for every udev implementation out there b) that this behavior has been present in all udev implementations from the very beginning. - Thundering herd: Broadcasting uevents into all network namespaces introduces significant overhead. All processes that listen to uevents running in non-initial user namespaces will end up responding to uevents that will be meaningless to them. Mainly, because non-initial user namespaces cannot easily manage devices unless they have a privileged host-process helping them out. This means that there will be a thundering herd of activity when there shouldn't be any. - Removing needless overhead/Increasing performance: Currently, the uevent socket for each network namespace is added to the global variable uevent_sock_list. The list itself needs to be protected by a mutex. So everytime a uevent is generated the mutex is taken on the list. The mutex is held *from the creation of the uevent (memory allocation, string creation etc. until all uevent sockets have been handled*. This is aggravated by the fact that for each uevent socket that has listeners the mc_list must be walked as well which means we're talking O(n^2) here. Given that a standard Linux workload usually has quite a lot of network namespaces and - in the face of containers - a lot of user namespaces this quickly becomes a performance problem (see "Thundering herd" above). By just recording uevent sockets of network namespaces that are owned by the initial user namespace we significantly increase performance in this codepath. - Injecting uevents: There's a valid argument that containers might be interested in receiving device events especially if they are delegated to them by a privileged userspace process. One prime example are SR-IOV enabled devices that are explicitly designed to be handed of to other users such as VMs or containers. This use-case can now be correctly handled since commit 692ec06d7c92 ("netns: send uevent messages"). This commit introduced the ability to send uevents from userspace. As such we can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of the network namespace of the netlink socket) userspace process make a decision what uevents should be sent. This removes the need to blindly broadcast uevents into all user namespaces and provides a performant and safe solution to this problem. - Filtering logic: This patch filters by *owning user namespace of the network namespace a given task resides in* and not by user namespace of the task per se. This means if the user namespace of a given task is unshared but the network namespace is kept and is owned by the initial user namespace a listener that is opening the uevent socket in that network namespace can still listen to uevents. - Fix permission for tagged kobjects: Network devices that are created or moved into a network namespace that is owned by a non-initial user namespace currently are send with INVALID_{G,U}ID in their credentials. This means that all current udev implementations in userspace will ignore the uevent they receive for them. This has lead to weird bugs whereby new devices showing up in such network namespaces were not recognized and did not get IPs assigned etc. This patch adjusts the permission to the appropriate {g,u}id in the respective user namespace. This way udevd is able to correctly handle such devices. - Simplify filtering logic: do_one_broadcast() already ensures that only listeners in mc_list receive uevents that have the same network namespace as the uevent socket itself. So the filtering logic in kobj_bcast_filter is not needed (see [3]). This patch therefore removes kobj_bcast_filter() and replaces netlink_broadcast_filtered() with the simpler netlink_broadcast() everywhere. [1]: https://lkml.org/lkml/2018/4/4/739 [2]: https://lkml.org/lkml/2018/4/26/767 [3]: https://lkml.org/lkml/2018/4/26/738 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
author: Christian Brauner <christian.brauner@ubuntu.com> 2018-04-29 06:44:12 -0400
committer: David S. Miller <davem@davemloft.net> 2018-05-01 10:22:41 -0400
commit: a3498436b3a0f8ec289e6847e1de40b4123e1639 (patch)
tree: 31e520ec03624c43a95048cace9e014c701fbdcb
parent: 26045a7b14bc7a5455e411d820110f66557d6589 (diff)
1 files changed, 95 insertions, 42 deletions
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 649bf60a9440..63d0816ab23b 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -232,30 +232,6 @@ out:
        return r;
 }
-#ifdef CONFIG_NET
-static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
-{
-        struct kobject *kobj = data, *ksobj;
-        const struct kobj_ns_type_operations *ops;
-        ops = kobj_ns_ops(kobj);
-        if (!ops && kobj->kset) {
-                ksobj = &kobj->kset->kobj;
-                if (ksobj->parent != NULL)
-                        ops = kobj_ns_ops(ksobj->parent);
-        }
-        if (ops && ops->netlink_ns && kobj->ktype->namespace) {
-                const void *sock_ns, *ns;
-                ns = kobj->ktype->namespace(kobj);
-                sock_ns = ops->netlink_ns(dsk);
-                return sock_ns != ns;
-        }
-        return 0;
-}
-#endif
 #ifdef CONFIG_UEVENT_HELPER
 static int kobj_usermode_filter(struct kobject *kobj)
 {
@@ -327,17 +303,14 @@ static struct sk_buff *alloc_uevent_skb(struct kobj_uevent_env *env,
        return skb;
 }
-#endif
-static int kobject_uevent_net_broadcast(struct kobject *kobj,
+static int uevent_net_broadcast_untagged(struct kobj_uevent_env *env,
-                                        struct kobj_uevent_env *env,
+                                         const char *action_string,
-                                        const char *action_string,
+                                         const char *devpath)
-                                        const char *devpath)
 {
-        int retval = 0;
-#if defined(CONFIG_NET)
        struct sk_buff *skb = NULL;
        struct uevent_sock *ue_sk;
+        int retval = 0;
        /* send netlink message */
        list_for_each_entry(ue_sk, &uevent_sock_list, list) {
@@ -353,19 +326,93 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
                                continue;
                }
-                retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
+                retval = netlink_broadcast(uevent_sock, skb_get(skb), 0, 1,
-                                                    0, 1, GFP_KERNEL,
+                                           GFP_KERNEL);
-                                                    kobj_bcast_filter,
-                                                    kobj);
                /* ENOBUFS should be handled in userspace */
                if (retval == -ENOBUFS || retval == -ESRCH)
                        retval = 0;
        }
        consume_skb(skb);
-#endif
        return retval;
 }
+static int uevent_net_broadcast_tagged(struct sock *usk,
+                                       struct kobj_uevent_env *env,
+                                       const char *action_string,
+                                       const char *devpath)
+{
+        struct user_namespace *owning_user_ns = sock_net(usk)->user_ns;
+        struct sk_buff *skb = NULL;
+        int ret = 0;
+        skb = alloc_uevent_skb(env, action_string, devpath);
+        if (!skb)
+                return -ENOMEM;
+        /* fix credentials */
+        if (owning_user_ns != &init_user_ns) {
+                struct netlink_skb_parms *parms = &NETLINK_CB(skb);
+                kuid_t root_uid;
+                kgid_t root_gid;
+                /* fix uid */
+                root_uid = make_kuid(owning_user_ns, 0);
+                if (uid_valid(root_uid))
+                        parms->creds.uid = root_uid;
+                /* fix gid */
+                root_gid = make_kgid(owning_user_ns, 0);
+                if (gid_valid(root_gid))
+                        parms->creds.gid = root_gid;
+        }
+        ret = netlink_broadcast(usk, skb, 0, 1, GFP_KERNEL);
+        /* ENOBUFS should be handled in userspace */
+        if (ret == -ENOBUFS || ret == -ESRCH)
+                ret = 0;
+        return ret;
+}
+#endif
+static int kobject_uevent_net_broadcast(struct kobject *kobj,
+                                        struct kobj_uevent_env *env,
+                                        const char *action_string,
+                                        const char *devpath)
+{
+        int ret = 0;
+#ifdef CONFIG_NET
+        const struct kobj_ns_type_operations *ops;
+        const struct net *net = NULL;
+        ops = kobj_ns_ops(kobj);
+        if (!ops && kobj->kset) {
+                struct kobject *ksobj = &kobj->kset->kobj;
+                if (ksobj->parent != NULL)
+                        ops = kobj_ns_ops(ksobj->parent);
+        }
+        /* kobjects currently only carry network namespace tags and they
+         * are the only tag relevant here since we want to decide which
+         * network namespaces to broadcast the uevent into.
+         */
+        if (ops && ops->netlink_ns && kobj->ktype->namespace)
+                if (ops->type == KOBJ_NS_TYPE_NET)
+                        net = kobj->ktype->namespace(kobj);
+        if (!net)
+                ret = uevent_net_broadcast_untagged(env, action_string,
+                                                    devpath);
+        else
+                ret = uevent_net_broadcast_tagged(net->uevent_sock->sk, env,
+                                                  action_string, devpath);
+#endif
+        return ret;
+}
 static void zap_modalias_env(struct kobj_uevent_env *env)
 {
        static const char modalias_prefix[] = "MODALIAS=";
@@ -724,9 +771,13 @@ static int uevent_net_init(struct net *net)
        net->uevent_sock = ue_sk;
-        mutex_lock(&uevent_sock_mutex);
+        /* Restrict uevents to initial user namespace. */
-        list_add_tail(&ue_sk->list, &uevent_sock_list);
+        if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
-        mutex_unlock(&uevent_sock_mutex);
+                mutex_lock(&uevent_sock_mutex);
+                list_add_tail(&ue_sk->list, &uevent_sock_list);
+                mutex_unlock(&uevent_sock_mutex);
+        }
        return 0;
 }
@@ -734,9 +785,11 @@ static void uevent_net_exit(struct net *net)
 {
        struct uevent_sock *ue_sk = net->uevent_sock;
-        mutex_lock(&uevent_sock_mutex);
+        if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
-        list_del(&ue_sk->list);
+                mutex_lock(&uevent_sock_mutex);
-        mutex_unlock(&uevent_sock_mutex);
+                list_del(&ue_sk->list);
+                mutex_unlock(&uevent_sock_mutex);
+        }
        netlink_kernel_release(ue_sk->sk);
        kfree(ue_sk);
author	Christian Brauner <christian.brauner@ubuntu.com>	2018-04-29 06:44:12 -0400
committer	David S. Miller <davem@davemloft.net>	2018-05-01 10:22:41 -0400
commit	a3498436b3a0f8ec289e6847e1de40b4123e1639 (patch)
tree	31e520ec03624c43a95048cace9e014c701fbdcb
parent	26045a7b14bc7a5455e411d820110f66557d6589 (diff)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c index 649bf60a9440..63d0816ab23b 100644 --- a/lib/kobject_uevent.c +++ b/lib/kobject_uevent.c
@@ -232,30 +232,6 @@ out:
232	return r;	232	return r;
233	}	233	}
234		234
235	#ifdef CONFIG_NET
236	static int kobj_bcast_filter(struct sock dsk, struct sk_buff skb, void *data)
237	{
238	struct kobject kobj = data, ksobj;
239	const struct kobj_ns_type_operations *ops;
240
241	ops = kobj_ns_ops(kobj);
242	if (!ops && kobj->kset) {
243	ksobj = &kobj->kset->kobj;
244	if (ksobj->parent != NULL)
245	ops = kobj_ns_ops(ksobj->parent);
246	}
247
248	if (ops && ops->netlink_ns && kobj->ktype->namespace) {
249	const void sock_ns, ns;
250	ns = kobj->ktype->namespace(kobj);
251	sock_ns = ops->netlink_ns(dsk);
252	return sock_ns != ns;
253	}
254
255	return 0;
256	}
257	#endif
258
259	#ifdef CONFIG_UEVENT_HELPER	235	#ifdef CONFIG_UEVENT_HELPER
260	static int kobj_usermode_filter(struct kobject *kobj)	236	static int kobj_usermode_filter(struct kobject *kobj)
261	{	237	{
@@ -327,17 +303,14 @@ static struct sk_buff alloc_uevent_skb(struct kobj_uevent_env env,
327		303
328	return skb;	304	return skb;
329	}	305	}
330	#endif
331		306
332	static int kobject_uevent_net_broadcast(struct kobject *kobj,	307	static int uevent_net_broadcast_untagged(struct kobj_uevent_env *env,
333	struct kobj_uevent_env *env,	308	const char *action_string,
334	const char *action_string,	309	const char *devpath)
335	const char *devpath)
336	{	310	{
337	int retval = 0;
338	#if defined(CONFIG_NET)
339	struct sk_buff *skb = NULL;	311	struct sk_buff *skb = NULL;
340	struct uevent_sock *ue_sk;	312	struct uevent_sock *ue_sk;
		313	int retval = 0;
341		314
342	/* send netlink message */	315	/* send netlink message */
343	list_for_each_entry(ue_sk, &uevent_sock_list, list) {	316	list_for_each_entry(ue_sk, &uevent_sock_list, list) {
@@ -353,19 +326,93 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
353	continue;	326	continue;
354	}	327	}
355		328
356	retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),	329	retval = netlink_broadcast(uevent_sock, skb_get(skb), 0, 1,
357	0, 1, GFP_KERNEL,	330	GFP_KERNEL);
358	kobj_bcast_filter,
359	kobj);
360	/* ENOBUFS should be handled in userspace */	331	/* ENOBUFS should be handled in userspace */
361	if (retval == -ENOBUFS \|\| retval == -ESRCH)	332	if (retval == -ENOBUFS \|\| retval == -ESRCH)
362	retval = 0;	333	retval = 0;
363	}	334	}
364	consume_skb(skb);	335	consume_skb(skb);
365	#endif	336
366	return retval;	337	return retval;
367	}	338	}
368		339
		340	static int uevent_net_broadcast_tagged(struct sock *usk,
		341	struct kobj_uevent_env *env,
		342	const char *action_string,
		343	const char *devpath)
		344	{
		345	struct user_namespace *owning_user_ns = sock_net(usk)->user_ns;
		346	struct sk_buff *skb = NULL;
		347	int ret = 0;
		348
		349	skb = alloc_uevent_skb(env, action_string, devpath);
		350	if (!skb)
		351	return -ENOMEM;
		352
		353	/* fix credentials */
		354	if (owning_user_ns != &init_user_ns) {
		355	struct netlink_skb_parms *parms = &NETLINK_CB(skb);
		356	kuid_t root_uid;
		357	kgid_t root_gid;
		358
		359	/* fix uid */
		360	root_uid = make_kuid(owning_user_ns, 0);
		361	if (uid_valid(root_uid))
		362	parms->creds.uid = root_uid;
		363
		364	/* fix gid */
		365	root_gid = make_kgid(owning_user_ns, 0);
		366	if (gid_valid(root_gid))
		367	parms->creds.gid = root_gid;
		368	}
		369
		370	ret = netlink_broadcast(usk, skb, 0, 1, GFP_KERNEL);
		371	/* ENOBUFS should be handled in userspace */
		372	if (ret == -ENOBUFS \|\| ret == -ESRCH)
		373	ret = 0;
		374
		375	return ret;
		376	}
		377	#endif
		378
		379	static int kobject_uevent_net_broadcast(struct kobject *kobj,
		380	struct kobj_uevent_env *env,
		381	const char *action_string,
		382	const char *devpath)
		383	{
		384	int ret = 0;
		385
		386	#ifdef CONFIG_NET
		387	const struct kobj_ns_type_operations *ops;
		388	const struct net *net = NULL;
		389
		390	ops = kobj_ns_ops(kobj);
		391	if (!ops && kobj->kset) {
		392	struct kobject *ksobj = &kobj->kset->kobj;
		393	if (ksobj->parent != NULL)
		394	ops = kobj_ns_ops(ksobj->parent);
		395	}
		396
		397	/* kobjects currently only carry network namespace tags and they
		398	* are the only tag relevant here since we want to decide which
		399	* network namespaces to broadcast the uevent into.
		400	*/
		401	if (ops && ops->netlink_ns && kobj->ktype->namespace)
		402	if (ops->type == KOBJ_NS_TYPE_NET)
		403	net = kobj->ktype->namespace(kobj);
		404
		405	if (!net)
		406	ret = uevent_net_broadcast_untagged(env, action_string,
		407	devpath);
		408	else
		409	ret = uevent_net_broadcast_tagged(net->uevent_sock->sk, env,
		410	action_string, devpath);
		411	#endif
		412
		413	return ret;
		414	}
		415
369	static void zap_modalias_env(struct kobj_uevent_env *env)	416	static void zap_modalias_env(struct kobj_uevent_env *env)
370	{	417	{
371	static const char modalias_prefix[] = "MODALIAS=";	418	static const char modalias_prefix[] = "MODALIAS=";
@@ -724,9 +771,13 @@ static int uevent_net_init(struct net *net)
724		771
725	net->uevent_sock = ue_sk;	772	net->uevent_sock = ue_sk;
726		773
727	mutex_lock(&uevent_sock_mutex);	774	/* Restrict uevents to initial user namespace. */
728	list_add_tail(&ue_sk->list, &uevent_sock_list);	775	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
729	mutex_unlock(&uevent_sock_mutex);	776	mutex_lock(&uevent_sock_mutex);
		777	list_add_tail(&ue_sk->list, &uevent_sock_list);
		778	mutex_unlock(&uevent_sock_mutex);
		779	}
		780
730	return 0;	781	return 0;
731	}	782	}
732		783
@@ -734,9 +785,11 @@ static void uevent_net_exit(struct net *net)
734	{	785	{
735	struct uevent_sock *ue_sk = net->uevent_sock;	786	struct uevent_sock *ue_sk = net->uevent_sock;
736		787
737	mutex_lock(&uevent_sock_mutex);	788	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
738	list_del(&ue_sk->list);	789	mutex_lock(&uevent_sock_mutex);
739	mutex_unlock(&uevent_sock_mutex);	790	list_del(&ue_sk->list);
		791	mutex_unlock(&uevent_sock_mutex);
		792	}
740		793
741	netlink_kernel_release(ue_sk->sk);	794	netlink_kernel_release(ue_sk->sk);
742	kfree(ue_sk);	795	kfree(ue_sk);