bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY

This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY. To unleash the full potential of a bpf prog, it is essential for the userspace to be capable of directly setting up a bpf map which can then be consumed by the bpf prog to make decision. In this case, decide which SO_REUSEPORT sk to serve the incoming request. By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control and visibility on where a SO_REUSEPORT sk should be located in a bpf map. The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that the bpf prog can directly select a sk from the bpf map. That will raise the programmability of the bpf prog attached to a reuseport group (a group of sk serving the same IP:PORT). For example, in UDP, the bpf prog can peek into the payload (e.g. through the "data" pointer introduced in the later patch) to learn the application level's connection information and then decide which sk to pick from a bpf map. The userspace can tightly couple the sk's location in a bpf map with the application logic in generating the UDP payload's connection information. This connection info contact/API stays within the userspace. Also, when used with map-in-map, the userspace can switch the old-server-process's inner map to a new-server-process's inner map in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)". The bpf prog will then direct incoming requests to the new process instead of the old process. The old process can finish draining the pending requests (e.g. by "accept()") before closing the old-fds. [Note that deleting a fd from a bpf map does not necessary mean the fd is closed] During map_update_elem(), Only SO_REUSEPORT sk (i.e. which has already been added to a reuse->socks[]) can be used. That means a SO_REUSEPORT sk that is "bind()" for UDP or "bind()+listen()" for TCP. These conditions are ensured in "reuseport_array_update_check()". A SO_REUSEPORT sk can only be added once to a map (i.e. the same sk cannot be added twice even to the same map). SO_REUSEPORT already allows another sk to be created for the same IP:PORT. There is no need to re-create a similar usage in the BPF side. When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"), it will notify the bpf map to remove it from the map also. It is done through "bpf_sk_reuseport_detach()" and it will only be called if >=1 of the "reuse->sock[]" has ever been added to a bpf map. The map_update()/map_delete() has to be in-sync with the "reuse->socks[]". Hence, the same "reuseport_lock" used by "reuse->socks[]" has to be used here also. Care has been taken to ensure the lock is only acquired when the adding sk passes some strict tests. and freeing the map does not require the reuseport_lock. The reuseport_array will also support lookup from the syscall side. It will return a sock_gen_cookie(). The sock_gen_cookie() is on-demand (i.e. a sk's cookie is not generated until the very first map_lookup_elem()). The lookup cookie is 64bits but it goes against the logical userspace expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also). It may catch user in surprise if we enforce value_size=8 while userspace still pass a 32bits fd during update. Supporting different value_size between lookup and update seems unintuitive also. We also need to consider what if other existing fd based maps want to return 64bits value from syscall's lookup in the future. Hence, reuseport_array supports both value_size 4 and 8, and assuming user will usually use value_size=4. The syscall's lookup will return ENOSPC on value_size=4. It will will only return 64bits value from sock_gen_cookie() when user consciously choose value_size=8 (as a signal that lookup is desired) which then requires a 64bits value in both lookup and update. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
author: Martin KaFai Lau <kafai@fb.com> 2018-08-08 04:01:24 -0400
committer: Daniel Borkmann <daniel@iogearbox.net> 2018-08-10 19:58:46 -0400
commit: 5dc4c4b7d4e8115e7cde96a030f98cb3ab2e458c (patch)
tree: 3ae127970e7e14a70948c989f6a702695767a6a6
parent: 736b46027eb4a4c602d3b8b93d2f48c9facbd915 (diff)
8 files changed, 413 insertions, 1 deletions
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cd8790d2c6ed..db11662faea6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -524,6 +524,7 @@ static inline int bpf_map_attr_numa_node(const union bpf_attr *attr)
 }
 struct bpf_prog *bpf_prog_get_type_path(const char *name, enum bpf_prog_type type);
+int array_map_alloc_check(union bpf_attr *attr);
 #else /* !CONFIG_BPF_SYSCALL */
 static inline struct bpf_prog *bpf_prog_get(u32 ufd)
@@ -769,6 +770,33 @@ static inline void __xsk_map_flush(struct bpf_map *map)
 }
 #endif
+#if defined(CONFIG_INET) && defined(CONFIG_BPF_SYSCALL)
+void bpf_sk_reuseport_detach(struct sock *sk);
+int bpf_fd_reuseport_array_lookup_elem(struct bpf_map *map, void *key,
+                                       void *value);
+int bpf_fd_reuseport_array_update_elem(struct bpf_map *map, void *key,
+                                       void *value, u64 map_flags);
+#else
+static inline void bpf_sk_reuseport_detach(struct sock *sk)
+{
+}
+#ifdef CONFIG_BPF_SYSCALL
+static inline int bpf_fd_reuseport_array_lookup_elem(struct bpf_map *map,
+                                                     void *key, void *value)
+{
+        return -EOPNOTSUPP;
+}
+static inline int bpf_fd_reuseport_array_update_elem(struct bpf_map *map,
+                                                     void *key, void *value,
+                                                     u64 map_flags)
+{
+        return -EOPNOTSUPP;
+}
+#endif /* CONFIG_BPF_SYSCALL */
+#endif /* defined(CONFIG_INET) && defined(CONFIG_BPF_SYSCALL) */
 /* verifier prototypes for helper functions called from eBPF programs */
 extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
 extern const struct bpf_func_proto bpf_map_update_elem_proto;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index add08be53b6f..14fd6c02d258 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -60,4 +60,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
 #if defined(CONFIG_XDP_SOCKETS)
 BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
 #endif
+#ifdef CONFIG_INET
+BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
+#endif
 #endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index dd5758dc35d3..40f584bc7da0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -126,6 +126,7 @@ enum bpf_map_type {
        BPF_MAP_TYPE_XSKMAP,
        BPF_MAP_TYPE_SOCKHASH,
        BPF_MAP_TYPE_CGROUP_STORAGE,
+        BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
 };
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index e8906cbad81f..0488b8258321 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -23,3 +23,6 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
 obj-$(CONFIG_CGROUP_BPF) += cgroup.o
+ifeq ($(CONFIG_INET),y)
+obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o
+endif
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 2aa55d030c77..f6ca3e712831 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -54,7 +54,7 @@ static int bpf_array_alloc_percpu(struct bpf_array *array)
 }
 /* Called from syscall */
-static int array_map_alloc_check(union bpf_attr *attr)
+int array_map_alloc_check(union bpf_attr *attr)
 {
        bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
        int numa_node = bpf_map_attr_numa_node(attr);
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
new file mode 100644
index 000000000000..18e225de80ff
--- /dev/null
+++ b/kernel/bpf/reuseport_array.c
@@ -0,0 +1,363 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2018 Facebook
+ */
+#include <linux/bpf.h>
+#include <linux/err.h>
+#include <linux/sock_diag.h>
+#include <net/sock_reuseport.h>
+struct reuseport_array {
+        struct bpf_map map;
+        struct sock __rcu *ptrs[];
+};
+static struct reuseport_array *reuseport_array(struct bpf_map *map)
+{
+        return (struct reuseport_array *)map;
+}
+/* The caller must hold the reuseport_lock */
+void bpf_sk_reuseport_detach(struct sock *sk)
+{
+        struct sock __rcu **socks;
+        write_lock_bh(&sk->sk_callback_lock);
+        socks = sk->sk_user_data;
+        if (socks) {
+                WRITE_ONCE(sk->sk_user_data, NULL);
+                /*
+                 * Do not move this NULL assignment outside of
+                 * sk->sk_callback_lock because there is
+                 * a race with reuseport_array_free()
+                 * which does not hold the reuseport_lock.
+                 */
+                RCU_INIT_POINTER(*socks, NULL);
+        }
+        write_unlock_bh(&sk->sk_callback_lock);
+}
+static int reuseport_array_alloc_check(union bpf_attr *attr)
+{
+        if (attr->value_size != sizeof(u32) &&
+            attr->value_size != sizeof(u64))
+                return -EINVAL;
+        return array_map_alloc_check(attr);
+}
+static void *reuseport_array_lookup_elem(struct bpf_map *map, void *key)
+{
+        struct reuseport_array *array = reuseport_array(map);
+        u32 index = *(u32 *)key;
+        if (unlikely(index >= array->map.max_entries))
+                return NULL;
+        return rcu_dereference(array->ptrs[index]);
+}
+/* Called from syscall only */
+static int reuseport_array_delete_elem(struct bpf_map *map, void *key)
+{
+        struct reuseport_array *array = reuseport_array(map);
+        u32 index = *(u32 *)key;
+        struct sock *sk;
+        int err;
+        if (index >= map->max_entries)
+                return -E2BIG;
+        if (!rcu_access_pointer(array->ptrs[index]))
+                return -ENOENT;
+        spin_lock_bh(&reuseport_lock);
+        sk = rcu_dereference_protected(array->ptrs[index],
+                                       lockdep_is_held(&reuseport_lock));
+        if (sk) {
+                write_lock_bh(&sk->sk_callback_lock);
+                WRITE_ONCE(sk->sk_user_data, NULL);
+                RCU_INIT_POINTER(array->ptrs[index], NULL);
+                write_unlock_bh(&sk->sk_callback_lock);
+                err = 0;
+        } else {
+                err = -ENOENT;
+        }
+        spin_unlock_bh(&reuseport_lock);
+        return err;
+}
+static void reuseport_array_free(struct bpf_map *map)
+{
+        struct reuseport_array *array = reuseport_array(map);
+        struct sock *sk;
+        u32 i;
+        synchronize_rcu();
+        /*
+         * ops->map_*_elem() will not be able to access this
+         * array now. Hence, this function only races with
+         * bpf_sk_reuseport_detach() which was triggerred by
+         * close() or disconnect().
+         *
+         * This function and bpf_sk_reuseport_detach() are
+         * both removing sk from "array".  Who removes it
+         * first does not matter.
+         *
+         * The only concern here is bpf_sk_reuseport_detach()
+         * may access "array" which is being freed here.
+         * bpf_sk_reuseport_detach() access this "array"
+         * through sk->sk_user_data _and_ with sk->sk_callback_lock
+         * held which is enough because this "array" is not freed
+         * until all sk->sk_user_data has stopped referencing this "array".
+         *
+         * Hence, due to the above, taking "reuseport_lock" is not
+         * needed here.
+         */
+        /*
+         * Since reuseport_lock is not taken, sk is accessed under
+         * rcu_read_lock()
+         */
+        rcu_read_lock();
+        for (i = 0; i < map->max_entries; i++) {
+                sk = rcu_dereference(array->ptrs[i]);
+                if (sk) {
+                        write_lock_bh(&sk->sk_callback_lock);
+                        /*
+                         * No need for WRITE_ONCE(). At this point,
+                         * no one is reading it without taking the
+                         * sk->sk_callback_lock.
+                         */
+                        sk->sk_user_data = NULL;
+                        write_unlock_bh(&sk->sk_callback_lock);
+                        RCU_INIT_POINTER(array->ptrs[i], NULL);
+                }
+        }
+        rcu_read_unlock();
+        /*
+         * Once reaching here, all sk->sk_user_data is not
+         * referenceing this "array".  "array" can be freed now.
+         */
+        bpf_map_area_free(array);
+}
+static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr)
+{
+        int err, numa_node = bpf_map_attr_numa_node(attr);
+        struct reuseport_array *array;
+        u64 cost, array_size;
+        if (!capable(CAP_SYS_ADMIN))
+                return ERR_PTR(-EPERM);
+        array_size = sizeof(*array);
+        array_size += (u64)attr->max_entries * sizeof(struct sock *);
+        /* make sure there is no u32 overflow later in round_up() */
+        cost = array_size;
+        if (cost >= U32_MAX - PAGE_SIZE)
+                return ERR_PTR(-ENOMEM);
+        cost = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
+        err = bpf_map_precharge_memlock(cost);
+        if (err)
+                return ERR_PTR(err);
+        /* allocate all map elements and zero-initialize them */
+        array = bpf_map_area_alloc(array_size, numa_node);
+        if (!array)
+                return ERR_PTR(-ENOMEM);
+        /* copy mandatory map attributes */
+        bpf_map_init_from_attr(&array->map, attr);
+        array->map.pages = cost;
+        return &array->map;
+}
+int bpf_fd_reuseport_array_lookup_elem(struct bpf_map *map, void *key,
+                                       void *value)
+{
+        struct sock *sk;
+        int err;
+        if (map->value_size != sizeof(u64))
+                return -ENOSPC;
+        rcu_read_lock();
+        sk = reuseport_array_lookup_elem(map, key);
+        if (sk) {
+                *(u64 *)value = sock_gen_cookie(sk);
+                err = 0;
+        } else {
+                err = -ENOENT;
+        }
+        rcu_read_unlock();
+        return err;
+}
+static int
+reuseport_array_update_check(const struct reuseport_array *array,
+                             const struct sock *nsk,
+                             const struct sock *osk,
+                             const struct sock_reuseport *nsk_reuse,
+                             u32 map_flags)
+{
+        if (osk && map_flags == BPF_NOEXIST)
+                return -EEXIST;
+        if (!osk && map_flags == BPF_EXIST)
+                return -ENOENT;
+        if (nsk->sk_protocol != IPPROTO_UDP && nsk->sk_protocol != IPPROTO_TCP)
+                return -ENOTSUPP;
+        if (nsk->sk_family != AF_INET && nsk->sk_family != AF_INET6)
+                return -ENOTSUPP;
+        if (nsk->sk_type != SOCK_STREAM && nsk->sk_type != SOCK_DGRAM)
+                return -ENOTSUPP;
+        /*
+         * sk must be hashed (i.e. listening in the TCP case or binded
+         * in the UDP case) and
+         * it must also be a SO_REUSEPORT sk (i.e. reuse cannot be NULL).
+         *
+         * Also, sk will be used in bpf helper that is protected by
+         * rcu_read_lock().
+         */
+        if (!sock_flag(nsk, SOCK_RCU_FREE) || !sk_hashed(nsk) || !nsk_reuse)
+                return -EINVAL;
+        /* READ_ONCE because the sk->sk_callback_lock may not be held here */
+        if (READ_ONCE(nsk->sk_user_data))
+                return -EBUSY;
+        return 0;
+}
+/*
+ * Called from syscall only.
+ * The "nsk" in the fd refcnt.
+ * The "osk" and "reuse" are protected by reuseport_lock.
+ */
+int bpf_fd_reuseport_array_update_elem(struct bpf_map *map, void *key,
+                                       void *value, u64 map_flags)
+{
+        struct reuseport_array *array = reuseport_array(map);
+        struct sock *free_osk = NULL, *osk, *nsk;
+        struct sock_reuseport *reuse;
+        u32 index = *(u32 *)key;
+        struct socket *socket;
+        int err, fd;
+        if (map_flags > BPF_EXIST)
+                return -EINVAL;
+        if (index >= map->max_entries)
+                return -E2BIG;
+        if (map->value_size == sizeof(u64)) {
+                u64 fd64 = *(u64 *)value;
+                if (fd64 > S32_MAX)
+                        return -EINVAL;
+                fd = fd64;
+        } else {
+                fd = *(int *)value;
+        }
+        socket = sockfd_lookup(fd, &err);
+        if (!socket)
+                return err;
+        nsk = socket->sk;
+        if (!nsk) {
+                err = -EINVAL;
+                goto put_file;
+        }
+        /* Quick checks before taking reuseport_lock */
+        err = reuseport_array_update_check(array, nsk,
+                                           rcu_access_pointer(array->ptrs[index]),
+                                           rcu_access_pointer(nsk->sk_reuseport_cb),
+                                           map_flags);
+        if (err)
+                goto put_file;
+        spin_lock_bh(&reuseport_lock);
+        /*
+         * Some of the checks only need reuseport_lock
+         * but it is done under sk_callback_lock also
+         * for simplicity reason.
+         */
+        write_lock_bh(&nsk->sk_callback_lock);
+        osk = rcu_dereference_protected(array->ptrs[index],
+                                        lockdep_is_held(&reuseport_lock));
+        reuse = rcu_dereference_protected(nsk->sk_reuseport_cb,
+                                          lockdep_is_held(&reuseport_lock));
+        err = reuseport_array_update_check(array, nsk, osk, reuse, map_flags);
+        if (err)
+                goto put_file_unlock;
+        /* Ensure reuse->reuseport_id is set */
+        err = reuseport_get_id(reuse);
+        if (err < 0)
+                goto put_file_unlock;
+        WRITE_ONCE(nsk->sk_user_data, &array->ptrs[index]);
+        rcu_assign_pointer(array->ptrs[index], nsk);
+        free_osk = osk;
+        err = 0;
+put_file_unlock:
+        write_unlock_bh(&nsk->sk_callback_lock);
+        if (free_osk) {
+                write_lock_bh(&free_osk->sk_callback_lock);
+                WRITE_ONCE(free_osk->sk_user_data, NULL);
+                write_unlock_bh(&free_osk->sk_callback_lock);
+        }
+        spin_unlock_bh(&reuseport_lock);
+put_file:
+        fput(socket->file);
+        return err;
+}
+/* Called from syscall */
+static int reuseport_array_get_next_key(struct bpf_map *map, void *key,
+                                        void *next_key)
+{
+        struct reuseport_array *array = reuseport_array(map);
+        u32 index = key ? *(u32 *)key : U32_MAX;
+        u32 *next = (u32 *)next_key;
+        if (index >= array->map.max_entries) {
+                *next = 0;
+                return 0;
+        }
+        if (index == array->map.max_entries - 1)
+                return -ENOENT;
+        *next = index + 1;
+        return 0;
+}
+const struct bpf_map_ops reuseport_array_ops = {
+        .map_alloc_check = reuseport_array_alloc_check,
+        .map_alloc = reuseport_array_alloc,
+        .map_free = reuseport_array_free,
+        .map_lookup_elem = reuseport_array_lookup_elem,
+        .map_get_next_key = reuseport_array_get_next_key,
+        .map_delete_elem = reuseport_array_delete_elem,
+};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5af4e9e2722d..57f4d076141b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -684,6 +684,8 @@ static int map_lookup_elem(union bpf_attr *attr)
                err = bpf_fd_array_map_lookup_elem(map, key, value);
        } else if (IS_FD_HASH(map)) {
                err = bpf_fd_htab_map_lookup_elem(map, key, value);
+        } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
+                err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
        } else {
                rcu_read_lock();
                ptr = map->ops->map_lookup_elem(map, key);
@@ -790,6 +792,10 @@ static int map_update_elem(union bpf_attr *attr)
                err = bpf_fd_htab_map_update_elem(map, f.file, key, value,
                                                  attr->flags);
                rcu_read_unlock();
+        } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
+                /* rcu_read_lock() is not needed */
+                err = bpf_fd_reuseport_array_update_elem(map, key, value,
+                                                         attr->flags);
        } else {
                rcu_read_lock();
                err = map->ops->map_update_elem(map, key, value, attr->flags);
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index cf2e4d305af9..8235f2439816 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -186,6 +186,14 @@ void reuseport_detach_sock(struct sock *sk)
        spin_lock_bh(&reuseport_lock);
        reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
                                          lockdep_is_held(&reuseport_lock));
+        /* At least one of the sk in this reuseport group is added to
+         * a bpf map.  Notify the bpf side.  The bpf map logic will
+         * remove the sk if it is indeed added to a bpf map.
+         */
+        if (reuse->reuseport_id)
+                bpf_sk_reuseport_detach(sk);
        rcu_assign_pointer(sk->sk_reuseport_cb, NULL);
        for (i = 0; i < reuse->num_socks; i++) {
author	Martin KaFai Lau <kafai@fb.com>	2018-08-08 04:01:24 -0400
committer	Daniel Borkmann <daniel@iogearbox.net>	2018-08-10 19:58:46 -0400
commit	5dc4c4b7d4e8115e7cde96a030f98cb3ab2e458c (patch)
tree	3ae127970e7e14a70948c989f6a702695767a6a6
parent	736b46027eb4a4c602d3b8b93d2f48c9facbd915 (diff)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h index cd8790d2c6ed..db11662faea6 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h
@@ -524,6 +524,7 @@ static inline int bpf_map_attr_numa_node(const union bpf_attr *attr)
524	}	524	}
525		525
526	struct bpf_prog bpf_prog_get_type_path(const char name, enum bpf_prog_type type);	526	struct bpf_prog bpf_prog_get_type_path(const char name, enum bpf_prog_type type);
		527	int array_map_alloc_check(union bpf_attr *attr);
527		528
528	#else /* !CONFIG_BPF_SYSCALL */	529	#else /* !CONFIG_BPF_SYSCALL */
529	static inline struct bpf_prog *bpf_prog_get(u32 ufd)	530	static inline struct bpf_prog *bpf_prog_get(u32 ufd)
@@ -769,6 +770,33 @@ static inline void __xsk_map_flush(struct bpf_map *map)
769	}	770	}
770	#endif	771	#endif
771		772
		773	#if defined(CONFIG_INET) && defined(CONFIG_BPF_SYSCALL)
		774	void bpf_sk_reuseport_detach(struct sock *sk);
		775	int bpf_fd_reuseport_array_lookup_elem(struct bpf_map map, void key,
		776	void *value);
		777	int bpf_fd_reuseport_array_update_elem(struct bpf_map map, void key,
		778	void *value, u64 map_flags);
		779	#else
		780	static inline void bpf_sk_reuseport_detach(struct sock *sk)
		781	{
		782	}
		783
		784	#ifdef CONFIG_BPF_SYSCALL
		785	static inline int bpf_fd_reuseport_array_lookup_elem(struct bpf_map *map,
		786	void key, void value)
		787	{
		788	return -EOPNOTSUPP;
		789	}
		790
		791	static inline int bpf_fd_reuseport_array_update_elem(struct bpf_map *map,
		792	void key, void value,
		793	u64 map_flags)
		794	{
		795	return -EOPNOTSUPP;
		796	}
		797	#endif /* CONFIG_BPF_SYSCALL */
		798	#endif /* defined(CONFIG_INET) && defined(CONFIG_BPF_SYSCALL) */
		799
772	/* verifier prototypes for helper functions called from eBPF programs */	800	/* verifier prototypes for helper functions called from eBPF programs */
773	extern const struct bpf_func_proto bpf_map_lookup_elem_proto;	801	extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
774	extern const struct bpf_func_proto bpf_map_update_elem_proto;	802	extern const struct bpf_func_proto bpf_map_update_elem_proto;


diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index add08be53b6f..14fd6c02d258 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h
@@ -60,4 +60,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
60	#if defined(CONFIG_XDP_SOCKETS)	60	#if defined(CONFIG_XDP_SOCKETS)
61	BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)	61	BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
62	#endif	62	#endif
		63	#ifdef CONFIG_INET
		64	BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
		65	#endif
63	#endif	66	#endif


diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index dd5758dc35d3..40f584bc7da0 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h
@@ -126,6 +126,7 @@ enum bpf_map_type {
126	BPF_MAP_TYPE_XSKMAP,	126	BPF_MAP_TYPE_XSKMAP,
127	BPF_MAP_TYPE_SOCKHASH,	127	BPF_MAP_TYPE_SOCKHASH,
128	BPF_MAP_TYPE_CGROUP_STORAGE,	128	BPF_MAP_TYPE_CGROUP_STORAGE,
		129	BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
129	};	130	};
130		131
131	enum bpf_prog_type {	132	enum bpf_prog_type {


diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index e8906cbad81f..0488b8258321 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile
@@ -23,3 +23,6 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
23	obj-$(CONFIG_BPF_SYSCALL) += stackmap.o	23	obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
24	endif	24	endif
25	obj-$(CONFIG_CGROUP_BPF) += cgroup.o	25	obj-$(CONFIG_CGROUP_BPF) += cgroup.o
		26	ifeq ($(CONFIG_INET),y)
		27	obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o
		28	endif


diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 2aa55d030c77..f6ca3e712831 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c
@@ -54,7 +54,7 @@ static int bpf_array_alloc_percpu(struct bpf_array *array)
54	}	54	}
55		55
56	/* Called from syscall */	56	/* Called from syscall */
57	static int array_map_alloc_check(union bpf_attr *attr)	57	int array_map_alloc_check(union bpf_attr *attr)
58	{	58	{
59	bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;	59	bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
60	int numa_node = bpf_map_attr_numa_node(attr);	60	int numa_node = bpf_map_attr_numa_node(attr);


diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c new file mode 100644 index 000000000000..18e225de80ff --- /dev/null +++ b/kernel/bpf/reuseport_array.c
@@ -0,0 +1,363 @@
		1	// SPDX-License-Identifier: GPL-2.0
		2	/*
		3	* Copyright (c) 2018 Facebook
		4	*/
		5	#include <linux/bpf.h>
		6	#include <linux/err.h>
		7	#include <linux/sock_diag.h>
		8	#include <net/sock_reuseport.h>
		9
		10	struct reuseport_array {
		11	struct bpf_map map;
		12	struct sock __rcu *ptrs[];
		13	};
		14
		15	static struct reuseport_array reuseport_array(struct bpf_map map)
		16	{
		17	return (struct reuseport_array *)map;
		18	}
		19
		20	/* The caller must hold the reuseport_lock */
		21	void bpf_sk_reuseport_detach(struct sock *sk)
		22	{
		23	struct sock __rcu **socks;
		24
		25	write_lock_bh(&sk->sk_callback_lock);
		26	socks = sk->sk_user_data;
		27	if (socks) {
		28	WRITE_ONCE(sk->sk_user_data, NULL);
		29	/*
		30	* Do not move this NULL assignment outside of
		31	* sk->sk_callback_lock because there is
		32	* a race with reuseport_array_free()
		33	* which does not hold the reuseport_lock.
		34	*/
		35	RCU_INIT_POINTER(*socks, NULL);
		36	}
		37	write_unlock_bh(&sk->sk_callback_lock);
		38	}
		39
		40	static int reuseport_array_alloc_check(union bpf_attr *attr)
		41	{
		42	if (attr->value_size != sizeof(u32) &&
		43	attr->value_size != sizeof(u64))
		44	return -EINVAL;
		45
		46	return array_map_alloc_check(attr);
		47	}
		48
		49	static void reuseport_array_lookup_elem(struct bpf_map map, void *key)
		50	{
		51	struct reuseport_array *array = reuseport_array(map);
		52	u32 index = (u32 )key;
		53
		54	if (unlikely(index >= array->map.max_entries))
		55	return NULL;
		56
		57	return rcu_dereference(array->ptrs[index]);
		58	}
		59
		60	/* Called from syscall only */
		61	static int reuseport_array_delete_elem(struct bpf_map map, void key)
		62	{
		63	struct reuseport_array *array = reuseport_array(map);
		64	u32 index = (u32 )key;
		65	struct sock *sk;
		66	int err;
		67
		68	if (index >= map->max_entries)
		69	return -E2BIG;
		70
		71	if (!rcu_access_pointer(array->ptrs[index]))
		72	return -ENOENT;
		73
		74	spin_lock_bh(&reuseport_lock);
		75
		76	sk = rcu_dereference_protected(array->ptrs[index],
		77	lockdep_is_held(&reuseport_lock));
		78	if (sk) {
		79	write_lock_bh(&sk->sk_callback_lock);
		80	WRITE_ONCE(sk->sk_user_data, NULL);
		81	RCU_INIT_POINTER(array->ptrs[index], NULL);
		82	write_unlock_bh(&sk->sk_callback_lock);
		83	err = 0;
		84	} else {
		85	err = -ENOENT;
		86	}
		87
		88	spin_unlock_bh(&reuseport_lock);
		89
		90	return err;
		91	}
		92
		93	static void reuseport_array_free(struct bpf_map *map)
		94	{
		95	struct reuseport_array *array = reuseport_array(map);
		96	struct sock *sk;
		97	u32 i;
		98
		99	synchronize_rcu();
		100
		101	/*
		102	* ops->map_*_elem() will not be able to access this
		103	* array now. Hence, this function only races with
		104	* bpf_sk_reuseport_detach() which was triggerred by
		105	* close() or disconnect().
		106	*
		107	* This function and bpf_sk_reuseport_detach() are
		108	* both removing sk from "array". Who removes it
		109	* first does not matter.
		110	*
		111	* The only concern here is bpf_sk_reuseport_detach()
		112	* may access "array" which is being freed here.
		113	* bpf_sk_reuseport_detach() access this "array"
		114	* through sk->sk_user_data _and_ with sk->sk_callback_lock
		115	* held which is enough because this "array" is not freed
		116	* until all sk->sk_user_data has stopped referencing this "array".
		117	*
		118	* Hence, due to the above, taking "reuseport_lock" is not
		119	* needed here.
		120	*/
		121
		122	/*
		123	* Since reuseport_lock is not taken, sk is accessed under
		124	* rcu_read_lock()
		125	*/
		126	rcu_read_lock();
		127	for (i = 0; i < map->max_entries; i++) {
		128	sk = rcu_dereference(array->ptrs[i]);
		129	if (sk) {
		130	write_lock_bh(&sk->sk_callback_lock);
		131	/*
		132	* No need for WRITE_ONCE(). At this point,
		133	* no one is reading it without taking the
		134	* sk->sk_callback_lock.
		135	*/
		136	sk->sk_user_data = NULL;
		137	write_unlock_bh(&sk->sk_callback_lock);
		138	RCU_INIT_POINTER(array->ptrs[i], NULL);
		139	}
		140	}
		141	rcu_read_unlock();
		142
		143	/*
		144	* Once reaching here, all sk->sk_user_data is not
		145	* referenceing this "array". "array" can be freed now.
		146	*/
		147	bpf_map_area_free(array);
		148	}
		149
		150	static struct bpf_map reuseport_array_alloc(union bpf_attr attr)
		151	{
		152	int err, numa_node = bpf_map_attr_numa_node(attr);
		153	struct reuseport_array *array;
		154	u64 cost, array_size;
		155
		156	if (!capable(CAP_SYS_ADMIN))
		157	return ERR_PTR(-EPERM);
		158
		159	array_size = sizeof(*array);
		160	array_size += (u64)attr->max_entries * sizeof(struct sock *);
		161
		162	/* make sure there is no u32 overflow later in round_up() */
		163	cost = array_size;
		164	if (cost >= U32_MAX - PAGE_SIZE)
		165	return ERR_PTR(-ENOMEM);
		166	cost = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
		167
		168	err = bpf_map_precharge_memlock(cost);
		169	if (err)
		170	return ERR_PTR(err);
		171
		172	/* allocate all map elements and zero-initialize them */
		173	array = bpf_map_area_alloc(array_size, numa_node);
		174	if (!array)
		175	return ERR_PTR(-ENOMEM);
		176
		177	/* copy mandatory map attributes */
		178	bpf_map_init_from_attr(&array->map, attr);
		179	array->map.pages = cost;
		180
		181	return &array->map;
		182	}
		183
		184	int bpf_fd_reuseport_array_lookup_elem(struct bpf_map map, void key,
		185	void *value)
		186	{
		187	struct sock *sk;
		188	int err;
		189
		190	if (map->value_size != sizeof(u64))
		191	return -ENOSPC;
		192
		193	rcu_read_lock();
		194	sk = reuseport_array_lookup_elem(map, key);
		195	if (sk) {
		196	(u64 )value = sock_gen_cookie(sk);
		197	err = 0;
		198	} else {
		199	err = -ENOENT;
		200	}
		201	rcu_read_unlock();
		202
		203	return err;
		204	}
		205
		206	static int
		207	reuseport_array_update_check(const struct reuseport_array *array,
		208	const struct sock *nsk,
		209	const struct sock *osk,
		210	const struct sock_reuseport *nsk_reuse,
		211	u32 map_flags)
		212	{
		213	if (osk && map_flags == BPF_NOEXIST)
		214	return -EEXIST;
		215
		216	if (!osk && map_flags == BPF_EXIST)
		217	return -ENOENT;
		218
		219	if (nsk->sk_protocol != IPPROTO_UDP && nsk->sk_protocol != IPPROTO_TCP)
		220	return -ENOTSUPP;
		221
		222	if (nsk->sk_family != AF_INET && nsk->sk_family != AF_INET6)
		223	return -ENOTSUPP;
		224
		225	if (nsk->sk_type != SOCK_STREAM && nsk->sk_type != SOCK_DGRAM)
		226	return -ENOTSUPP;
		227
		228	/*
		229	* sk must be hashed (i.e. listening in the TCP case or binded
		230	* in the UDP case) and
		231	* it must also be a SO_REUSEPORT sk (i.e. reuse cannot be NULL).
		232	*
		233	* Also, sk will be used in bpf helper that is protected by
		234	* rcu_read_lock().
		235	*/
		236	if (!sock_flag(nsk, SOCK_RCU_FREE) \|\| !sk_hashed(nsk) \|\| !nsk_reuse)
		237	return -EINVAL;
		238
		239	/* READ_ONCE because the sk->sk_callback_lock may not be held here */
		240	if (READ_ONCE(nsk->sk_user_data))
		241	return -EBUSY;
		242
		243	return 0;
		244	}
		245
		246	/*
		247	* Called from syscall only.
		248	* The "nsk" in the fd refcnt.
		249	* The "osk" and "reuse" are protected by reuseport_lock.
		250	*/
		251	int bpf_fd_reuseport_array_update_elem(struct bpf_map map, void key,
		252	void *value, u64 map_flags)
		253	{
		254	struct reuseport_array *array = reuseport_array(map);
		255	struct sock free_osk = NULL, osk, *nsk;
		256	struct sock_reuseport *reuse;
		257	u32 index = (u32 )key;
		258	struct socket *socket;
		259	int err, fd;
		260
		261	if (map_flags > BPF_EXIST)
		262	return -EINVAL;
		263
		264	if (index >= map->max_entries)
		265	return -E2BIG;
		266
		267	if (map->value_size == sizeof(u64)) {
		268	u64 fd64 = (u64 )value;
		269
		270	if (fd64 > S32_MAX)
		271	return -EINVAL;
		272	fd = fd64;
		273	} else {
		274	fd = (int )value;
		275	}
		276
		277	socket = sockfd_lookup(fd, &err);
		278	if (!socket)
		279	return err;
		280
		281	nsk = socket->sk;
		282	if (!nsk) {
		283	err = -EINVAL;
		284	goto put_file;
		285	}
		286
		287	/* Quick checks before taking reuseport_lock */
		288	err = reuseport_array_update_check(array, nsk,
		289	rcu_access_pointer(array->ptrs[index]),
		290	rcu_access_pointer(nsk->sk_reuseport_cb),
		291	map_flags);
		292	if (err)
		293	goto put_file;
		294
		295	spin_lock_bh(&reuseport_lock);
		296	/*
		297	* Some of the checks only need reuseport_lock
		298	* but it is done under sk_callback_lock also
		299	* for simplicity reason.
		300	*/
		301	write_lock_bh(&nsk->sk_callback_lock);
		302
		303	osk = rcu_dereference_protected(array->ptrs[index],
		304	lockdep_is_held(&reuseport_lock));
		305	reuse = rcu_dereference_protected(nsk->sk_reuseport_cb,
		306	lockdep_is_held(&reuseport_lock));
		307	err = reuseport_array_update_check(array, nsk, osk, reuse, map_flags);
		308	if (err)
		309	goto put_file_unlock;
		310
		311	/* Ensure reuse->reuseport_id is set */
		312	err = reuseport_get_id(reuse);
		313	if (err < 0)
		314	goto put_file_unlock;
		315
		316	WRITE_ONCE(nsk->sk_user_data, &array->ptrs[index]);
		317	rcu_assign_pointer(array->ptrs[index], nsk);
		318	free_osk = osk;
		319	err = 0;
		320
		321	put_file_unlock:
		322	write_unlock_bh(&nsk->sk_callback_lock);
		323
		324	if (free_osk) {
		325	write_lock_bh(&free_osk->sk_callback_lock);
		326	WRITE_ONCE(free_osk->sk_user_data, NULL);
		327	write_unlock_bh(&free_osk->sk_callback_lock);
		328	}
		329
		330	spin_unlock_bh(&reuseport_lock);
		331	put_file:
		332	fput(socket->file);
		333	return err;
		334	}
		335
		336	/* Called from syscall */
		337	static int reuseport_array_get_next_key(struct bpf_map map, void key,
		338	void *next_key)
		339	{
		340	struct reuseport_array *array = reuseport_array(map);
		341	u32 index = key ? (u32 )key : U32_MAX;
		342	u32 next = (u32 )next_key;
		343
		344	if (index >= array->map.max_entries) {
		345	*next = 0;
		346	return 0;
		347	}
		348
		349	if (index == array->map.max_entries - 1)
		350	return -ENOENT;
		351
		352	*next = index + 1;
		353	return 0;
		354	}
		355
		356	const struct bpf_map_ops reuseport_array_ops = {
		357	.map_alloc_check = reuseport_array_alloc_check,
		358	.map_alloc = reuseport_array_alloc,
		359	.map_free = reuseport_array_free,
		360	.map_lookup_elem = reuseport_array_lookup_elem,
		361	.map_get_next_key = reuseport_array_get_next_key,
		362	.map_delete_elem = reuseport_array_delete_elem,
		363	};


diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 5af4e9e2722d..57f4d076141b 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c
@@ -684,6 +684,8 @@ static int map_lookup_elem(union bpf_attr *attr)
684	err = bpf_fd_array_map_lookup_elem(map, key, value);	684	err = bpf_fd_array_map_lookup_elem(map, key, value);
685	} else if (IS_FD_HASH(map)) {	685	} else if (IS_FD_HASH(map)) {
686	err = bpf_fd_htab_map_lookup_elem(map, key, value);	686	err = bpf_fd_htab_map_lookup_elem(map, key, value);
		687	} else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
		688	err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
687	} else {	689	} else {
688	rcu_read_lock();	690	rcu_read_lock();
689	ptr = map->ops->map_lookup_elem(map, key);	691	ptr = map->ops->map_lookup_elem(map, key);
@@ -790,6 +792,10 @@ static int map_update_elem(union bpf_attr *attr)
790	err = bpf_fd_htab_map_update_elem(map, f.file, key, value,	792	err = bpf_fd_htab_map_update_elem(map, f.file, key, value,
791	attr->flags);	793	attr->flags);
792	rcu_read_unlock();	794	rcu_read_unlock();
		795	} else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
		796	/* rcu_read_lock() is not needed */
		797	err = bpf_fd_reuseport_array_update_elem(map, key, value,
		798	attr->flags);
793	} else {	799	} else {
794	rcu_read_lock();	800	rcu_read_lock();
795	err = map->ops->map_update_elem(map, key, value, attr->flags);	801	err = map->ops->map_update_elem(map, key, value, attr->flags);


diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index cf2e4d305af9..8235f2439816 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c
@@ -186,6 +186,14 @@ void reuseport_detach_sock(struct sock *sk)
186	spin_lock_bh(&reuseport_lock);	186	spin_lock_bh(&reuseport_lock);
187	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,	187	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
188	lockdep_is_held(&reuseport_lock));	188	lockdep_is_held(&reuseport_lock));
		189
		190	/* At least one of the sk in this reuseport group is added to
		191	* a bpf map. Notify the bpf side. The bpf map logic will
		192	* remove the sk if it is indeed added to a bpf map.
		193	*/
		194	if (reuse->reuseport_id)
		195	bpf_sk_reuseport_detach(sk);
		196
189	rcu_assign_pointer(sk->sk_reuseport_cb, NULL);	197	rcu_assign_pointer(sk->sk_reuseport_cb, NULL);
190		198
191	for (i = 0; i < reuse->num_socks; i++) {	199	for (i = 0; i < reuse->num_socks; i++) {