aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4
Commit message (Collapse)AuthorAge
* ipv4: Remove rt->rt_dst reference from ip_forward_options().David S. Miller2011-05-13
| | | | | | At this point iph->daddr equals what rt->rt_dst would hold. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Remove route key identity dependencies in ip_rt_get_source().David S. Miller2011-05-13
| | | | | | | Pass in the sk_buff so that we can fetch the necessary keys from the packet header when working with input routes. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Always call ip_options_build() after rest of IP header is filled in.David S. Miller2011-05-13
| | | | | | | This will allow ip_options_build() to reliably look at the values of iph->{daddr,saddr} Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Kill spurious write to iph->daddr in ip_forward_options().David S. Miller2011-05-13
| | | | | | | | | | | | | | | | | | | | | | | | This code block executes when opt->srr_is_hit is set. It will be set only by ip_options_rcv_srr(). ip_options_rcv_srr() walks until it hits a matching nexthop in the SRR option addresses, and when it matches one 1) looks up the route for that nexthop and 2) on route lookup success it writes that nexthop value into iph->daddr. ip_forward_options() runs later, and again walks the SRR option addresses looking for the option matching the destination of the route stored in skb_rtable(). This route will be the same exact one looked up for the nexthop by ip_options_rcv_srr(). Therefore "rt->rt_dst == iph->daddr" must be true. All it really needs to do is record the route's source address in the matching SRR option adddress. It need not write iph->daddr again, since that has already been done by ip_options_rcv_srr() as detailed above. Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv4: add IPPROTO_ICMP socket kindVasiliy Kulikov2011-05-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds IPPROTO_ICMP socket kind. It makes it possible to send ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages without any special privileges. In other words, the patch makes it possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In order not to increase the kernel's attack surface, the new functionality is disabled by default, but is enabled at bootup by supporting Linux distributions, optionally with restriction to a group or a group range (see below). Similar functionality is implemented in Mac OS X: http://www.manpagez.com/man/4/icmp/ A new ping socket is created with socket(PF_INET, SOCK_DGRAM, PROT_ICMP) Message identifiers (octets 4-5 of ICMP header) are interpreted as local ports. Addresses are stored in struct sockaddr_in. No port numbers are reserved for privileged processes, port 0 is reserved for API ("let the kernel pick a free number"). There is no notion of remote ports, remote port numbers provided by the user (e.g. in connect()) are ignored. Data sent and received include ICMP headers. This is deliberate to: 1) Avoid the need to transport headers values like sequence numbers by other means. 2) Make it easier to port existing programs using raw sockets. ICMP headers given to send() are checked and sanitized. The type must be ICMP_ECHO and the code must be zero (future extensions might relax this, see below). The id is set to the number (local port) of the socket, the checksum is always recomputed. ICMP reply packets received from the network are demultiplexed according to their id's, and are returned by recv() without any modifications. IP header information and ICMP errors of those packets may be obtained via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source quenches and redirects are reported as fake errors via the error queue (IP_RECVERR); the next hop address for redirects is saved to ee_info (in network order). socket(2) is restricted to the group range specified in "/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning that nobody (not even root) may create ping sockets. Setting it to "100 100" would grant permissions to the single group (to either make /sbin/ping g+s and owned by this group or to grant permissions to the "netadmins" group), "0 4294967295" would enable it for the world, "100 4294967295" would enable it for the users, but not daemons. The existing code might be (in the unlikely case anyone needs it) extended rather easily to handle other similar pairs of ICMP messages (Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply etc.). Userspace ping util & patch for it: http://openwall.info/wiki/people/segoon/ping For Openwall GNU/*/Linux it was the last step on the road to the setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels) is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs: http://mirrors.kernel.org/openwall/Owl/current/iso/ Initially this functionality was written by Pavel Kankovsky for Linux 2.4.32, but unfortunately it was never made public. All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with the patch. PATCH v3: - switched to flowi4. - minor changes to be consistent with raw sockets code. PATCH v2: - changed ping_debug() to pr_debug(). - removed CONFIG_IP_PING. - removed ping_seq_fops.owner field (unused for procfs). - switched to proc_net_fops_create(). - switched to %pK in seq_printf(). PATCH v1: - fixed checksumming bug. - CAP_NET_RAW may not create icmp sockets anymore. RFC v2: - minor cleanups. - introduced sysctl'able group range to restrict socket(2). Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Fix 'iph' use before set.David S. Miller2011-05-12
| | | | | | | | | | | I swear none of my compilers warned about this, yet it is so obvious. > net/ipv4/ip_forward.c: In function 'ip_forward': > net/ipv4/ip_forward.c:87: warning: 'iph' may be used uninitialized in this function Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Elide use of rt->rt_dst in ip_forward()David S. Miller2011-05-12
| | | | | | | | | No matter what kind of header mangling occurs due to IP options processing, rt->rt_dst will always equal iph->daddr in the packet. So we can safely use iph->daddr instead of rt->rt_dst here. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Simplify iph->daddr overwrite in ip_options_rcv_srr().David S. Miller2011-05-12
| | | | | | | | | | We already copy the 4-byte nexthop from the options block into local variable "nexthop" for the route lookup. Re-use that variable instead of memcpy()'ing again when assigning to iph->daddr after the route lookup succeeds. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Kill spurious opt->srr check in ip_options_rcv_srr().David S. Miller2011-05-12
| | | | | | | All call sites conditionalize the call to ip_options_rcv_srr() with a check of opt->srr, so no need to check it again there. Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2011-05-11
|\ | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-3.6 Conflicts: drivers/net/benet/be_main.c
| * xfrm: Assign the inner mode output function to the dst entrySteffen Klassert2011-05-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As it is, we assign the outer modes output function to the dst entry when we create the xfrm bundle. This leads to two problems on interfamily scenarios. We might insert ipv4 packets into ip6_fragment when called from xfrm6_output. The system crashes if we try to fragment an ipv4 packet with ip6_fragment. This issue was introduced with git commit ad0081e4 (ipv6: Fragment locally generated tunnel-mode IPSec6 packets as needed). The second issue is, that we might insert ipv4 packets in netfilter6 and vice versa on interfamily scenarios. With this patch we assign the inner mode output function to the dst entry when we create the xfrm bundle. So xfrm4_output/xfrm6_output from the inner mode is used and the right fragmentation and netfilter functions are called. We switch then to outer mode with the output_finish functions. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp_cubic: limit delayed_ack ratio to prevent divide errorstephen hemminger2011-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCP Cubic keeps a metric that estimates the amount of delayed acknowledgements to use in adjusting the window. If an abnormally large number of packets are acknowledged at once, then the update could wrap and reach zero. This kind of ACK could only happen when there was a large window and huge number of ACK's were lost. This patch limits the value of delayed ack ratio. The choice of 32 is just a conservative value since normally it should be range of 1 to 4 packets. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: fix two lockdep splatsEric Dumazet2011-05-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e67f88dd12f6 (net: dont hold rtnl mutex during netlink dump callbacks) switched rtnl protection to RCU, but we forgot to adjust two rcu_dereference() lockdep annotations : inet_get_link_af_size() or inet_fill_link_af() might be called with rcu_read_lock or rtnl held, so use rcu_dereference_rtnl() instead of rtnl_dereference() Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: xfrm: Eliminate ->rt_src reference in policy code.David S. Miller2011-05-10
| | | | | | | | | | | | | | | | | | | | | | Rearrange xfrm4_dst_lookup() so that it works by calling a helper function __xfrm_dst_lookup() that takes an explicit flow key storage area as an argument. Use this new helper in xfrm4_get_saddr() so we can fetch the selected source address from the flow instead of from rt->rt_src Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: udp: Eliminate remaining uses of rt->rt_srcDavid S. Miller2011-05-10
| | | | | | | | | | | | | | We already track and pass around the correct flow key, so simply use it in udp_send_skb(). Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: icmp: Eliminate remaining uses of rt->rt_srcDavid S. Miller2011-05-10
| | | | | | | | | | | | | | | | | | | | On input packets, rt->rt_src always equals ip_hdr(skb)->saddr Anything that mangles or otherwise changes the IP header must relookup the route found at skb_rtable(). Therefore this invariant must always hold true. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Pass explicit daddr arg to ip_send_reply().David S. Miller2011-05-10
| | | | | | | | | | | | This eliminates an access to rt->rt_src. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Pass flow key down into ip_append_*().David S. Miller2011-05-09
| | | | | | | | | | | | This way rt->rt_dst accesses are unnecessary. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Pass flow keys down into datagram packet building engine.David S. Miller2011-05-09
| | | | | | | | | | | | | | | | | | This way ip_output.c no longer needs rt->rt_{src,dst}. We already have these keys sitting, ready and waiting, on the stack or in a socket structure. Signed-off-by: David S. Miller <davem@davemloft.net>
* | udp: Use flow key information instead of rt->rt_{src,dst}David S. Miller2011-05-09
| | | | | | | | | | | | | | | | | | | | | | | | We have two cases. Either the socket is in TCP_ESTABLISHED state and connect() filled in the inet socket cork flow, or we looked up the route here and used an on-stack flow. Track which one it was, and use it to obtain src/dst addrs. Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Use cork flow info instead of rt->rt_dst in tcp_v4_get_peer()David S. Miller2011-05-08
| | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Don't use rt->rt_{src,dst} in ip_queue_xmit().David S. Miller2011-05-08
| | | | | | | | | | | | Now we can pick it out of the provided flow key. Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: Pass flowi to ->queue_xmit().David S. Miller2011-05-08
| | | | | | | | | | | | | | | | | | | | | | This allows us to acquire the exact route keying information from the protocol, however that might be managed. It handles all of the possibilities, from the simplest case of storing the key in inet->cork.fl to the more complex setup SCTP has where individual transports determine the flow. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Use inet_csk_route_child_sock() in DCCP and TCP.David S. Miller2011-05-08
| | | | | | | | | | | | | | Operation order is now transposed, we first create the child socket then we try to hook up the route. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Create inet_csk_route_child_sock().David S. Miller2011-05-08
| | | | | | | | | | | | | | | | | | | | | | | | This is just like inet_csk_route_req() except that it operates after we've created the new child socket. In this way we can use the new socket's cork flow for proper route key storage. This will be used by DCCP and TCP child socket creation handling. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Use cork flow in ip_queue_xmit()David S. Miller2011-05-08
| | | | | | | | | | | | | | | | | | | | | | All invokers of ip_queue_xmit() must make certain that the socket is locked. All of SCTP, TCP, DCCP, and L2TP now make sure this is the case. Therefore we can use the cork flow during output route lookup in ip_queue_xmit() when the socket route check fails. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Use cork flow in inet_sk_{reselect_saddr,rebuild_header}()David S. Miller2011-05-08
| | | | | | | | | | | | | | | | | | These two functions must be invoked only when the socket is locked (because socket identity modifications are made non-atomically). Therefore we can use the cork flow for output route lookups. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Lock socket and use cork flow in ip4_datagram_connect().David S. Miller2011-05-08
| | | | | | | | | | | | | | This is to make sure that an l2tp socket's inet cork flow is fully filled in, when it's encapsulated in UDP. Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Use cork flow in tcp_v4_connect()David S. Miller2011-05-08
| | | | | | | | | | | | | | Since this is invoked from inet_stream_connect() the socket is locked and therefore this usage is safe. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Initialize cork->opt using NULL not 0.David S. Miller2011-05-06
| | | | | | | | | | | | Noticed by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Initialize on-stack cork more efficiently.David S. Miller2011-05-06
| | | | | | | | | | | | | | | | | | | | ip_setup_cork() explicitly initializes every member of inet_cork except flags, addr, and opt. So we can simply set those three members to zero instead of using a memset() via an empty struct assignment. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
* | inet: Decrease overhead of on-stack inet_cork.David S. Miller2011-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we fast path datagram sends to avoid locking by putting the inet_cork on the stack we use up lots of space that isn't necessary. This is because inet_cork contains a "struct flowi" which isn't used in these code paths. Split inet_cork to two parts, "inet_cork" and "inet_cork_full". Only the latter of which has the "struct flowi" and is what is stored in inet_sock. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
* | Merge branch 'master' of ↵David S. Miller2011-05-05
|\| | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/tg3.c
| * net: ip_expire() must revalidate routeEric Dumazet2011-05-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 4a94445c9a5c (net: Use ip_route_input_noref() in input path) added a bug in IP defragmentation handling, in case timeout is fired. When a frame is defragmented, we use last skb dst field when building final skb. Its dst is valid, since we are in rcu read section. But if a timeout occurs, we take first queued fragment to build one ICMP TIME EXCEEDED message. Problem is all queued skb have weak dst pointers, since we escaped RCU critical section after their queueing. icmp_send() might dereference a now freed (and possibly reused) part of memory. Calling skb_dst_drop() and ip_route_input_noref() to revalidate route is the only possible choice. Reported-by: Denys Fedoryshchenko <denys@visp.net.lb> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sysctl: net: call unregister_net_sysctl_table where neededLucian Adrian Grijincu2011-05-02
| | | | | | | | | | | | | | | | ctl_table_headers registered with register_net_sysctl_table should have been unregistered with the equivalent unregister_net_sysctl_table Signed-off-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: don't spam dmesg with "Using LC-trie" messagesAlexey Dobriyan2011-05-02
| | | | | | | | | | | | | | | | | | | | fib_trie_table() is called during netns creation and Chromium uses clone(CLONE_NEWNET) to sandbox renderer process. Don't print anything. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: call dev_alloc_name from register_netdeviceJiri Pirko2011-05-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Force dev_alloc_name() to be called from register_netdevice() by dev_get_valid_name(). That allows to remove multiple explicit dev_alloc_name() calls. The possibility to call dev_alloc_name in advance remains. This also fixes veth creation regresion caused by 84c49d8c3e4abefb0a41a77b25aa37ebe8d6b743 Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Kill rt->rt_{src, dst} usage in IP GRE tunnels.David S. Miller2011-05-04
| | | | | | | | | | | | | | | | | | | | First, make callers pass on-stack flowi4 to ip_route_output_gre() so they can get at the fully resolved flow key. Next, use that in ipgre_tunnel_xmit() to avoid the need to use rt->rt_{dst,src}. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Pass explicit saddr/daddr args to ipmr_get_route().David S. Miller2011-05-04
| | | | | | | | | | | | This eliminates the need to use rt->rt_{src,dst}. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: In ip_build_and_send_pkt() use 'saddr' and 'daddr' args passed in.David S. Miller2011-05-04
| | | | | | | | | | | | | | | | | | | | | | Instead of rt->rt_{dst,src} The only tricky part is source route option handling. If the source route option is enabled we can't just use plain 'daddr', we have to use opt->opt.faddr. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Use flowi4->{daddr,saddr} in ipip_tunnel_xmit().David S. Miller2011-05-04
| | | | | | | | | | | | Instead of rt->rt_{dst,src} Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Use flowi4's {saddr,daddr} in igmpv3_newpack() and igmp_send_report()David S. Miller2011-05-03
| | | | | | | | | | | | Instead of rt->rt_{src,dst} Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Make caller provide on-stack flow key to ip_route_output_ports().David S. Miller2011-05-03
| | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Renamt struct rtable's rt_tos to rt_key_tos.David S. Miller2011-05-03
| | | | | | | | | | | | | | To more accurately reflect that it is purely a routing cache lookup key and is used in no other context. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Rework ipmr_rt_fib_lookup() flow key initialization.David S. Miller2011-05-03
| | | | | | | | | | | | | | Use information from the skb as much as possible, currently this means daddr, saddr, and TOS. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Make sure flowi4->{saddr,daddr} are always set.David S. Miller2011-05-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Slow path output route resolution always makes sure that ->{saddr,daddr} are set, and also if we trigger into IPSEC resolution we initialize them as well, because xfrm_lookup() expects them to be fully resolved. But if we hit the fast path and flowi4->flowi4_proto is zero, we won't do this initialization. Therefore, move the IPSEC path initialization to the route cache lookup fast path to make sure these are always set. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4, ipv6, bonding: Restore control over number of peer notificationsBen Hutchings2011-04-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For backward compatibility, we should retain the module parameters and sysfs attributes to control the number of peer notifications (gratuitous ARPs and unsolicited NAs) sent after bonding failover. Also, it is possible for failover to take place even though the new active slave does not have link up, and in that case the peer notification should be deferred until it does. Change ipv4 and ipv6 so they do not automatically send peer notifications on bonding failover. Change the bonding driver to send separate NETDEV_NOTIFY_PEERS notifications when the link is up, as many times as requested. Since it does not directly control which protocols send notifications, make num_grat_arp and num_unsol_na aliases for a single parameter. Bump the bonding version number and update its documentation. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Acked-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Get route daddr from flow key in tcp_v4_connect().David S. Miller2011-04-29
| | | | | | | | | | | | | | | | Now that output route lookups update the flow with destination address selection, we can fetch it from fl4->daddr instead of rt->rt_dst Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Get route daddr from flow key in inet_csk_route_req().David S. Miller2011-04-29
| | | | | | | | | | | | | | | | Now that output route lookups update the flow with destination address selection, we can fetch it from fl4->daddr instead of rt->rt_dst Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Get route daddr from flow key in ip4_datagram_connect().David S. Miller2011-04-29
| | | | | | | | | | | | | | | | Now that output route lookups update the flow with destination address selection, we can fetch it from fl4->daddr instead of rt->rt_dst Signed-off-by: David S. Miller <davem@davemloft.net>