From: Tobias Klausmann
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] Race condition in Netfilter triggered by glibc 2.9
Date: Thu, 29 Jan 2009 08:47:51
In Reply to: Re: [gentoo-dev] Race condition in Netfilter triggered by glibc 2.9 by Mike Frysinger

On Wed, 28 Jan 2009, Mike Frysinger wrote:
> > On the wire between the client and the firewall, this happens: > > > > a packet 1 is sent > > b packet 2 is sent > > c answer 1 is received > > d answer 2 is received > > > > Sometimes d doesn't happen because b is lost in the firewall > > along the way (where the race condition happens). > > does this affect actual userspace behavior ? in other words, > does this lead to lost lookups and errors from the resolver ?
The most visible effect (and the way we found out about it first) is a 5s hang on ssh connects. Thing is: how long that timeout is is program dependant (whatever they use in select()). A recvfrom() simply hangs. I wrote a simple C program to do what glibc does (simplified for brevity): sockfd = socket(AF_INET, SOCK_DGRAM, IPPROTO_IP); connect(sockfd, tgt->ai_addr, tgt->ai_addrlen); sendto(sockfd, payload1, sizeof(payload1), 0, tgt->ai_addr, tgt->ai_addrlen); sendto(sockfd, payload2, sizeof(payload2), 0, tgt->ai_addr, tgt->ai_addrlen); recvfrom(sockfd, buf, sizeof(buf), 0, &addr, &fromlen); recvfrom(sockfd, buf, sizeof(buf), 0, &addr, &fromlen); payload1 and 2 are an A and a AAAA request for the same name, respectively. That second recvfrom() hangs indefinitely in the error case. Here's the full program for those interested: It'd be easy to put in a call to select and make the program timeout as glibc does instead of simply hanging. Note that for an actual test in your environment, you'll probably have to change the payloads and line 44. Here's the tcpdump of the error case: 09:42:53.614905 IP > 64583+[|domain] 09:42:53.614920 IP > 61812+[|domain] 09:42:53.615623 IP > 64583[|domain] Or, if you prefer tshark: 0.000000 -> DNS Standard query A 0.000015 -> DNS Standard query AAAA 0.000667 -> DNS Standard query response A As you can see, timing on the two queries is very close. glibc usually is in the 20-50 microsecond range on this machine, my little program can get as low as 5 microseconds. "Correct" timing of course depends on a myriad of variables including load on the packetfilter, kernel version there etc etc. A "quickfix" would indeed be using two different ports for the queries - but the bug in Netfilter would still be there. Regards, Tobias


