Gentoo Archives: gentoo-user

From: Marc Joliet <marcec@×××.de>
To: gentoo-user@l.g.o
Subject: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails
Date: Thu, 05 Mar 2015 09:46:45
Message-Id: 20150305104625.2d88242a@marcec.fritz.box
1 Hi all,
2
3 at work I'm (well, *we* are) facing an interesting problem. Since we are sort
4 of stabbing in the dark here, I thought I'd ask here. Also, since this is from
5 work, I will not be able to diverge very many details (not to mention that as a
6 student worker I simply don't *know* many details). However, I do have
7 permission from my boss to ask about this in an anonymised fashion.
8
9 The symptom we're seeing is that the NIC goes down and DHCP packets stop getting
10 through after a certain amount of time. What happens is:
11
12 1.) The NIC is brought up (some built-in Intel model).
13
14 2.) A DHCP client configures it.
15
16 3.) The network connection is lost at some point (the amount of time this takes
17 varies, but it can be as little as 20 minutes).
18
19 4.) Eventually the lease runs out and the DHCP client tries to renew it, but
20 gets no response. Sometimes, after many hours (at least 6), it will get a
21 DHCPACK, but that's it. One of our sysadmins says that not only does
22 the DHCP server never see the packets, but the managed switch that the PC
23 is directly attached to *also* never does (again, except for when the
24 occasional DHCPACK comes).
25
26 4.) Restart the network device. A reboot is not required, but it is necessary
27 to terminate the DHCP client. After that everything works again.
28
29 5.) GOTO 3.
30
31 (Note that I have observed that steps 3 and 4 do not necessarily occur in
32 order.)
33
34 This has been rather baffling, since this problem is limited to 3 computers.
35
36 One of them (the longest running) runs Gentoo, courtesy of me. This is the
37 first one we saw the problem with. Since we couldn't figure it out (switching
38 from dhcpcd to dhclient, turning off the firewall, monitoring with tcpdump,
39 etc., all with help from one of our sysadmins; Google, too, of course), Gentoo
40 was "blamed", so we got a replacement PC with Fedora 20 on it, which *also*
41 showed this behaviour. Both PCs run some special software (some of it mine).
42 Thus, at some point this software was "blamed".
43
44 So we started experimenting: we configured the Fedora PC to *not* start the
45 special software, and have not seen any problems all week. Yesterday afternoon
46 I then started *one* of the programs, and had not seen any problems yet by the
47 time I went home.
48
49 So that would speak *for* that theory, right? Well, for comparison, my boss
50 recently started running a separate PC, also with a bog-standard Fedora 20.
51 Guess what: it *also* shows the *exact* same behaviour as the other two PCs
52 ("journalctl -u NetworkManager" shows pages upon pages of unanswered
53 DHCPREQUESTs, with the occasional response thrown in). Note here that this PC
54 is on a different switch and in a different VLAN.
55
56 The choice of Fedora comes from the fact that we use a Fedora based distro
57 internally, so it is "known". PCs running it have *not* shown the behaviour
58 above (AFAIK not even *once*). Thus, one of the few things I can think of is
59 finding out what is different about them relative to the standard Fedora.
60
61 Right now my main ideas on what the culprit could be are:
62
63 - The computers' kernel/network device is improperly configured. That is,
64 maybe special configuration is needed for the computers to work properly as
65 clients in the network. I'm thinking of support for some (from my
66 perspective) obscure protocol(s).
67
68 - It's a network problem. The three computers are in two different VLANs,
69 while the workplace computers running the internal Fedora based distro are in
70 a third (the main network that all the normal Windows and Linux workstations
71 are connected to). However, they are on the same switch as the two computers
72 running my software. One argument against this is that the Windows PC that
73 runs on the same VLAN does *not* have any problems like this.
74
75 One of the other ideas I had was faulty power management, and I did read of
76 problems of the sort regarding the exact same network card that is in the old
77 Gentoo machine on an HP support forum (from around 2008). However, the local
78 sysadmin said that they have had nothing but good experience with those network
79 cards. Also: *three* computers with NIC power management problems? That sounds
80 a bit far-fetched to me. Nevertheless, I am not fully discounting the
81 possibility.
82
83 You can imagine how confusing and frustrating this is.
84
85 So, has anybody here ever experienced something like this? Any ideas on what
86 could be the cause?
87
88 Greetings
89 --
90 Marc Joliet
91 --
92 "People who think they know everything really annoy those of us who know we
93 don't" - Bjarne Stroustrup

Replies