1 |
Hi all, |
2 |
|
3 |
at work I'm (well, *we* are) facing an interesting problem. Since we are sort |
4 |
of stabbing in the dark here, I thought I'd ask here. Also, since this is from |
5 |
work, I will not be able to diverge very many details (not to mention that as a |
6 |
student worker I simply don't *know* many details). However, I do have |
7 |
permission from my boss to ask about this in an anonymised fashion. |
8 |
|
9 |
The symptom we're seeing is that the NIC goes down and DHCP packets stop getting |
10 |
through after a certain amount of time. What happens is: |
11 |
|
12 |
1.) The NIC is brought up (some built-in Intel model). |
13 |
|
14 |
2.) A DHCP client configures it. |
15 |
|
16 |
3.) The network connection is lost at some point (the amount of time this takes |
17 |
varies, but it can be as little as 20 minutes). |
18 |
|
19 |
4.) Eventually the lease runs out and the DHCP client tries to renew it, but |
20 |
gets no response. Sometimes, after many hours (at least 6), it will get a |
21 |
DHCPACK, but that's it. One of our sysadmins says that not only does |
22 |
the DHCP server never see the packets, but the managed switch that the PC |
23 |
is directly attached to *also* never does (again, except for when the |
24 |
occasional DHCPACK comes). |
25 |
|
26 |
4.) Restart the network device. A reboot is not required, but it is necessary |
27 |
to terminate the DHCP client. After that everything works again. |
28 |
|
29 |
5.) GOTO 3. |
30 |
|
31 |
(Note that I have observed that steps 3 and 4 do not necessarily occur in |
32 |
order.) |
33 |
|
34 |
This has been rather baffling, since this problem is limited to 3 computers. |
35 |
|
36 |
One of them (the longest running) runs Gentoo, courtesy of me. This is the |
37 |
first one we saw the problem with. Since we couldn't figure it out (switching |
38 |
from dhcpcd to dhclient, turning off the firewall, monitoring with tcpdump, |
39 |
etc., all with help from one of our sysadmins; Google, too, of course), Gentoo |
40 |
was "blamed", so we got a replacement PC with Fedora 20 on it, which *also* |
41 |
showed this behaviour. Both PCs run some special software (some of it mine). |
42 |
Thus, at some point this software was "blamed". |
43 |
|
44 |
So we started experimenting: we configured the Fedora PC to *not* start the |
45 |
special software, and have not seen any problems all week. Yesterday afternoon |
46 |
I then started *one* of the programs, and had not seen any problems yet by the |
47 |
time I went home. |
48 |
|
49 |
So that would speak *for* that theory, right? Well, for comparison, my boss |
50 |
recently started running a separate PC, also with a bog-standard Fedora 20. |
51 |
Guess what: it *also* shows the *exact* same behaviour as the other two PCs |
52 |
("journalctl -u NetworkManager" shows pages upon pages of unanswered |
53 |
DHCPREQUESTs, with the occasional response thrown in). Note here that this PC |
54 |
is on a different switch and in a different VLAN. |
55 |
|
56 |
The choice of Fedora comes from the fact that we use a Fedora based distro |
57 |
internally, so it is "known". PCs running it have *not* shown the behaviour |
58 |
above (AFAIK not even *once*). Thus, one of the few things I can think of is |
59 |
finding out what is different about them relative to the standard Fedora. |
60 |
|
61 |
Right now my main ideas on what the culprit could be are: |
62 |
|
63 |
- The computers' kernel/network device is improperly configured. That is, |
64 |
maybe special configuration is needed for the computers to work properly as |
65 |
clients in the network. I'm thinking of support for some (from my |
66 |
perspective) obscure protocol(s). |
67 |
|
68 |
- It's a network problem. The three computers are in two different VLANs, |
69 |
while the workplace computers running the internal Fedora based distro are in |
70 |
a third (the main network that all the normal Windows and Linux workstations |
71 |
are connected to). However, they are on the same switch as the two computers |
72 |
running my software. One argument against this is that the Windows PC that |
73 |
runs on the same VLAN does *not* have any problems like this. |
74 |
|
75 |
One of the other ideas I had was faulty power management, and I did read of |
76 |
problems of the sort regarding the exact same network card that is in the old |
77 |
Gentoo machine on an HP support forum (from around 2008). However, the local |
78 |
sysadmin said that they have had nothing but good experience with those network |
79 |
cards. Also: *three* computers with NIC power management problems? That sounds |
80 |
a bit far-fetched to me. Nevertheless, I am not fully discounting the |
81 |
possibility. |
82 |
|
83 |
You can imagine how confusing and frustrating this is. |
84 |
|
85 |
So, has anybody here ever experienced something like this? Any ideas on what |
86 |
could be the cause? |
87 |
|
88 |
Greetings |
89 |
-- |
90 |
Marc Joliet |
91 |
-- |
92 |
"People who think they know everything really annoy those of us who know we |
93 |
don't" - Bjarne Stroustrup |