1 |
On 05/03/15 09:46, Marc Joliet wrote: |
2 |
> Hi all, |
3 |
> |
4 |
> at work I'm (well, *we* are) facing an interesting problem. Since we are sort |
5 |
> of stabbing in the dark here, I thought I'd ask here. Also, since this is from |
6 |
> work, I will not be able to diverge very many details (not to mention that as a |
7 |
> student worker I simply don't *know* many details). However, I do have |
8 |
> permission from my boss to ask about this in an anonymised fashion. |
9 |
> |
10 |
> The symptom we're seeing is that the NIC goes down and DHCP packets stop getting |
11 |
> through after a certain amount of time. What happens is: |
12 |
> |
13 |
> 1.) The NIC is brought up (some built-in Intel model). |
14 |
> |
15 |
> 2.) A DHCP client configures it. |
16 |
> |
17 |
> 3.) The network connection is lost at some point (the amount of time this takes |
18 |
> varies, but it can be as little as 20 minutes). |
19 |
> |
20 |
> 4.) Eventually the lease runs out and the DHCP client tries to renew it, but |
21 |
> gets no response. Sometimes, after many hours (at least 6), it will get a |
22 |
> DHCPACK, but that's it. One of our sysadmins says that not only does |
23 |
> the DHCP server never see the packets, but the managed switch that the PC |
24 |
> is directly attached to *also* never does (again, except for when the |
25 |
> occasional DHCPACK comes). |
26 |
> |
27 |
> 4.) Restart the network device. A reboot is not required, but it is necessary |
28 |
> to terminate the DHCP client. After that everything works again. |
29 |
> |
30 |
> 5.) GOTO 3. |
31 |
> |
32 |
> (Note that I have observed that steps 3 and 4 do not necessarily occur in |
33 |
> order.) |
34 |
> |
35 |
> This has been rather baffling, since this problem is limited to 3 computers. |
36 |
> |
37 |
> One of them (the longest running) runs Gentoo, courtesy of me. This is the |
38 |
> first one we saw the problem with. Since we couldn't figure it out (switching |
39 |
> from dhcpcd to dhclient, turning off the firewall, monitoring with tcpdump, |
40 |
> etc., all with help from one of our sysadmins; Google, too, of course), Gentoo |
41 |
> was "blamed", so we got a replacement PC with Fedora 20 on it, which *also* |
42 |
> showed this behaviour. Both PCs run some special software (some of it mine). |
43 |
> Thus, at some point this software was "blamed". |
44 |
> |
45 |
> So we started experimenting: we configured the Fedora PC to *not* start the |
46 |
> special software, and have not seen any problems all week. Yesterday afternoon |
47 |
> I then started *one* of the programs, and had not seen any problems yet by the |
48 |
> time I went home. |
49 |
> |
50 |
> So that would speak *for* that theory, right? Well, for comparison, my boss |
51 |
> recently started running a separate PC, also with a bog-standard Fedora 20. |
52 |
> Guess what: it *also* shows the *exact* same behaviour as the other two PCs |
53 |
> ("journalctl -u NetworkManager" shows pages upon pages of unanswered |
54 |
> DHCPREQUESTs, with the occasional response thrown in). Note here that this PC |
55 |
> is on a different switch and in a different VLAN. |
56 |
> |
57 |
> The choice of Fedora comes from the fact that we use a Fedora based distro |
58 |
> internally, so it is "known". PCs running it have *not* shown the behaviour |
59 |
> above (AFAIK not even *once*). Thus, one of the few things I can think of is |
60 |
> finding out what is different about them relative to the standard Fedora. |
61 |
> |
62 |
> Right now my main ideas on what the culprit could be are: |
63 |
> |
64 |
> - The computers' kernel/network device is improperly configured. That is, |
65 |
> maybe special configuration is needed for the computers to work properly as |
66 |
> clients in the network. I'm thinking of support for some (from my |
67 |
> perspective) obscure protocol(s). |
68 |
> |
69 |
> - It's a network problem. The three computers are in two different VLANs, |
70 |
> while the workplace computers running the internal Fedora based distro are in |
71 |
> a third (the main network that all the normal Windows and Linux workstations |
72 |
> are connected to). However, they are on the same switch as the two computers |
73 |
> running my software. One argument against this is that the Windows PC that |
74 |
> runs on the same VLAN does *not* have any problems like this. |
75 |
> |
76 |
> One of the other ideas I had was faulty power management, and I did read of |
77 |
> problems of the sort regarding the exact same network card that is in the old |
78 |
> Gentoo machine on an HP support forum (from around 2008). However, the local |
79 |
> sysadmin said that they have had nothing but good experience with those network |
80 |
> cards. Also: *three* computers with NIC power management problems? That sounds |
81 |
> a bit far-fetched to me. Nevertheless, I am not fully discounting the |
82 |
> possibility. |
83 |
> |
84 |
> You can imagine how confusing and frustrating this is. |
85 |
> |
86 |
> So, has anybody here ever experienced something like this? Any ideas on what |
87 |
> could be the cause? |
88 |
> |
89 |
> Greetings |
90 |
|
91 |
Howdy |
92 |
i've seen this before but not with the nic down event |
93 |
the problem was old managed alcatel switches combined with questionable |
94 |
wiring |
95 |
in my case it was reversed, the gentoo box was providing the dhcp but |
96 |
then suddenly nothing got dhcp responses |
97 |
power cycling the switch was a temporary fix |
98 |
updating the switch firmware helped a lot - went from a daily occurence |
99 |
to weekly occurence |
100 |
i'd have a word with the network team and have them verify through port |
101 |
mirroring |
102 |
1. the dhcp server is sending packets out and they are being received on |
103 |
the switchport it is connected |
104 |
2. the packet is also being sent out on the correct port |
105 |
what they will probably discover is an issue with the mac tables / |
106 |
switching and have to bounce the ports / the switch |
107 |
forcing the up/down on the dhcp server also seemed to help on occasion |
108 |
good luck - if you find the resolution is something else please do let |
109 |
me know as i'd love to find out what the issue might have been if not |
110 |
the switch! |