1 |
On 06/03/2015 20:45, Marc Joliet wrote: |
2 |
> First of all, thanks to everybody who responded so far. |
3 |
> |
4 |
> I wanted preface my reply to Alan by mentioning that the local sysadmin made |
5 |
> changes to the DHCP server that appear to have worked around whatever the issue |
6 |
> is. |
7 |
> |
8 |
> I don't fully understand the error analysis (something to do with the DHCP |
9 |
> client reaching a particular state and sending DHCP packets that something |
10 |
> in-between it and the DHCP server doesn't like and that might result in vendor |
11 |
> dependent behaviour), but what the DHCP server now does is tell the client to |
12 |
> use the broadcast address as the DHCP server address (which is weird, because |
13 |
> the DHCP clients always switch to the broadcast address after a timeout, but of |
14 |
> course I'm no DHCP expert). The affected PCs have been working normally all |
15 |
> day today. |
16 |
|
17 |
In light of what you say below: |
18 |
|
19 |
|
20 |
I'd be interested to hear what your sysadmin has to say; dhcp is one of |
21 |
those things that JustWork(tm) - it uses regular tcp and nothing funny |
22 |
about it at all. The only thing normally between your NIC and the dhcp |
23 |
server is a switch, so that's what I'd be looking at. |
24 |
|
25 |
|
26 |
|
27 |
|
28 |
> |
29 |
> So the current resolution is "it works", but we still don't understand (or at |
30 |
> least me and my boss don't) what the underlying issue is. Hence I'm still |
31 |
> curious what people who know these technologies better than me think. |
32 |
> |
33 |
> Also, I suppose it was confusing to say that the switch never saw the packets. |
34 |
> The way this was determined was by post-mortem log inspection; AFAIK we didn't |
35 |
> do any live inspection on the switch. Based on the workaround, the conclusion |
36 |
> we came to is that the switch must have dropped the packets (for whatever |
37 |
> reason) without logging that it did. |
38 |
> |
39 |
> Am Fri, 6 Mar 2015 08:01:44 +0200 |
40 |
> schrieb Alan McKinnon <alan.mckinnon@×××××.com>: |
41 |
> |
42 |
> [...] |
43 |
>> I've seen similar things many times myself (but nevr on Intel network |
44 |
>> kit so far) |
45 |
>> |
46 |
>> A lot of reading and Googling usually leads to the solution: |
47 |
>> |
48 |
>> - firmware upgrade for the hardware |
49 |
> |
50 |
> OK, I can look into that. |
51 |
> |
52 |
>> - use the correct driver (this is often non-obvious) |
53 |
>> - try the in-kernel driver vs any out-of-tree vendor driver |
54 |
> |
55 |
> All PCs run with the e1000e in-kernel module. I think the Fedora systems run |
56 |
> 3.18.7, so it's about as current as it can be, too. Could it really be that the |
57 |
> kernel selects the wrong driver? |
58 |
> |
59 |
>> - apply driver parameters designed to work around buggy hardware (this |
60 |
>> often involves (much reading) |
61 |
> |
62 |
> I will also consider that. I see that the kernel sources contains |
63 |
> documentation for the e1000e driver that I can look at. |
64 |
|
65 |
I wasn't aware you had e1000e hardware - those are about as reliable as |
66 |
they come. I've used many of them and never had the slightest trouble at |
67 |
all. By all means study up on firmware and driver options - if you don;t |
68 |
know much about that area it's very illuminating to find out more. But |
69 |
based on experience I'd say the chances of finding an oddity with e1000e |
70 |
are slim, and I'd be looking at a misconfigured switch. |
71 |
|
72 |
There are some strange switches out there that let you make crazy |
73 |
configuration, like eg blanket drop all broadcast traffic on one or more |
74 |
ports. That's where I'd be looking first. |
75 |
|
76 |
|
77 |
-- |
78 |
Alan McKinnon |
79 |
alan.mckinnon@×××××.com |