1 |
On Wed, 2013-03-20 at 16:14 +0000, Grant Edwards wrote: |
2 |
> On 2013-03-20, Carlos Hendson <skyclan@×××.net> wrote: |
3 |
> |
4 |
> > That's by no means conclusive, however, I've also run a complete pass of |
5 |
> > memcheck for over an hour without any issues reported. |
6 |
> |
7 |
> FWIW. I've had flakey memory that ran memcheck fine for several hours |
8 |
> and multiple passes -- but if I let it run long enough, it would fail. |
9 |
> I wouldn't be confident unless memtest ran for at least 12 hours (24 |
10 |
> would be even better). |
11 |
> |
12 |
> I'd also keep an eye on CPU core temperature. |
13 |
> |
14 |
> A failing hard-drive can also cause some pretty strange behavior. If |
15 |
> you're drives are smart (AFAICT, all recent ones are), ask them how |
16 |
> they're feeling with 'smartclt' or something like that. |
17 |
> |
18 |
|
19 |
I'll run a 24 hour memtest this weekend. |
20 |
|
21 |
I started a long test on the hard drive: |
22 |
|
23 |
smartctl -t long /dev/sda |
24 |
|
25 |
smartctl -a /dev/sda also appears to indicate various errors (the output |
26 |
is attached). I'll trying to track down some documentation as to what |
27 |
they're actually reporting. |
28 |
|
29 |
Looking at the difference between the output of smartctl for before and |
30 |
during the test, there has been an increase in errors detected for ID |
31 |
#195 |
32 |
|
33 |
Before test: |
34 |
195 Hardware_ECC_Recovered 0x001a 001 001 000 Old_age Always |
35 |
- 241822 |
36 |
|
37 |
During test: |
38 |
195 Hardware_ECC_Recovered 0x001a 001 001 000 Old_age Always |
39 |
- 243582 |
40 |
|
41 |
Could this be the cause of the stalls during compiles? If it is the |
42 |
cause, is it possible for the kernel to detect such failures and report |
43 |
them? |
44 |
|
45 |
Thanks to you and everyone else for your ideas and suggestions. |
46 |
|
47 |
Regards, |
48 |
Carlos |