Re: [gentoo-user] System reboot - gentoo-user - Gentoo Mailing List Archives

From:	Dale <rdalek1967@×××××.com>
To:	gentoo-user@l.g.o
Subject:	Re: [gentoo-user] System reboot
Date:	Sun, 16 Dec 2018 15:18:11
Message-Id:	`5fb03d6f-7714-6fbf-79b3-cb1cfa113379@gmail.com`
In Reply to:	Re: [gentoo-user] System reboot by Rich Freeman

1

Rich Freeman wrote:

2

> On Sat, Dec 15, 2018 at 10:54 PM Dale <rdalek1967@×××××.com> wrote:

3

>> I checked the messages log.  Before with the memory hogging Dolphin it

4

>> had logged the problem.  Today, it shows this:

5

>>

6

>>

7

>> Dec 15 20:40:01 fireball CROND[30668]: (root) CMD (/usr/lib64/sa/sa1 1 1)

8

>> Dec 15 20:50:01 fireball CROND[1532]: (root) CMD (/usr/lib64/sa/sa1 1 1)

9

>> Dec 15 21:00:01 fireball CROND[5513]: (root) CMD (/usr/lib64/sa/sa1 1 1)

10

>> Dec 15 21:01:01 fireball CROND[5718]: (root) CMD (run-parts

11

>> /etc/cron.hourly)

12

>> Dec 15 21:08:34 fireball syslog-ng[4370]: syslog-ng starting up;

13

>> version='3.17.2'

14

>> Dec 15 21:08:34 fireball /usr/sbin/gpm[4400]: *** info

15

>> [daemon/startup.c(136)]:

16

>> Dec 15 21:08:34 fireball /usr/sbin/gpm[4400]: Started gpm successfully.

17

>> Entered daemon mode.

18

>>

19

>>

20

>> As you can see, it went from running a normal cron job to me booting

21

>> back up.  I don't see any error at all.  Not even one electron.

22

> This is pretty typical if you aren't taking special steps to log this

23

> sort of thing.  There are a couple of ways the kernel can crash:

24

>

25

> 1.  OOPS/BUG - these are semi-recoverable errors.  I believe they can

26

> get logged unless they occur in a manner that disrupt your userspace

27

> logger, vfs, filesystem, or disk.  If the error happens in one of

28

> those subsystems then your filesystems will stop syncing and it won't

29

> be logged normally.

30

>

31

> 2.  PANIC - these are unrecoverable and are NOT logged.  When the

32

> kernel PANICs it halts all disk IO and just about everything else.

33

> This is to prevent damage to anything already written on disk.  You

34

> don't want a corrupt OS trying to write to your disk - that makes a

35

> bad situation MUCH worse.  It would be like sending a drunk surgeon

36

> into the operating room to fix up a trauma patient.

37

>

38

> 3.  Hardware reset.  This isn't a kernel issue but I'll toss it in.

39

> If your CPU gets a reset signal it forgets it was ever running linux

40

> and starts executing the firmware as if it had been freshly powered

41

> on.  There is no opportunity to capture anything.  Only way to log

42

> something like this is hardware-level monitoring.

43

>

44

> Issues #1-2 CAN be logged, but not conventionally.  There are few

45

> routes for this:

46

>

47

> 1.  Remote console logging.  Serial and network are the two main

48

> options for this.  If you have a hardware serial port you can capture

49

> its output and any kernel errors will be output to these (just the

50

> text/backtrace/etc).  A network console is very easy to set up if you

51

> have a remote host that can run netcat on the same LAN:

52

> https://www.kernel.org/doc/Documentation/networking/netconsole.txt

53

>

54

> 2.  Recovery kernel.  Gentoo doesn't have tooling for this but you can

55

> follow https://wiki.gentoo.org/wiki/Kernel_Crash_Dumps .  Disclaimer -

56

> I haven't done this in ages so it could be dated in parts.  If the

57

> kernel panics then it will run the recovery kernel, which boots in a

58

> clean state and dumps the old kernel's RAM to disk for subsequent

59

> analysis.

60

>

61

> #1 gets the job done most of the time, but #2 is more thorough.  If

62

> you have a host that is tending to reset you should consider network

63

> logging as a starting point - it is easy to set up.

64

>

65

> I'm not sure why your UPS display is coming on.  It might be some kind

66

> of spurious data on the USB port if it is connected.  It might be a

67

> result of something the PC is doing.  It is also possible it is due to

68

> a brownout or other power issues going into your house, but if your

69

> UPS is in good shape and not overloaded then it should be shielding

70

> your PC from the effects of these.  A PC power supply issue sounds

71

> plausible.  I've had my CP UPS flicker its display and a light might

72

> flicker a bit at the same time, but the PC was unaffected.  I'll also

73

> note that these kinds of transient issues are often mitigated by

74

> having a good quality PC power supply that is not overloaded, and that

75

> this probably also helps with any latency in the UPS switching in.  If

76

> your PC power supply is strained to the point of breaking then any

77

> transients in the input supply are going to get through to the output

78

> rails.  This is one of those areas where spending an extra $30 on your

79

> build can make a significant difference.

80

>

81

82

83

I've seen kernel panics in the past.  Keep in mind, different panics can

84

behave differently but in the past, I got a console type screen with

85

some weird error messages.  Those are what I usually see.  This tho, it

86

was as if the power off button was pushed and held down.  The system

87

didn't reboot, it powered off.  I was asleep but it did beep, which is

88

what woke me up.  Generally in the past when I've seen something like

89

this, it either goes to the console and sits there until I hit reset or

90

just reboots.  This is the first time I've seen my system poweroff like

91

this.  This is what has me curious. 

92

93

My BIOS is set to remain off in the event of a power failure, which

94

shouldn't reach it with a power glitch or even short term power outage

95

due to the UPS.  However, if power fails and it does a shut off, it is

96

set to remain off.  This is what makes me think power supply.  It's not

97

that old but that doesn't rule it out either.  I've read about bad out

98

of the box units before.  Thing is, that is what it sort of acts like. 

99

100

My power supply is a 650 watt unit.  It can power over twice what I

101

pull.  When I built this rig, it could power up three times what I

102

pull.  Keep in mind, I'm measuring not only the puter but also the

103

monitor, speakers, modem and router as well.  That wattage is from the

104

UPS itself.  I try to allow for a lot of head room power wise to

105

compensate for that turn on surge when several drives and fans are

106

spinning up.  I've got five hard drives, three 230MM fans and a 140MM

107

fan just for the case.  Then comes the CPU etc.  I haven't calculated

108

the surge or anything but I figure it is a good bit more than what it

109

pulls when already running. 

110

111

It may be that this has to happen a few times to see if anything can be

112

narrowed down.  Maybe it will do it while I'm sitting at it next time

113

and I can see from start to finish what it is doing.  May help, may

114

not.  One reason for the thread, tips on what to look for.  A good tip

115

could come in handy.  ;-)  Plus, I thought there may be another log I

116

wasn't aware of to look at. 

117

118

Thanks.  Gives me things to think on. 

119

120

Dale

121

122

:-)  :-) 

1	Rich Freeman wrote:
2	> On Sat, Dec 15, 2018 at 10:54 PM Dale <rdalek1967@×××××.com> wrote:
3	>> I checked the messages log. Before with the memory hogging Dolphin it
4	>> had logged the problem. Today, it shows this:
5	>>
6	>>
7	>> Dec 15 20:40:01 fireball CROND[30668]: (root) CMD (/usr/lib64/sa/sa1 1 1)
8	>> Dec 15 20:50:01 fireball CROND[1532]: (root) CMD (/usr/lib64/sa/sa1 1 1)
9	>> Dec 15 21:00:01 fireball CROND[5513]: (root) CMD (/usr/lib64/sa/sa1 1 1)
10	>> Dec 15 21:01:01 fireball CROND[5718]: (root) CMD (run-parts
11	>> /etc/cron.hourly)
12	>> Dec 15 21:08:34 fireball syslog-ng[4370]: syslog-ng starting up;
13	>> version='3.17.2'
14	>> Dec 15 21:08:34 fireball /usr/sbin/gpm[4400]: *** info
15	>> [daemon/startup.c(136)]:
16	>> Dec 15 21:08:34 fireball /usr/sbin/gpm[4400]: Started gpm successfully.
17	>> Entered daemon mode.
18	>>
19	>>
20	>> As you can see, it went from running a normal cron job to me booting
21	>> back up. I don't see any error at all. Not even one electron.
22	> This is pretty typical if you aren't taking special steps to log this
23	> sort of thing. There are a couple of ways the kernel can crash:
24	>
25	> 1. OOPS/BUG - these are semi-recoverable errors. I believe they can
26	> get logged unless they occur in a manner that disrupt your userspace
27	> logger, vfs, filesystem, or disk. If the error happens in one of
28	> those subsystems then your filesystems will stop syncing and it won't
29	> be logged normally.
30	>
31	> 2. PANIC - these are unrecoverable and are NOT logged. When the
32	> kernel PANICs it halts all disk IO and just about everything else.
33	> This is to prevent damage to anything already written on disk. You
34	> don't want a corrupt OS trying to write to your disk - that makes a
35	> bad situation MUCH worse. It would be like sending a drunk surgeon
36	> into the operating room to fix up a trauma patient.
37	>
38	> 3. Hardware reset. This isn't a kernel issue but I'll toss it in.
39	> If your CPU gets a reset signal it forgets it was ever running linux
40	> and starts executing the firmware as if it had been freshly powered
41	> on. There is no opportunity to capture anything. Only way to log
42	> something like this is hardware-level monitoring.
43	>
44	> Issues #1-2 CAN be logged, but not conventionally. There are few
45	> routes for this:
46	>
47	> 1. Remote console logging. Serial and network are the two main
48	> options for this. If you have a hardware serial port you can capture
49	> its output and any kernel errors will be output to these (just the
50	> text/backtrace/etc). A network console is very easy to set up if you
51	> have a remote host that can run netcat on the same LAN:
52	> https://www.kernel.org/doc/Documentation/networking/netconsole.txt
53	>
54	> 2. Recovery kernel. Gentoo doesn't have tooling for this but you can
55	> follow https://wiki.gentoo.org/wiki/Kernel_Crash_Dumps . Disclaimer -
56	> I haven't done this in ages so it could be dated in parts. If the
57	> kernel panics then it will run the recovery kernel, which boots in a
58	> clean state and dumps the old kernel's RAM to disk for subsequent
59	> analysis.
60	>
61	> #1 gets the job done most of the time, but #2 is more thorough. If
62	> you have a host that is tending to reset you should consider network
63	> logging as a starting point - it is easy to set up.
64	>
65	> I'm not sure why your UPS display is coming on. It might be some kind
66	> of spurious data on the USB port if it is connected. It might be a
67	> result of something the PC is doing. It is also possible it is due to
68	> a brownout or other power issues going into your house, but if your
69	> UPS is in good shape and not overloaded then it should be shielding
70	> your PC from the effects of these. A PC power supply issue sounds
71	> plausible. I've had my CP UPS flicker its display and a light might
72	> flicker a bit at the same time, but the PC was unaffected. I'll also
73	> note that these kinds of transient issues are often mitigated by
74	> having a good quality PC power supply that is not overloaded, and that
75	> this probably also helps with any latency in the UPS switching in. If
76	> your PC power supply is strained to the point of breaking then any
77	> transients in the input supply are going to get through to the output
78	> rails. This is one of those areas where spending an extra $30 on your
79	> build can make a significant difference.
80	>
81
82
83	I've seen kernel panics in the past. Keep in mind, different panics can
84	behave differently but in the past, I got a console type screen with
85	some weird error messages. Those are what I usually see. This tho, it
86	was as if the power off button was pushed and held down. The system
87	didn't reboot, it powered off. I was asleep but it did beep, which is
88	what woke me up. Generally in the past when I've seen something like
89	this, it either goes to the console and sits there until I hit reset or
90	just reboots. This is the first time I've seen my system poweroff like
91	this. This is what has me curious.
92
93	My BIOS is set to remain off in the event of a power failure, which
94	shouldn't reach it with a power glitch or even short term power outage
95	due to the UPS. However, if power fails and it does a shut off, it is
96	set to remain off. This is what makes me think power supply. It's not
97	that old but that doesn't rule it out either. I've read about bad out
98	of the box units before. Thing is, that is what it sort of acts like.
99
100	My power supply is a 650 watt unit. It can power over twice what I
101	pull. When I built this rig, it could power up three times what I
102	pull. Keep in mind, I'm measuring not only the puter but also the
103	monitor, speakers, modem and router as well. That wattage is from the
104	UPS itself. I try to allow for a lot of head room power wise to
105	compensate for that turn on surge when several drives and fans are
106	spinning up. I've got five hard drives, three 230MM fans and a 140MM
107	fan just for the case. Then comes the CPU etc. I haven't calculated
108	the surge or anything but I figure it is a good bit more than what it
109	pulls when already running.
110
111	It may be that this has to happen a few times to see if anything can be
112	narrowed down. Maybe it will do it while I'm sitting at it next time
113	and I can see from start to finish what it is doing. May help, may
114	not. One reason for the thread, tips on what to look for. A good tip
115	could come in handy. ;-) Plus, I thought there may be another log I
116	wasn't aware of to look at.
117
118	Thanks. Gives me things to think on.
119
120	Dale
121
122	:-) :-)

Gentoo Archives: gentoo-user