Re: [gentoo-user] Re: "No CUDA device found" with nvidia-drivers newer than nvidia-drivers-396.24-r1( - gentoo-user

From:	"Corentin “Nado” Pazdera" <nado@××××××××××.be>
To:	gentoo-user@l.g.o
Subject:	Re: [gentoo-user] Re: "No CUDA device found" with nvidia-drivers newer than nvidia-drivers-396.24-r1(
Date:	Wed, 15 Aug 2018 16:05:56
Message-Id:	`c20401b74fd4024d10620af3b2568117@troglodyte.be`
In Reply to:	Re: [gentoo-user] Re: "No CUDA device found" with nvidia-drivers newer than nvidia-drivers-396.24-r1( by tuxic@posteo.de

1

August 15, 2018 5:45 PM, tuxic@××××××.de wrote:

2

3

> I put nvidia-uvm explictly into /etc/conf.d/modules - which was not

4

> necessary ever before....and it shows the same problems: No cuda

5

> devices.

6

>

7

> I think I will dream this night of no cuda devices... ;(

8

>

9

> On 08/15 05:11, tuxic@××××××.de wrote:

10

>

11

>> On 08/15 02:32, Corentin “Nado” Pazdera wrote:

12

>> August 15, 2018 2:59 PM, tuxic@××××××.de wrote:

13

>

14

> Yes I did reboot the sustem. In my initial mail I mentioned a tool

15

> called CUDA-Z and Blender, which both reports a missing CUDA device.

16

>> Ok, so you do not have a specific error which might have been thrown by the module?

17

>> Other ideas, check dev-util/nvidia-cuda-toolkit version and double check nvidia/nvidia_uvm with

18

>> modinfo to ensure they are installed and loaded correctly with the right version?

19

>> Could you also run /opt/cuda/extras/demo_suite/deviceQuery (from nvidia-cuda-toolkit) and show its

20

>> output?

21

>>

22

>> My installation works, so at least we know their version is not completely broken...

23

>> Driver version: 396.51

24

>> Cuda version: 9.2.88

25

>>

26

>> --

27

>> Corentin “Nado” Pazdera

28

>>

29

>> I compiled the new version of the driver again and rebooted the

30

>> system.

31

>>

32

>> # dmesg | grep -i nvidia:

33

>>

34

>> [ 11.375631] nvidia_drm: module license 'MIT' taints kernel.

35

>> [ 12.313260] nvidia-nvlink: Nvlink Core is being initialized, major device number 246

36

>> [ 12.313586] nvidia 0000:07:00.0: vgaarb: changed VGA decodes:

37

>> olddecodes=io+mem,decodes=none:owns=io+mem

38

>> [ 12.313691] nvidia 0000:02:00.0: enabling device (0000 -> 0003)

39

>> [ 12.313737] nvidia 0000:02:00.0: vgaarb: changed VGA decodes:

40

>> olddecodes=io+mem,decodes=none:owns=none

41

>> [ 12.313826] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 396.51 Tue Jul 31 10:43:06 PDT 2018

42

>> (using threaded interrupts)

43

>> [ 12.491106] input: HDA NVidia HDMI as

44

>> /devices/pci0000:00/0000:00:0b.0/0000:02:00.1/sound/card2/input9

45

>> [ 12.492291] input: HDA NVidia HDMI as

46

>> /devices/pci0000:00/0000:00:0b.0/0000:02:00.1/sound/card2/input10

47

>> [ 12.493772] input: HDA NVidia HDMI as

48

>> /devices/pci0000:00/0000:00:02.0/0000:07:00.1/sound/card1/input11

49

>> [ 12.494605] input: HDA NVidia HDMI as

50

>> /devices/pci0000:00/0000:00:02.0/0000:07:00.1/sound/card1/input12

51

>> [ 13.963644] caller _nv001112rm+0xe3/0x1d0 [nvidia] mapping multiple BARs

52

>> [ 34.236553] caller _nv001112rm+0xe3/0x1d0 [nvidia] mapping multiple BARs

53

>> [ 34.516495] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 396.51

54

>> Tue Jul 31 14:52:09 PDT 2018

55

>>

56

>> # modprobe -a nvidia-uvm

57

>>

58

>> # dmesg | grep uvm

59

>>

60

>> [ 209.441956] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 245

61

>>

62

>> # /opt/cuda/extras/demo_suite/deviceQuery

63

>> /opt/cuda/extras/demo_suite/deviceQuery Starting...

64

>>

65

>> CUDA Device Query (Runtime API) version (CUDART static linking)

66

>>

67

>> cudaGetDeviceCount returned 30

68

>> -> unknown error

69

>> Result = FAIL

70

>> [1] 5086 exit 1 /opt/cuda/extras/demo_suite/deviceQuery

71

>>

72

>> CUDA-Z shows also "no CUDA device"

73

>>

74

>> # modinfo nvidia-uvm

75

>> filename: /lib/modules/4.18.0-RT/video/nvidia-uvm.ko

76

>> supported: external

77

>> license: MIT

78

>> depends: nvidia

79

>> name: nvidia_uvm

80

>> vermagic: 4.18.0-RT SMP preempt mod_unload

81

>> parm: uvm_perf_prefetch_enable:uint

82

>> parm: uvm_perf_prefetch_threshold:uint

83

>> parm: uvm_perf_prefetch_min_faults:uint

84

>> parm: uvm_perf_thrashing_enable:uint

85

>> parm: uvm_perf_thrashing_threshold:uint

86

>> parm: uvm_perf_thrashing_pin_threshold:uint

87

>> parm: uvm_perf_thrashing_lapse_usec:uint

88

>> parm: uvm_perf_thrashing_nap_usec:uint

89

>> parm: uvm_perf_thrashing_epoch_msec:uint

90

>> parm: uvm_perf_thrashing_max_resets:uint

91

>> parm: uvm_perf_thrashing_pin_msec:uint

92

>> parm: uvm_perf_map_remote_on_native_atomics_fault:uint

93

>> parm: uvm_hmm:Enable (1) or disable (0) HMM mode. Default: 0. Ignored if CONFIG_HMM is not set, or

94

>> if NEXT settings conflict with HMM. (int)

95

>> parm: uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int)

96

>> parm: uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes

97

>> allocated and freed, 2 = per-allocation origin tracking. (int)

98

>> parm: uvm_force_prefetch_fault_support:uint

99

>> parm: uvm_debug_enable_push_desc:Enable push description tracking (int)

100

>> parm: uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid,

101

>> sys. (charp)

102

>> parm: uvm_perf_access_counter_mimc_migration_enable:Whether MIMC access counters will trigger

103

>> migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)

104

>> parm: uvm_perf_access_counter_momc_migration_enable:Whether MOMC access counters will trigger

105

>> migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)

106

>> parm: uvm_perf_access_counter_batch_count:uint

107

>> parm: uvm_perf_access_counter_granularity:Size of the physical memory region tracked by each

108

>> counter. Valid values asof Volta: 64k, 2m, 16m, 16g (charp)

109

>> parm: uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a

110

>> notification.Valid values: [1, 65535] (uint)

111

>> parm: uvm_perf_reenable_prefetch_faults_lapse_msec:uint

112

>> parm: uvm_perf_fault_batch_count:uint

113

>> parm: uvm_perf_fault_replay_policy:uint

114

>> parm: uvm_perf_fault_replay_update_put_ratio:uint

115

>> parm: uvm_perf_fault_max_batches_per_service:uint

116

>> parm: uvm_perf_fault_max_throttle_per_service:uint

117

>> parm: uvm_perf_fault_coalesce:uint

118

>> parm: uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0.

119

>> (int)

120

>> parm: uvm_perf_map_remote_on_eviction:int

121

>> parm: uvm_channel_num_gpfifo_entries:uint

122

>> parm: uvm_channel_gpfifo_loc:charp

123

>> parm: uvm_channel_gpput_loc:charp

124

>> parm: uvm_channel_pushbuffer_loc:charp

125

>> parm: uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int)

126

>> parm: uvm8_ats_mode:Override the default ATS (Address Translation Services) UVM mode by disabling

127

>> (0) or enabling (1) (int)

128

>> parm: uvm_driver_mode:Set the uvm kernel driver mode. Choices include: 8 (charp)

129

>> parm: uvm_debug_prints:Enable uvm debug prints. (int)

130

>> parm: uvm_enable_builtin_tests:Enable the UVM built-in tests. (This is a security risk) (int)

131

>>

132

>> # ls -l /lib/modules/4.18.0-RT/video/nvidia-uvm.ko

133

>> -rw-r--r-- 1 root root 1405808 Aug 15 16:49 /lib/modules/4.18.0-RT/video/nvidia-uvm.ko

134

>> (just installed minytes before)

135

>>

136

>> # uname -a

137

>> Linux solfire 4.18.0-RT #1 SMP PREEMPT Mon Aug 13 05:15:26 CEST 2018 x86_64 AMD Phenom(tm) II X6

138

>> 1090T Processor AuthenticAMD GNU/Linux

139

>> (the kernel version matches)

140

>>

141

>> # eix nvidia-cuda-toolkit

142

>>

143

>> [I] dev-util/nvidia-cuda-toolkit

144

>> Available versions: [M](~)6.5.14(0/6.5.14) [M](~)6.5.19-r1(0/6.5.19) [M](~)7.5.18-r2(0/7.5.18)

145

>> [M](~)8.0.44(0/8.0.44) [M](~)8.0.61(0/8.0.61) (~)9.0.176(0/9.0.176) (~)9.1.85(0/9.1.85)

146

>> (~)9.2.88(0/9.2.88) {debugger doc eclipse profiler}

147

>> Installed versions: 9.2.88(0/9.2.88)(06:31:32 PM 08/14/2018)(-debugger -doc -eclipse -profiler)

148

>> Homepage: https://developer.nvidia.com/cuda-zone

149

>> Description: NVIDIA CUDA Toolkit (compiler and friends)

150

>>

151

>> It becomes even more weird...

152

153

It is weird indeed... Im running on kernel 4.15.16 and I needed to disable MSI in

154

/etc/modprobe.d/nvidia.conf with " NVreg_EnableMSI=0" appended to the line "options nvidia ...".

155

Thats the main differences I see with you from the software side.

156

157

This kind of error is usually due to a failed reload (not rebooting) or because of a version

158

mismatch according to google, but I can't find any mismatch in the info you gave us.

159

160

Good luck

161

162

--

163

Corentin “Nado” Pazdera

1	August 15, 2018 5:45 PM, tuxic@××××××.de wrote:
2
3	> I put nvidia-uvm explictly into /etc/conf.d/modules - which was not
4	> necessary ever before....and it shows the same problems: No cuda
5	> devices.
6	>
7	> I think I will dream this night of no cuda devices... ;(
8	>
9	> On 08/15 05:11, tuxic@××××××.de wrote:
10	>
11	>> On 08/15 02:32, Corentin “Nado” Pazdera wrote:
12	>> August 15, 2018 2:59 PM, tuxic@××××××.de wrote:
13	>
14	> Yes I did reboot the sustem. In my initial mail I mentioned a tool
15	> called CUDA-Z and Blender, which both reports a missing CUDA device.
16	>> Ok, so you do not have a specific error which might have been thrown by the module?
17	>> Other ideas, check dev-util/nvidia-cuda-toolkit version and double check nvidia/nvidia_uvm with
18	>> modinfo to ensure they are installed and loaded correctly with the right version?
19	>> Could you also run /opt/cuda/extras/demo_suite/deviceQuery (from nvidia-cuda-toolkit) and show its
20	>> output?
21	>>
22	>> My installation works, so at least we know their version is not completely broken...
23	>> Driver version: 396.51
24	>> Cuda version: 9.2.88
25	>>
26	>> --
27	>> Corentin “Nado” Pazdera
28	>>
29	>> I compiled the new version of the driver again and rebooted the
30	>> system.
31	>>
32	>> # dmesg \| grep -i nvidia:
33	>>
34	>> [ 11.375631] nvidia_drm: module license 'MIT' taints kernel.
35	>> [ 12.313260] nvidia-nvlink: Nvlink Core is being initialized, major device number 246
36	>> [ 12.313586] nvidia 0000:07:00.0: vgaarb: changed VGA decodes:
37	>> olddecodes=io+mem,decodes=none:owns=io+mem
38	>> [ 12.313691] nvidia 0000:02:00.0: enabling device (0000 -> 0003)
39	>> [ 12.313737] nvidia 0000:02:00.0: vgaarb: changed VGA decodes:
40	>> olddecodes=io+mem,decodes=none:owns=none
41	>> [ 12.313826] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 396.51 Tue Jul 31 10:43:06 PDT 2018
42	>> (using threaded interrupts)
43	>> [ 12.491106] input: HDA NVidia HDMI as
44	>> /devices/pci0000:00/0000:00:0b.0/0000:02:00.1/sound/card2/input9
45	>> [ 12.492291] input: HDA NVidia HDMI as
46	>> /devices/pci0000:00/0000:00:0b.0/0000:02:00.1/sound/card2/input10
47	>> [ 12.493772] input: HDA NVidia HDMI as
48	>> /devices/pci0000:00/0000:00:02.0/0000:07:00.1/sound/card1/input11
49	>> [ 12.494605] input: HDA NVidia HDMI as
50	>> /devices/pci0000:00/0000:00:02.0/0000:07:00.1/sound/card1/input12
51	>> [ 13.963644] caller _nv001112rm+0xe3/0x1d0 [nvidia] mapping multiple BARs
52	>> [ 34.236553] caller _nv001112rm+0xe3/0x1d0 [nvidia] mapping multiple BARs
53	>> [ 34.516495] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 396.51
54	>> Tue Jul 31 14:52:09 PDT 2018
55	>>
56	>> # modprobe -a nvidia-uvm
57	>>
58	>> # dmesg \| grep uvm
59	>>
60	>> [ 209.441956] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 245
61	>>
62	>> # /opt/cuda/extras/demo_suite/deviceQuery
63	>> /opt/cuda/extras/demo_suite/deviceQuery Starting...
64	>>
65	>> CUDA Device Query (Runtime API) version (CUDART static linking)
66	>>
67	>> cudaGetDeviceCount returned 30
68	>> -> unknown error
69	>> Result = FAIL
70	>> [1] 5086 exit 1 /opt/cuda/extras/demo_suite/deviceQuery
71	>>
72	>> CUDA-Z shows also "no CUDA device"
73	>>
74	>> # modinfo nvidia-uvm
75	>> filename: /lib/modules/4.18.0-RT/video/nvidia-uvm.ko
76	>> supported: external
77	>> license: MIT
78	>> depends: nvidia
79	>> name: nvidia_uvm
80	>> vermagic: 4.18.0-RT SMP preempt mod_unload
81	>> parm: uvm_perf_prefetch_enable:uint
82	>> parm: uvm_perf_prefetch_threshold:uint
83	>> parm: uvm_perf_prefetch_min_faults:uint
84	>> parm: uvm_perf_thrashing_enable:uint
85	>> parm: uvm_perf_thrashing_threshold:uint
86	>> parm: uvm_perf_thrashing_pin_threshold:uint
87	>> parm: uvm_perf_thrashing_lapse_usec:uint
88	>> parm: uvm_perf_thrashing_nap_usec:uint
89	>> parm: uvm_perf_thrashing_epoch_msec:uint
90	>> parm: uvm_perf_thrashing_max_resets:uint
91	>> parm: uvm_perf_thrashing_pin_msec:uint
92	>> parm: uvm_perf_map_remote_on_native_atomics_fault:uint
93	>> parm: uvm_hmm:Enable (1) or disable (0) HMM mode. Default: 0. Ignored if CONFIG_HMM is not set, or
94	>> if NEXT settings conflict with HMM. (int)
95	>> parm: uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int)
96	>> parm: uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes
97	>> allocated and freed, 2 = per-allocation origin tracking. (int)
98	>> parm: uvm_force_prefetch_fault_support:uint
99	>> parm: uvm_debug_enable_push_desc:Enable push description tracking (int)
100	>> parm: uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid,
101	>> sys. (charp)
102	>> parm: uvm_perf_access_counter_mimc_migration_enable:Whether MIMC access counters will trigger
103	>> migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
104	>> parm: uvm_perf_access_counter_momc_migration_enable:Whether MOMC access counters will trigger
105	>> migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
106	>> parm: uvm_perf_access_counter_batch_count:uint
107	>> parm: uvm_perf_access_counter_granularity:Size of the physical memory region tracked by each
108	>> counter. Valid values asof Volta: 64k, 2m, 16m, 16g (charp)
109	>> parm: uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a
110	>> notification.Valid values: [1, 65535] (uint)
111	>> parm: uvm_perf_reenable_prefetch_faults_lapse_msec:uint
112	>> parm: uvm_perf_fault_batch_count:uint
113	>> parm: uvm_perf_fault_replay_policy:uint
114	>> parm: uvm_perf_fault_replay_update_put_ratio:uint
115	>> parm: uvm_perf_fault_max_batches_per_service:uint
116	>> parm: uvm_perf_fault_max_throttle_per_service:uint
117	>> parm: uvm_perf_fault_coalesce:uint
118	>> parm: uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0.
119	>> (int)
120	>> parm: uvm_perf_map_remote_on_eviction:int
121	>> parm: uvm_channel_num_gpfifo_entries:uint
122	>> parm: uvm_channel_gpfifo_loc:charp
123	>> parm: uvm_channel_gpput_loc:charp
124	>> parm: uvm_channel_pushbuffer_loc:charp
125	>> parm: uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int)
126	>> parm: uvm8_ats_mode:Override the default ATS (Address Translation Services) UVM mode by disabling
127	>> (0) or enabling (1) (int)
128	>> parm: uvm_driver_mode:Set the uvm kernel driver mode. Choices include: 8 (charp)
129	>> parm: uvm_debug_prints:Enable uvm debug prints. (int)
130	>> parm: uvm_enable_builtin_tests:Enable the UVM built-in tests. (This is a security risk) (int)
131	>>
132	>> # ls -l /lib/modules/4.18.0-RT/video/nvidia-uvm.ko
133	>> -rw-r--r-- 1 root root 1405808 Aug 15 16:49 /lib/modules/4.18.0-RT/video/nvidia-uvm.ko
134	>> (just installed minytes before)
135	>>
136	>> # uname -a
137	>> Linux solfire 4.18.0-RT #1 SMP PREEMPT Mon Aug 13 05:15:26 CEST 2018 x86_64 AMD Phenom(tm) II X6
138	>> 1090T Processor AuthenticAMD GNU/Linux
139	>> (the kernel version matches)
140	>>
141	>> # eix nvidia-cuda-toolkit
142	>>
143	>> [I] dev-util/nvidia-cuda-toolkit
144	>> Available versions: [M](~)6.5.14(0/6.5.14) [M](~)6.5.19-r1(0/6.5.19) [M](~)7.5.18-r2(0/7.5.18)
145	>> [M](~)8.0.44(0/8.0.44) [M](~)8.0.61(0/8.0.61) (~)9.0.176(0/9.0.176) (~)9.1.85(0/9.1.85)
146	>> (~)9.2.88(0/9.2.88) {debugger doc eclipse profiler}
147	>> Installed versions: 9.2.88(0/9.2.88)(06:31:32 PM 08/14/2018)(-debugger -doc -eclipse -profiler)
148	>> Homepage: https://developer.nvidia.com/cuda-zone
149	>> Description: NVIDIA CUDA Toolkit (compiler and friends)
150	>>
151	>> It becomes even more weird...
152
153	It is weird indeed... Im running on kernel 4.15.16 and I needed to disable MSI in
154	/etc/modprobe.d/nvidia.conf with " NVreg_EnableMSI=0" appended to the line "options nvidia ...".
155	Thats the main differences I see with you from the software side.
156
157	This kind of error is usually due to a failed reload (not rebooting) or because of a version
158	mismatch according to google, but I can't find any mismatch in the info you gave us.
159
160	Good luck
161
162	--
163	Corentin “Nado” Pazdera

Gentoo Archives: gentoo-user