Gentoo Archives: gentoo-user

From: "Corentin “Nado” Pazdera" <nado@××××××××××.be>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] Re: "No CUDA device found" with nvidia-drivers newer than nvidia-drivers-396.24-r1(
Date: Wed, 15 Aug 2018 16:05:56
Message-Id: c20401b74fd4024d10620af3b2568117@troglodyte.be
In Reply to: Re: [gentoo-user] Re: "No CUDA device found" with nvidia-drivers newer than nvidia-drivers-396.24-r1( by tuxic@posteo.de
1 August 15, 2018 5:45 PM, tuxic@××××××.de wrote:
2
3 > I put nvidia-uvm explictly into /etc/conf.d/modules - which was not
4 > necessary ever before....and it shows the same problems: No cuda
5 > devices.
6 >
7 > I think I will dream this night of no cuda devices... ;(
8 >
9 > On 08/15 05:11, tuxic@××××××.de wrote:
10 >
11 >> On 08/15 02:32, Corentin “Nado” Pazdera wrote:
12 >> August 15, 2018 2:59 PM, tuxic@××××××.de wrote:
13 >
14 > Yes I did reboot the sustem. In my initial mail I mentioned a tool
15 > called CUDA-Z and Blender, which both reports a missing CUDA device.
16 >> Ok, so you do not have a specific error which might have been thrown by the module?
17 >> Other ideas, check dev-util/nvidia-cuda-toolkit version and double check nvidia/nvidia_uvm with
18 >> modinfo to ensure they are installed and loaded correctly with the right version?
19 >> Could you also run /opt/cuda/extras/demo_suite/deviceQuery (from nvidia-cuda-toolkit) and show its
20 >> output?
21 >>
22 >> My installation works, so at least we know their version is not completely broken...
23 >> Driver version: 396.51
24 >> Cuda version: 9.2.88
25 >>
26 >> --
27 >> Corentin “Nado” Pazdera
28 >>
29 >> I compiled the new version of the driver again and rebooted the
30 >> system.
31 >>
32 >> # dmesg | grep -i nvidia:
33 >>
34 >> [ 11.375631] nvidia_drm: module license 'MIT' taints kernel.
35 >> [ 12.313260] nvidia-nvlink: Nvlink Core is being initialized, major device number 246
36 >> [ 12.313586] nvidia 0000:07:00.0: vgaarb: changed VGA decodes:
37 >> olddecodes=io+mem,decodes=none:owns=io+mem
38 >> [ 12.313691] nvidia 0000:02:00.0: enabling device (0000 -> 0003)
39 >> [ 12.313737] nvidia 0000:02:00.0: vgaarb: changed VGA decodes:
40 >> olddecodes=io+mem,decodes=none:owns=none
41 >> [ 12.313826] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 396.51 Tue Jul 31 10:43:06 PDT 2018
42 >> (using threaded interrupts)
43 >> [ 12.491106] input: HDA NVidia HDMI as
44 >> /devices/pci0000:00/0000:00:0b.0/0000:02:00.1/sound/card2/input9
45 >> [ 12.492291] input: HDA NVidia HDMI as
46 >> /devices/pci0000:00/0000:00:0b.0/0000:02:00.1/sound/card2/input10
47 >> [ 12.493772] input: HDA NVidia HDMI as
48 >> /devices/pci0000:00/0000:00:02.0/0000:07:00.1/sound/card1/input11
49 >> [ 12.494605] input: HDA NVidia HDMI as
50 >> /devices/pci0000:00/0000:00:02.0/0000:07:00.1/sound/card1/input12
51 >> [ 13.963644] caller _nv001112rm+0xe3/0x1d0 [nvidia] mapping multiple BARs
52 >> [ 34.236553] caller _nv001112rm+0xe3/0x1d0 [nvidia] mapping multiple BARs
53 >> [ 34.516495] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 396.51
54 >> Tue Jul 31 14:52:09 PDT 2018
55 >>
56 >> # modprobe -a nvidia-uvm
57 >>
58 >> # dmesg | grep uvm
59 >>
60 >> [ 209.441956] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 245
61 >>
62 >> # /opt/cuda/extras/demo_suite/deviceQuery
63 >> /opt/cuda/extras/demo_suite/deviceQuery Starting...
64 >>
65 >> CUDA Device Query (Runtime API) version (CUDART static linking)
66 >>
67 >> cudaGetDeviceCount returned 30
68 >> -> unknown error
69 >> Result = FAIL
70 >> [1] 5086 exit 1 /opt/cuda/extras/demo_suite/deviceQuery
71 >>
72 >> CUDA-Z shows also "no CUDA device"
73 >>
74 >> # modinfo nvidia-uvm
75 >> filename: /lib/modules/4.18.0-RT/video/nvidia-uvm.ko
76 >> supported: external
77 >> license: MIT
78 >> depends: nvidia
79 >> name: nvidia_uvm
80 >> vermagic: 4.18.0-RT SMP preempt mod_unload
81 >> parm: uvm_perf_prefetch_enable:uint
82 >> parm: uvm_perf_prefetch_threshold:uint
83 >> parm: uvm_perf_prefetch_min_faults:uint
84 >> parm: uvm_perf_thrashing_enable:uint
85 >> parm: uvm_perf_thrashing_threshold:uint
86 >> parm: uvm_perf_thrashing_pin_threshold:uint
87 >> parm: uvm_perf_thrashing_lapse_usec:uint
88 >> parm: uvm_perf_thrashing_nap_usec:uint
89 >> parm: uvm_perf_thrashing_epoch_msec:uint
90 >> parm: uvm_perf_thrashing_max_resets:uint
91 >> parm: uvm_perf_thrashing_pin_msec:uint
92 >> parm: uvm_perf_map_remote_on_native_atomics_fault:uint
93 >> parm: uvm_hmm:Enable (1) or disable (0) HMM mode. Default: 0. Ignored if CONFIG_HMM is not set, or
94 >> if NEXT settings conflict with HMM. (int)
95 >> parm: uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int)
96 >> parm: uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes
97 >> allocated and freed, 2 = per-allocation origin tracking. (int)
98 >> parm: uvm_force_prefetch_fault_support:uint
99 >> parm: uvm_debug_enable_push_desc:Enable push description tracking (int)
100 >> parm: uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid,
101 >> sys. (charp)
102 >> parm: uvm_perf_access_counter_mimc_migration_enable:Whether MIMC access counters will trigger
103 >> migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
104 >> parm: uvm_perf_access_counter_momc_migration_enable:Whether MOMC access counters will trigger
105 >> migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
106 >> parm: uvm_perf_access_counter_batch_count:uint
107 >> parm: uvm_perf_access_counter_granularity:Size of the physical memory region tracked by each
108 >> counter. Valid values asof Volta: 64k, 2m, 16m, 16g (charp)
109 >> parm: uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a
110 >> notification.Valid values: [1, 65535] (uint)
111 >> parm: uvm_perf_reenable_prefetch_faults_lapse_msec:uint
112 >> parm: uvm_perf_fault_batch_count:uint
113 >> parm: uvm_perf_fault_replay_policy:uint
114 >> parm: uvm_perf_fault_replay_update_put_ratio:uint
115 >> parm: uvm_perf_fault_max_batches_per_service:uint
116 >> parm: uvm_perf_fault_max_throttle_per_service:uint
117 >> parm: uvm_perf_fault_coalesce:uint
118 >> parm: uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0.
119 >> (int)
120 >> parm: uvm_perf_map_remote_on_eviction:int
121 >> parm: uvm_channel_num_gpfifo_entries:uint
122 >> parm: uvm_channel_gpfifo_loc:charp
123 >> parm: uvm_channel_gpput_loc:charp
124 >> parm: uvm_channel_pushbuffer_loc:charp
125 >> parm: uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int)
126 >> parm: uvm8_ats_mode:Override the default ATS (Address Translation Services) UVM mode by disabling
127 >> (0) or enabling (1) (int)
128 >> parm: uvm_driver_mode:Set the uvm kernel driver mode. Choices include: 8 (charp)
129 >> parm: uvm_debug_prints:Enable uvm debug prints. (int)
130 >> parm: uvm_enable_builtin_tests:Enable the UVM built-in tests. (This is a security risk) (int)
131 >>
132 >> # ls -l /lib/modules/4.18.0-RT/video/nvidia-uvm.ko
133 >> -rw-r--r-- 1 root root 1405808 Aug 15 16:49 /lib/modules/4.18.0-RT/video/nvidia-uvm.ko
134 >> (just installed minytes before)
135 >>
136 >> # uname -a
137 >> Linux solfire 4.18.0-RT #1 SMP PREEMPT Mon Aug 13 05:15:26 CEST 2018 x86_64 AMD Phenom(tm) II X6
138 >> 1090T Processor AuthenticAMD GNU/Linux
139 >> (the kernel version matches)
140 >>
141 >> # eix nvidia-cuda-toolkit
142 >>
143 >> [I] dev-util/nvidia-cuda-toolkit
144 >> Available versions: [M](~)6.5.14(0/6.5.14) [M](~)6.5.19-r1(0/6.5.19) [M](~)7.5.18-r2(0/7.5.18)
145 >> [M](~)8.0.44(0/8.0.44) [M](~)8.0.61(0/8.0.61) (~)9.0.176(0/9.0.176) (~)9.1.85(0/9.1.85)
146 >> (~)9.2.88(0/9.2.88) {debugger doc eclipse profiler}
147 >> Installed versions: 9.2.88(0/9.2.88)(06:31:32 PM 08/14/2018)(-debugger -doc -eclipse -profiler)
148 >> Homepage: https://developer.nvidia.com/cuda-zone
149 >> Description: NVIDIA CUDA Toolkit (compiler and friends)
150 >>
151 >> It becomes even more weird...
152
153 It is weird indeed... Im running on kernel 4.15.16 and I needed to disable MSI in
154 /etc/modprobe.d/nvidia.conf with " NVreg_EnableMSI=0" appended to the line "options nvidia ...".
155 Thats the main differences I see with you from the software side.
156
157 This kind of error is usually due to a failed reload (not rebooting) or because of a version
158 mismatch according to google, but I can't find any mismatch in the info you gave us.
159
160 Good luck
161
162 --
163 Corentin “Nado” Pazdera