1 |
August 15, 2018 5:45 PM, tuxic@××××××.de wrote: |
2 |
|
3 |
> I put nvidia-uvm explictly into /etc/conf.d/modules - which was not |
4 |
> necessary ever before....and it shows the same problems: No cuda |
5 |
> devices. |
6 |
> |
7 |
> I think I will dream this night of no cuda devices... ;( |
8 |
> |
9 |
> On 08/15 05:11, tuxic@××××××.de wrote: |
10 |
> |
11 |
>> On 08/15 02:32, Corentin “Nado” Pazdera wrote: |
12 |
>> August 15, 2018 2:59 PM, tuxic@××××××.de wrote: |
13 |
> |
14 |
> Yes I did reboot the sustem. In my initial mail I mentioned a tool |
15 |
> called CUDA-Z and Blender, which both reports a missing CUDA device. |
16 |
>> Ok, so you do not have a specific error which might have been thrown by the module? |
17 |
>> Other ideas, check dev-util/nvidia-cuda-toolkit version and double check nvidia/nvidia_uvm with |
18 |
>> modinfo to ensure they are installed and loaded correctly with the right version? |
19 |
>> Could you also run /opt/cuda/extras/demo_suite/deviceQuery (from nvidia-cuda-toolkit) and show its |
20 |
>> output? |
21 |
>> |
22 |
>> My installation works, so at least we know their version is not completely broken... |
23 |
>> Driver version: 396.51 |
24 |
>> Cuda version: 9.2.88 |
25 |
>> |
26 |
>> -- |
27 |
>> Corentin “Nado” Pazdera |
28 |
>> |
29 |
>> I compiled the new version of the driver again and rebooted the |
30 |
>> system. |
31 |
>> |
32 |
>> # dmesg | grep -i nvidia: |
33 |
>> |
34 |
>> [ 11.375631] nvidia_drm: module license 'MIT' taints kernel. |
35 |
>> [ 12.313260] nvidia-nvlink: Nvlink Core is being initialized, major device number 246 |
36 |
>> [ 12.313586] nvidia 0000:07:00.0: vgaarb: changed VGA decodes: |
37 |
>> olddecodes=io+mem,decodes=none:owns=io+mem |
38 |
>> [ 12.313691] nvidia 0000:02:00.0: enabling device (0000 -> 0003) |
39 |
>> [ 12.313737] nvidia 0000:02:00.0: vgaarb: changed VGA decodes: |
40 |
>> olddecodes=io+mem,decodes=none:owns=none |
41 |
>> [ 12.313826] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 396.51 Tue Jul 31 10:43:06 PDT 2018 |
42 |
>> (using threaded interrupts) |
43 |
>> [ 12.491106] input: HDA NVidia HDMI as |
44 |
>> /devices/pci0000:00/0000:00:0b.0/0000:02:00.1/sound/card2/input9 |
45 |
>> [ 12.492291] input: HDA NVidia HDMI as |
46 |
>> /devices/pci0000:00/0000:00:0b.0/0000:02:00.1/sound/card2/input10 |
47 |
>> [ 12.493772] input: HDA NVidia HDMI as |
48 |
>> /devices/pci0000:00/0000:00:02.0/0000:07:00.1/sound/card1/input11 |
49 |
>> [ 12.494605] input: HDA NVidia HDMI as |
50 |
>> /devices/pci0000:00/0000:00:02.0/0000:07:00.1/sound/card1/input12 |
51 |
>> [ 13.963644] caller _nv001112rm+0xe3/0x1d0 [nvidia] mapping multiple BARs |
52 |
>> [ 34.236553] caller _nv001112rm+0xe3/0x1d0 [nvidia] mapping multiple BARs |
53 |
>> [ 34.516495] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 396.51 |
54 |
>> Tue Jul 31 14:52:09 PDT 2018 |
55 |
>> |
56 |
>> # modprobe -a nvidia-uvm |
57 |
>> |
58 |
>> # dmesg | grep uvm |
59 |
>> |
60 |
>> [ 209.441956] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 245 |
61 |
>> |
62 |
>> # /opt/cuda/extras/demo_suite/deviceQuery |
63 |
>> /opt/cuda/extras/demo_suite/deviceQuery Starting... |
64 |
>> |
65 |
>> CUDA Device Query (Runtime API) version (CUDART static linking) |
66 |
>> |
67 |
>> cudaGetDeviceCount returned 30 |
68 |
>> -> unknown error |
69 |
>> Result = FAIL |
70 |
>> [1] 5086 exit 1 /opt/cuda/extras/demo_suite/deviceQuery |
71 |
>> |
72 |
>> CUDA-Z shows also "no CUDA device" |
73 |
>> |
74 |
>> # modinfo nvidia-uvm |
75 |
>> filename: /lib/modules/4.18.0-RT/video/nvidia-uvm.ko |
76 |
>> supported: external |
77 |
>> license: MIT |
78 |
>> depends: nvidia |
79 |
>> name: nvidia_uvm |
80 |
>> vermagic: 4.18.0-RT SMP preempt mod_unload |
81 |
>> parm: uvm_perf_prefetch_enable:uint |
82 |
>> parm: uvm_perf_prefetch_threshold:uint |
83 |
>> parm: uvm_perf_prefetch_min_faults:uint |
84 |
>> parm: uvm_perf_thrashing_enable:uint |
85 |
>> parm: uvm_perf_thrashing_threshold:uint |
86 |
>> parm: uvm_perf_thrashing_pin_threshold:uint |
87 |
>> parm: uvm_perf_thrashing_lapse_usec:uint |
88 |
>> parm: uvm_perf_thrashing_nap_usec:uint |
89 |
>> parm: uvm_perf_thrashing_epoch_msec:uint |
90 |
>> parm: uvm_perf_thrashing_max_resets:uint |
91 |
>> parm: uvm_perf_thrashing_pin_msec:uint |
92 |
>> parm: uvm_perf_map_remote_on_native_atomics_fault:uint |
93 |
>> parm: uvm_hmm:Enable (1) or disable (0) HMM mode. Default: 0. Ignored if CONFIG_HMM is not set, or |
94 |
>> if NEXT settings conflict with HMM. (int) |
95 |
>> parm: uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int) |
96 |
>> parm: uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes |
97 |
>> allocated and freed, 2 = per-allocation origin tracking. (int) |
98 |
>> parm: uvm_force_prefetch_fault_support:uint |
99 |
>> parm: uvm_debug_enable_push_desc:Enable push description tracking (int) |
100 |
>> parm: uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid, |
101 |
>> sys. (charp) |
102 |
>> parm: uvm_perf_access_counter_mimc_migration_enable:Whether MIMC access counters will trigger |
103 |
>> migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int) |
104 |
>> parm: uvm_perf_access_counter_momc_migration_enable:Whether MOMC access counters will trigger |
105 |
>> migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int) |
106 |
>> parm: uvm_perf_access_counter_batch_count:uint |
107 |
>> parm: uvm_perf_access_counter_granularity:Size of the physical memory region tracked by each |
108 |
>> counter. Valid values asof Volta: 64k, 2m, 16m, 16g (charp) |
109 |
>> parm: uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a |
110 |
>> notification.Valid values: [1, 65535] (uint) |
111 |
>> parm: uvm_perf_reenable_prefetch_faults_lapse_msec:uint |
112 |
>> parm: uvm_perf_fault_batch_count:uint |
113 |
>> parm: uvm_perf_fault_replay_policy:uint |
114 |
>> parm: uvm_perf_fault_replay_update_put_ratio:uint |
115 |
>> parm: uvm_perf_fault_max_batches_per_service:uint |
116 |
>> parm: uvm_perf_fault_max_throttle_per_service:uint |
117 |
>> parm: uvm_perf_fault_coalesce:uint |
118 |
>> parm: uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0. |
119 |
>> (int) |
120 |
>> parm: uvm_perf_map_remote_on_eviction:int |
121 |
>> parm: uvm_channel_num_gpfifo_entries:uint |
122 |
>> parm: uvm_channel_gpfifo_loc:charp |
123 |
>> parm: uvm_channel_gpput_loc:charp |
124 |
>> parm: uvm_channel_pushbuffer_loc:charp |
125 |
>> parm: uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int) |
126 |
>> parm: uvm8_ats_mode:Override the default ATS (Address Translation Services) UVM mode by disabling |
127 |
>> (0) or enabling (1) (int) |
128 |
>> parm: uvm_driver_mode:Set the uvm kernel driver mode. Choices include: 8 (charp) |
129 |
>> parm: uvm_debug_prints:Enable uvm debug prints. (int) |
130 |
>> parm: uvm_enable_builtin_tests:Enable the UVM built-in tests. (This is a security risk) (int) |
131 |
>> |
132 |
>> # ls -l /lib/modules/4.18.0-RT/video/nvidia-uvm.ko |
133 |
>> -rw-r--r-- 1 root root 1405808 Aug 15 16:49 /lib/modules/4.18.0-RT/video/nvidia-uvm.ko |
134 |
>> (just installed minytes before) |
135 |
>> |
136 |
>> # uname -a |
137 |
>> Linux solfire 4.18.0-RT #1 SMP PREEMPT Mon Aug 13 05:15:26 CEST 2018 x86_64 AMD Phenom(tm) II X6 |
138 |
>> 1090T Processor AuthenticAMD GNU/Linux |
139 |
>> (the kernel version matches) |
140 |
>> |
141 |
>> # eix nvidia-cuda-toolkit |
142 |
>> |
143 |
>> [I] dev-util/nvidia-cuda-toolkit |
144 |
>> Available versions: [M](~)6.5.14(0/6.5.14) [M](~)6.5.19-r1(0/6.5.19) [M](~)7.5.18-r2(0/7.5.18) |
145 |
>> [M](~)8.0.44(0/8.0.44) [M](~)8.0.61(0/8.0.61) (~)9.0.176(0/9.0.176) (~)9.1.85(0/9.1.85) |
146 |
>> (~)9.2.88(0/9.2.88) {debugger doc eclipse profiler} |
147 |
>> Installed versions: 9.2.88(0/9.2.88)(06:31:32 PM 08/14/2018)(-debugger -doc -eclipse -profiler) |
148 |
>> Homepage: https://developer.nvidia.com/cuda-zone |
149 |
>> Description: NVIDIA CUDA Toolkit (compiler and friends) |
150 |
>> |
151 |
>> It becomes even more weird... |
152 |
|
153 |
It is weird indeed... Im running on kernel 4.15.16 and I needed to disable MSI in |
154 |
/etc/modprobe.d/nvidia.conf with " NVreg_EnableMSI=0" appended to the line "options nvidia ...". |
155 |
Thats the main differences I see with you from the software side. |
156 |
|
157 |
This kind of error is usually due to a failed reload (not rebooting) or because of a version |
158 |
mismatch according to google, but I can't find any mismatch in the info you gave us. |
159 |
|
160 |
Good luck |
161 |
|
162 |
-- |
163 |
Corentin “Nado” Pazdera |