1 |
Hi all - |
2 |
|
3 |
I'm new to infiniband and still getting my feet wet. I am admining a very |
4 |
small cluster of 5 nodes, and have recently installed infiniband HCAs. I |
5 |
have the infiniband modules built into the kernel, and I am using the |
6 |
openib-userspace package in the gentoo-science overlay. |
7 |
|
8 |
The strange thing with my situation is that I have infiniband working with |
9 |
openmpi on 4 of my 5 nodes, but the 5th one is a mystery. |
10 |
|
11 |
All 4 working nodes have a /dev/infiniband directory that look roughly like |
12 |
this: |
13 |
|
14 |
crw-rw---- 1 root root 231, 64 Dec 31 09:13 issm0 |
15 |
crw-rw-rw- 1 root root 231, 224 Dec 31 09:13 ucm0 |
16 |
crw-rw---- 1 root root 231, 0 Dec 31 09:13 umad0 |
17 |
crw-rw-rw- 1 root root 231, 192 Dec 31 09:13 uverbs0 |
18 |
|
19 |
|
20 |
But the 5th node doesn't, which could indicate the problem (it isn't |
21 |
completely the problem, as I tried making those nodes myself to match, but |
22 |
it doesn't help). I'm just not sure what the difference is, because I |
23 |
installed them all the same way, they all have the same hardware, and they |
24 |
are all running the same kernel. |
25 |
|
26 |
All 5 nodes have the same thing in the /sys/class/infiniband directory. |
27 |
|
28 |
Here's the mpirun I am trying: |
29 |
|
30 |
mpirun -np 2 -mca btl self,openib -machinefile burn_machine_file ./loadtest |
31 |
[burn-3][0,1,1][btl_openib_component.c:437:init_one_hca] error obtaining |
32 |
device context for mthca0 errno says No such file or directory |
33 |
|
34 |
-------------------------------------------------------------------------- |
35 |
WARNING: There were errors during IB HCA initialization on host 'burn-3'. |
36 |
-------------------------------------------------------------------------- |
37 |
-------------------------------------------------------------------------- |
38 |
WARNING: There is at least on IB HCA found on host 'burn-3', but there is |
39 |
no active ports detected. This is most certainly not what you wanted. |
40 |
Check your cables and SM configuration. |
41 |
-------------------------------------------------------------------------- |
42 |
-------------------------------------------------------------------------- |
43 |
Process 0.1.1 is unable to reach 0.1.0 for MPI communication. |
44 |
If you specified the use of a BTL component, you may have |
45 |
forgotten a component (such as "self") in the list of |
46 |
usable components. |
47 |
-------------------------------------------------------------------------- |
48 |
|
49 |
Any help would be appreciated! Thanks. |
50 |
|
51 |
Brian |