Note: Due to technical difficulties, the Archives are currently not up to date.
GMANE provides an alternative service for most mailing lists. c.f. bug 424647
List Archive: gentoo-cluster
Hi all -<br><br>I'm new to infiniband and still getting my feet wet. I am admining a very small cluster of 5 nodes, and have recently installed infiniband HCAs. I have the infiniband modules built into the kernel, and I am using the openib-userspace package in the gentoo-science overlay.
<br><br>The strange thing with my situation is that I have infiniband working with openmpi on 4 of my 5 nodes, but the 5th one is a mystery. <br><br>All 4 working nodes have a /dev/infiniband directory that look roughly like this:
<br><br>crw-rw---- 1 root root 231, 64 Dec 31 09:13 issm0<br>crw-rw-rw- 1 root root 231, 224 Dec 31 09:13 ucm0<br>crw-rw---- 1 root root 231, 0 Dec 31 09:13 umad0<br>crw-rw-rw- 1 root root 231, 192 Dec 31 09:13 uverbs0
<br><br><br>But the 5th node doesn't, which could indicate the problem (it isn't completely the problem, as I tried making those nodes myself to match, but it doesn't help). I'm just not sure what the difference is, because I installed them all the same way, they all have the same hardware, and they are all running the same kernel.
<br><br>All 5 nodes have the same thing in the /sys/class/infiniband directory.<br><br>Here's the mpirun I am trying:<br><br>mpirun -np 2 -mca btl self,openib -machinefile burn_machine_file ./loadtest<br>[burn-3][0,1,1][btl_openib_component.c:437:init_one_hca] error obtaining device context for mthca0 errno says No such file or directory
<br><br>--------------------------------------------------------------------------<br>WARNING: There were errors during IB HCA initialization on host 'burn-3'.<br>--------------------------------------------------------------------------
<br>--------------------------------------------------------------------------<br>WARNING: There is at least on IB HCA found on host 'burn-3', but there is<br>no active ports detected. This is most certainly not what you wanted.
<br>Check your cables and SM configuration.<br>--------------------------------------------------------------------------<br>--------------------------------------------------------------------------<br>Process 0.1.1 is unable to reach
0.1.0 for MPI communication.<br>If you specified the use of a BTL component, you may have<br>forgotten a component (such as "self") in the list of <br>usable components.<br>--------------------------------------------------------------------------
<br><br>Any help would be appreciated! Thanks.<br><br> Brian<br><br>
|
|