Gentoo Archives: gentoo-cluster

From: Brian Budge <brian.budge@×××××.com>
To: gentoo-cluster@l.g.o
Subject: [gentoo-cluster] openib, no /dev/infiniband
Date: Wed, 02 Jan 2008 21:41:20
Message-Id: 5b7094580801021339n22db7c35y8580642c784d2c17@mail.gmail.com
1 Hi all -
2
3 I'm new to infiniband and still getting my feet wet. I am admining a very
4 small cluster of 5 nodes, and have recently installed infiniband HCAs. I
5 have the infiniband modules built into the kernel, and I am using the
6 openib-userspace package in the gentoo-science overlay.
7
8 The strange thing with my situation is that I have infiniband working with
9 openmpi on 4 of my 5 nodes, but the 5th one is a mystery.
10
11 All 4 working nodes have a /dev/infiniband directory that look roughly like
12 this:
13
14 crw-rw---- 1 root root 231, 64 Dec 31 09:13 issm0
15 crw-rw-rw- 1 root root 231, 224 Dec 31 09:13 ucm0
16 crw-rw---- 1 root root 231, 0 Dec 31 09:13 umad0
17 crw-rw-rw- 1 root root 231, 192 Dec 31 09:13 uverbs0
18
19
20 But the 5th node doesn't, which could indicate the problem (it isn't
21 completely the problem, as I tried making those nodes myself to match, but
22 it doesn't help). I'm just not sure what the difference is, because I
23 installed them all the same way, they all have the same hardware, and they
24 are all running the same kernel.
25
26 All 5 nodes have the same thing in the /sys/class/infiniband directory.
27
28 Here's the mpirun I am trying:
29
30 mpirun -np 2 -mca btl self,openib -machinefile burn_machine_file ./loadtest
31 [burn-3][0,1,1][btl_openib_component.c:437:init_one_hca] error obtaining
32 device context for mthca0 errno says No such file or directory
33
34 --------------------------------------------------------------------------
35 WARNING: There were errors during IB HCA initialization on host 'burn-3'.
36 --------------------------------------------------------------------------
37 --------------------------------------------------------------------------
38 WARNING: There is at least on IB HCA found on host 'burn-3', but there is
39 no active ports detected. This is most certainly not what you wanted.
40 Check your cables and SM configuration.
41 --------------------------------------------------------------------------
42 --------------------------------------------------------------------------
43 Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
44 If you specified the use of a BTL component, you may have
45 forgotten a component (such as "self") in the list of
46 usable components.
47 --------------------------------------------------------------------------
48
49 Any help would be appreciated! Thanks.
50
51 Brian