Gentoo Archives: gentoo-soc

From: wuyy <xgreenlandforwyy@×××××.com>
To: gentoo-soc <gentoo-soc@l.g.o>
Subject: [gentoo-soc] Week 5 Report for Refining ROCm Packages in Gentoo
Date: Mon, 18 Jul 2022 15:07:17
Message-Id: YtV3PGe/lSHs3gUn@HEPwuyy
1 Hi all,
2
3 Week 5 is mainly about utilizing and testing the rocm.eclass I wrote --
4 packagingi and testing the ROCm-5.1.3 libraries. I also began to land
5 ROCm-5.1.3 toolchains in ::gentoo. However, new problems emerges, so I'm
6 a bit behind schedule, so after negotiating with my mentor, I decide to
7 put packaging tesorflow and jax with rocm into low priority jobs.
8
9 On https://github.com/littlewu2508/gentoo/tree/rocm-5.1.3 there are my
10 newest progress, sci-libs/roc{BLAS,FFT,PRIM,SPARSE,Thrust}-5.1.3
11 utilizing rocm.eclass. I have write amdgpu_targets.desc and added to
12 profile/desc, so each amdgpu_targets_ USE_EXPAND have its description
13 (the name and codename of the architecture, as well as the included
14 graphics cards). I shall post the screenshot of `equery uses rocBLAS` to
15 my blog.
16
17 It turned out rocm.eclass simplified those ebuilds, especially src_test.
18 I have spent some time testing those libraries on Radeon VII and Radeon
19 RX 6700XT. By running tests I've found a critical bug in rocFFT-5.1.3
20 [1], and was confirmed by upstream. It should be cautious, and before
21 the bug is fixed, amdgpu_targets_gfx906 should be masked for
22 rocFFT-5.1.3. On the other hand, 6700XT failed several tests on
23 rocSPARSE, which is explained by upstream [2]. rocBLAS pass tests on
24 Radeon VII, but causes amdgpu kernel module failure for some unknown
25 reason (maybe the load is two high, because when I restarted and ran the
26 failed test suite, it worked normally, it's just running the entire test
27 failed the GPU). Other packages passed all the tests on these two cards.
28
29 Meanwhile I'm also working on dev-libs/rccl and
30 dev-libs/rocm-opencl-runtime. dev-libs/rccl, like sci-libs/roc-*, can
31 utilize rocm.eclass and works well; however there are build failures due
32 to calling `chrpath -r` on a library without rpath (rocm.eclas set
33 -DSKIP_RPATH=ON). I shall make it work in the coming week. For
34 rocm-opencl-runtime, I managed to turn on USE=test, but there are test
35 failures on 6700XT which needs to be further investigated. Also, some of
36 the tests in rocm-opencl-runtime needs a DISPLAY. I tried
37 virtualx.eclass as ionen suggested in #gentoo-soc IRC, but in my docker
38 environment that didn't work. In Gentoo prefix vitualx does not work,
39 either.
40
41 I came across another bug when compiling rccl-5.1.3 with gfx10xx [3].
42 After consulting Gentoo llvm maintainer, I opened an issue on
43 llvm-project to ask for acknowledgement on backporting a patch to
44 llvm-14 which fix this problem [4].
45
46 As I prepare to land ROCm-5.1.3 toolchain in ::gentoo via this PR [5], I
47 noticed another problem. hip and rocm-comgr has hard-coded clang include
48 path in their sources, so if clang upgrades (even minor version upgrades
49 like 14.0.5 -> 14.0.6) would cause runtime problem. I have consulted
50 mgorny about this problem. He suggested me to try hacking into the clang
51 Driver, and see whether the include path can be extracted using C++ API
52 at runtime. I'll try this in the coming week, and if I failed, adding
53 subslot to clang may be the plan B. After fixing this, I think hip-5.1.3
54 and rocm-comgr-5.1.3 would be ready to land in ::gentoo.
55
56 Due to limited time I have little progress on rocm.eclass. I begun read
57 PYTHON_USEDEP in python eclasses, to prepare for ROCM_USEDEP. I plan to
58 implement this in the coming week, completing the last piece of
59 rocm.eclass.
60
61 And here is the brief plan of feature works for the following weeks,
62 after lowering the priority of tensorflow and jax:
63
64 week 6: finish rocm.eclass, send for review; continue packaging ROCm libs;
65 week 7: modify rocm.eclass according to comments; packaging ROCm libs, including rocWMMA;
66 week 8: finalize rocm.eclass; start working on cupy;
67 week 9: cupy ebuild; start writing wiki;
68 week 10: get cupy land in ::gentoo; bump dev-util/rocprofiler to 5.1.3;
69 week 11: continue wiki writing ; consider ROCgdb;
70 week 12: finish wiki; summaries my GSoC.
71
72
73 [1] https://github.com/ROCmSoftwarePlatform/rocFFT/issues/369
74 [2] https://github.com/ROCmSoftwarePlatform/rocSPARSE/issues/258
75 [3] https://bugs.gentoo.org/851702#c15
76 [4] https://github.com/llvm/llvm-project/issues/56577
77 [5] https://github.com/gentoo/gentoo/pull/26441
78
79 Best regards,
80 --
81 Yiyang Wu