1 |
Hi all, |
2 |
|
3 |
Week 5 is mainly about utilizing and testing the rocm.eclass I wrote -- |
4 |
packagingi and testing the ROCm-5.1.3 libraries. I also began to land |
5 |
ROCm-5.1.3 toolchains in ::gentoo. However, new problems emerges, so I'm |
6 |
a bit behind schedule, so after negotiating with my mentor, I decide to |
7 |
put packaging tesorflow and jax with rocm into low priority jobs. |
8 |
|
9 |
On https://github.com/littlewu2508/gentoo/tree/rocm-5.1.3 there are my |
10 |
newest progress, sci-libs/roc{BLAS,FFT,PRIM,SPARSE,Thrust}-5.1.3 |
11 |
utilizing rocm.eclass. I have write amdgpu_targets.desc and added to |
12 |
profile/desc, so each amdgpu_targets_ USE_EXPAND have its description |
13 |
(the name and codename of the architecture, as well as the included |
14 |
graphics cards). I shall post the screenshot of `equery uses rocBLAS` to |
15 |
my blog. |
16 |
|
17 |
It turned out rocm.eclass simplified those ebuilds, especially src_test. |
18 |
I have spent some time testing those libraries on Radeon VII and Radeon |
19 |
RX 6700XT. By running tests I've found a critical bug in rocFFT-5.1.3 |
20 |
[1], and was confirmed by upstream. It should be cautious, and before |
21 |
the bug is fixed, amdgpu_targets_gfx906 should be masked for |
22 |
rocFFT-5.1.3. On the other hand, 6700XT failed several tests on |
23 |
rocSPARSE, which is explained by upstream [2]. rocBLAS pass tests on |
24 |
Radeon VII, but causes amdgpu kernel module failure for some unknown |
25 |
reason (maybe the load is two high, because when I restarted and ran the |
26 |
failed test suite, it worked normally, it's just running the entire test |
27 |
failed the GPU). Other packages passed all the tests on these two cards. |
28 |
|
29 |
Meanwhile I'm also working on dev-libs/rccl and |
30 |
dev-libs/rocm-opencl-runtime. dev-libs/rccl, like sci-libs/roc-*, can |
31 |
utilize rocm.eclass and works well; however there are build failures due |
32 |
to calling `chrpath -r` on a library without rpath (rocm.eclas set |
33 |
-DSKIP_RPATH=ON). I shall make it work in the coming week. For |
34 |
rocm-opencl-runtime, I managed to turn on USE=test, but there are test |
35 |
failures on 6700XT which needs to be further investigated. Also, some of |
36 |
the tests in rocm-opencl-runtime needs a DISPLAY. I tried |
37 |
virtualx.eclass as ionen suggested in #gentoo-soc IRC, but in my docker |
38 |
environment that didn't work. In Gentoo prefix vitualx does not work, |
39 |
either. |
40 |
|
41 |
I came across another bug when compiling rccl-5.1.3 with gfx10xx [3]. |
42 |
After consulting Gentoo llvm maintainer, I opened an issue on |
43 |
llvm-project to ask for acknowledgement on backporting a patch to |
44 |
llvm-14 which fix this problem [4]. |
45 |
|
46 |
As I prepare to land ROCm-5.1.3 toolchain in ::gentoo via this PR [5], I |
47 |
noticed another problem. hip and rocm-comgr has hard-coded clang include |
48 |
path in their sources, so if clang upgrades (even minor version upgrades |
49 |
like 14.0.5 -> 14.0.6) would cause runtime problem. I have consulted |
50 |
mgorny about this problem. He suggested me to try hacking into the clang |
51 |
Driver, and see whether the include path can be extracted using C++ API |
52 |
at runtime. I'll try this in the coming week, and if I failed, adding |
53 |
subslot to clang may be the plan B. After fixing this, I think hip-5.1.3 |
54 |
and rocm-comgr-5.1.3 would be ready to land in ::gentoo. |
55 |
|
56 |
Due to limited time I have little progress on rocm.eclass. I begun read |
57 |
PYTHON_USEDEP in python eclasses, to prepare for ROCM_USEDEP. I plan to |
58 |
implement this in the coming week, completing the last piece of |
59 |
rocm.eclass. |
60 |
|
61 |
And here is the brief plan of feature works for the following weeks, |
62 |
after lowering the priority of tensorflow and jax: |
63 |
|
64 |
week 6: finish rocm.eclass, send for review; continue packaging ROCm libs; |
65 |
week 7: modify rocm.eclass according to comments; packaging ROCm libs, including rocWMMA; |
66 |
week 8: finalize rocm.eclass; start working on cupy; |
67 |
week 9: cupy ebuild; start writing wiki; |
68 |
week 10: get cupy land in ::gentoo; bump dev-util/rocprofiler to 5.1.3; |
69 |
week 11: continue wiki writing ; consider ROCgdb; |
70 |
week 12: finish wiki; summaries my GSoC. |
71 |
|
72 |
|
73 |
[1] https://github.com/ROCmSoftwarePlatform/rocFFT/issues/369 |
74 |
[2] https://github.com/ROCmSoftwarePlatform/rocSPARSE/issues/258 |
75 |
[3] https://bugs.gentoo.org/851702#c15 |
76 |
[4] https://github.com/llvm/llvm-project/issues/56577 |
77 |
[5] https://github.com/gentoo/gentoo/pull/26441 |
78 |
|
79 |
Best regards, |
80 |
-- |
81 |
Yiyang Wu |