Add bitvector data structure #1820

upsj · 2025-04-02T13:41:50Z

Pulled out of the symbolic Cholesky work:
This adds a bitvector data structure that can be used to map from a full index range to a subset of indices in constant time $\mathcal O(1)$ instead of $\mathcal O(\log n)$. It can be used to either represent an increasing sequence of integers, or a boolean predicate on an index range. It can be used to get rid of a lot of binary searches.

core/components/bitvector.hpp

common/cuda_hip/components/bitvector_kernels.cpp

common/cuda_hip/components/bitvector_kernels.hpp

common/unified/components/bitvector_kernels.generic.hpp

yhmtsai · 2025-04-08T08:30:12Z

common/unified/components/bitvector_kernels.generic.hpp

+                        device_predicate(base_i + local_i) ? 1 : 0;
+                    mask |= bit << local_i;
+                }
+            }


the above two conditions can be merged into base_i + local_i < size && local_i < block_size.

I wanted the top one to potentially allow for full unrolling, since the loop trip count is known. But I'm not actually sure if the compiler can manage that always.

core/components/bitvector.hpp

yhmtsai · 2025-04-08T08:51:37Z

reference/components/bitvector_kernels.hpp

+    array<IndexType> ranks{exec, num_blocks};
+    std::fill_n(bits.get_data(), num_blocks, 0);
+    std::fill_n(ranks.get_data(), num_blocks, 0);
+    for (auto i : irange{size}) {


I prefer index_range rather than irange

this is based on boost::irange for familiarity

reference/components/bitvector_kernels.hpp

reference/test/components/bitvector_kernels.cpp

MarcelKoch

I have a suggestion for the device type: make the data members public. This could allow to implement the current member functions as free-standing functions, which would allow for backend specific implementations. It would also allow to add a set function, which will require backend specific atomics.
(This suggestion is not blocking however.)

core/components/bitvector.hpp

reference/components/bitvector_kernels.cpp

common/unified/components/bitvector_kernels.hpp

test/components/bitvector_kernels.cpp

upsj · 2025-04-20T15:01:30Z

@MarcelKoch the device view is meant as a read-only representation for the bitvector, I don't really want to encourage people building them by hand without the helper algorithms.

MarcelKoch

For me the biggest issue is the header in common/unified. I don't think we should start putting headers in there, especially for kernels that are available for all backends.

common/unified/components/bitvector.hpp

core/components/bitvector.hpp

dpcpp/components/bitvector.dp.hpp

omp/components/bitvector.hpp

reference/components/bitvector.hpp

reference/test/components/bitvector.cpp

test/components/bitvector.cpp

MarcelKoch · 2025-04-26T19:13:15Z

common/unified/components/bitvector.hpp

This is still a blocker for me. I just don't see that we should put kernel headers (that are not used exclusively in common/unified) into common/unified.

With a uniform transform_iterator implementation, I can also implement #1832 in common/unified. Would that be sufficient justification for putting it there?

My issue is not that it includes non-uniform implementations from different backends, but that I don't really understand how this header is going to be used. If the kernels are only called from kernels implemented in common/unified, then it's fine with me. If this file is going to be included anywhere else, then I think it should be put under core.
In any case, I hope I explained my reservation enough, and I will not hold up the PR because of it anymore. However, if the header gets included outside common/unified, I will bring it up again.

For tests and kernels implemented in unified, I would use this header, for other tests and kernels I would use the backend-specific one. Your reservations definitely make sense and made me reevaluate how I am using the headers.

Related to a Thrust bug ROCm/rocThrust#352

Co-authored-by: Marcel Koch <marcel.koch@kit.edu>

yhmtsai

some questions

Do we need to use iterator for from_sorted_indices?
What's the purpose of transformer iterator?
the index check of (base_i + block_size <= size())

yhmtsai · 2025-04-27T16:55:53Z

core/components/bitvector.hpp

+        assert(i < size());
+        const auto block = i / block_size;
+        const auto local = i % block_size;
+        return bool((bits_[block] >> local) & 1);


maybe use static_cast<bool>?

common/cuda_hip/components/bitvector.hpp

yhmtsai · 2025-04-27T17:14:09Z

dpcpp/components/bitvector.dp.hpp

+            auto value = it[i];
+            const auto [block, mask] =
+                device_bitvector<index_type>::get_block_and_mask(value);
+            sycl::atomic_ref<storage_type, sycl::memory_order::relaxed,


we have ballot from cooperative group and it can eliminate the atomic usage.
you need to set the sub-group size specifically

not sure I follow. This would require a segmented scan to get rid of some of the atomic operations

I think it should almost be like the kernel in cuda_hip.

mask = ballot(device_predicate(base_i + local_i)). // subgroup needs to be 32 masks[base_i/bv::blocksize] = mask

This is from_sorted_indices, not from_predicate

I put it in wrong place. I mean for the predicate case.

Yes, I realized that after writing my response 😄 We could do that, but I wanted do avoid the complexity of subgroup size dispatch, and since subgroups are sometimes smaller than 32, we would need to assemble the mask out of multiple elements. So that definitely makes sense as a future optimization, but I wanted to keep the complexity in this PR smaller.

omp/components/bitvector.hpp

yhmtsai · 2025-04-27T19:38:16Z

common/cuda_hip/components/bitvector.hpp

+    array<storage_type> bits_compact{exec, num_blocks};
+    array<index_type> bits_position{exec, num_blocks};
+    array<storage_type> bits{exec, num_blocks};
+    array<index_type> ranks{exec, num_blocks};


compared to the others, it uses double memory allocation.
does it give better performance?

It has a better worst-case behavior, since it doesn't suffer from atomic collisions. If we implemented our own kernel for it, we could also get rid of the second pair of arrays, but reduce_by_key has an interface that prevents us from scattering the results to the output already.

dpcpp/components/bitvector.dp.hpp

core/components/bitvector.hpp

yhmtsai · 2025-04-27T20:20:45Z

By the way, I am not sure how do you plan to use it in symbolic cholesky.
I guess you will have some base as the start point to use smaller size than the #columns of matrix.
Otherwise, the memory will request #columns bits * # rows

upsj · 2025-04-27T20:47:51Z

The most common use case in symbolic elimination tree computation is mapping between a set [0, n) of vertices and a subset (e.g. representatives of connected components, endpoints of tree edges, ...), and speeding up the CSR-to-COO conversion, which ran into some edge cases in the algorithm that caused a 100x-1000x slowdown compared to this improved algorithm #1832

yhmtsai · 2025-04-28T08:49:15Z

(It is not related to this PR itself but the usage)
If it always takes care of [0, n), we will need $\frac{2^{20} * 2^{20}}{2^3}$ byte -> $2^{37}$ bytes ~ 128 GB for $2^{20} \times 2^{20}$ matrix (~1e6 size).
It takes quite a lot for a small matrix. Do I calculate something wrong?

upsj · 2025-04-28T08:51:40Z

I think you fundamentally misunderstand how this is used - it is an algorithmic component that is usually of size n or nnz, but would never be used to represent a dense matrix.

yhmtsai · 2025-04-28T09:12:19Z

That's why I ask how do you plan to use it.
Originally, I thought it is to represent the index in a row. no matter how sparse of this row, it always take at least n bits, right?
And then you need to have a bitvector for each row, so you need n * n bits.

upsj · 2025-04-28T09:17:15Z

Please take a look at #1832 for an example how the data structure is used. I am not using it in symbolic Cholesky, only for the elimination forest computation and the aforementioned conversion so far. Other applications are operations similar to copy_if that need to map between a full index range and a subset repeatedly.
You will just have to take my word on it that this is a useful data structure 🙃

yhmtsai · 2025-04-28T10:08:08Z

I see. It is more clear now. the index range on the nnz not the columns and symbolic Cholesky.

upsj · 2025-04-28T10:13:12Z

If I'm filtering nodes or edges, the index space is of size n or nnz, if I'm representing row_ptrs, it is of size nnz + n.

yhmtsai

three nits left. others looks good to me or not necessary to block this pr.

some uint32 should use storage_type from bitvector.
add the unroll to the for loop with known fixed size
using static_cast<bool> rather than bool()

- reformat documentation - remove explicit uint32 usages - rename member functions with get_ - unroll loops explicitly Co-authored-by: Marcel Koch <marcel.koch@kit.edu> Co-authored-by: Yu-Hsiang M. Tsai <yhmtsai@gmail.com>

upsj added the 1:ST:ready-for-review This PR is ready for review label Apr 2, 2025

upsj requested a review from a team April 2, 2025 13:41

upsj self-assigned this Apr 2, 2025

ginkgo-bot added reg:build This is related to the build system. reg:testing This is related to testing. mod:all This touches all Ginkgo modules. labels Apr 2, 2025

upsj commented Apr 3, 2025

View reviewed changes

core/components/bitvector.hpp Outdated Show resolved Hide resolved

yhmtsai requested changes Apr 8, 2025

View reviewed changes

MarcelKoch self-requested a review April 10, 2025 14:52

MarcelKoch requested changes Apr 11, 2025

View reviewed changes

upsj force-pushed the bitvector branch 3 times, most recently from 0386c4b to 2ee600e Compare April 20, 2025 20:47

upsj changed the base branch from develop to original_filename_in_generated_code April 20, 2025 21:04

upsj force-pushed the bitvector branch from 0f06710 to f0c5d97 Compare April 20, 2025 21:06

upsj requested review from MarcelKoch and yhmtsai April 20, 2025 21:07

Base automatically changed from original_filename_in_generated_code to develop April 21, 2025 12:12

MarcelKoch requested changes Apr 22, 2025

View reviewed changes

upsj force-pushed the bitvector branch from f0c5d97 to 8b168b9 Compare April 22, 2025 11:46

MarcelKoch requested changes Apr 26, 2025

View reviewed changes

upsj mentioned this pull request Apr 27, 2025

Add transform_iterator #1836

Open

upsj and others added 7 commits April 27, 2025 14:14

add bitvector functionality

a8be380

add bitvector generation from predicate

d9d91dd

use sortedness in bitvector kernels

60c441c

add bitvector support for all backends

a0f0c3e

fix HIP execution

26cde3a

Related to a Thrust bug ROCm/rocThrust#352

add native support for OpenMP and SYCL

f934c25

review update

a9daa2e

Co-authored-by: Marcel Koch <marcel.koch@kit.edu>

upsj added 4 commits April 27, 2025 14:14

fix OpenMP compilation on macOS

629400c

fix compilation for DPC++

add25b8

attempt fixing HIP compilation

59abaff

fix ROCm 4.5 issue

4c974d4

upsj force-pushed the bitvector branch from 2d5d86e to 4c974d4 Compare April 27, 2025 12:14

yhmtsai reviewed Apr 27, 2025

View reviewed changes

MarcelKoch approved these changes Apr 28, 2025

View reviewed changes

yhmtsai approved these changes Apr 28, 2025

View reviewed changes

upsj and others added 2 commits April 30, 2025 15:36

review updates

94eee10

- reformat documentation - remove explicit uint32 usages - rename member functions with get_ - unroll loops explicitly Co-authored-by: Marcel Koch <marcel.koch@kit.edu> Co-authored-by: Yu-Hsiang M. Tsai <yhmtsai@gmail.com>

fix ROCm 4.5 compilation

93f4c38

upsj added the 1:ST:no-changelog-entry Skip the wiki check for changelog update label Apr 30, 2025

upsj added 2 commits April 30, 2025 17:49

remove file

090969e

review updates

3c3a66b

upsj added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Apr 30, 2025

upsj merged commit e146f94 into develop Apr 30, 2025
8 of 11 checks passed

upsj deleted the bitvector branch April 30, 2025 18:49

Add bitvector data structure #1820

Add bitvector data structure #1820

Uh oh!

Conversation

upsj commented Apr 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MarcelKoch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

upsj commented Apr 20, 2025

Uh oh!

MarcelKoch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

upsj Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhmtsai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

upsj Apr 28, 2025 •

edited

Loading

upsj Apr 28, 2025 •

edited

Loading

upsj commented Apr 27, 2025 •

edited

Loading

yhmtsai commented Apr 28, 2025 •

edited

Loading