-
Notifications
You must be signed in to change notification settings - Fork 99
Add bitvector data structure #1820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
device_predicate(base_i + local_i) ? 1 : 0; | ||
mask |= bit << local_i; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the above two conditions can be merged into base_i + local_i < size && local_i < block_size
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted the top one to potentially allow for full unrolling, since the loop trip count is known. But I'm not actually sure if the compiler can manage that always.
array<IndexType> ranks{exec, num_blocks}; | ||
std::fill_n(bits.get_data(), num_blocks, 0); | ||
std::fill_n(ranks.get_data(), num_blocks, 0); | ||
for (auto i : irange{size}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer index_range rather than irange
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is based on boost::irange
for familiarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a suggestion for the device type: make the data members public. This could allow to implement the current member functions as free-standing functions, which would allow for backend specific implementations. It would also allow to add a set
function, which will require backend specific atomics.
(This suggestion is not blocking however.)
@MarcelKoch the device view is meant as a read-only representation for the bitvector, I don't really want to encourage people building them by hand without the helper algorithms. |
0386c4b
to
2ee600e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me the biggest issue is the header in common/unified
. I don't think we should start putting headers in there, especially for kernels that are available for all backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a blocker for me. I just don't see that we should put kernel headers (that are not used exclusively in common/unified
) into common/unified
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a uniform transform_iterator
implementation, I can also implement #1832 in common/unified
. Would that be sufficient justification for putting it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issue is not that it includes non-uniform implementations from different backends, but that I don't really understand how this header is going to be used. If the kernels are only called from kernels implemented in common/unified
, then it's fine with me. If this file is going to be included anywhere else, then I think it should be put under core
.
In any case, I hope I explained my reservation enough, and I will not hold up the PR because of it anymore. However, if the header gets included outside common/unified
, I will bring it up again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For tests and kernels implemented in unified
, I would use this header, for other tests and kernels I would use the backend-specific one. Your reservations definitely make sense and made me reevaluate how I am using the headers.
Related to a Thrust bug ROCm/rocThrust#352
Co-authored-by: Marcel Koch <marcel.koch@kit.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some questions
- Do we need to use iterator for
from_sorted_indices
? - What's the purpose of transformer iterator?
- the index check of (base_i + block_size <= size())
assert(i < size()); | ||
const auto block = i / block_size; | ||
const auto local = i % block_size; | ||
return bool((bits_[block] >> local) & 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use static_cast<bool>
?
auto value = it[i]; | ||
const auto [block, mask] = | ||
device_bitvector<index_type>::get_block_and_mask(value); | ||
sycl::atomic_ref<storage_type, sycl::memory_order::relaxed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have ballot from cooperative group and it can eliminate the atomic usage.
you need to set the sub-group size specifically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I follow. This would require a segmented scan to get rid of some of the atomic operations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should almost be like the kernel in cuda_hip.
mask = ballot(device_predicate(base_i + local_i)). // subgroup needs to be 32
masks[base_i/bv::blocksize] = mask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from_sorted_indices
, not from_predicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it in wrong place. I mean for the predicate case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I realized that after writing my response 😄 We could do that, but I wanted do avoid the complexity of subgroup size dispatch, and since subgroups are sometimes smaller than 32, we would need to assemble the mask out of multiple elements. So that definitely makes sense as a future optimization, but I wanted to keep the complexity in this PR smaller.
array<storage_type> bits_compact{exec, num_blocks}; | ||
array<index_type> bits_position{exec, num_blocks}; | ||
array<storage_type> bits{exec, num_blocks}; | ||
array<index_type> ranks{exec, num_blocks}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compared to the others, it uses double memory allocation.
does it give better performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has a better worst-case behavior, since it doesn't suffer from atomic collisions. If we implemented our own kernel for it, we could also get rid of the second pair of arrays, but reduce_by_key
has an interface that prevents us from scattering the results to the output already.
By the way, I am not sure how do you plan to use it in symbolic cholesky. |
The most common use case in symbolic elimination tree computation is mapping between a set [0, n) of vertices and a subset (e.g. representatives of connected components, endpoints of tree edges, ...), and speeding up the CSR-to-COO conversion, which ran into some edge cases in the algorithm that caused a 100x-1000x slowdown compared to this improved algorithm #1832 |
(It is not related to this PR itself but the usage) |
I think you fundamentally misunderstand how this is used - it is an algorithmic component that is usually of size |
That's why I ask how do you plan to use it. |
Please take a look at #1832 for an example how the data structure is used. I am not using it in symbolic Cholesky, only for the elimination forest computation and the aforementioned conversion so far. Other applications are operations similar to |
I see. It is more clear now. the index range on the nnz not the columns and symbolic Cholesky. |
If I'm filtering nodes or edges, the index space is of size n or nnz, if I'm representing row_ptrs, it is of size nnz + n. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
three nits left. others looks good to me or not necessary to block this pr.
- some uint32 should use storage_type from bitvector.
- add the unroll to the for loop with known fixed size
- using
static_cast<bool>
rather thanbool()
- reformat documentation - remove explicit uint32 usages - rename member functions with get_ - unroll loops explicitly Co-authored-by: Marcel Koch <marcel.koch@kit.edu> Co-authored-by: Yu-Hsiang M. Tsai <yhmtsai@gmail.com>
Pulled out of the symbolic Cholesky work:$\mathcal O(1)$ instead of $\mathcal O(\log n)$ . It can be used to either represent an increasing sequence of integers, or a boolean predicate on an index range. It can be used to get rid of a lot of binary searches.
This adds a bitvector data structure that can be used to map from a full index range to a subset of indices in constant time