Skip to content

Conversation

upsj
Copy link
Member

@upsj upsj commented Apr 2, 2025

Pulled out of the symbolic Cholesky work:
This adds a bitvector data structure that can be used to map from a full index range to a subset of indices in constant time $\mathcal O(1)$ instead of $\mathcal O(\log n)$. It can be used to either represent an increasing sequence of integers, or a boolean predicate on an index range. It can be used to get rid of a lot of binary searches.

@upsj upsj added the 1:ST:ready-for-review This PR is ready for review label Apr 2, 2025
@upsj upsj requested a review from a team April 2, 2025 13:41
@upsj upsj self-assigned this Apr 2, 2025
@ginkgo-bot ginkgo-bot added reg:build This is related to the build system. reg:testing This is related to testing. mod:all This touches all Ginkgo modules. labels Apr 2, 2025
device_predicate(base_i + local_i) ? 1 : 0;
mask |= bit << local_i;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the above two conditions can be merged into base_i + local_i < size && local_i < block_size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted the top one to potentially allow for full unrolling, since the loop trip count is known. But I'm not actually sure if the compiler can manage that always.

array<IndexType> ranks{exec, num_blocks};
std::fill_n(bits.get_data(), num_blocks, 0);
std::fill_n(ranks.get_data(), num_blocks, 0);
for (auto i : irange{size}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer index_range rather than irange

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is based on boost::irange for familiarity

@MarcelKoch MarcelKoch self-requested a review April 10, 2025 14:52
Copy link
Member

@MarcelKoch MarcelKoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a suggestion for the device type: make the data members public. This could allow to implement the current member functions as free-standing functions, which would allow for backend specific implementations. It would also allow to add a set function, which will require backend specific atomics.
(This suggestion is not blocking however.)

@upsj
Copy link
Member Author

upsj commented Apr 20, 2025

@MarcelKoch the device view is meant as a read-only representation for the bitvector, I don't really want to encourage people building them by hand without the helper algorithms.

@upsj upsj force-pushed the bitvector branch 3 times, most recently from 0386c4b to 2ee600e Compare April 20, 2025 20:47
@upsj upsj changed the base branch from develop to original_filename_in_generated_code April 20, 2025 21:04
@upsj upsj requested review from MarcelKoch and yhmtsai April 20, 2025 21:07
Base automatically changed from original_filename_in_generated_code to develop April 21, 2025 12:12
Copy link
Member

@MarcelKoch MarcelKoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me the biggest issue is the header in common/unified. I don't think we should start putting headers in there, especially for kernels that are available for all backends.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a blocker for me. I just don't see that we should put kernel headers (that are not used exclusively in common/unified) into common/unified.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a uniform transform_iterator implementation, I can also implement #1832 in common/unified. Would that be sufficient justification for putting it there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue is not that it includes non-uniform implementations from different backends, but that I don't really understand how this header is going to be used. If the kernels are only called from kernels implemented in common/unified, then it's fine with me. If this file is going to be included anywhere else, then I think it should be put under core.
In any case, I hope I explained my reservation enough, and I will not hold up the PR because of it anymore. However, if the header gets included outside common/unified, I will bring it up again.

Copy link
Member Author

@upsj upsj Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tests and kernels implemented in unified, I would use this header, for other tests and kernels I would use the backend-specific one. Your reservations definitely make sense and made me reevaluate how I am using the headers.

@upsj upsj mentioned this pull request Apr 27, 2025
Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some questions

  • Do we need to use iterator for from_sorted_indices?
  • What's the purpose of transformer iterator?
  • the index check of (base_i + block_size <= size())

assert(i < size());
const auto block = i / block_size;
const auto local = i % block_size;
return bool((bits_[block] >> local) & 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use static_cast<bool>?

auto value = it[i];
const auto [block, mask] =
device_bitvector<index_type>::get_block_and_mask(value);
sycl::atomic_ref<storage_type, sycl::memory_order::relaxed,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have ballot from cooperative group and it can eliminate the atomic usage.
you need to set the sub-group size specifically

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I follow. This would require a segmented scan to get rid of some of the atomic operations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should almost be like the kernel in cuda_hip.

mask = ballot(device_predicate(base_i + local_i)). // subgroup needs to be 32
masks[base_i/bv::blocksize] = mask

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from_sorted_indices, not from_predicate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it in wrong place. I mean for the predicate case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I realized that after writing my response 😄 We could do that, but I wanted do avoid the complexity of subgroup size dispatch, and since subgroups are sometimes smaller than 32, we would need to assemble the mask out of multiple elements. So that definitely makes sense as a future optimization, but I wanted to keep the complexity in this PR smaller.

Comment on lines +148 to +151
array<storage_type> bits_compact{exec, num_blocks};
array<index_type> bits_position{exec, num_blocks};
array<storage_type> bits{exec, num_blocks};
array<index_type> ranks{exec, num_blocks};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compared to the others, it uses double memory allocation.
does it give better performance?

Copy link
Member Author

@upsj upsj Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has a better worst-case behavior, since it doesn't suffer from atomic collisions. If we implemented our own kernel for it, we could also get rid of the second pair of arrays, but reduce_by_key has an interface that prevents us from scattering the results to the output already.

@yhmtsai
Copy link
Member

yhmtsai commented Apr 27, 2025

By the way, I am not sure how do you plan to use it in symbolic cholesky.
I guess you will have some base as the start point to use smaller size than the #columns of matrix.
Otherwise, the memory will request #columns bits * # rows

@upsj
Copy link
Member Author

upsj commented Apr 27, 2025

The most common use case in symbolic elimination tree computation is mapping between a set [0, n) of vertices and a subset (e.g. representatives of connected components, endpoints of tree edges, ...), and speeding up the CSR-to-COO conversion, which ran into some edge cases in the algorithm that caused a 100x-1000x slowdown compared to this improved algorithm #1832

@yhmtsai
Copy link
Member

yhmtsai commented Apr 28, 2025

(It is not related to this PR itself but the usage)
If it always takes care of [0, n), we will need $\frac{2^{20} * 2^{20}}{2^3}$ byte -> $2^{37}$ bytes ~ 128 GB for $2^{20} \times 2^{20}$ matrix (~1e6 size).
It takes quite a lot for a small matrix. Do I calculate something wrong?

@upsj
Copy link
Member Author

upsj commented Apr 28, 2025

I think you fundamentally misunderstand how this is used - it is an algorithmic component that is usually of size n or nnz, but would never be used to represent a dense matrix.

@yhmtsai
Copy link
Member

yhmtsai commented Apr 28, 2025

That's why I ask how do you plan to use it.
Originally, I thought it is to represent the index in a row. no matter how sparse of this row, it always take at least n bits, right?
And then you need to have a bitvector for each row, so you need n * n bits.

@upsj
Copy link
Member Author

upsj commented Apr 28, 2025

Please take a look at #1832 for an example how the data structure is used. I am not using it in symbolic Cholesky, only for the elimination forest computation and the aforementioned conversion so far. Other applications are operations similar to copy_if that need to map between a full index range and a subset repeatedly.
You will just have to take my word on it that this is a useful data structure 🙃

@yhmtsai
Copy link
Member

yhmtsai commented Apr 28, 2025

I see. It is more clear now. the index range on the nnz not the columns and symbolic Cholesky.

@upsj
Copy link
Member Author

upsj commented Apr 28, 2025

If I'm filtering nodes or edges, the index space is of size n or nnz, if I'm representing row_ptrs, it is of size nnz + n.

Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

three nits left. others looks good to me or not necessary to block this pr.

  1. some uint32 should use storage_type from bitvector.
  2. add the unroll to the for loop with known fixed size
  3. using static_cast<bool> rather than bool()

upsj and others added 2 commits April 30, 2025 15:36
- reformat documentation
- remove explicit uint32 usages
- rename member functions with get_
- unroll loops explicitly

Co-authored-by: Marcel Koch <marcel.koch@kit.edu>
Co-authored-by: Yu-Hsiang M. Tsai <yhmtsai@gmail.com>
@upsj upsj added the 1:ST:no-changelog-entry Skip the wiki check for changelog update label Apr 30, 2025
@upsj upsj added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Apr 30, 2025
@upsj upsj merged commit e146f94 into develop Apr 30, 2025
8 of 11 checks passed
@upsj upsj deleted the bitvector branch April 30, 2025 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1:ST:no-changelog-entry Skip the wiki check for changelog update 1:ST:ready-to-merge This PR is ready to merge. mod:all This touches all Ginkgo modules. reg:build This is related to the build system. reg:testing This is related to testing.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants