Skip to content

parallel partitioned shuffle #50970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

tlcz
Copy link

@tlcz tlcz commented Aug 18, 2023

add ppshuffle, pprandperm to stdlib.Random (ppmisc.jl)

@tlcz
Copy link
Author

tlcz commented Aug 18, 2023

Hi @JeffBezanson,
@bkamins mentioned some time ago that you expressed interest in a parallelized version of Random.shuffle we developed for generation of large random graphs, possibly to include it into stdlib.Random. Here is a proposed implementation.
The method works by 1) partitioning the input into random partitions and then 2) shuffling the partitions in parallel if multi-threading is enabled. This has a twofold effect 1) speedup from better cache utilization and 2) speedup from parallel processing.
Here are examples run on my 4-core i5-9300H @2.4GHz with 4 julia threads:

julia> n = Int32(1e8);

julia> @time v = shuffle(Base.OneTo(n));
  2.953982 seconds (3 allocations: 381.470 MiB, 0.06% gc time)

julia> @time v = ppshuffle(Base.OneTo(n));
  0.408063 seconds (82 allocations: 381.485 MiB, 0.17% gc time)

julia> @time randperm!(v);
  2.761123 seconds

julia> @time pprandperm!(v);
  0.395918 seconds (80 allocations: 15.500 KiB)

julia> isperm(v)
true

Please let us know what you think.
Regards,
@tolcz

add ppshuffle, pprandperm to stdlib.Random (ppmisc.jl)
@tlcz
Copy link
Author

tlcz commented Aug 19, 2023

Below is rationale behind the method presented recently at WAW2023 (slides 10-15, 18-21).
tolczak_waw2023.pdf

@oscardssmith oscardssmith added performance Must go faster randomness Random number generation and the Random stdlib labels Aug 19, 2023
@oscardssmith
Copy link
Member

oscardssmith commented Aug 19, 2023

Imo if these are strictly better than the regular shuffle and randperm they should be the method used by default (presumably falling back to the single threaded case automatically for small arguments).

Edit: it also seems like the number of threads used should be user selectable.

```
"""
function ppshuffle!(r::TaskLocalRNG, B::AbstractArray{T}, A::Union{AbstractArray{T}, Base.OneTo{T}}) where {T<:Integer}
nparts = max(2, (length(A) * sizeof(T)) >> 21)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this come from?

Copy link
Author

@tlcz tlcz Aug 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An experimental 'rule of thumb' setting an optimal partition size to 2MB and a minimal partition count to 2. Should be replaced by a more robust heuristic in the future.

@bkamins
Copy link
Member

bkamins commented Aug 19, 2023

Imo if these are strictly better than the regular shuffle and randperm they should be the method used by default (presumably falling back to the single threaded case automatically for small arguments).

These functions use more memory than standard functions so there is a trade-off (that is usually worth to pay though as the overhead is small). @tolcz - can you comment please on the memory allocation comparison? Thank you!

@tlcz
Copy link
Author

tlcz commented Aug 19, 2023

Hi, thank you for the review and for your comments.

Imo if these are strictly better than the regular shuffle and randperm they should be the method used by default (presumably falling back to the single threaded case automatically for small arguments).

The fallback to sequential processing for input which is not 'large enough' for parallel processing is a good idea. In fact it was already implemented in an earlier version of the code and could easily be restored. I removed it as finding an optimal 'transition size' is platform-dependent. So I decided to separate the methods and leave the choice up to the user - at least for now.
In addition consider different signatures - shuffle! is in-place while ppshuffle! is not.

Edit: it also seems like the number of threads used should be user selectable.

I agree, such flexibility is desirable. The current version will simply use the number of threads available for a running Julia process. I will consider this change for the subsequent release.
However it is worth noting that this is a limitation of @threads macro rather then the code itself. As far as I am aware there is currently no way to set number of threads being used at runtime by @threads.

@tlcz
Copy link
Author

tlcz commented Aug 19, 2023

These functions use more memory than standard functions so there is a trade-off (that is usually worth to pay though as the overhead is small). @tolcz - can you comment please on the memory allocation comparison? Thank you!

Yes, as usual for non-embarrassingly parallel problems there is some overhead of parallel processing and the problem size should be 'large enough' to compensate for it. In our case there is an auxiliary Array{Int}[nparts x nthreads()] for tracking decomposition and reassembly of an input array. In addition there is an O(n) size output array required by ppshuffle! which (in contrast to shuffle!) is not in-place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster randomness Random number generation and the Random stdlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants