Skip to content

speed-up randperm by using our current rand(1:n) #50509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rfourquet
Copy link
Member

@rfourquet rfourquet commented Jul 11, 2023

And similarly for randcycle and shuffle.

We had a custom version of range generation for randperm, which was based on the ideas of our previous default range sampler SamplerRangeFast (generate k-bits integers using masking and reject out-of-range ones) and took advantage of the fact that randperm needs to generate rand(1:i) for i = 2:n.

But our current range sampler ("Nearly Division Less") is usually better than this hack, and makes these functions more readable. Typically, for array lengths < 2^20, the new version is faster, but gets slightly slower beyond 2^22.

Here are some speedups:
randperm-ndl-speedup

The slow down for big arrays seems fine to me, but I will see if I can find an easy workaround.

Fix #57771.

@rfourquet rfourquet added performance Must go faster randomness Random number generation and the Random stdlib labels Jul 11, 2023
## rand Less Than Masked 52 bits (helper function)

"Return a sampler generating a random `Int` (masked with `mask`) in ``[0, n)``, when `n <= 2^52`."
ltm52(n::Int, mask::Int=nextpow(2, n)-1) = LessThan(n-1, Masked(mask, UInt52Raw(Int)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is not exported, I see that it is referenced in at least two packages:

SimpleChains.jl: https://github.com/PumasAI/SimpleChains.jl/blob/f028d69679d47f11d35e7f311abdf0d1d3bfab9c/src/utils.jl#L114

REPLference.jl: https://github.com/udohjeremiah/REPLference.jl/blob/f3801aac2713ee5f19705513d8318e0923e0bee7/src/_22_random.jl#L170

Should we keep it to avoid breakage? Or just submit PRs to update those packages to remove the obsolete function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch! I pretty much prefer deleting this function, so will attempt PRs against these packages.

@rfourquet
Copy link
Member Author

Revisiting this: I lost the original benchmark code, so I don't remember if it was using an explicit rng, or the implicit one. Here is a more comprehensive graph, which shows the same tendencies:

new-shuffle

However, if #58089 would be merged, the changes here would seem to always be an improvement:

new-shuffle-ndl

@jakobnissen
Copy link
Member

Does this close #57771 ?

@rfourquet
Copy link
Member Author

Does this close #57771 ?

Yes.

@adienes
Copy link
Member

adienes commented Apr 14, 2025

The slow down for big arrays seems fine to me

me as well, given that the larger the shuffled array gets the more likely that the user will want to do it with a distributed algorithm anyway, like the hypothetical JuliaFolds2/OhMyThreads.jl#139 (Rust impl here https://github.com/manpen/rip_shuffle)

@rfourquet
Copy link
Member Author

Nice reference! But also, with #58089, I believe there won't actually be slowdowns anyway.

oscardssmith pushed a commit that referenced this pull request Apr 24, 2025
It's hard to measure the improvement with single calls, but this change
substantially improve the situation in #50509, such that these new
versions of `randperm` etc are almost always faster (even for big n).

Here are some example benchmarks. Note that biggest ranges like
`UInt(0):UInt(2)^64-2` are the ones exercising the most the "unlikely"
branch:
```julia
julia> const xx = Xoshiro(); using Chairmarks

julia> rands(rng, ns) = for i=ns
    rand(rng, zero(i):i)
end

julia> rands(ns) = for i=ns
    rand(zero(i):i)
end

julia> @b rand(xx, 1:100), rand(xx, UInt(0):UInt(2)^63), rand(xx, UInt(0):UInt(2)^64-3), rand(xx, UInt(0):UInt(2)^64-2), rand(xx, UInt(0):UInt(2)^64-1)
(1.968 ns, 8.000 ns, 3.321 ns, 3.321 ns, 2.152 ns) # PR
(2.151 ns, 7.284 ns, 2.151 ns, 2.151 ns, 2.151 ns) # master

julia> @b  rand(1:100), rand(UInt(0):UInt(2)^63), rand(UInt(0):UInt(2)^64-3), rand(UInt(0):UInt(2)^64-2),rand(UInt(0):UInt(2)^64-1) # with TaskLocalRNG
(2.148 ns, 7.837 ns, 3.317 ns, 3.085 ns, 1.957 ns) # PR
(3.128 ns, 8.275 ns, 3.324 ns, 3.324 ns, 1.955 ns) # master

julia> rands(xx, 1:100), rands(xx, UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(xx, UInt(2)^64-4:UInt(2)^64-2)
(95.315 ns, 132.144 ns, 7.486 ns) # PR
(217.169 ns, 143.519 ns, 8.065 ns) # master

julia> rands(1:100), rands(UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(UInt(2)^64-4:UInt(2)^64-2)
(235.882 ns, 162.809 ns, 10.603 ns) # PR
(202.524 ns, 132.869 ns, 7.631 ns) # master
```

So it's a bit tricky: with an explicit RNG, `rands(xx, 1:100)` becomes
much faster, but without, `rands(1:100)` becomes slower.

Assuming #50509 was merged, `shuffle` is a good function to benchmark
`rand(1:n)`, and the changes here consistently improve performance, as
shown by this graph (when `TaskLocalRNG` is mentioned, it means *no* RNG
argument was passed to the function):

![new-rand-ndl](https://github.com/user-attachments/assets/b7a6229a-f5d9-408e-9102-4056b796d22c)

So although there can be slowdowns, I think this change is overall a
win.
LebedevRI pushed a commit to LebedevRI/julia that referenced this pull request May 2, 2025
It's hard to measure the improvement with single calls, but this change
substantially improve the situation in JuliaLang#50509, such that these new
versions of `randperm` etc are almost always faster (even for big n).

Here are some example benchmarks. Note that biggest ranges like
`UInt(0):UInt(2)^64-2` are the ones exercising the most the "unlikely"
branch:
```julia
julia> const xx = Xoshiro(); using Chairmarks

julia> rands(rng, ns) = for i=ns
    rand(rng, zero(i):i)
end

julia> rands(ns) = for i=ns
    rand(zero(i):i)
end

julia> @b rand(xx, 1:100), rand(xx, UInt(0):UInt(2)^63), rand(xx, UInt(0):UInt(2)^64-3), rand(xx, UInt(0):UInt(2)^64-2), rand(xx, UInt(0):UInt(2)^64-1)
(1.968 ns, 8.000 ns, 3.321 ns, 3.321 ns, 2.152 ns) # PR
(2.151 ns, 7.284 ns, 2.151 ns, 2.151 ns, 2.151 ns) # master

julia> @b  rand(1:100), rand(UInt(0):UInt(2)^63), rand(UInt(0):UInt(2)^64-3), rand(UInt(0):UInt(2)^64-2),rand(UInt(0):UInt(2)^64-1) # with TaskLocalRNG
(2.148 ns, 7.837 ns, 3.317 ns, 3.085 ns, 1.957 ns) # PR
(3.128 ns, 8.275 ns, 3.324 ns, 3.324 ns, 1.955 ns) # master

julia> rands(xx, 1:100), rands(xx, UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(xx, UInt(2)^64-4:UInt(2)^64-2)
(95.315 ns, 132.144 ns, 7.486 ns) # PR
(217.169 ns, 143.519 ns, 8.065 ns) # master

julia> rands(1:100), rands(UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(UInt(2)^64-4:UInt(2)^64-2)
(235.882 ns, 162.809 ns, 10.603 ns) # PR
(202.524 ns, 132.869 ns, 7.631 ns) # master
```

So it's a bit tricky: with an explicit RNG, `rands(xx, 1:100)` becomes
much faster, but without, `rands(1:100)` becomes slower.

Assuming JuliaLang#50509 was merged, `shuffle` is a good function to benchmark
`rand(1:n)`, and the changes here consistently improve performance, as
shown by this graph (when `TaskLocalRNG` is mentioned, it means *no* RNG
argument was passed to the function):

![new-rand-ndl](https://github.com/user-attachments/assets/b7a6229a-f5d9-408e-9102-4056b796d22c)

So although there can be slowdowns, I think this change is overall a
win.
rfourquet added a commit that referenced this pull request May 2, 2025
In #58089, this method took a small performance hit in some contexts.
It turns out that by outlining unlikely branch which throw on empty
ranges, his hit can be recovered.

In #50509 (comment),
a graph of the performance improvement of the "speed-up randperm by using
our current rand(1:n)" was posted, but I realized it was only true when
calls to `rand(1:n)` were prefixed by `@inline`; without `@inline` it was
overall slower for `TaskLocalRNG()` for very big arrays (but still faster
otherwise).

An alternative to these `@inline` annotation is to outline `throw` like here,
for equivalent benefits as `@inline` in that `randperm` PR.

Assuming that PR is merged, this PR improves roughly performance by 2x for
`TaskLocalRNG()` (no change for other RNGs).
rfourquet added a commit that referenced this pull request May 9, 2025
In #58089, this method took a small performance hit in some contexts. It
turns out that by outlining the unlikely branch which throws on empty
ranges, this hit can be recovered.

In
#50509 (comment), a
graph of the performance improvement of the "speed-up randperm by using
our current rand(1:n)" was posted, but I realized it was only true when
calls to `rand(1:n)` were prefixed by `@inline`; without `@inline` it
was overall slower for `TaskLocalRNG()` for very big arrays (but still
faster otherwise).

An alternative to these `@inline` annotation is to outline `throw` like
here, for equivalent benefits as `@inline` in that `randperm` PR.

Assuming that PR is merged, this PR improves roughly performance by 2x
for `TaskLocalRNG()` (no change for other RNGs):


![new-shuffle-outlinethrow](https://github.com/user-attachments/assets/8c0d4740-3bb4-4bcf-a49d-9af09426bec7)

While at it, I outlined a bunch of other unliky throwing branches.

After that, #50509 can probably be merged, finally!
And similarly for `randcycle` and `shuffle`.

We had a custom version of range generation for `randperm`, which was based
on the ideas of our previous default range sampler `SamplerRangeFast`
(generate `k`-bits integers using masking and reject out-of-range ones) and
took advantage of the fact that `randperm` needs to generate `rand(1:i)` for
`i = 2:n`.

But our current range sampler ("Nearly Division Less") is usually better than
this hack, and makes these functions more readable.
Typically, for array lengths `< 2^20`, the new version is faster, but gets
slightly slower beyond 2^22.
mask = 3
@inbounds for i = 2:n
j = 1 + rand(r, ltm52(i, mask))
@inbounds for i = 2:length(a)
Copy link
Member

@jakobnissen jakobnissen May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we delete this @inbounds, given that this is out-of-bounds for OffsetArrays? Benchmarking in #57771 suggests this @inbounds has almost no effect on speed.
Edit: Ah, there is a require_one_based_indexing above. Nonetheless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently support one one-based indexing, but yes I will re-benchmark without the @inbounds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it doesn't make a big difference, but still bring some speed-up, so I will prefer not changing that here. E.g.

julia> @btime randperm!($([1:2^12;]));
  5.636 μs (0 allocations: 0 bytes) # with @inbounds
  6.562 μs (0 allocations: 0 bytes) # without @inbounds

mask = 3
@inbounds for i = 2:n
j = 1 + rand(r, ltm52(i, mask))
@inbounds for i = 2:length(a)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@inbounds for i = 2:length(a)
for i = 2:length(a)

```
"""
function randperm!(r::AbstractRNG, a::Array{<:Integer})
function randperm!(rng::AbstractRNG, a::Array{<:Integer})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this, and randcycle! are not defined for AbstractArray?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I also just noticed that!

charleskawczynski pushed a commit to charleskawczynski/julia that referenced this pull request May 12, 2025
)

In JuliaLang#58089, this method took a small performance hit in some contexts. It
turns out that by outlining the unlikely branch which throws on empty
ranges, this hit can be recovered.

In
JuliaLang#50509 (comment), a
graph of the performance improvement of the "speed-up randperm by using
our current rand(1:n)" was posted, but I realized it was only true when
calls to `rand(1:n)` were prefixed by `@inline`; without `@inline` it
was overall slower for `TaskLocalRNG()` for very big arrays (but still
faster otherwise).

An alternative to these `@inline` annotation is to outline `throw` like
here, for equivalent benefits as `@inline` in that `randperm` PR.

Assuming that PR is merged, this PR improves roughly performance by 2x
for `TaskLocalRNG()` (no change for other RNGs):


![new-shuffle-outlinethrow](https://github.com/user-attachments/assets/8c0d4740-3bb4-4bcf-a49d-9af09426bec7)

While at it, I outlined a bunch of other unliky throwing branches.

After that, JuliaLang#50509 can probably be merged, finally!
charleskawczynski pushed a commit to charleskawczynski/julia that referenced this pull request May 12, 2025
)

In JuliaLang#58089, this method took a small performance hit in some contexts. It
turns out that by outlining the unlikely branch which throws on empty
ranges, this hit can be recovered.

In
JuliaLang#50509 (comment), a
graph of the performance improvement of the "speed-up randperm by using
our current rand(1:n)" was posted, but I realized it was only true when
calls to `rand(1:n)` were prefixed by `@inline`; without `@inline` it
was overall slower for `TaskLocalRNG()` for very big arrays (but still
faster otherwise).

An alternative to these `@inline` annotation is to outline `throw` like
here, for equivalent benefits as `@inline` in that `randperm` PR.

Assuming that PR is merged, this PR improves roughly performance by 2x
for `TaskLocalRNG()` (no change for other RNGs):

![new-shuffle-outlinethrow](https://github.com/user-attachments/assets/8c0d4740-3bb4-4bcf-a49d-9af09426bec7)

While at it, I outlined a bunch of other unliky throwing branches.

After that, JuliaLang#50509 can probably be merged, finally!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster randomness Random number generation and the Random stdlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

shuffle! is as much as 2x slower than a naive implementation
5 participants