speed-up `randperm` by using our current `rand(1:n)` #50509

rfourquet · 2023-07-11T16:22:56Z

And similarly for randcycle and shuffle.

We had a custom version of range generation for randperm, which was based on the ideas of our previous default range sampler SamplerRangeFast (generate k-bits integers using masking and reject out-of-range ones) and took advantage of the fact that randperm needs to generate rand(1:i) for i = 2:n.

But our current range sampler ("Nearly Division Less") is usually better than this hack, and makes these functions more readable. Typically, for array lengths < 2^20, the new version is faster, but gets slightly slower beyond 2^22.

Here are some speedups:

The slow down for big arrays seems fine to me, but I will see if I can find an easy workaround.

Fix #57771.

stevengj · 2023-07-19T18:15:27Z

stdlib/Random/src/misc.jl

-## rand Less Than Masked 52 bits (helper function)
-
-"Return a sampler generating a random `Int` (masked with `mask`) in ``[0, n)``, when `n <= 2^52`."
-ltm52(n::Int, mask::Int=nextpow(2, n)-1) = LessThan(n-1, Masked(mask, UInt52Raw(Int)))


Although this is not exported, I see that it is referenced in at least two packages:

SimpleChains.jl: https://github.com/PumasAI/SimpleChains.jl/blob/f028d69679d47f11d35e7f311abdf0d1d3bfab9c/src/utils.jl#L114

REPLference.jl: https://github.com/udohjeremiah/REPLference.jl/blob/f3801aac2713ee5f19705513d8318e0923e0bee7/src/_22_random.jl#L170

Should we keep it to avoid breakage? Or just submit PRs to update those packages to remove the obsolete function?

Oh good catch! I pretty much prefer deleting this function, so will attempt PRs against these packages.

rfourquet · 2025-04-12T14:22:44Z

Revisiting this: I lost the original benchmark code, so I don't remember if it was using an explicit rng, or the implicit one. Here is a more comprehensive graph, which shows the same tendencies:

However, if #58089 would be merged, the changes here would seem to always be an improvement:

jakobnissen · 2025-04-12T16:54:16Z

Does this close #57771 ?

rfourquet · 2025-04-13T10:14:32Z

Does this close #57771 ?

Yes.

adienes · 2025-04-14T14:15:43Z

The slow down for big arrays seems fine to me

me as well, given that the larger the shuffled array gets the more likely that the user will want to do it with a distributed algorithm anyway, like the hypothetical JuliaFolds2/OhMyThreads.jl#139 (Rust impl here https://github.com/manpen/rip_shuffle)

rfourquet · 2025-04-14T17:18:57Z

Nice reference! But also, with #58089, I believe there won't actually be slowdowns anyway.

@b

It's hard to measure the improvement with single calls, but this change substantially improve the situation in #50509, such that these new versions of `randperm` etc are almost always faster (even for big n). Here are some example benchmarks. Note that biggest ranges like `UInt(0):UInt(2)^64-2` are the ones exercising the most the "unlikely" branch: ```julia julia> const xx = Xoshiro(); using Chairmarks julia> rands(rng, ns) = for i=ns rand(rng, zero(i):i) end julia> rands(ns) = for i=ns rand(zero(i):i) end julia> @b rand(xx, 1:100), rand(xx, UInt(0):UInt(2)^63), rand(xx, UInt(0):UInt(2)^64-3), rand(xx, UInt(0):UInt(2)^64-2), rand(xx, UInt(0):UInt(2)^64-1) (1.968 ns, 8.000 ns, 3.321 ns, 3.321 ns, 2.152 ns) # PR (2.151 ns, 7.284 ns, 2.151 ns, 2.151 ns, 2.151 ns) # master julia> @b rand(1:100), rand(UInt(0):UInt(2)^63), rand(UInt(0):UInt(2)^64-3), rand(UInt(0):UInt(2)^64-2),rand(UInt(0):UInt(2)^64-1) # with TaskLocalRNG (2.148 ns, 7.837 ns, 3.317 ns, 3.085 ns, 1.957 ns) # PR (3.128 ns, 8.275 ns, 3.324 ns, 3.324 ns, 1.955 ns) # master julia> rands(xx, 1:100), rands(xx, UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(xx, UInt(2)^64-4:UInt(2)^64-2) (95.315 ns, 132.144 ns, 7.486 ns) # PR (217.169 ns, 143.519 ns, 8.065 ns) # master julia> rands(1:100), rands(UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(UInt(2)^64-4:UInt(2)^64-2) (235.882 ns, 162.809 ns, 10.603 ns) # PR (202.524 ns, 132.869 ns, 7.631 ns) # master ``` So it's a bit tricky: with an explicit RNG, `rands(xx, 1:100)` becomes much faster, but without, `rands(1:100)` becomes slower. Assuming #50509 was merged, `shuffle` is a good function to benchmark `rand(1:n)`, and the changes here consistently improve performance, as shown by this graph (when `TaskLocalRNG` is mentioned, it means *no* RNG argument was passed to the function): ![new-rand-ndl](https://github.com/user-attachments/assets/b7a6229a-f5d9-408e-9102-4056b796d22c) So although there can be slowdowns, I think this change is overall a win.

@b

It's hard to measure the improvement with single calls, but this change substantially improve the situation in JuliaLang#50509, such that these new versions of `randperm` etc are almost always faster (even for big n). Here are some example benchmarks. Note that biggest ranges like `UInt(0):UInt(2)^64-2` are the ones exercising the most the "unlikely" branch: ```julia julia> const xx = Xoshiro(); using Chairmarks julia> rands(rng, ns) = for i=ns rand(rng, zero(i):i) end julia> rands(ns) = for i=ns rand(zero(i):i) end julia> @b rand(xx, 1:100), rand(xx, UInt(0):UInt(2)^63), rand(xx, UInt(0):UInt(2)^64-3), rand(xx, UInt(0):UInt(2)^64-2), rand(xx, UInt(0):UInt(2)^64-1) (1.968 ns, 8.000 ns, 3.321 ns, 3.321 ns, 2.152 ns) # PR (2.151 ns, 7.284 ns, 2.151 ns, 2.151 ns, 2.151 ns) # master julia> @b rand(1:100), rand(UInt(0):UInt(2)^63), rand(UInt(0):UInt(2)^64-3), rand(UInt(0):UInt(2)^64-2),rand(UInt(0):UInt(2)^64-1) # with TaskLocalRNG (2.148 ns, 7.837 ns, 3.317 ns, 3.085 ns, 1.957 ns) # PR (3.128 ns, 8.275 ns, 3.324 ns, 3.324 ns, 1.955 ns) # master julia> rands(xx, 1:100), rands(xx, UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(xx, UInt(2)^64-4:UInt(2)^64-2) (95.315 ns, 132.144 ns, 7.486 ns) # PR (217.169 ns, 143.519 ns, 8.065 ns) # master julia> rands(1:100), rands(UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(UInt(2)^64-4:UInt(2)^64-2) (235.882 ns, 162.809 ns, 10.603 ns) # PR (202.524 ns, 132.869 ns, 7.631 ns) # master ``` So it's a bit tricky: with an explicit RNG, `rands(xx, 1:100)` becomes much faster, but without, `rands(1:100)` becomes slower. Assuming JuliaLang#50509 was merged, `shuffle` is a good function to benchmark `rand(1:n)`, and the changes here consistently improve performance, as shown by this graph (when `TaskLocalRNG` is mentioned, it means *no* RNG argument was passed to the function): ![new-rand-ndl](https://github.com/user-attachments/assets/b7a6229a-f5d9-408e-9102-4056b796d22c) So although there can be slowdowns, I think this change is overall a win.

In #58089, this method took a small performance hit in some contexts. It turns out that by outlining unlikely branch which throw on empty ranges, his hit can be recovered. In #50509 (comment), a graph of the performance improvement of the "speed-up randperm by using our current rand(1:n)" was posted, but I realized it was only true when calls to `rand(1:n)` were prefixed by `@inline`; without `@inline` it was overall slower for `TaskLocalRNG()` for very big arrays (but still faster otherwise). An alternative to these `@inline` annotation is to outline `throw` like here, for equivalent benefits as `@inline` in that `randperm` PR. Assuming that PR is merged, this PR improves roughly performance by 2x for `TaskLocalRNG()` (no change for other RNGs).

In #58089, this method took a small performance hit in some contexts. It turns out that by outlining the unlikely branch which throws on empty ranges, this hit can be recovered. In #50509 (comment), a graph of the performance improvement of the "speed-up randperm by using our current rand(1:n)" was posted, but I realized it was only true when calls to `rand(1:n)` were prefixed by `@inline`; without `@inline` it was overall slower for `TaskLocalRNG()` for very big arrays (but still faster otherwise). An alternative to these `@inline` annotation is to outline `throw` like here, for equivalent benefits as `@inline` in that `randperm` PR. Assuming that PR is merged, this PR improves roughly performance by 2x for `TaskLocalRNG()` (no change for other RNGs): ![new-shuffle-outlinethrow](https://github.com/user-attachments/assets/8c0d4740-3bb4-4bcf-a49d-9af09426bec7) While at it, I outlined a bunch of other unliky throwing branches. After that, #50509 can probably be merged, finally!

And similarly for `randcycle` and `shuffle`. We had a custom version of range generation for `randperm`, which was based on the ideas of our previous default range sampler `SamplerRangeFast` (generate `k`-bits integers using masking and reject out-of-range ones) and took advantage of the fact that `randperm` needs to generate `rand(1:i)` for `i = 2:n`. But our current range sampler ("Nearly Division Less") is usually better than this hack, and makes these functions more readable. Typically, for array lengths `< 2^20`, the new version is faster, but gets slightly slower beyond 2^22.

jakobnissen · 2025-05-09T12:06:37Z

stdlib/Random/src/misc.jl

-    mask = 3
-    @inbounds for i = 2:n
-        j = 1 + rand(r, ltm52(i, mask))
+    @inbounds for i = 2:length(a)


Can we delete this @inbounds, given that this is out-of-bounds for OffsetArrays? Benchmarking in #57771 suggests this @inbounds has almost no effect on speed.
Edit: Ah, there is a require_one_based_indexing above. Nonetheless.

We currently support one one-based indexing, but yes I will re-benchmark without the @inbounds.

So it doesn't make a big difference, but still bring some speed-up, so I will prefer not changing that here. E.g.

julia> @btime randperm!($([1:2^12;])); 5.636 μs (0 allocations: 0 bytes) # with @inbounds 6.562 μs (0 allocations: 0 bytes) # without @inbounds

jakobnissen · 2025-05-09T12:07:05Z

stdlib/Random/src/misc.jl

-    mask = 3
-    @inbounds for i = 2:n
-        j = 1 + rand(r, ltm52(i, mask))
+    @inbounds for i = 2:length(a)


Suggested change

@inbounds for i = 2:length(a)

for i = 2:length(a)

jakobnissen · 2025-05-09T12:09:04Z

stdlib/Random/src/misc.jl

 ```
 """
-function randperm!(r::AbstractRNG, a::Array{<:Integer})
+function randperm!(rng::AbstractRNG, a::Array{<:Integer})


Any reason this, and randcycle! are not defined for AbstractArray?

Interesting, I also just noticed that!

) In JuliaLang#58089, this method took a small performance hit in some contexts. It turns out that by outlining the unlikely branch which throws on empty ranges, this hit can be recovered. In JuliaLang#50509 (comment), a graph of the performance improvement of the "speed-up randperm by using our current rand(1:n)" was posted, but I realized it was only true when calls to `rand(1:n)` were prefixed by `@inline`; without `@inline` it was overall slower for `TaskLocalRNG()` for very big arrays (but still faster otherwise). An alternative to these `@inline` annotation is to outline `throw` like here, for equivalent benefits as `@inline` in that `randperm` PR. Assuming that PR is merged, this PR improves roughly performance by 2x for `TaskLocalRNG()` (no change for other RNGs): ![new-shuffle-outlinethrow](https://github.com/user-attachments/assets/8c0d4740-3bb4-4bcf-a49d-9af09426bec7) While at it, I outlined a bunch of other unliky throwing branches. After that, JuliaLang#50509 can probably be merged, finally!

rfourquet added performance Must go faster randomness Random number generation and the Random stdlib labels Jul 11, 2023

stevengj reviewed Jul 19, 2023

View reviewed changes

rfourquet mentioned this pull request Apr 12, 2025

faster rand(1:n) by outlining unlikely branch #58089

Merged

rfourquet force-pushed the rf/randperm-ndl branch from 5d1ca6a to e142f09 Compare April 12, 2025 14:27

rfourquet force-pushed the rf/randperm-ndl branch from e142f09 to 411ab4a Compare April 26, 2025 15:59

rfourquet mentioned this pull request May 2, 2025

faster rand(TaskLocalRNG(), 1:n) by outlining throw #58306

Merged

jakobnissen suggested changes May 9, 2025

View reviewed changes

rfourquet force-pushed the rf/randperm-ndl branch from 411ab4a to cfe1c68 Compare May 9, 2025 12:17

jakobnissen mentioned this pull request Jun 10, 2025

Random: make randperm! and randcycle! accept AbstractArray #58596

Merged

Merge branch 'master' into rf/randperm-ndl

7742c6e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

speed-up `randperm` by using our current `rand(1:n)` #50509

speed-up `randperm` by using our current `rand(1:n)` #50509

Uh oh!

rfourquet commented Jul 11, 2023 •

edited by giordano

Loading

Uh oh!

stevengj Jul 19, 2023

Uh oh!

rfourquet Jul 20, 2023

Uh oh!

rfourquet commented Apr 12, 2025

Uh oh!

jakobnissen commented Apr 12, 2025

Uh oh!

rfourquet commented Apr 13, 2025

Uh oh!

adienes commented Apr 14, 2025 •

edited

Loading

Uh oh!

rfourquet commented Apr 14, 2025

Uh oh!

jakobnissen May 9, 2025 •

edited

Loading

Uh oh!

rfourquet May 9, 2025

Uh oh!

rfourquet May 9, 2025

Uh oh!

jakobnissen May 9, 2025

Uh oh!

jakobnissen May 9, 2025

Uh oh!

rfourquet May 9, 2025

Uh oh!

Uh oh!

Uh oh!

speed-up randperm by using our current rand(1:n) #50509

Are you sure you want to change the base?

speed-up randperm by using our current rand(1:n) #50509

Uh oh!

Conversation

rfourquet commented Jul 11, 2023 • edited by giordano Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevengj Jul 19, 2023

Choose a reason for hiding this comment

Uh oh!

rfourquet Jul 20, 2023

Choose a reason for hiding this comment

Uh oh!

rfourquet commented Apr 12, 2025

Uh oh!

jakobnissen commented Apr 12, 2025

Uh oh!

rfourquet commented Apr 13, 2025

Uh oh!

adienes commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rfourquet commented Apr 14, 2025

Uh oh!

jakobnissen May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rfourquet May 9, 2025

Choose a reason for hiding this comment

Uh oh!

rfourquet May 9, 2025

Choose a reason for hiding this comment

Uh oh!

jakobnissen May 9, 2025

Choose a reason for hiding this comment

Uh oh!

jakobnissen May 9, 2025

Choose a reason for hiding this comment

Uh oh!

rfourquet May 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

speed-up `randperm` by using our current `rand(1:n)` #50509

speed-up `randperm` by using our current `rand(1:n)` #50509

rfourquet commented Jul 11, 2023 •

edited by giordano

Loading

adienes commented Apr 14, 2025 •

edited

Loading

jakobnissen May 9, 2025 •

edited

Loading