I suspect most of the difference is due to the 32 vs 64 bit integers. I changed the function to use the type of its input throughout.
function loops(u::T)::T where {T}
a = zeros(T, 10^4) # Allocate an array of 10,000 zeros
r = rand(T(1):T(10^4)) # Choose a random index between 1 and 10,000
@inbounds for i in T(1):T(10^4) # Outer loop over array indices
@inbounds for j in T(1):T(10^4) # Inner loop: 10,000 iterations per outer loop iteration
a[i] += j % u # Simple sum
end
a[i] += r # Add a random value to each element in array
end
return @inbounds a[r] # Return the element at the random index
end
On an older Intel cpu, I get
julia> versioninfo()
Julia Version 1.11.3
Commit d63adeda50d (2025-01-21 19:42 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 4 × Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, ivybridge)
Threads: 4 default, 0 interactive, 2 GC (on 4 virtual cores)
Environment:
JULIA_NUM_THREADS = 4
julia> @benchmark loops(10)
BenchmarkTools.Trial: 16 samples with 1 evaluation per sample.
Range (min … max): 326.743 ms … 335.066 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 328.532 ms ┊ GC (median): 0.00%
Time (mean ± σ): 328.833 ms ± 1.961 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ ▁▁ █ ▁▁█ ▁▁ ▁ ▁ ▁ ▁
█▁█▁██▁▁█▁▁███▁██▁▁▁█▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
327 ms Histogram: frequency by time 335 ms <
Memory estimate: 78.16 KiB, allocs estimate: 2.
julia> @benchmark loops(Int32(10))
BenchmarkTools.Trial: 24 samples with 1 evaluation per sample.
Range (min … max): 211.762 ms … 217.078 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 213.031 ms ┊ GC (median): 0.00%
Time (mean ± σ): 213.261 ms ± 1.170 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█ ▃ ▃ ▃
▇▇▁▇▁▇▁▇▁██▁▇▁█▇▇▁▇▇█▁▁▁▁▁▁▇▁▁▇▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
212 ms Histogram: frequency by time 217 ms <
Memory estimate: 39.09 KiB, allocs estimate: 2.
The Int32 benchmark is pretty much equal to the C benchmark.
[loops] $ gcc benchmark.c loops.c -lm -O3
[loops] $ ./a.out 2000 3000 10
...
..
240.228608,2.209823,238.869205,246.366234,9,54136
There seems to be some CPU dependence. I also ran things on a much newer AMD system, and got the same time for Int32 and Int64, both of which match the C version.
julia> versioninfo()
Julia Version 1.11.4
Commit 8561cc3d68d (2025-03-10 11:36 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 × AMD Ryzen 9 9900X 12-Core Processor
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, generic)
Threads: 12 default, 0 interactive, 6 GC (on 24 virtual cores)
Environment:
JULIA_NUM_THREADS = 12
julia> @benchmark loops(10)
BenchmarkTools.Trial: 47 samples with 1 evaluation per sample.
Range (min … max): 107.854 ms … 108.336 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 107.926 ms ┊ GC (median): 0.00%
Time (mean ± σ): 107.958 ms ± 103.468 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ ▄ █▄▄█▁ ▄ ▁
█▁█▆█▆█████▆▆▁▆█▆█▁▆▁▆▆▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▆▁▆▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
108 ms Histogram: frequency by time 108 ms <
Memory estimate: 78.16 KiB, allocs estimate: 2.
julia> @benchmark loops(Int32(10))
BenchmarkTools.Trial: 47 samples with 1 evaluation per sample.
Range (min … max): 107.177 ms … 108.378 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 107.295 ms ┊ GC (median): 0.00%
Time (mean ± σ): 107.338 ms ± 193.199 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▂ ▅
▅▄▅██████▄▄▄▁▄▁▁▄▄▁▁▁▁▁▁▁▁▁▄▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
107 ms Histogram: frequency by time 108 ms <
Memory estimate: 39.09 KiB, allocs estimate: 2.
[loops] $ gcc benchmark.c loops.c -O3 -lm
[loops] $ ./a.out 2000 2000 10
..
..
106.522007,0.079939,106.462386,106.836421,19,54333
Removing allocations as in the loops_noalloc
function below makes little difference (results not shown).
using StaticArrays, LoopVectorization
function loops_noalloc(u::T)::T where {T}
a = @MVector zeros(T, 10^4) # Allocate an array of 10,000 zeros
r = rand(T(1):T(10^4)) # Choose a random index between 1 and 10,000
@inbounds for i in T(1):T(10^4) # Outer loop over array indices
@inbounds for j in T(1):T(10^4) # Inner loop: 10,000 iterations per outer loop iteration
a[i] += j % u # Simple sum
end
a[i] += r # Add a random value to each element in array
end
return @inbounds a[r] # Return the element at the random index
end
function loops_fast(u::T)::T where {T}
a = @MVector zeros(T, 10^4) # Allocate an array of 10,000 zeros
r = rand(T(1):T(10^4)) # Choose a random index between 1 and 10,000
@turbo for i in T(1):T(10000) # Outer loop over array indices
for j in T(1):T(10000) # Inner loop: 10,000 iterations per outer loop iteration
a[i] += j % u # Simple sum
end
a[i] = a[i] + r # Add a random value to each element in array
end
return @inbounds a[r] # Return the element at the random index
end
However, you can make things much faster by using LoopVectorization (admittedly, it’s questionable whether this is in the spirit of the original language comparison benchmark). On my newer AMD machine, I get
julia> @benchmark loops_fast(Int64(10))
BenchmarkTools.Trial: 2986 samples with 1 evaluation per sample.
Range (min … max): 1.672 ms … 1.775 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.674 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.674 ms ± 4.688 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅▇▄▇█▆▄▂▁▁ ▁
███████████▇▅▇▆▇▆▆▅▆▅▆▆▆▅▄▅▅▅▃▆▃▁▅▄▄▅▄▅▆▄▅▄▁▄▁▆▅▁▄▅▆▆▃▅▅▆ █
1.67 ms Histogram: log(frequency) by time 1.7 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark loops_fast(Int32(10))
BenchmarkTools.Trial: 5926 samples with 1 evaluation per sample.
Range (min … max): 841.047 μs … 1.289 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 842.871 μs ┊ GC (median): 0.00%
Time (mean ± σ): 843.270 μs ± 8.455 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▄▅▂▂▂▇▇█▅▄▃▂▁ ▂
████████████████▇▇▇▆▅▇▆▅▆▆▆▅▆▅▆▆▅▆▅▆▅▇▅▆▇▇▆▆▆▄▅▁▄▅▄▃▃▅▃▄▁▄▅ █
841 μs Histogram: log(frequency) by time 855 μs <
Memory estimate: 0 bytes, allocs estimate: 0.