Yet another language benchmark

juliohm · 2025-04-10T11:37:25.357Z

I like the visuals in this one. As usual, Julia compilation seems to be included in the results:

pez.github.io

Languages Benchmark Visualization

Watch how different programming languages performs on the same tasks, loops, naive-fibonacci, levenshtein, and even hello-world. How fast is your favorite language?

P.S.: we don’t have a HPC category on discourse. Should it be created? The GPU and Julia at Scale categories could be subcategories.

Benny · 2025-04-10T14:47:47.725Z

juliohm:

As usual, Julia compilation seems to be included in the results:

The not-yet-default version without it except for hello-world: Languages Benchmark Visualization. Much fewer languages on there, still not clear on what specific implementations or versions any of them are, and I haven’t been able to click my way to a comprehensive information list if it exists.

Not sure why there seems to be a 62-64ms island (Java, C, Rust, etc) and a 205-232ms island (Python JIT, Julia, Racket) for the loops benchmark. The number for Julia is plausible, I quickly tried the custom benchmark module and BenchmarkTools and got ~158ms for the loops call either way, which is an expected difference for CPUs released within a few years of each other. I tried preallocating the array and switching up the integer types but saved 1ms at best. If anyone can spot the important difference between the Julia version (64-bit signed Int) and the Rust (also LLVM; 32-bit unsigned u32) or C version (gcc; 32-bit? signed int), feel free to share. Could it just be 10k-element vectors on the stack?

The fibonacci result of 0.00 for Julia (woohoo 1st place?) is caused by a deliberate use of Val inputs to shift all the work to compile-time, which is not included in the benchmark results due to function barriers (not a discarded warmup benchmark run as the custom benchmark implementation suggests). It’s as valid as running a more typical program in a setup to a benchmark of just passing the result, and I’m sure other languages can pull off the same thing with compile-time computation, just probably AOT.

adienes · 2025-04-10T14:54:16.865Z

I don’t really understand how there are not one, but four, languages above C on a simple loop. (and one of them being java ..?)

Benny · 2025-04-10T15:09:38.859Z

I’d chalk a 0.07ms discrepancy out of 63ms to OS jitter, it’s not a real-time environment as acknowledged.

Paul_Schrimpf · 2025-04-10T17:51:31.502Z

Benny:

If anyone can spot the important difference between the Julia version (64-bit signed Int) and the Rust (also LLVM; 64-bit unsigned usize?) or C version (gcc; 32-bit signed int?), feel free to share. Could it just be 10k-element vectors on the stack?

I suspect most of the difference is due to the 32 vs 64 bit integers. I changed the function to use the type of its input throughout.

function loops(u::T)::T where {T}
    a = zeros(T, 10^4)          # Allocate an array of 10,000 zeros
    r = rand(T(1):T(10^4))              # Choose a random index between 1 and 10,000
    @inbounds for i in T(1):T(10^4)     # Outer loop over array indices
        @inbounds for j in T(1):T(10^4) # Inner loop: 10,000 iterations per outer loop iteration
            a[i] += j % u         # Simple sum
        end
        a[i] += r                 # Add a random value to each element in array
    end
    return @inbounds a[r]                   # Return the element at the random index
end

On an older Intel cpu, I get

julia> versioninfo()
Julia Version 1.11.3
Commit d63adeda50d (2025-01-21 19:42 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, ivybridge)
Threads: 4 default, 0 interactive, 2 GC (on 4 virtual cores)
Environment:
  JULIA_NUM_THREADS = 4

julia> @benchmark loops(10)
BenchmarkTools.Trial: 16 samples with 1 evaluation per sample.
 Range (min … max):  326.743 ms … 335.066 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     328.532 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   328.833 ms ±   1.961 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁ ▁ ▁▁  █  ▁▁█ ▁▁   ▁  ▁    ▁                               ▁  
  █▁█▁██▁▁█▁▁███▁██▁▁▁█▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  327 ms           Histogram: frequency by time          335 ms <

 Memory estimate: 78.16 KiB, allocs estimate: 2.

julia> @benchmark loops(Int32(10))
BenchmarkTools.Trial: 24 samples with 1 evaluation per sample.
 Range (min … max):  211.762 ms … 217.078 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     213.031 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   213.261 ms ±   1.170 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▃█   ▃     ▃              ▃                           
  ▇▇▁▇▁▇▁▇▁██▁▇▁█▇▇▁▇▇█▁▁▁▁▁▁▇▁▁▇▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  212 ms           Histogram: frequency by time          217 ms <

 Memory estimate: 39.09 KiB, allocs estimate: 2.

The Int32 benchmark is pretty much equal to the C benchmark.

[loops] $ gcc benchmark.c loops.c -lm -O3 
[loops] $ ./a.out 2000 3000 10
...
..
240.228608,2.209823,238.869205,246.366234,9,54136

There seems to be some CPU dependence. I also ran things on a much newer AMD system, and got the same time for Int32 and Int64, both of which match the C version.

julia> versioninfo()
Julia Version 1.11.4
Commit 8561cc3d68d (2025-03-10 11:36 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 9900X 12-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, generic)
Threads: 12 default, 0 interactive, 6 GC (on 24 virtual cores)
Environment:
  JULIA_NUM_THREADS = 12

julia> @benchmark loops(10)
BenchmarkTools.Trial: 47 samples with 1 evaluation per sample.
 Range (min … max):  107.854 ms … 108.336 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     107.926 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   107.958 ms ± 103.468 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ █ ▄ █▄▄█▁    ▄ ▁
  █▁█▆█▆█████▆▆▁▆█▆█▁▆▁▆▆▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▆▁▆▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  108 ms           Histogram: frequency by time          108 ms <

 Memory estimate: 78.16 KiB, allocs estimate: 2.

julia> @benchmark loops(Int32(10))
BenchmarkTools.Trial: 47 samples with 1 evaluation per sample.
 Range (min … max):  107.177 ms … 108.378 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     107.295 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   107.338 ms ± 193.199 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █▂ ▅
  ▅▄▅██████▄▄▄▁▄▁▁▄▄▁▁▁▁▁▁▁▁▁▄▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  107 ms           Histogram: frequency by time          108 ms <

 Memory estimate: 39.09 KiB, allocs estimate: 2.

[loops] $ gcc benchmark.c loops.c -O3 -lm
[loops] $ ./a.out 2000 2000 10
..
..
106.522007,0.079939,106.462386,106.836421,19,54333

Removing allocations as in the loops_noalloc function below makes little difference (results not shown).

using StaticArrays, LoopVectorization

function loops_noalloc(u::T)::T where {T}
    a = @MVector zeros(T, 10^4)          # Allocate an array of 10,000 zeros
    r = rand(T(1):T(10^4))              # Choose a random index between 1 and 10,000
    @inbounds for i in T(1):T(10^4)     # Outer loop over array indices
        @inbounds for j in T(1):T(10^4) # Inner loop: 10,000 iterations per outer loop iteration
            a[i] += j % u         # Simple sum
        end
        a[i] += r                 # Add a random value to each element in array
    end
    return @inbounds a[r]                   # Return the element at the random index
end


function loops_fast(u::T)::T where {T}
  a = @MVector zeros(T, 10^4) # Allocate an array of 10,000 zeros
  r = rand(T(1):T(10^4))              # Choose a random index between 1 and 10,000
  @turbo for i in T(1):T(10000) # Outer loop over array indices
    for j in T(1):T(10000) # Inner loop: 10,000 iterations per outer loop iteration
      a[i] += j % u         # Simple sum
    end
    a[i] = a[i] + r                 # Add a random value to each element in array
  end
  return @inbounds a[r]                   # Return the element at the random index
end

However, you can make things much faster by using LoopVectorization (admittedly, it’s questionable whether this is in the spirit of the original language comparison benchmark). On my newer AMD machine, I get

julia> @benchmark loops_fast(Int64(10))
BenchmarkTools.Trial: 2986 samples with 1 evaluation per sample.
 Range (min … max):  1.672 ms … 1.775 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.674 ms             ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.674 ms ± 4.688 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▇▄▇█▆▄▂▁▁                                                ▁
  ███████████▇▅▇▆▇▆▆▅▆▅▆▆▆▅▄▅▅▅▃▆▃▁▅▄▄▅▄▅▆▄▅▄▁▄▁▆▅▁▄▅▆▆▃▅▅▆ █
  1.67 ms     Histogram: log(frequency) by time      1.7 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark loops_fast(Int32(10))
BenchmarkTools.Trial: 5926 samples with 1 evaluation per sample.
 Range (min … max):  841.047 μs … 1.289 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     842.871 μs             ┊ GC (median):    0.00%
 Time  (mean ± σ):   843.270 μs ± 8.455 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄▄▅▂▂▂▇▇█▅▄▃▂▁                                              ▂
  ████████████████▇▇▇▆▅▇▆▅▆▆▆▅▆▅▆▆▅▆▅▆▅▇▅▆▇▇▆▆▆▄▅▁▄▅▄▃▃▅▃▄▁▄▅ █
  841 μs       Histogram: log(frequency) by time       855 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Benny · 2025-04-10T18:44:42.686Z

Paul_Schrimpf:

I suspect most of the difference is due to the 32 vs 64 bit integers.

I think I misread the Rust (I don’t use it), another file specifies u32 for array elements. If I did know Rust, I’d have matched the numeric types and compared the LLVM IR.

Paul_Schrimpf:

There seems to be some CPU dependence…and got the same time for Int32 and Int64

Happened to me too, but on Intel (i7-1065G7). The 1.5x difference for your older Intel CPU makes sense for a 2x change in numeric type size, but the 3.25x difference on the benchmark’s M1 Max (also now noticing the M4 Max option with a 2.5x difference) surprises me.