I just run your script, changing:
- the last line to
@time main()
- Remove
using Distributions
(not used)
Then, just did:
time julia trnbias.jl 0 200 0.01 200
I get:
# lots of printing
5.192750 seconds (7.82 M allocations: 6.508 GiB, 2.97% gc time, 0.09% compilation time)
real 0m6,319s
user 0m6,393s
sys 0m0,327s
Thus: 1) compilation time is not an issue; 2) Julia startup is a minor portion of the time (~1s).
Then solved those @views
indicated above, with:
short_sum = sum(@view(x[i - ishort + 1:i + 1]))
long_sum = sum(@view(x[i - ilong + 1:i + 1]))
# and here
for i in (long_term:ncases-2)
short_mean = sum(@view(x[i - short_term + 1:i + 1])) / short_term
long_mean = sum(@view(x[i - long_term + 1:i + 1])) / long_term
And I get:
3.949312 seconds (3.67 k allocations: 289.875 KiB, 0.11% compilation time)
real 0m5,095s
user 0m5,180s
sys 0m0,318s
Thus gaining about 30% gain in performance for the small test (but not for the largest one, as I will show at the end).
The profiler (run @profview optimize(0, 200, 0.01, 200)
in VScode) says that most of the time is spent on these lines:
short_sum += x[i + 1] - x[i - ishort]
long_sum += x[i + 1] - x[i - ilong]
#and
ret = (short_mean > long_mean) ? x[i + 1] - x[i] :
(short_mean < long_mean) ? x[i] - x[i + 1] : 0.0
where above the cost is in getindex
calls and below on >
and <
calls. Except adding inbounds there, which improves a little the performance, I don’t see anything else that could be done to accelerate much the code, as it is.
Finally, I run your original code (just removing using Distributions
and adding @time main()
at end), with time julia trnbias.jl 0 500 0.01 500
, and got:
59.313881 seconds (20.13 M allocations: 16.728 GiB, 0.43% gc time, 0.01% compilation time)
real 1m0,626s
user 1m0,513s
sys 0m0,477s
Then, with the @view
s in place:
58.977891 seconds (3.67 k allocations: 292.172 KiB, 0.01% compilation time)
real 1m0,339s
user 1m0,343s
sys 0m0,397s
My conclusions:
- Startup and compilation time are not relevant for your tests, if you keep them taking at least a few seconds.
- The use of views reduces significantly the allocations, but that’s not the critical part for performance.
- The execution seems to be memory-bounded, as
getindex
is taking most of the time (here).
Can you explain in more detail how to compile and run the Rust and C++ codes so we can test here more easily? Given the tests above I find suprising that such a big performance difference is found (I don´t think bounds checking justifies the difference between C++ and Rust either, in Julia disabling them gives a minor speedup).
(ps: I got a lot of errors trying to compile the Rust or C++ versions, so clearly I don’t know what I’m doing there)