In addiiton to Julia profiling that I’ve mentioned before, I also have a trivial check within the program itself. The main loop runs a certain number of steps, and is then repeated 100 times. There is a PRNG involved, but I seed it specifically when running benchmarks, and check that the results are identical. The whole program 5 times for each given configuration, and multiple batches are analyzed to account for variability. The results are always pretty consistent and repeatable, regardless of the method used to measure.
As an example, two separate batches of 5 runs with findall
would give me
Execution time: 100 runs in 6.03s, Average steps: 236, Average steps per second: 3914.9253731343283
Execution time: 100 runs in 6.036s, Average steps: 236, Average steps per second: 3911.0337972167
Execution time: 100 runs in 6.085s, Average steps: 236, Average steps per second: 3879.5398520953163
Execution time: 100 runs in 6.063s, Average steps: 236, Average steps per second: 3893.617021276596
Execution time: 100 runs in 6.065s, Average steps: 236, Average steps per second: 3892.3330585325634
versus
Execution time: 100 runs in 5.973s, Average steps: 236, Average steps per second: 3952.2852837769965
Execution time: 100 runs in 6.043s, Average steps: 236, Average steps per second: 3906.503392354791
Execution time: 100 runs in 6.063s, Average steps: 236, Average steps per second: 3893.617021276596
Execution time: 100 runs in 6.279s, Average steps: 236, Average steps per second: 3759.6751075011944
Execution time: 100 runs in 6.073s, Average steps: 236, Average steps per second: 3887.2056644162685
whereas with this loop
for xy in eachindex(A)
@inbounds (A[xy] > 0) && (A[xy] -= 1)
end
I might get
Execution time: 100 runs in 6.827s, Average steps: 236, Average steps per second: 3457.8877984473415
Execution time: 100 runs in 6.665s, Average steps: 236, Average steps per second: 3541.935483870968
Execution time: 100 runs in 6.672s, Average steps: 236, Average steps per second: 3538.2194244604316
Execution time: 100 runs in 6.738s, Average steps: 236, Average steps per second: 3503.561887800534
Execution time: 100 runs in 6.697s, Average steps: 236, Average steps per second: 3525.0111990443484
from one batch and
Execution time: 100 runs in 6.876s, Average steps: 236, Average steps per second: 3433.246073298429
Execution time: 100 runs in 6.952s, Average steps: 236, Average steps per second: 3395.7134637514387
Execution time: 100 runs in 6.958s, Average steps: 236, Average steps per second: 3392.7852831273353
Execution time: 100 runs in 6.954s, Average steps: 236, Average steps per second: 3394.7368421052633
Execution time: 100 runs in 6.796s, Average steps: 236, Average steps per second: 3473.6609770453206
from another. These timings are taken inside the main loop only, and even though it’s using a simple stopwatch principle, they are consistent enough to get a first idea of the relative performance gains and losses. Of course for details I will also check the profiler (in this case, for example, I see that much of the materialize
time has shifted to indexing instead).
Both these and the profiling with Julia’s tools is done with the actual “production” code and data, and since they’re not contradicting each other, I have no reason to suspect they might be misleading (although maybe the “where“ the bottleneck is may be imprecise?).
This is something I can try working on (I assume there’s no creduce
for Julia, is there?), but I suspect that the specific array structures are responsible for the anomaly, so it’s probably mostly about the data distribution.
The code is basically a 2D cellular automaton with O(10^5) cells, but the “evolving” cells at every step are in a narrow interface band with a number of cells that during the simulation changes from O(10) to O(10^3), typically with a mean and median of < 500 active cells. These are not randomly spread out across the domain, but the “topology” of the distribution can get quite complex and disconnected (it’s not necessarily a contiguous band). I suspect that the number of active cells and their distribution, and the way this changes over time, are at fault here. It’s possible that other approaches than findall
are more convenient with less/more cells, or with less/more spatially coherent distribution, but the variability over time ends up making the findall
approach more convenient somehow.
I have a few optimization strategies in mind that require more extensive changes to the code (e.g. views on a bounded region of the arrays where the band is etc), but these are a separate matter. I’ll revising the findall
when I’ve looked at improving the computational domain boundary.
Thanks all again for the suggestions.