Spending a lot of time in `materialize` called by `findall`

Oblomov · 2024-08-05T21:06:30.184Z

Hello all,

I’m new to Julia and as an exercise I’ve ported a medium-complex Python code to Julia. Even without particular optimizations, the Julia code is 4 to 5 times faster than Python (nice!), but I’ve decided to try additional optimizations, at least the low-hanging fruits. Using the excellent Performance Tips from the manual and the Profile module I’ve been able to achieve an overall 50% improvement to the code performance (nice!), but I’m not stuck in a somewhat annoying situation where the profiler shows that in my “hot path” most of the time is actually being spent in materialize (39% of the runtime, versus 21% of the function that does the computations I care about), that is being called indirectly —if the profiler is to be trusted in identifying the source code lines correctly— in the whereabouts of findall calls.

I wouldn’t mind it as much, if not for the fact that it feels “unfair” that the code is spending more time on administrative tasks, so to say, than in what I want it to actually do.

I have already managed to reduce the use of findall with simpler looping constructs in a couple of instances (with measurable performance benefits), but the remaining ones don’t seem to offer any advantage (and in fact sometimes provide worse performance) when findall gets replaced with something different.

The general structure for the remaining uses of findall is something like this:

for xy in findall(A .== constant) # or A .> constant or similar
work_on!(B, C, A, xy)
end

where A, B and C are (same-sized) arrays. There is also a A[findall(A .> 0)] .-= 1 which I’m sure can’t be improved any further. The reason why this approach is used is that on each turn the set of elements where work needs to be done is sparse and scattered throughout the original matrices (using a “dense” approach is an order of magnitude slower).

My understanding is that the materialize isn’t called by the findall directly, but is a byproduct of the broadcast, so I was thinking that an iterator-like argument to findall could help (I’ve seen some discussions in the forum about using broadcasted for that), but IME these approach are consistently worse, and so also things like iterating over enumerate(A) and skipping unwanted items.

Is there a way out of this that I’m not seeing or am I stuck with ~40% of my runtime going on “finding where to do the work”?

mbauman · 2024-08-05T21:33:44.394Z

You’re right you probably could optimize that further, but it’s true that what is best will be dependent upon your real applications.

Some things to try (if you’ve not tried them already) include:

for xy in findall((==)(constant), A)

for (xy, v) in pairs(A)
    v == constant || continue

There are times, though, that spending the effort to materialize an intermediate array for a computation is worth the effort. Computers are surprisingly good at dense arrays and it can sometimes pay you back. But Julia can also be surprisingly bad at lots of intermediate arrays, so it can also stab you in the back. YMMV.

stillyslalom · 2024-08-05T21:48:24.865Z

Oblomov:

There is also a A[findall(A .> 0)] .-= 1 which I’m sure can’t be improved any further

Not sure what your conditional sparsity pattern looks like, but a loop appears faster with 0.1% fill:

julia> function f!(A)
           A[findall(A .> 0)] .-= 1
           nothing
       end
f! (generic function with 1 method)

julia> function g!(A)
           for i in eachindex(A)
               (A[i] > 0) && (A[i] -= 1)
           end
       end
g! (generic function with 1 method)

julia> using BenchmarkTools

julia> A = -rand(1024, 1024) .+ 0.001;

julia> @btime f!(a) setup=(a = copy(A)) evals=1
  732.800 μs (7 allocations: 157.52 KiB)

julia> @btime g!(a) setup=(a = copy(A)) evals=1
  552.500 μs (0 allocations: 0 bytes)

It’s also relatively easy to throw multithreading at loops compared to a conditionally-indexed broadcast.

stevengj · 2024-08-05T21:52:21.451Z

Oblomov:

A[findall(A .> 0)] .-= 1 which I’m sure can’t be improved any further.

You could eliminate the index array allocation with A[A .> 0] .-= 1 (“logical” indexing), or better yet eliminate the boolean array allocation too with something like A .-= A .> 0 (which operates entirely in-place, with a single loop over A, allocating no temporary arrays).

And, as @mbauman and @stillyslalom point out, you can always just write a loop. Loops are fast in Julia (inside functions, not in global scope!). But it’s probably a mistake to focus on tiny calculations like subtracting 1 from each element:

It’s typically much better to write a single loop that does lots of work in a single pass over an array than to do lots separate passes (separate “vectorized operations” = separate loops) that do a tiny amount of work each (like subtracting 1 from some array elements).

You should look more holistically at your calculation that optimizing individual lines of a line-for-line port from Python.

(Python teaches pretty bad habits in this sense when it comes to optimizing code. Even “optimized, vectorized” code in Python leaves a lot of performance on the table.)

Oblomov · 2024-08-06T09:18:15.836Z

Thank all for your suggestions. I had explored these possibilities already, and they lead to performance drops between 10% and 30%. Interestingly, even removing the findall from the inline decrease to switch to logical indexing as suggested by @stevengj causes a small performance drop. I guess the code is at that sweet spot where the findall + broadcast materialization just happens to give the best compromise for the kind of data I’m dealing with.

The perspective of the easier paralllelizatoin mentioned by @stillyslalom does give weight to the potentially slower (but easier to parallelize) approaches, but at least in my case this code is inside an outer loop that is even more trivial to parallelize (think Monte Carlo approach, for simplicity), so I guess I will have to live with my Julia code being 7 to 10 times faster than Python

And thank you all again. This has been a very educational experience, and an excellent way to experiment with the various approaches.

DNF · 2024-08-06T10:19:49.115Z

Oblomov:

I had explored these possibilities already, and they lead to performance drops between 10% and 30%. Interestingly, even removing the findall from the inline decrease to switch to logical indexing as suggested by @stevengj causes a small performance drop.

Are you sure about this? It seems hard to believe. Benchmarking can often be fiddly, are you certain the benchmarking itself is not misleading you?

If you could share an MWE where you see a performance drop, I would be interested. It’s hard for me to imagine how findall can be faster, given that it requires first a full pass over A for comparison (including a temporary allocation), then a full pass for findall (with a second temporary allocation), and then a final pass with non-contiguous indices to do the subtraction. While the other approaches simply uses a single, non-allocating, pass, which could also be simd-vectorized. It should be faster in every instance, unless I am missing something.

stevengj · 2024-08-06T11:12:33.580Z

DNF:

Are you sure about this? It seems hard to believe. Benchmarking can often be fiddly, are you certain the benchmarking itself is not misleading you

Using @stillyslalom’s benchmark code, I find that findall is slower than @. A -= A > 0, which is slower than the loop.

However, A[A .> 0] .-= 1 is indeed slower than findall here. I think the issue is that A[i] .-= 1 is equivalent to A[i] .= A[i] .- 1, which performs the indexing A[i] twice, making it faster to use findall to cache the indices where i is true.

Oblomov · 2024-08-06T12:09:01.893Z

DNF:

Are you sure about this? It seems hard to believe. Benchmarking can often be fiddly, are you certain the benchmarking itself is not misleading you?

In addiiton to Julia profiling that I’ve mentioned before, I also have a trivial check within the program itself. The main loop runs a certain number of steps, and is then repeated 100 times. There is a PRNG involved, but I seed it specifically when running benchmarks, and check that the results are identical. The whole program 5 times for each given configuration, and multiple batches are analyzed to account for variability. The results are always pretty consistent and repeatable, regardless of the method used to measure.

As an example, two separate batches of 5 runs with findall would give me

Execution time: 100 runs in 6.03s, Average steps: 236, Average steps per second: 3914.9253731343283                    
Execution time: 100 runs in 6.036s, Average steps: 236, Average steps per second: 3911.0337972167
Execution time: 100 runs in 6.085s, Average steps: 236, Average steps per second: 3879.5398520953163
Execution time: 100 runs in 6.063s, Average steps: 236, Average steps per second: 3893.617021276596       
Execution time: 100 runs in 6.065s, Average steps: 236, Average steps per second: 3892.3330585325634

versus

Execution time: 100 runs in 5.973s, Average steps: 236, Average steps per second: 3952.2852837769965
Execution time: 100 runs in 6.043s, Average steps: 236, Average steps per second: 3906.503392354791
Execution time: 100 runs in 6.063s, Average steps: 236, Average steps per second: 3893.617021276596
Execution time: 100 runs in 6.279s, Average steps: 236, Average steps per second: 3759.6751075011944
Execution time: 100 runs in 6.073s, Average steps: 236, Average steps per second: 3887.2056644162685

whereas with this loop

                for xy in eachindex(A)
                    @inbounds (A[xy] > 0) && (A[xy] -= 1)
                end

I might get

Execution time: 100 runs in 6.827s, Average steps: 236, Average steps per second: 3457.8877984473415
Execution time: 100 runs in 6.665s, Average steps: 236, Average steps per second: 3541.935483870968
Execution time: 100 runs in 6.672s, Average steps: 236, Average steps per second: 3538.2194244604316
Execution time: 100 runs in 6.738s, Average steps: 236, Average steps per second: 3503.561887800534
Execution time: 100 runs in 6.697s, Average steps: 236, Average steps per second: 3525.0111990443484

from one batch and

Execution time: 100 runs in 6.876s, Average steps: 236, Average steps per second: 3433.246073298429
Execution time: 100 runs in 6.952s, Average steps: 236, Average steps per second: 3395.7134637514387
Execution time: 100 runs in 6.958s, Average steps: 236, Average steps per second: 3392.7852831273353
Execution time: 100 runs in 6.954s, Average steps: 236, Average steps per second: 3394.7368421052633
Execution time: 100 runs in 6.796s, Average steps: 236, Average steps per second: 3473.6609770453206

from another. These timings are taken inside the main loop only, and even though it’s using a simple stopwatch principle, they are consistent enough to get a first idea of the relative performance gains and losses. Of course for details I will also check the profiler (in this case, for example, I see that much of the materialize time has shifted to indexing instead).

Both these and the profiling with Julia’s tools is done with the actual “production” code and data, and since they’re not contradicting each other, I have no reason to suspect they might be misleading (although maybe the “where“ the bottleneck is may be imprecise?).

DNF:

If you could share an MWE where you see a performance drop, I would be interested.

This is something I can try working on (I assume there’s no creduce for Julia, is there?), but I suspect that the specific array structures are responsible for the anomaly, so it’s probably mostly about the data distribution.

The code is basically a 2D cellular automaton with O(10^5) cells, but the “evolving” cells at every step are in a narrow interface band with a number of cells that during the simulation changes from O(10) to O(10^3), typically with a mean and median of < 500 active cells. These are not randomly spread out across the domain, but the “topology” of the distribution can get quite complex and disconnected (it’s not necessarily a contiguous band). I suspect that the number of active cells and their distribution, and the way this changes over time, are at fault here. It’s possible that other approaches than findall are more convenient with less/more cells, or with less/more spatially coherent distribution, but the variability over time ends up making the findall approach more convenient somehow.

I have a few optimization strategies in mind that require more extensive changes to the code (e.g. views on a bounded region of the arrays where the band is etc), but these are a separate matter. I’ll revising the findall when I’ve looked at improving the computational domain boundary.

Thanks all again for the suggestions.

DNF · 2024-08-06T12:23:55.537Z

It might be, as you said, that your data are such that the vector found by findall is very small, for example. In that case it costs very little. But this snippet

Oblomov:

for xy in eachindex(A)
    @inbounds (A[xy] > 0) && (A[xy] -= 1)
end

should rather be

for xy in eachindex(A)
    a = A[xy]
    (a > 0) && (A[xy] = a - 1)
end

to avoid unnecessary double lookup (as mentioned by @stevengj ) That’s a 20% difference in runtime in my test.

(Btw, the @inbounds annotation is redundant when you use eachindex.)

Oblomov · 2024-08-06T16:55:25.925Z

Thanks for the hint about @inbounds. I tried the form you recommend, and it’s still not as fast as the A[findall()] version, although it’s slightly better than the one without pulling out the value in advance (> 3.5K steps per second).

I also tried

for (xy, a) in pairs(A)
    (a > 0) && (A[xy] = a - 1)
end

but this is catastrophically bad (~2.3K steps per second).

DNF · 2024-08-06T17:20:17.223Z

I cannot understand how this is possible. There must be something going on with respect to the surrounding code. Your code is objectively doing more work, and allocating as well. There’s something wrong here.

mbauman · 2024-08-06T17:32:31.459Z

Are you doing this loop in top level? Or perhaps with a non-constant global? Or perhaps you are using those variable names in wildly different manners within the same function? Any or all of those would fully explain these disparities — they’re all type instabilities, which have far worse penalties for loops than their “vectorized” equivalents.

Oblomov · 2024-08-06T18:40:47.604Z

These loops are running from a function called from the top level, although it’s a bit of on the longish one. I have tried factoring out each of the loops currently using findall but there was no benefit. I tried to strongly-type all variables, but I may have missed some, this is definitely something to check. There aren’t globals I access for writing, but I have some that I haven’t marked as const —this is definitely something I can look into, eventhough they are definitely not involved in A[A .> 0] .-= 1. I’ll now read more on type instabilities and tools to look for them, and see if I manage to find any snags.

Oblomov · 2024-08-06T20:18:00.694Z

I’ve gone through all the globals and made sure they were loaded through functions so that they could be marked const (which they are, after loading), and made sure that their type is clearly indicated. This alone has been sufficient to speed up my code by around 40%. I guess I will now go through all the recommendations in this thread again, to test them in a cleaner condition —hopefully the results will be more in tune with expectations!

FWIW, I’m really starting to appreciate Julia as a bridge between Python-like fast prototyping and excellent performance —and the way one can actually go “from here to there” by incremental improvements to the code.

EDIT: even after following @mbauman’s recommendation, the “decrement larger than zero values” works better with the findall.

logical indexing (`A[A .> 0] .-= 1 is ~2% slower;
the eachindex() version is ~10% slower;
the pairs() version is between 40% and 50% slower.

Next step will be trying if @code_warntype has something to say.

abraemer · 2024-08-07T06:01:08.167Z

Oblomov:

I tried to strongly-type all variables

Please note that this is not really a thing in Julia (except maybe for global variables which you anyways shouldn’t access inside functions). Type annotations on variables inside functions are not required for performance and in fact can even hurt performance if done incorrectly. There is basically only one case where a type annotation help and that is to help contain some type instability. But it is likely that this should be fixed elsewhere.