Alex90
1
Hello,
I have tested a simple math example in Julia using GPU computing and GTX 960. The code is as follows:
A = CUDA.rand(151,151,151);
B = similar(A);
C = CUDA.zeros(151,151,151);
D = similar(C);
E = similar(C);
F = similar(C);
@. function math1(A,B)
C = A.^2 .+ B.^2 .+ A .* B .+ A ./ B .- A .* B .- A ./ B .+ A .* B .+ A ./ B .- A .* B .- A ./ B;
return C
end
@. function math2(C)
D = C.^2 .+ C.^2 .+ C .* C .+ C ./ C .- C .* C .- C ./ C .+ C .* C .+ C ./ C .- C .* C .- C ./ C;
return D
end
@. function math3(D)
E= D.^2 .+ D.^2 .+ D .* D .+ D ./ D .- D .* D .- D ./ D .+ D .* D .+ D ./ D .- D .* D .- D ./ D;
return E
end
@btime for iter = 1:1000
C = math1(A,B);
D = math2(C);
E = math3(D)
end
10.619 s (1671000 allocations: 26.09 MiB)
And i run the same code in Matlab(2023b) and got around 2.7 seconds.
A similar performance problem was discussed in Why Julia is much slower than MATLAB on GPU computing?
So i am curious is it expectable performance behavior? Can Julia be faster than Matlab without writing a GPU kernel manually?
1 Like
Here’s a more idiomatic Julia version:
A = CUDA.rand(151,151,151);
B = similar(A);
C = CUDA.zeros(151,151,151);
D = similar(C);
E = similar(C);
F = similar(C);
function math1!(C, A, B)
@. C = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
return C
end
function math2!(D, C)
@. D = C^2 + C^2 + C * C + C / C - C * C - C / C + C * C + C / C - C * C - C / C
return D
end
function math3!(E, D)
@. E = D^2 + D^2 + D * D + D / D - D * D - D / D + D * D + D / D - D * D - D / D
return E
end
@btime for iter = 1:1000
math1!($C, $A, $B)
math2!($D, $C)
math3!($E, $D)
end
What time do you get with this code?
Key points:
- If you want to mutate your preallocated arrays
C, D, E
, you need to pass them to the functions as arguments, and apply a mutating operation to them. C = ...
does not mutate, it creates a new variable C
. However, @. C = ...
mutates because it translates to C .= ...
, which is syntax for in-place broadcasting.
- The macro
@.
goes in front of an expression to insert dots (i.e., fused broadcasting) at every subexpression. When you use it, you don’t have to insert any dots manually. (I’ve never seen it placed in front of the entire method definition before, not sure what it would do there.)
- When benchmarking with
@btime
you should interpolate global variables with $
, otherwise the expression can’t be type inferred and compiled to efficient code.
Minor point: you don’t need to terminate lines with a semicolon in Julia.
Let us know how the updated code performs!
3 Likes
Alex90
3
Dear Daniel,
Many thanks for your response. I tried a very similar code modification before and it gave me the same result. On the other hand, the code revised by you provides this output:
10.512 s (1698000 allocations: 26.64 MiB)
As i understand, Julia does’t like long formulas. Previously, i did another performance test where i did no more than 3 operations (sum and double square). In this case, Julia significantly outperformed Matlab
1 Like
gdalle
4
Is it possible that dot fusion gives up after too many operations?
1 Like
rveltz
5
Try putting the whole computation in a function to which you pass A B C DCE
Alex90
6
Dear rveltz,
I have tried and it gave the same result
rveltz
7
You start with zeros and then NaN right? Not sure how the GPU handles that
@rveltz might be onto something. B
is never initialized; similar
returns uninitialized memory. What did you intend to fill B
with before your computations?
1 Like
B = CUDA.rand(151, 151, 151)
helped a little bit on my end, but not much. Perhaps there’s a limit to loop fusion as @gdalle suggests? Broadcast support for GPU arrays, including CuArrays, is implemented using KernelAbstractions, which I’ve never used. Here’s the relevant method: GPUArrays.jl/src/host/broadcast.jl at 77d7d5104cd1deea1a22343ff6fad2b7d65b3c28 · JuliaGPU/GPUArrays.jl · GitHub
1 Like
Alex90
10
You are right. It should be
B = CUDA.rand(151,151,151)
Then, there are no more NaN values. However, it slightly affected the performance. The output is as follows:
9.379 s (1671000 allocations: 26.09 MiB)
You’re using float64 on a pretty old GPU, which is really slow. To give the GPU a chance, you should switch to float32
2 Likes
I am wondering if MATLAB is doing some Float32 conversions…
lmiq
14
Is the code non allocating when run in the CPU?
1 Like
Alex90
15
Dear sdanisch,
I used Float32 both with Julia and Matlab
Alex90
16
Dear tamasgal,
Before computations, i created gpuArrays in Matlab with a single precision (Float32)
Alex90
17
Dear Imiq,
My version of the code run on CPU gives this output:
45.033 s (15000 allocations: 656.25 KiB)
The code revised by danielwe provides another result:
47.197 s (33000 allocations: 1.01 MiB)
gdalle
18
Then there is an issue with broadcasting fusion. Maybe the *
symbols are still performing matrix multiplication? I never fully trust @.
because I’m not sure how it works
1 Like
Alex90
19
Without @.
i get the same results when performing element-wise operations:
9.458 s (1671000 allocations: 26.09 MiB)
lmiq
20
Then I would probably try to split that long line in a few smaller lines, too see what happens.
Edit: effectielly, if you break that long line, it becomes non-allocating in the CPU:
A = rand(51,51,51);
B = rand(A);
C = zeros(51,51,51);
D = similar(C);
E = similar(C);
F = similar(C);
function math1!(C, A, B)
@. C = A^2 + B^2 + A * B
@. C += A / B - A * B - A / B + A * B + A / B - A * B - A / B
return C
end
before:
julia> @time math1!(C,A,B);
0.000845 seconds (11 allocations: 352 bytes)
after:
julia> @time math1!(C,A,B);
0.000966 seconds
(this was with a smaller matrix, just for testing)
4 Likes