"I don't like NumPy" - Julia equivalents to the numpy code?

MDSW · 2025-05-16T01:27:54.491Z

I came across this post and, not being an expert in either NumPy or the how it would be done in Julia, was curious if anyone here could weigh in with idiomatic Julia versions of some of these examples for comparison.
Note that the author also promises a follow-up piece in which a “better suggestion” will be revealed.

dqeeq · 2025-05-16T02:04:56.267Z

Here is how I would do the first three examples

Example 1

Say A is a 5×5 matrix, x is a length-5 vector, and you want to find the vector y such that Ay=x

A = rand(5,5)
x = rand(5)
y = A\x

Example 2

Say A is a stack of 100 5×5 matrices, given as a 100×5×5 array. And say x is a stack of 100 length-5 vectors, given as a 100×5 array. And say you want to solve Aᵢyᵢ=xᵢ for 1≤i≤100

A = [rand(5,5) for i in 1:100]
x = [rand(5) for i in 1:100]
y = [A[i]\x[i] for i in 1:100]

Example 3

Suppose:

A is a K×L×M array

B is a L×N array

C is a K×M array

And say that for each k and n, you’d like to compute the mean over the L and M > dimensions. That is, you want

D_kn = 1/(LM) × ∑_lm A_klm B_ln C_km.

K = 2
L = 3
M = 4
N = 5
A = rand(K,L,M)
B = rand(L,N)
C = rand(K,M)
D = [[1/(L * M) * sum(A[k,l,m] * B[l,n] * C[k,m] for l in 1:L for m in 1:M) for k in 1:K ] for n in 1:N]

karei · 2025-05-16T02:12:48.327Z

I don’t like, too.

digital_carver · 2025-05-16T03:04:12.001Z

dqeeq:

y = [A[i]\x[i] for i in 1:100]

I was actually curious about the best equivalent for this part, because they say:

But nowadays, everything is GPU and if you’ve got big arrays, you probably don’t want to use loops in any language.

That made me curious if Julia breaks the “loops are bad” mold here too. I’m not very familiar with the KernelAbstractions.jl / CUDA.jl side of things though, so I don’t know how much effort it would be to write this in a GPU-friendly way. (A lot of the complaints in the article are about the ergonomics of doing things, stuff being unnecessarily hard, rather than about being able to do it at all.)

dqeeq:

A = [rand(5,5) for i in 1:100]

They explicitly say “given as a 100×5×5 array”, so a rand(100, 5, 5) would be closer to the spec. (And similarly for x.)

The last example comes out really neat with Tullio.jl or similar, but it’s nice to see that even the native Julia version comes out pretty nice and readable.

Benny · 2025-05-16T03:22:45.299Z

I guarantee the “better suggestion” is only “better” because it’s less powerful and can be much worse when scrutinized in more ways. Operations on arrays are always going to be less intuitive than explicit iteration, but the complicated stuff is NOT something you ever want to be writing out a loop for anyway. Taking the example of the (10,2,3,40) shape resulting from a broadcasting of 3-wide and 1x2 indices, I prefer how Julia forces us to acknowledge they are broadcast together size(A[:, CartesianIndex.([1 2 3], [1,2]), :]) because arrays of indices aren’t treated differently from ranges, but let’s be real, do we want to write out the 4-level loop, preallocate the correct array size, and order the indices correctly in either Python or Julia? We almost always don’t need to anyway, and array syntax can intuitively handle the simpler cases just fine. The last example is just a mockery of someone using Einstein summation instead of the “obvious” suboptimal loops.

danielwe · 2025-05-16T03:24:56.169Z

dqeeq:

y = [A[i]\x[i] for i in 1:100]

This can be simplified to just y = A .\ x. But as @digital_carver pointed out, I think the real challenge here is to come up with a syntax for batched linsolves that would dispatch to batched GPU kernels.

Oscar_Smith · 2025-05-16T03:30:43.305Z

One option here is to use a BlockDiagonalMatrix

danielwe · 2025-05-16T03:45:20.506Z

Actually, taking a quick look at cuBLAS it looks like an array of pointers to arrays is the expected data structure for batched operations: 1. Introduction — cuBLAS 12.9 documentation. So broadcasting over arrays of arrays may in fact be exactly the right syntax.

That said, I wonder how relevant batching actually is for linsolves. From a quick scroll through the cuSOLVER docs, I don’t see a lot of support for it.

EDIT: There’s the gemvStridedBatched function for strided batches (rand(5, 5, 100) et cetera), and the only batched function for dense matrices in cuSOLVER (syev) also expects strided batches, so the above is probably too optimistic about arrays of arrays and broadcasting. The ideal syntax would have first class support for strided batches.

gdalle · 2025-05-16T06:08:53.217Z

Maybe JAX / Reactant.jl are able to optimize this kind of thing?

EDIT: at the moment, it seems Reactant.jl can’t do linear solves

julia> using Reactant

julia> A = Reactant.to_rarray(rand(10, 10));

julia> b = Reactant.to_rarray(rand(10));

julia> f = @compile \(A, b)
ERROR: Scalar indexing is disallowed.
Invocation of getindex(::TracedRArray, ::Union{Int, TracedRNumber{Int}}) resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore should be avoided.

If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
to enable scalar iteration globally or for the operations in question.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] errorscalar(op::String)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/aNaXo/src/GPUArraysCore.jl:151
  [3] _assertscalar(op::String, behavior::GPUArraysCore.ScalarIndexing)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/aNaXo/src/GPUArraysCore.jl:124
  [4] assertscalar(op::String)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/aNaXo/src/GPUArraysCore.jl:112
  [5] getindex(a::Reactant.TracedRArray{Float64, 2}, index::Int64)
    @ Reactant.TracedRArrayOverrides ~/.julia/packages/Reactant/cTiTU/src/TracedRArray.jl:114
  [6] getindex
    @ ./subarray.jl:343 [inlined]
  [7] _mapreduce(f::typeof(iszero), op::typeof(&), ::IndexLinear, A::SubArray{…})
    @ Base ./reduce.jl:431
  [8] _mapreduce_dim(f::Function, op::Function, ::Base._InitialValue, A::SubArray{…}, ::Colon)
    @ Base ./reducedim.jl:337
  [9] mapreduce
    @ ./reducedim.jl:329 [inlined]
 [10] overloaded_all
    @ ~/.julia/packages/Reactant/cTiTU/src/TracedRArray.jl:821 [inlined]
 [11] _all(f::Function, x::SubArray{Reactant.TracedRNumber{…}, 1, Reactant.TracedRArray{…}, Tuple{…}, true}, dims::Function)
    @ Reactant ~/.julia/packages/Reactant/cTiTU/src/Overlay.jl:168
 [12] all
    @ ./reducedim.jl:995 [inlined]
 [13] _istril
    @ ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/generic.jl:1362 [inlined]
 [14] istril
    @ ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/generic.jl:1355 [inlined]
 [15] istril
    @ ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/generic.jl:1354 [inlined]
 [16] \
    @ ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/generic.jl:1122 [inlined]
 [17] \(none::Reactant.TracedRArray{Float64, 2}, none::Reactant.TracedRArray{Float64, 1})
    @ Reactant ./<missing>:0
 [18] getproperty
    @ ./Base.jl:49 [inlined]
 [19] size
    @ ~/.julia/packages/Reactant/cTiTU/src/TracedRArray.jl:425 [inlined]
 [20] \
    @ ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/generic.jl:1120 [inlined]
 [21] call_with_reactant(::typeof(\), ::Reactant.TracedRArray{Float64, 2}, ::Reactant.TracedRArray{Float64, 1})
    @ Reactant ~/.julia/packages/Reactant/cTiTU/src/utils.jl:0
 [22] make_mlir_fn(f::typeof(\), args::Tuple{…}, kwargs::@NamedTuple{}, name::String, concretein::Bool; toscalar::Bool, return_dialect::Symbol, args_in_result::Symbol, construct_function_without_args::Bool, do_transpose::Bool, input_shardings::Nothing, output_shardings::Nothing, runtime::Val{…}, verify_arg_names::Nothing, argprefix::Symbol, resprefix::Symbol, resargprefix::Symbol, num_replicas::Int64, optimize_then_pad::Bool)
    @ Reactant.TracedUtils ~/.julia/packages/Reactant/cTiTU/src/TracedUtils.jl:307
 [23] make_mlir_fn
    @ ~/.julia/packages/Reactant/cTiTU/src/TracedUtils.jl:237 [inlined]
 [24] compile_mlir!(mod::Reactant.MLIR.IR.Module, f::typeof(\), args::Tuple{…}, callcache::Dict{…}, sdycache::Dict{…}; fn_kwargs::@NamedTuple{}, optimize::Bool, cudnn_hlo_optimize::Bool, shardy_passes::Symbol, no_nan::Bool, transpose_propagate::Symbol, reshape_propagate::Symbol, optimize_communications::Bool, assert_nonallocating::Bool, backend::String, raise::Bool, raise_first::Bool, donated_args::Symbol, optimize_then_pad::Bool, runtime::Val{…}, kwargs::@Kwargs{})
    @ Reactant.Compiler ~/.julia/packages/Reactant/cTiTU/src/Compiler.jl:1156
 [25] compile_mlir!
    @ ~/.julia/packages/Reactant/cTiTU/src/Compiler.jl:1110 [inlined]
 [26] compile_xla(f::Function, args::Tuple{…}; client::Nothing, serializable::Bool, kwargs::@Kwargs{…})
    @ Reactant.Compiler ~/.julia/packages/Reactant/cTiTU/src/Compiler.jl:2929
 [27] compile_xla
    @ ~/.julia/packages/Reactant/cTiTU/src/Compiler.jl:2911 [inlined]
 [28] compile(f::Function, args::Tuple{…}; sync::Bool, kwargs::@Kwargs{…})
    @ Reactant.Compiler ~/.julia/packages/Reactant/cTiTU/src/Compiler.jl:2982
 [29] top-level scope
    @ ~/.julia/packages/Reactant/cTiTU/src/Compiler.jl:2135
Some type information was truncated. Use `show(err)` to see complete types.

blegat · 2025-05-16T06:24:18.483Z

I assume GitHub - mcabbott/Tullio.jl: ⅀ would address the issues mentioned in the blog post.

tecosaur · 2025-05-16T08:47:19.434Z

Finch.jl also comes to mind, like Tullio.jl it has an @einsum macro.

danielwe · 2025-05-16T16:04:13.902Z

Einsum syntax only addresses forward BLAS-type operations. The blog post opens with a gripe about the lack of a sane interface for batched linsolve, which is an inverse, LAPACK-type operation. Granted, this is quite academic because batched linear solvers don’t seem very common (as mentioned above, the only batched dense solver in cuSOLVER is the symmetric/hermitian eigensolver), but it’s what the author seems to want.

Mason · 2025-05-16T16:18:41.921Z

Given the array shape the author uses here, I would probably write this using eachslice if I wanted to avoid loops:

julia> let A = rand(5, 5, 100)
           x = rand(5, 100)
           y = similar(x)

           Ab = eachslice(A, dims=3)
           xb = eachslice(x, dims=2)
           yb = eachslice(y, dims=2)

           yb .= Ab .\ xb
           y
       end
5×100 Matrix{Float64}:
  1.26329    0.846035  …   2.02272
  0.408383  -1.11563       2.82224
  0.180778   0.364129      4.49043
  0.239752   0.345266     -4.85347
 -0.749327   0.263876     -0.945626

danielwe · 2025-05-16T16:30:19.729Z

For completeness, since you’re preallocating, the in-place version would be

ldiv!.(yb, lu!.(Ab), xb)  # explicit factorization required

DNF · 2025-05-16T16:31:38.050Z

digital_carver:

They explicitly say “given as a 100×5×5 array”, so a rand(100, 5, 5) would be closer to the spec. (And similarly for x.)

Could this not simply be a classic case of “numpy has no way of distinguishing vectors of arrays from higher-dimensional arrays”?You cannot express this in numpy, so it must be 3d.

avikpal · 2025-05-16T16:34:38.078Z

We landed it on EnzymeJax side just 2 days back . I am working on adding it to Reactant feat: factorizations by avik-pal · Pull Request #1234 · EnzymeAD/Reactant.jl · GitHub.

But once the frontend code lands it will support all forms of fancy batching. For GPU and TPU we already use nice batched solvers. On CPU we currently use a loop, but eventually the plan is to lower to MKL that has batched CPU solvers

bertschi · 2025-05-17T08:52:46.878Z

Numpy is not bad for what it does and has brought array programming into the main stream while relieving the world from Matlab …
There are other broadcasting models which are more consistent, e.g., as in modern APL or J. Based on the idea of verb rank and the leading axis model this provides a (more?) powerful notation for expressing many array operations concisely. For a proof of concept – based on stack and without further optimizations – you can try my JJ.jl:

julia> using JJ

julia> A = rand(5, 5, 100);  # Note: Switched order to trailing axis model as Julia is column-major

julia> y = rand(5, 100);

julia> rank"2 \ 1"(A, y) |> size
(5, 100)

Here, rank"2 \ 1" indicates that \ now applies to matrices as left argument and vectors as right argument. If a batch is given it simply brodcasts over that. A similar effect can be achieved with jax.vmap and allows to write code for low dimensional input arrays only and add support for batches simply when calling it. An advantage over notations with named indices, e.g., using Einsum like conventions, is that it is easier to generalize and applies equally to multiple batch dimensions:

julia> rank"2 \ 1"(rand(5, 5, 10, 8), rand(5, 10, 8)) |> size
(5, 10, 8)

Imho, the other – made up – example of D_{kn} = \frac{1}{L \cdot M} \sum_{l,m} A_{klm} B_{ln} C_{km} will probably always be easier to read (and write) with named indices as the order of the axis is more or less arbitrary. In practice, a natural order is often available, e.g., distinguishing between batch, time and embedding dimensions, and keeping the order consistent often makes it easier to express, e.g., using rank notation or vmap, to which subset of data each computation applies.
For what its worth, here is a J line for the above using the provided axis order:

(+/%#)@,/"3 A * B */"1/~ C

A more consistent ordering might keep the batch dimensions K, N strictly before or after the “compute” dimensions L,M and vamp a function that is defined for L\times M matrices and L and M vectors only over those.

avikpal · 2025-05-21T17:32:16.837Z

Latest release of reactant supports this:

julia> using Reactant, LinearAlgebra

julia> A = Reactant.to_rarray(rand(5, 5, 32, 128));

julia> y = Reactant.to_rarray(rand(5, 32, 128));

julia> fn(A, b) = lu(A) \ b
fn (generic function with 1 method)

julia> (@jit fn(A, y)) |> size
(5, 32, 128)

gdalle · 2025-05-21T17:36:49.979Z

Is it expected that the vectorization always occurs along the rightmost axes for this kind of matrix function? How can we know in advance?

avikpal · 2025-05-21T19:46:36.075Z

Yes batching is always on the rightmost axes.