Ray Tracing in a week-end - Julia vs SIMD-optimized C++

claforte · 2021-12-11T23:58:13.678Z

Hi folks,

As an exercise to learn how to optimize Julia code for performance, I adapted from C++ to Julia a popular raytracing “book” (Ray Tracing In One Weekend by Peter Shirley) into a Pluto.jl notebook. Hopefully this will attract some other newcomers to Julia. (I plan to use this to advertise Julia’s benefits at AMD, where I work.)

https://github.com/claforte/RayTracingWeekend.jl

My initial implementation ran very slowly, but by eliminating some of my earlier newbie mistakes, then using StaticArrays, Float32, @btime and --track-allocation=user, I eliminated most of the heap allocations and I think I can soon get pretty close to the performance of a SIMD-optimized C++ implementation.

As a comparison, currently, it takes ~4.1 hours to render(scene_random_spheres(), t_cam1, 1920, 1000). I read that someone got their SIMD-optimized, multi-threaded CPU-only C++ implementation to run a very similar command in ~1 hour, on a system that is probably comparable with mine. (I plan to eventually run their code to get more precise measurements.) My version isn’t optimized for SIMD and it’s still single-threaded, and processes rays 1-by-1. I also haven’t implemented the BVH (Bounding Volume Hierarchy) raytracing acceleration structure that was likely implemented in that C++ version. So there’s still lots of space for CPU-side optimization. Finally, in the long run I hope to also try to execute on GPU, and also make a version that uses Vulkan raytracing.

If any of you expert can take a look and recommend ways to clean-up the code while maintaining (or improving?) performance, I welcome your suggestions. I’d like this project to grow into an example of how to get good performance on graphics problems with Julia.

Thank you!

jling · 2021-12-12T00:33:18.546Z

for i in 1:ny, j in 1:nx # Julia is column-major, i.e. iterate 1 column at a time
...
		img[i,j] = RGB{Float32}(r,g,b)

this is flipped, column-major means things in the same column are next to each other physically in RAM. so you should iterate 1 row at a time, which means you should make i the inner index.

lmiq · 2021-12-12T00:56:07.860Z

claforte:

to clean-up the code

Looking quickly, I think that the thing that can be done to clean the code is to remove type annotations from many functions, and making some other genetic, such that the use of different types of floats can be set by the user.

A small example:

random_vec3(min::Float32, max::Float32) = @SVector[random_between(min,max) for i∈1:3]

This can be just:

random_vec3(min, max) = @SVector[random_between(min,max) for i∈1:3]j

Another thing, here Material is an abstract type:

struct Sphere <: Hittable
	center::SVector{3,Float32}
	radius::Float32
	mat::Material
end

It would be better for performance to write:

struct Sphere{M} <: Hittable
	center::SVector{3,Float32}
	radius::Float32
	mat::M
end

claforte · 2021-12-12T03:28:06.734Z

Thanks everyone, I will read more carefully tomorrow and try these ideas!

I added @inline to most low-level functions, and got a 2.44X speed-up, on the long benchmark… so it’s down from 4.1 hour to 1.67 hour. Getting pretty close to 1h and I’m still only using 1 thread.

claforte · 2021-12-12T08:24:50.197Z

Good catch, thanks a lot @jling! I should have noticed this, it’s one of the first tips @ChrisRackauckas taught in his course!

Surprisingly, if I iterate over each column in my render() function, with a tiny scene (scene_2_spheres), iterating per column is consistently slower by a large margin… (compared 5 times). (BTW I also fixed a minor logic bug… I kept applying random offset, brownian-motion-style, instead of simply applying a delta per sample around each image pixel.)

function render(scene::HittableList, cam::Camera, image_width=400,
				n_samples=1)
	# Image
	aspect_ratio = 16.0f0/9.0f0 # TODO: use cam.aspect_ratio for consistency
	image_height = convert(Int64, floor(image_width / aspect_ratio))

	# Render
	img = zeros(RGB{Float32}, image_height, image_width)
	# Compared to C++, Julia:
	# 1. is column-major, elements in a column are contiguous in memory
	#    this is the opposite to C++.
	# 2. uses i=row, j=column as a convention.
	# 3. is 1-based, so no need to subtract 1 from image_width, etc.
	# 4. The array is Y-down, but `v` is Y-up 
	# Usually iterating over 1 column at a time is faster, but
	# surprisingly, in this case the opposite pattern arises...
	for j in 1:image_width, i in 1:image_height # iterate over each column
	#for i in 1:image_height, j in 1:image_width # iterate over each row
		accum_color = @SVector[0f0,0f0,0f0]
		u = convert(Float32, j/image_width)
		v = convert(Float32, (image_height-i)/image_height) # i is Y-down, v is Y-up!

		for s in 1:n_samples
			if s == 1 # 1st sample is always centered, for 1-sample/pixel
				δu = 0f0; δv = 0f0
			else
				# claforte: I think the C++ version had a bug, the rand offset was
				# between [0,1] instead of centered at 0, e.g. [-0.5, 0.5].
				δu = convert(Float32, (rand()-0.5f0) / image_width)
				δv = convert(Float32 ,(rand()-0.5f0) / image_height)
			end
			ray = get_ray(cam, u+δu, v+δv)
			accum_color += ray_color(ray, scene)
		end
		img[i,j] = rgb_gamma2(accum_color / n_samples)
	end
	img
end

# `for j in 1:image_width, i in 1:image_height # iterate over each column`: 10.489 ms
# `for i in 1:image_height, j in 1:image_width # iterate over each row`: 8.129 ms
@btime render(scene_2_spheres(), default_camera(), 96, 16) # 16 samples

# Iterate over each column: 614.820 μs
# Iterate over each row: 500.334 μs
@btime render(scene_2_spheres(), default_camera(), 96, 1) # 1 sample

I guess this can easily be explained by get_ray or ray_color thrashing a cache… but that was a surprisingly large margin… maybe there’s still a bug in my code somewhere?

Anyway, as expected, this change makes no noticeable difference in larger scenes that have hundreds of spheres, like scene_random_spheres.

For now I’ll keep the faster loop, and test again later, when I optimize for SIMD.

woclass · 2021-12-12T09:42:45.509Z

claforte:

δu = convert(Float32, (rand()-0.5f0) / image_width)
δv = convert(Float32, (rand()-0.5f0) / image_height)

You can use rand(Float32).

julia> @btime rand()
  7.300 ns (0 allocations: 0 bytes)
0.8462830063896598

julia> @btime rand(Float32)
  7.600 ns (0 allocations: 0 bytes)
0.3544693f0

julia> @btime Float32(rand())
  7.900 ns (0 allocations: 0 bytes)
0.8506073f0

julia> @btime convert(Float32, rand())
  7.900 ns (0 allocations: 0 bytes)
0.3169133f0

lmiq · 2021-12-12T10:58:15.710Z

There is another thread discussing the performance of a toy implementation of a ray tracer: Why is this raytracer slow? - #30 by lmiq

Maybe some of the code there can give you some ideas.

claforte · 2021-12-12T12:48:10.480Z

Yes, that’s where I got the idea to use @inline all over the place, and it provided a huge speed-up! See my comment above.

DNF · 2021-12-12T12:57:54.354Z

I thought inlining was supposed to happen automatically, and that there was rarely much use for explicit @inline anymore.

So that’s not correct, then?

lmiq · 2021-12-12T13:08:28.462Z

Like other compiler optimizations there are heuristics for that, which may not do the best choice always. I have read somewhere* recently that “code simplicity” is one of the main criteria for inlining automatically. Inlining too much may also increase compilation time.

(probably the great differences observed are associated to the inlining of one or or other of the functions - some that is called very frequently in critical code and for some reason was not inlined automatically - , ideally one should search which are the non-inlined functions for which that annotation makes a difference).

*edit… here: Inlining 101

claforte · 2021-12-12T13:18:47.794Z

Thanks! There were a few rand()s that didn’t explicitly request a Float32, I converted them all to rand(Float32) and lifted out of inner loop some convert(Float32,...).

New commit:

Speed-up (2.2%?) from using Rand(Float32)

- ... to avoid unnecessary conversions and Float64 intermediate values
- also lifted out of an inner loop some convert(Float32)
- thanks to @woclass for pointing it out!

claforte · 2021-12-12T14:37:44.732Z

Thanks! I made this commit:

random_vec3(min, max) - no need for ::Float32

- no measurable impact on performance
- thanks to @leandromartinez98 for pointing this out!

However I didn’t figure out any way to clean-up or speed up the code using your struct Sphere{M} <: Hittable recommendation…

It made the code more complicated, e.g. Sphere{Lambertian}(@SVector[0f0,0f0,-1f0], 0.5, Lambertian(@SVector[0.7f0,0.3f0,0.3f0]))) instead of Sphere(@SVector[0f0,0f0,-1f0], 0.5, Lambertian(@SVector[0.7f0,0.3f0,0.3f0])))
It seems to cause very noticeable slowdowns, e.g.
@btime render(scene_2_spheres(), default_camera(), 96, 16) went from 8.077 ms to 14.359 ms.

Maybe there’s a detail I missed?

BTW the #1 thing I’d like to clean up is the SVector{3, Float32}. The original C++ code defined Color and Vec3 classes, and I’d really like to do the same. I tried doing Vec3 = SVector{3, Float32}, then using Vec3 in other functions, but that caused tons of extra memory allocations, made the performance horrible… If you or anyone else can point me to a good example on how to manage this, that would be great!

nilshg · 2021-12-12T14:51:27.308Z

Might depend on how exactly you use Vec3 in your code but have you tried const Vec3 = SVector{3, Float32}?

woclass · 2021-12-12T16:01:47.593Z

Tried using const Vec3{T<:AbstractFloat} = SVector{3, T} and compared F32 and F64

F64 is faster.

# The original version
@btime render(scene, t_cam1, 200, 43);
  10.222 s (2399087 allocations: 110.08 MiB)

# Vec3{T} + Float32
@btime render(scene, t_cam1, 200, 43);
  10.756 s (7370438 allocations: 188.79 MiB)
# Vec3{T} + Float64
@btime render(scene, t_cam1, 200, 43);
  8.797 s (2274922 allocations: 174.08 MiB)

github.com/fork4jl/RayTracingWeekend.jl

Use Type{T}: support F64/F32

committed 03:06PM - 12 Dec 21 UTC

inkydragon

+154 -139

picked Float32 instead of Float64, because that will likely run twice as fast on GPU

Arrays on the GPU have their own type CuArray{T}.
The code running on the CPU can continue to use Floa64, the conversion only happens when copying data from CPU memory to GPU memory.

I guess the conversion overhead can be ignored, compared to the data copying overhead.

claforte · 2021-12-12T16:26:50.160Z

Awesome, thanks @woclass! I hope this perf increase will hold even after the @inline. I’ll try to merge it now…

claforte · 2021-12-12T16:49:26.750Z

@woclass, I integrated your previous commit…

https://github.com/claforte/RayTracingWeekend.jl/commit/254781f48fb7f82e95b2dfcd8926b3c0168f3d88

… in my latest code (proto.jl) and even after the @inline and other optimizations, that change gives a 6.6% speed up on @btime render(scene_random_spheres(), t_cam1, 200, 32)! Awesome! I’m going to remember this const alias trick, @nilsh and @woclass!

I’ll try the rest now…

DNF · 2021-12-12T17:09:01.215Z

BTW, if you do a lot of rand calls, you should explicitly provide the rng, and take care which one you use. That may make a significant difference.

Take a look here The need for rand speed | Blog by Bogumił Kamiński and probably, you should use the Xoshiro rng.

paulmelis · 2021-12-12T19:02:41.508Z

I did my own implementation of (only the first part) of Ray Tracing in One Weekend some time ago, see https://github.com/paulmelis/riow.jl. Probably not as optimized as yours at this point But there are a few differences, such as using a Union as the Material “type”.

lmiq · 2021-12-12T22:50:14.891Z

claforte:

It made the code more complicated, e.g. Sphere{Lambertian}(@SVector[0f0,0f0,-1f0], 0.5, Lambertian(@SVector[0.7f0,0.3f0,0.3f0]))) instead of Sphere(@SVector[0f0,0f0,-1f0], 0.5, Lambertian(@SVector[0.7f0,0.3f0,0.3f0])))

It seems to cause very noticeable slowdowns, e.g.
@btime render(scene_2_spheres(), default_camera(), 96, 16) went from 8.077 ms to 14.359 ms.

Point (1) can be solved with a constructor like Sphere(v,d,m) = Sphere{typeof(m)}(v,d,m)

I played a little with the code because of (2). I would do two things with the HitPoint type: make it immutable and parametrize the material as well. Thus, in the end I did this:

# ╔═╡ 3b570d37-f407-41d8-b8a0-a0af4d85b14d
"Record a hit between a ray and an object's surface"
struct HitRecord{M}
	t::Float32 # vector from the ray's origin to the intersection with a surface. 
	
	# If t==Inf32, there was no hit, and all following values are undefined!
	#
	p::SVector{3,Float32} # point of intersection between an object's surface and ray
	n⃗::SVector{3,Float32} # local normal (see diagram below)
	
	# If true, our ray hit from outside to the front of the surface. 
	# If false, the ray hit from within.
	front_face::Bool
	mat::M
end
HitRecord() = HitRecord{NoMaterial}(Inf32,zero(SVector{3,Float32}), zero(SVector{3,Float32}), false, NoMaterial()) # no hit!
HitRecord(t,p,n⃗,front_face,mat) = HitRecord{typeof(mat)}(t,p,n⃗,front_face,mat)

# ╔═╡ 138bb5b6-0f45-4f13-8339-5110eb7cd1ff
struct Sphere{M} <: Hittable
	center::SVector{3,Float32}
	radius::Float32
	mat::M
end
Sphere(c,r,m) = Sphere{typeof(m)}(c,r,m)

(note the constructors that solves problem 1 here as well)

That, unfortunately, didn’t improve performance, and why is that is a bit more complicated, probably. You are reaching probably a dynamic dispatch point inside the hit function, and making the types concrete is not helping. In general those would be good advice, I think.

Probably on alternative would be to make a concrete more general “Material” which might carry the properties of all materials and a flag, to avoid run time dispatch. But that is less elegant than what you have.

claforte · 2021-12-12T23:45:05.880Z

I’m almost done merging @woclass’s Vec3{T} optimization… I hope to push the change tomorrw. I confirm that Float64 runs ~7% faster. I ran the full benchmark and got this result:

Previous measurement after lots of other optimizations including @inline lots of stuff:
6018.101653 seconds (5.39 G allocations: 241.139 GiB, 0.21% gc time) (1.67 hour)

New measurement after 3 major changes:
@woclass’s rand(Float32) to avoid Float64s: (expected to provide 2.2% speed up)
@woclass’s “Use alias instead of new struct”, i.e. const HittableList = Vector{Hittable}
@woclass’s Vec3{T} with T=Float64: (7.8% speed-up!):
5268.175362 seconds (4.79 G allocations: 357.005 GiB, 0.47% gc time) (1.46 hours)

That’s ~14% speed up, and I learnt a lot from your suggestions, thanks everyone!

Next I plan to:

Run the C++ benchmark on my PC, confirm exactly what algorithmic changes they have which I don’t (to have a more fair comparison)
Read that blog on the rand performance, try that
Read on the SIMD libraries, try to figure out the best approach. (I’ll welcome any suggestion you have!)
Probably pre-allocate the ray bundles/paths (I don’t know how they are called in the litterature) to later simplify the GPU (and probably the SIMD) implementation.

Thanks!