A. S. To prevent any confusion: at the time when this was originally posted, Julia’s run was about 600ms and was ~20x slower than Rust. This is not longer the case.
Resuming the original post…
… →
I stumbled upon this funny benchmark with Julia at the bottom of the list.
They (whoever wrote that code - I wasn’t curious to see the git blame) didn’t even account for compiling time.
If somebody has some time, they might be interested in fixing this (I am not even sure that the Julia code is doing only the work in the benchmark description).
Context: the F# version was also unfairly represented, and somebody fixed it (a little overkill there with some unidiomatic F# - but it is fast).
However, the benchmark doesn’t seem concerned with comparing apples to apples - anybody can jump on it and propose an implementation as long as they do not break the benchmark rules.
4 Likes
I’m betting @quinnj could make this fly, but not sure if worth the effort. It’s a small benchmark at least.
3 Likes
I think there is an update to the code, and in my machine here is the result.
Which beats all , the best benchmark time was Rust : 53 ms.
1 Like
Yes - I did the update. It’s pretty straightforward and naive - but the goal was to replace the weird/initial implementation that took about 0.6 seconds.
5 Likes
I’m not following… the current version of the Julia code is now the fastest? Then what’s to do?
I did the update after writing the initial post (the official result was not yet updated - so we can still see Julia with >0.5 seconds on that table).
1 Like
Benny
7
I guess if someone gripes about calling the function just to compile it prior to timing another call, maybe replace the first call with a precompile and check if that works as well? But the typical tendency of JIT compilation to make its way into 1-run benchmarks has already been adequately demonstrated, and that precompile wouldn’t solve this setup.
@Benny, the compilation time only explained about 100-150 ms of that result (at least on my machine).
It was also a matter of implementation.
Even a pretty naive fix changed things tremendously.
From the most recent PR:
| Go | 26.39 ms
| Zig | 38.00 ms
| Rust | 38.91 ms
| Java (GraalVM) | 40.00 ms
| Julia | 42.67 ms (down from >580 ms)
If somebody has some time and wants to play with it - it might be a nice exercise to improve the speed.
It is not about some popular benchmark, but it tries to showcase a real-world scenario - not some rarely encountered micro-benchmark.
1 Like
jling
9
if you want to fix the second column (the including I/O, which is timing the julia script.jl
from outside), you need to AOT I guess, not sure worth the effort…
Nope - it is about the first column.
atthom
11
Hi there,
I’m the author of the initial Julia implementations. (Sorry for finishing last !)
My goal wasn’t to write the fastest code but to be as close as possible to the plain Python and numpy implementation because that’s my main language at work. If you’re curious the Julia was 3x better than plain python and 1.5x than numpy.
Since then the repo blew out of proportion and I wanted to write a proper version.
I was able to push it toward 70ms in my machine with this code:
But I think I’m already outclassed by the julia wizards of this forum.
In anycase, to fix the second column maybe a tool like DaemonMode.jl could do the job?
Not sure if it’s worth it and may be considered cheating.
15 Likes
Hi @atthom,
I hope there are no hard feelings on your end. I wasn’t curious to check the git blame since my only concern was related to seeing Julia in Rust/Go/Zig club. Objectively we can agree that the implementation was not close to the Rust level (and you had your freedom to create whatever implementation you saw fit and according to your objectives - you had no obligation to satisfy me or anybody else).
Also - since the benchmark is not some micro-benchmark and actually resembles a scenario that is probable to be encountered in real-world data processing, it was somehow painful to see Julia running 20x slower than Rust.
But this was my subjective experience - significant enough that I did something about it. Waiting for the next benchmark run to reflect the latest changes: but I estimate that Julia will get the second place in that benchmark. And that is enough for my personal satisfaction.
However, it would be nice to see somebody from the wizard class getting their hands dirty and doing some wonders (while also making sure to stay within the boundaries set by the benchmark rules).
6 Likes
jling
13
your last PR actually broke the rule:
because this array doesn’t check bounds
1 Like
In my defence, the docs are not saying anything about this: Home · StrideArrays.jl (juliasimd.github.io)
Thanks for bringing this up - I’ll fix it.
1 Like
Dan
15
Looking at the new code, it was a bit long. Here is a shorter version which tries to be short and fast (achieved close to current version in speed but still doesn’t do some of the optimizations in the current version):
using JSON3
using Dates
using DataStructures
struct PostData
_id::String
title::String
tags::Vector{String}
end
struct RelatedPost
_id::String
tags::Vector{String}
related::SubArray{PostData, 1, Vector{PostData}, Tuple{Vector{Int64}}, false}
end
function relatedIO()
posts = JSON3.read("posts.json", Vector{PostData})
start = now()
all_related_posts = related(posts)
println("Processing time (w/o IO): $(now() - start)")
open("../related_posts_julia.json", "w") do f
JSON3.write(f, all_related_posts)
end
end
function related(posts, topn = 5)
tagmap = Dict{String,Vector{Int}}()
for (idx, post) in enumerate(posts), tag in post.tags
tagmap[tag] = push!(get(()->Int[], tagmap, tag), idx)
end
scratch = zeros(Int, length(posts))
scratchbase = Ref(0)
return map(enumerate(posts)) do (i, post)
for t in post.tags, x in tagmap[t]
x == i && continue
scratch[x] = ifelse(scratch[x] < scratchbase[], scratchbase[], scratch[x])+1
end
tops = nlargest(topn, 1:length(scratch); by=x->scratch[x])
scratchbase[] = scratch[tops[1]]
RelatedPost(post._id, post.tags, @view posts[tops])
end
end
const res = relatedIO()
If anyone wants to try and make it nicer/faster…
A new speed trick is to avoid zeroing the scratch
array by starting new counters from a higher scratchbase
.
4 Likes
Participation is open from any contributor: feel free to fork and update the script at any point.
3 Likes
I solved the StrideArrays
mishap, and the latest PR was reverted.
The previous non-StrideArrays
version was not bad - but the performance gain after StrideArrays
addition was clearly going against the rules.
@Dan, my current version (from the reverted PR) runs in 32ms on my machine - the version you proposed is around 54-56ms. The weird thing is that by looking at your code I intuitively expected a performance gain (which didn’t materialise when benchmarking it).
From what I see, some improvements can be done for this function I wrote:
function fastmaxindex!(xs::Vector{Int}, topn, maxn, maxv)
maxn .= 1
maxv .= 0
for (i, x) in enumerate(xs)
if x > maxv[1]
maxv[1] = x
maxn[1] = i
for j in 2:topn
if maxv[j-1] > maxv[j]
maxv[j-1], maxv[j] = maxv[j], maxv[j-1]
maxn[j-1], maxn[j] = maxn[j], maxn[j-1]
end
end
end
end
reverse!(maxn)
return maxn
end
Maybe a proper usage of a really fast PriorityQueue
might help - or maybe somebody can do a smart algorithmic trick that is more compiler-friendly.
1 Like
Dan
18
Tried to use PriorityQueue from DataStructures, but it was very slow and allocating. The problem could be an underlying Dict supporting the PriorityQueue.
A new Dict type which supports keys from a limited universe and is implemented using flat memory storage could be useful in other cases. I think there are many times a Dict is known to have keys from a small universe and speed is paramount.
1 Like
Go implementation seems to do this (using the knowledge that there are max 100 tags).
The Dictionaries.jl
supports a size hint, but in the current scenario, seems slower than the native implementation.
jling
20
I realized that this whole function can be replaced by partialsortperm!()
except that it’s much slower right now:
@@ -69,27 +51,33 @@ function related(posts)
relatedposts = Vector{RelatedPost}(undef, length(posts))
taggedpostcount = Vector{Int64}(undef, length(posts))
- maxn = Vector{Int64}(undef, topn)
- maxv = Vector{Int64}(undef, topn)
+ scratch = collect(eachindex(posts))
taggedpostcount[i] = 0
- fastmaxindex!(taggedpostcount, topn, maxn, maxv)
+ maxn = partialsortperm!(scratch, taggedpostcount, 1:5; rev=true, initialized=true)
relatedpost = RelatedPost(post._id, post.tags, SVector{topn}(posts[ix] for ix in maxn))
relatedposts[i] = relatedpost
end
return relatedposts
end
maybe our sorting algorithm guru @Lilith can provide some comments here
1 Like