WebGPU "Wavefront" Path Tracing
Addison Prairie / June 23
Recently, I read the paper
Megakernels Considered Harmful on "wavefront" path tracing for GPUs, which breaks the steps of a traditional monolithic path tracer into individual compute kernels for better coherence and latency hiding. With Chrome 113 shipping with WebGPU support, I decided to try to implement some of the ideas of wavefront path tracing to create my own, tiny path tracer.
NOTE: this code was written when I was still getting familiar with WebGPU and GPU compute in general, so it is not very optimized nor well organized. The source code is still available
here. A demo is also available
here.
Path traced scan of a Hannibal sculpture by Sebastian Slodtz. Diffuse (left), glass coated diffuse (right). Model courtesy
scan the world.
1. BVH Building & Ray Tracing
In order to path trace 3D models, I needed to build BVHs for those models and intersect those BVHs in a GPU kernel. The code to construct the BVH is written in C and compiled to Web Assembly (WASM) using Emscripten. Using WASM not only made the BVH building process significantly shorter, it also allowed me to operate on buffers that would be sent to the GPU similarly to how they would be read by the GPU, rather than using a DataView or typed array in JavaScript. The BVH was represented by two buffers: a flat array of internal and leaf nodes, and an array of primitives (triangles in this case). The memory layout for each looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
struct BVH_Branch {
float AABB_L_Lo[3];
uint RightChildIndex;
float AABB_L_Hi[3];
uint PADDING0;
float AABB_R_Lo[3];
uint PADDING1;
float AABB_R_Hi[3];
uint PADDING2;
};
struct BVH_Leaf {
uint primitiveStartIndex;
uint primitiveEndIndex;
uint PADDING0;
uint LEAF_FLAG = 0;
uint PADDING1_12[12];
};
A parent node stores the bounding boxes of both its children; this is so that the intersection algorithm on the GPU does not have to read from two "random" memory addresses in order to pick which child nodes to traverse or skip. Additionally, each node assumes that its left child will be at the next index in the array, so it only stores an index for its right child. A leaf node points to a list of consecutive primitives in the array of primitives, and contains a flag (LEAF_FLAG) that distinguishes a branch node from a leaf node.
The ray-BVH intersection function, written in a WGSL-style syntax, looks something like this
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
@group(...) @binding(...) var<storage> bvh array<BVHNode>;
...
var<private> stack : array<u32, 64>;
fn intersect(o : vec3f, d : vec3f) -> RayHit {
var closestDist = 1e30f;
var stackPtr = 1u;
var nodeIndex = 0u;
var currentNode = 0u;
var newLeaf = false;
var primitiveLoIndex = 0;
var primitiveHiIndex = 0;
while (stackPointer != 0u) {
var node : BVHNode = bvh[currentNode];
if (node.LEAF_FLAG == 0 && newLeaf) {
newLeaf = false;
primitiveLoIndex = node.primitiveStartIndex;
primitiveHiIndex = node.primitiveEndIndex;
}
if (primitiveLoIndex < primitiveHiIndex) {
...
if (HIT && newDist < closestDist) {
closestDist = newDist;
}
} else {
var leftDist = AABBIntersect( node.leftAABB);
var rightDist = AABBIntersect(node.rightAABB);
var leftValid = leftHit > 0 && leftDist <= closestDist;
var rightValid = rightHit > 0 && rightDist <= closestDist;
if (lValid && rValid) {
var furtherIndex = 0u;
var furtherIndex = 0;
if (leftDist < rightDist) {
furtherIndex = node.rightChildIndedx;
currentNode = currentNode + 1;
} else {
furtherIndex = currentNode + 1;
currentNode = node.rightChildIndex;
}
stack[stackPtr++] = furtherIndex;
} else
if (lValid) {
currentNode = currentNode + 1;
} else
if (rValid) {
currentNode = node.rightChildIndex;
} else {
currentNode = stack[stackPtr--];
}
newLeaf = true;
}
}
}
The core of the function is a while loop that, with each iteration, either further searches the BVH for a leaf node, or tests against a triangle. The "..." in the primitive intersection portion of the function calculates the distance to the triangle and the normal using the Moller-Trumbore ray-triangle intersection algorith. There's really nothing new or innovative here, and there are probably further optimizations to be made. However, I found the performance to be good enough. I was too lazy to put together WebGPU QuerySets to test the time it took to intersect large models, but the path tracer does run at interactive frame rates for models with ~1M triangles on my laptop (Radeon RX 6700S).
One thing I did not get do here but would really like to do in the future is find a good way to store very large models in GPU buffers; currently, due to WebGPU limitations, the path tracer can only fit around 1M triangles into a storage buffer.
Early test of BVH ray tracing on the GPU. A ~1M model ray traced in real time on a laptop.
2. Wavefront Path Tracing
Wavefront path tracing works by splitting the traditional unrolled (loop-based/non-recursive) path tracing loop across multiple smaller and leaner GPU kernels. For example, consider a basic loop like the following:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
var t = vec3f(0.);
var b = vec3f(1.);
for (var i = 0; i < MAX_BOUNCES; i++) {
var normal : vec3f;
var dist : f32;
var bHit : bool = trace(o, d, &normal, &dist);
if (!bHit) {
t += b * SKY_COLOR;
break;
}
o = o + d * dist + normal * .001;
b *= Sample_f_lambert(normal, &d);
}
What happens when we run this loop as-is with a workgroup size of 4 threads? The workgroup can only leave the for-loop once every thread has either hit MAX_BOUNCES or missed a surface. Consider the following scenario, where T represents calling trace() in an iteration and B represents the iteration that thread broke out of the loop.
loop index: |
1 |
2 |
3 |
4 |
5 |
thread 0: |
T |
T |
T |
T |
B |
thread 1: |
T |
B |
- |
- |
- |
thread 2: |
B |
- |
- |
- |
- |
thread 3: |
T |
B |
- |
- |
- |
In this example, thread 0 bounced into a crevice and took 5 iterations to reach the sky light, while thread 2 did not hit a surface at all. Even though threads 1, 2, and 3 broke out of the loop relatively early, the workgroup was not able to leave the loop until thread 0 was finished. Thus we wasted 10 extra traces (3 from thread 1, 4 from thread 2, and 3 from thread 3). One of the goals of wavefront path tracing is to allow the loop to regenerate rays so that the execution of the trace() function across a workgroup runs across as many threads as possible. For example, if we change our code slightly:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
var t = vec3f(0.);
var b = vec3f(1.);
var samples = 0;
for (var i = 0; i < MAX_BOUNCES; i++) {
var normal : vec3f;
var dist : f32;
var bHit : bool = trace(o, d, &normal, &dist);
if (!bHit) {
t += b * SKY_COLOR;
samples++;
regenerateCameraRay(&o, &d);
b = vec3f(1.);
} else {
o = o + d * dist + normal * .001;
b *= Sample_f_lambert(normal, &d);
}
}
var sumColor = samples == 0 ? vec3f(0.) : t / f32(samples);
Since we can regenerate rays inside the loop, we can ensure that no thread is sitting idle during a call to trace(). This adds some overhead, since now the branch "if(!bHit){...}" is larger and thus threads that do not take it will sit "idle" for longer; however, this is more than made up for by having more threads executing the significantly more expensive trace() function.
Wavefront path tracing as described in the paper goes much beyond this simple optimization: it actually splits the trace function, material evaluation, ray generation, and loop logic into seperate kernels which communicate through storage buffers. I chose to implement this, at least partially: the program executes four seperate kernels each iteration and communicates between them with queues and atomic operations. I won't go into much more detail, as most of this is described thoroughly in the paper. One thing to note is that my current implementation does not use ExecuteIndirect(), meaning it has to read the size of each queue back to the CPU multiple times a frame; switching to ExecuteIndirect would probably be one of the simplest yet most beneficial optimizations I could make in the future.
Figurine path traced with transmissive material. Model courtesy
scan the world.
3. Image Based Lighting
It can be difficult setting up lighting that looks realistic. For this project, I decided to use an HDR image as a skylight. The one file in the entire source code not written by me is the C Program "hdrToFloats.cc" which is from
this site. It reads a .HDR file and outputs a buffer of floats ready to be sent to the GPU. In the future, I could probably get away with sending the raw .HDR file to the GPU, then reconstructing the RGB value from the mantissas and shared exponent using a bitcast operation.
Path tracing example with Image Based Lighting.
4. Interface
The interface for interacting with the path tracer is broken into four seperate editor windows: "Meshes", which allows users to import .obj files and convert them to BVHs; "Objects", which allows the user to rotate, position, and change the materials and meshes of different objects; "Materials", which allows the user to edit properties of the materials applied to objects; and "Camera & Film", which allows the user to change the size of the render and the focal length and aperture of the camera.
The interface is one of my least favorite parts of the project, and I am hoping that in the future I could recreate a project similar to this one but with a much better interface. I would also like to re-implement wavefront path tracing and BVH ray tracing with a greater focus on performance.
A screenshot of a work-in-progress editor interface.