10,000 robots with collision avoidance in WebGPU in HTML

Live demo: https://html-preview.github.io/?url=https://gist.githubusercontent.com/ayushmi/634dfe4b402d743aaf6f62bc1eea1245/raw/fe4edf0563047503fcce95d8aaa4535029c4f819/swarm.html

Source:

I wanted to run a plain web browser based WebGPU experiment without any frameworks or tool chains, so I started with HTML code that does the compute and the drawing, with that I had set a goal: move 10,000 robots on screen at good FPS.

And it eventually worked! I was able to run all the math on the GPU in three compute passes. The robots are small triangles drawn with instancing. On my M1 Mac I see ~120 FPS at 10k robots. It was challenging but fun to achieve ~120 FPS (My first version gave me just ~32 FPS). On integrated graphics it should work fine if you lower the count of number of robots.

What I did algorithmically?

Split the screen into a grid of cells.
Each robot then checks for robots in its own cell and the 8 cells around it.
Each frame runs three GPU passes:
1. Clear the cell counters (set to 0).
2. Bin robots into cells (atomic add) and store their IDs.
3. Update robots using only nearby cells.
Then a render pass draws one triangle per robot. No CPU uploads every frame.

That’s the whole trick. Use a grid. Keep memory simple. Let the GPU do the work.

How the grid works?

Diagram A shows the grid and binning where robots map to a cell, they automatically bump that cell's count, and writes their ID into that cell's slice of the index.

+----+----+----+----+
| 05 | 12 | 03 | 00 |  gridCounts[c] == number of agents binned in cell 
+----+----+----+----+
| 02 | 08 | 06 | 01 |  gridIdx[c*max + 0..count-1] = agent IDs
+----+----+----+----+
| 00 | 03 | 07 | 04 |
+----+----+----+----+
| 00 | 00 | 01 | 00 |
+----+----+----+----+

Diagram A - grid + binning

Diagram B illustrates my frame pipeline: Each frame first clears counts, then bins robots, then updates using 3×3 neighbor cells, then swaps buffers and finally renders the triangles.

[agentsRead] --(compute: clear)-> [gridCounts = 0]
                \
               --(compute: bin)--> [gridCounts, gridIdx]
                   \
                  --(compute: update)--> [agentsWrite]
                     \
                     -- swap buffers --> 
                               render(instanced triangles from agentsRead)

Diagram B - frame pipeline

Failures/Bugs I encountered leading to blank screen

When I first opened the HTML, the sliders and FPS showed up, but the canvas looked empty. I took me some iterations to make it work.

I started with a basic plan, three compute passes (clear → bin → update), then a render pass that draws tiny triangles from the same buffer. I wired the buffers, wrote the WGSL, saved the HTML file, and opened it in Chrome.

The UI came up with sliders, FPS. But the canvas was empty ☹️

First stop: DevTools console. Chrome gave me a few WGSL errors, but only at pipeline creation (good to know: WebGPU often hides shader errors until you ask it to build a pipeline). One error complained about arrayLength(). I had used it on plain array<Agent> and array<atomic<u32>>. A quick search (spec + a couple of blog posts + ChatGPT sanity check) reminded me it only works on runtime arrays at the end of a struct. I removed those calls and passed counts in via uniforms (agentCount, gridCells), and for per-cell sizes I looked at gridCounts.

Next error: atomics. I had declared:

@group(0) @binding(3) var<storage, read> gridCounts : array<atomic<u32>>;

…but I was doing atomicAdd on it. Console said atomic vars in storage must be read_write. Right, makes sense. I changed it to:

@group(0) @binding(3) var<storage, read_write> gridCounts : array<atomic<u32>>;

and that warning went away.

Then I hit a bind-group layout mismatch. The binning pass was binding @binding(1) even though the shader didn’t declare it. Classic "I refactored the shader but forgot to update the JavaScript." Removing the extra binding fixed that. Tip from a Github issue I found: your bind group entries must match the shader exactly; the validator is strict (and that’s a good thing).

With those fixed, the pipelines compiled, but still no bots. I dumped a few values and re-read my own shader. Another error in the console nudged me: WGSL for loops aren’t C style. I’d written dy++ 😄. WGSL wants var and manual increments. I rewrote the loops:

var dy:i32 = -1;
loop {
  if (dy > 1) { break; }
  // ...
  dy = dy + 1;
}

I also realized I was referencing Params and Agent in stages where I hadn’t defined them. That one I caught by skimming the code and confirming with the errors. I made a tiny commonWGSL string and included it wherever I needed those structs. That kept things consistent.

At this point it ran, but the canvas still looked empty. Time to suspect the obvious: maybe the triangles are there, just way too small. I bumped the triangle size from microscopic (~0.005 px, lol) to ~6 px, added a little velocity-based tint, and suddenly I could see dots. Progress.

But movement looked off. On resize, the bots seemed to drift into nowhere land. That turned out to be me not updating uniforms (screen width/height, grid dims) every frame. I started writing screenBuf/paramsBuf each frame and rebuilt the grid if the canvas changed size. That snapped positions back into place.

Some of these fixes came straight from the Chrome console messages; others from skimming the WebGPU/WGSL spec and a few blog posts; a couple I sanity-checked with ChatGPT (mainly the arrayLength() and atomic rules). The last bit-the invisible triangles-was just me remembering to "render something big and bright first."

If you’re debugging this stuff: keep DevTools open, read the line/column in the WGSL errors, and don’t trust your triangle size!

Why it’s fast enough

The grid turns "check everyone" into "check 9 cells."
Buffers are flat and friendly for the GPU.
Only the cell counters use atomics.
The CPU doesn’t upload per-frame vertices. The render pass reads the same buffer the compute pass wrote.

From ~32 FPS to ~120 FPS (what I changed and why)

Your numbers will differ. This is a rough guide from my tests.

I started on an M1 laptop and saw ~30-32 FPS at 1k robots, falling to 16 FPS at 20k. That told me I was blowing the frame budget on pixels (too many to shade) and work per agent (too many neighbors, atomics, etc.). I then started to adjust some knobs and fixed a few things.

1) I reduced the pixels the GPU had to shade

HiDPI is great for text, not for 10k triangles.

I capped device pixel ratio: set DPR cap = 1.0.
and Lowered internal render scale: kept CSS size the same but set render scale = 0.9~1

Result: Instant jump. It’s basically a free 2-3x if you were at DPR 2. The scene looked the same at arm’s length.

2) I lowered per-agent neighbor work

The uniform grid is the whole trick, so I leaned into it.

Raised cellSize from ~12–16 to 20 (fewer cells, fewer lookups).
Set maxPerCell to a realistic bound (28). If it’s too high, you scan too much; too low, you drop a few neighbors under extreme clumps but keep the frame smooth.

This cut both compute time and atomic contention on the binning pass.

3) I ran physics at 30 Hz, but rendered at 60 Hz

I only run the compute passes every other frame:

// roughly 30 Hz physics, 60 Hz render
if ((tick++ & 1) === 0) runCompute(encoder);
render(encoder, view);

4) I bumped workgroup size for my GPUs

Mac likes bigger threadgroups. I went from 64 to 128 (sometimes 256).

For me, 128 gave the best consistent results.
The HTML now lets me switch between 64/128/256 and rebuilds pipelines on click.

5) I stopped making the GPU do silly work

Turned off debug overlays while testing.
Made triangles a sane size (6 px), and computed color cheaply from velocity.
Kept the vertex math tiny; more math in compute, less in the vertex shader.

6) I fixed two bugs that tanked performance (and sometimes correctness)

Pipeline creation before buffers: I was creating bind groups in resize() before agent buffers existed, which crashed the first frame. The fix was to guard pipeline creation until buffers are ready, and also rebuild after swapping A/B.
WGSL intrinsic name: wrote inversesqrt; WGSL wants inverseSqrt. That one line can invalidate the compute pipeline and leave you "rendering" nothing.

7) I kept uniform writes small and simple

All per-frame changes are tiny queue.writeBuffer calls to the two uniform buffers. No mapping, no reallocation, no per-frame pipeline rebuilds.

8) I kept the machine honest

Plugged in and turned off Low Power Mode on macOS.
Kept the tab visible (background tabs throttle).
Closed other GPU-heavy windows.

The “fast preset” that hit ~120 FPS (10k robots on my laptop)

What worked best for me (your numbers will differ):

DPR cap: 1.0
Render scale: 0.9
Agents: 10,000
cellSize: 20
maxPerCell: 28
Physics rate: every 2 frames (~30 Hz)
Workgroup size: 128 (try 256)
Triangle size: 6
Vel color: 1.0

If you want more, go f16 later (store pos/vel as vec2<f16>), but I kept the demo clean and portable for now.

Debug crumbs that actually helped

Empty canvas but FPS showing? Check for shader errors at pipeline creation (DevTools only prints them then). It’s usually a WGSL gotcha (loop syntax, intrinsic names, missing struct).
Nothing draws, no errors? Your triangles might be too small or off-screen from bad uniforms (screen width/height). Write those every frame; rebuild grid on resize.
Random validation error about bind groups? 99% it’s a layout mismatch-your JS entries and WGSL bindings must match exactly.
Stutters when flocks clump? Atomics hot-spots. Raise cellSize a bit, keep maxPerCell modest (and it’s okay to drop extra entries).

Tiny diffs that mattered

Guard pipeline builds until buffers exist:

- allocGrid(); makePipelines(); + allocGrid(); + if (agentsA && agentsB) makePipelines();

Build pipelines after creating agents:

resize();
allocAgents(state.agentCount);
+makePipelines();
writeParams();

Fix the WGSL intrinsic:

- let inv = inversesqrt(d2); + let inv = inverseSqrt(d2);

Where I got the answers

Chrome DevTools errors (pipeline creation time) pointed to the real WGSL lines.
WGSL spec + a couple of WebGPU blog posts (loop syntax, atomic rules).
A ChatGPT sanity check when I got stuck on arrayLength() and storage access modes.
Trial and error on M1 with DPR and workgroup size.

If you want to squeeze even more

Half precision: enable f16; and put pos/vel in vec2<f16>; convert to f32 for math. Cuts bandwidth.
SoA buffers: split positions and velocities for more linear reads.
Better neighbor cache: prefetch 3×3 cell bounds to registers/shared mem (more code, more speed).
Soft-cap cells: when slot >= maxPerCell, randomly drop entries to avoid worst-case scans.

Closing

This toy project is about how far a clean html WebGPU baseline can go with clear compute passes, predictable performance, laptop-friendliness. If you squeeze out more FPS or find a smarter kernel, send it over to me, I will ship it as a preset and credit you.