10,000 robots with collision avoidance in WebGPU in HTML
Source:

I wanted to run a plain web browser based WebGPU experiment without any frameworks or tool chains, so I started with HTML code that does the compute and the drawing, with that I had set a goal: move 10,000 robots on screen at good FPS.
And it eventually worked! I was able to run all the math on the GPU in three compute passes. The robots are small triangles drawn with instancing. On my M1 Mac I see ~120 FPS at 10k robots. It was challenging but fun to achieve ~120 FPS (My first version gave me just ~32 FPS). On integrated graphics it should work fine if you lower the count of number of robots.
What I did algorithmically?
- Split the screen into a grid of cells.
- Each robot then checks for robots in its own cell and the 8 cells around it.
- Each frame runs three GPU passes:
- Clear the cell counters (set to 0).
- Bin robots into cells (atomic add) and store their IDs.
- Update robots using only nearby cells.
- Then a render pass draws one triangle per robot. No CPU uploads every frame.
That’s the whole trick. Use a grid. Keep memory simple. Let the GPU do the work.
How the grid works?
Diagram A shows the grid and binning where robots map to a cell, they automatically bump that cell's count, and writes their ID into that cell's slice of the index.
+----+----+----+----+
| 05 | 12 | 03 | 00 | gridCounts[c] == number of agents binned in cell
+----+----+----+----+
| 02 | 08 | 06 | 01 | gridIdx[c*max + 0..count-1] = agent IDs
+----+----+----+----+
| 00 | 03 | 07 | 04 |
+----+----+----+----+
| 00 | 00 | 01 | 00 |
+----+----+----+----+
Diagram A - grid + binning
Diagram B illustrates my frame pipeline: Each frame first clears counts, then bins robots, then updates using 3×3 neighbor cells, then swaps buffers and finally renders the triangles.
[agentsRead] --(compute: clear)-> [gridCounts = 0]
\
--(compute: bin)--> [gridCounts, gridIdx]
\
--(compute: update)--> [agentsWrite]
\
-- swap buffers -->
render(instanced triangles from agentsRead)
Diagram B - frame pipeline
Failures/Bugs I encountered leading to blank screen
When I first opened the HTML, the sliders and FPS showed up, but the canvas looked empty. I took me some iterations to make it work.
I started with a basic plan, three compute passes (clear → bin → update), then a render pass that draws tiny triangles from the same buffer. I wired the buffers, wrote the WGSL, saved the HTML file, and opened it in Chrome.
The UI came up with sliders, FPS. But the canvas was empty ☹️
First stop: DevTools console. Chrome gave me a few WGSL errors, but only at pipeline creation (good to know: WebGPU often hides shader errors until you ask it to build a pipeline). One error complained about arrayLength(). I had used it on plain array<Agent> and array<atomic<u32>>. A quick search (spec + a couple of blog posts + ChatGPT sanity check) reminded me it only works on runtime arrays at the end of a struct. I removed those calls and passed counts in via uniforms (agentCount, gridCells), and for per-cell sizes I looked at gridCounts.
Next error: atomics. I had declared:
@group(0) @binding(3) var<storage, read> gridCounts : array<atomic<u32>>;
…but I was doing atomicAdd on it. Console said atomic vars in storage must be read_write. Right, makes sense. I changed it to:
@group(0) @binding(3) var<storage, read_write> gridCounts : array<atomic<u32>>;and that warning went away.
Then I hit a bind-group layout mismatch. The binning pass was binding @binding(1) even though the shader didn’t declare it. Classic "I refactored the shader but forgot to update the JavaScript." Removing the extra binding fixed that. Tip from a Github issue I found: your bind group entries must match the shader exactly; the validator is strict (and that’s a good thing).
With those fixed, the pipelines compiled, but still no bots. I dumped a few values and re-read my own shader. Another error in the console nudged me: WGSL for loops aren’t C style. I’d written dy++ 😄. WGSL wants var and manual increments. I rewrote the loops:
var dy:i32 = -1;
loop {
if (dy > 1) { break; }
// ...
dy = dy + 1;
}I also realized I was referencing Params and Agent in stages where I hadn’t defined them. That one I caught by skimming the code and confirming with the errors. I made a tiny commonWGSL string and included it wherever I needed those structs. That kept things consistent.
At this point it ran, but the canvas still looked empty. Time to suspect the obvious: maybe the triangles are there, just way too small. I bumped the triangle size from microscopic (~0.005 px, lol) to ~6 px, added a little velocity-based tint, and suddenly I could see dots. Progress.
But movement looked off. On resize, the bots seemed to drift into nowhere land. That turned out to be me not updating uniforms (screen width/height, grid dims) every frame. I started writing screenBuf/paramsBuf each frame and rebuilt the grid if the canvas changed size. That snapped positions back into place.
Some of these fixes came straight from the Chrome console messages; others from skimming the WebGPU/WGSL spec and a few blog posts; a couple I sanity-checked with ChatGPT (mainly the arrayLength() and atomic rules). The last bit-the invisible triangles-was just me remembering to "render something big and bright first."
If you’re debugging this stuff: keep DevTools open, read the line/column in the WGSL errors, and don’t trust your triangle size!
Why it’s fast enough
- The grid turns "check everyone" into "check 9 cells."
- Buffers are flat and friendly for the GPU.
- Only the cell counters use atomics.
- The CPU doesn’t upload per-frame vertices. The render pass reads the same buffer the compute pass wrote.
From ~32 FPS to ~120 FPS (what I changed and why)
Your numbers will differ. This is a rough guide from my tests.
I started on an M1 laptop and saw ~30-32 FPS at 1k robots, falling to 16 FPS at 20k. That told me I was blowing the frame budget on pixels (too many to shade) and work per agent (too many neighbors, atomics, etc.). I then started to adjust some knobs and fixed a few things.
1) I reduced the pixels the GPU had to shade
HiDPI is great for text, not for 10k triangles.
- I capped device pixel ratio: set DPR cap = 1.0.
- and Lowered internal render scale: kept CSS size the same but set render scale = 0.9~1
Result: Instant jump. It’s basically a free 2-3x if you were at DPR 2. The scene looked the same at arm’s length.
2) I lowered per-agent neighbor work
The uniform grid is the whole trick, so I leaned into it.
- Raised
cellSizefrom ~12–16 to 20 (fewer cells, fewer lookups). - Set
maxPerCellto a realistic bound (28). If it’s too high, you scan too much; too low, you drop a few neighbors under extreme clumps but keep the frame smooth.
This cut both compute time and atomic contention on the binning pass.
3) I ran physics at 30 Hz, but rendered at 60 Hz
I only run the compute passes every other frame:
// roughly 30 Hz physics, 60 Hz render
if ((tick++ & 1) === 0) runCompute(encoder);
render(encoder, view);4) I bumped workgroup size for my GPUs
Mac likes bigger threadgroups. I went from 64 to 128 (sometimes 256).
- For me, 128 gave the best consistent results.
- The HTML now lets me switch between 64/128/256 and rebuilds pipelines on click.
5) I stopped making the GPU do silly work
- Turned off debug overlays while testing.
- Made triangles a sane size (6 px), and computed color cheaply from velocity.
- Kept the vertex math tiny; more math in compute, less in the vertex shader.
6) I fixed two bugs that tanked performance (and sometimes correctness)
- Pipeline creation before buffers: I was creating bind groups in
resize()before agent buffers existed, which crashed the first frame. The fix was to guard pipeline creation until buffers are ready, and also rebuild after swapping A/B. - WGSL intrinsic name: wrote
inversesqrt; WGSL wantsinverseSqrt. That one line can invalidate the compute pipeline and leave you "rendering" nothing.
7) I kept uniform writes small and simple
All per-frame changes are tiny queue.writeBuffer calls to the two uniform buffers. No mapping, no reallocation, no per-frame pipeline rebuilds.
8) I kept the machine honest
- Plugged in and turned off Low Power Mode on macOS.
- Kept the tab visible (background tabs throttle).
- Closed other GPU-heavy windows.
The “fast preset” that hit ~120 FPS (10k robots on my laptop)
What worked best for me (your numbers will differ):
- DPR cap: 1.0
- Render scale: 0.9
- Agents: 10,000
- cellSize: 20
- maxPerCell: 28
- Physics rate: every 2 frames (~30 Hz)
- Workgroup size: 128 (try 256)
- Triangle size: 6
- Vel color: 1.0
If you want more, go f16 later (store pos/vel as vec2<f16>), but I kept the demo clean and portable for now.
Debug crumbs that actually helped
- Empty canvas but FPS showing? Check for shader errors at pipeline creation (DevTools only prints them then). It’s usually a WGSL gotcha (loop syntax, intrinsic names, missing struct).
- Nothing draws, no errors? Your triangles might be too small or off-screen from bad uniforms (screen width/height). Write those every frame; rebuild grid on resize.
- Random validation error about bind groups? 99% it’s a layout mismatch-your JS entries and WGSL bindings must match exactly.
- Stutters when flocks clump? Atomics hot-spots. Raise
cellSizea bit, keepmaxPerCellmodest (and it’s okay to drop extra entries).
Tiny diffs that mattered
Guard pipeline builds until buffers exist:
- allocGrid(); makePipelines();
+ allocGrid();
+ if (agentsA && agentsB) makePipelines();
Build pipelines after creating agents:
resize();
allocAgents(state.agentCount);+makePipelines();
writeParams();
Fix the WGSL intrinsic:
- let inv = inversesqrt(d2);
+ let inv = inverseSqrt(d2);
Where I got the answers
- Chrome DevTools errors (pipeline creation time) pointed to the real WGSL lines.
- WGSL spec + a couple of WebGPU blog posts (loop syntax, atomic rules).
- A ChatGPT sanity check when I got stuck on
arrayLength()and storage access modes. - Trial and error on M1 with DPR and workgroup size.
If you want to squeeze even more
- Half precision:
enable f16;and put pos/vel invec2<f16>; convert tof32for math. Cuts bandwidth. - SoA buffers: split positions and velocities for more linear reads.
- Better neighbor cache: prefetch 3×3 cell bounds to registers/shared mem (more code, more speed).
- Soft-cap cells: when
slot >= maxPerCell, randomly drop entries to avoid worst-case scans.
Closing
This toy project is about how far a clean html WebGPU baseline can go with clear compute passes, predictable performance, laptop-friendliness. If you squeeze out more FPS or find a smarter kernel, send it over to me, I will ship it as a preset and credit you.