Methodology — WebGPU Bench

How Benchmarks Work

build.sh compiles llama.cpp to WebAssembly with WebGPU support via Emscripten + emdawnwebgpu, producing two WASM variants: JSPI (Chrome) and Asyncify (Firefox, Safari).
runner.js launches Playwright browsers and navigates to harness.html.
harness.js detects JSPI support and loads the correct WASM variant.
The GGUF model is downloaded from HuggingFace directly in the browser.
Inference runs via WebGPU (or CPU fallback) using llama.cpp's C API with greedy sampling for deterministic output.
Performance metrics are collected via llama_perf_context() and returned to Playwright.
A fresh browser instance is launched for each variant to prevent WASM memory accumulation (OOM fix).

Metrics Glossary

Metric	Description
`decode_tok_s`	Token generation speed (tokens/sec) — main performance metric
`prefill_tok_s`	Prompt processing speed (tokens/sec)
`t_eval_ms`	Total decode time in milliseconds
`t_p_eval_ms`	Total prefill time in milliseconds
`n_eval`	Number of tokens generated
`n_p_eval`	Number of prompt tokens processed
`buildType`	`jspi` or `asyncify` — which WASM variant was used
`webgpuAvailable`	Whether WebGPU was available in the browser
`wallTimeMs`	Total wall-clock time for the benchmark run

Error Categories

Category	Pattern	Typical Cause
OOM	out of memory, memory allocation	Model too large for available WASM memory
WASM Abort	wasm, abort, unreachable	WASM execution error, often from unsupported operations
Timeout	timeout, timed out	Benchmark exceeded time limit (model download or inference)
Download Failed	download, fetch, 404, network	Model file not found or network error
Other	everything else	Uncategorized errors

Consistency Measurement

The --consistency flag measures how faithfully the WebGPU backend reproduces the CPU computation for each quantization type.

How it works

For each variant, two runs are performed in the same browser:

CPU baseline (n_gpu_layers=0): greedy-decodes 128 tokens and records the token ID sequence. Cached to results/cpu_baselines.json.
WebGPU run (n_gpu_layers=999): performs a forced-decoding pass — feeds the CPU's token sequence one token at a time and checks whether the WebGPU backend independently predicts the same top-1 token at each position.

Why forced decoding

Naively comparing generated text suffers from cascading divergence: a single token difference changes the KV cache context for all subsequent tokens. Forced decoding evaluates each position independently, giving a clean per-token accuracy signal.

Interpreting results

agreement_rate	Interpretation
`1.00`	Numerically identical to CPU — no precision issues
`0.95–0.99`	A few tokens differ due to near-equal logits — expected for lower-precision quants
`< 0.90`	Systematic precision issues — GPU kernel may need investigation
`0.00`	First token wrong — quantization kernel likely broken