build.sh compiles llama.cpp to WebAssembly with WebGPU support via Emscripten + emdawnwebgpu, producing two WASM variants: JSPI (Chrome) and Asyncify (Firefox, Safari).runner.js launches Playwright browsers and navigates to harness.html.harness.js detects JSPI support and loads the correct WASM variant.llama_perf_context() and returned to Playwright.| Metric | Description |
|---|---|
decode_tok_s | Token generation speed (tokens/sec) — main performance metric |
prefill_tok_s | Prompt processing speed (tokens/sec) |
t_eval_ms | Total decode time in milliseconds |
t_p_eval_ms | Total prefill time in milliseconds |
n_eval | Number of tokens generated |
n_p_eval | Number of prompt tokens processed |
buildType | jspi or asyncify — which WASM variant was used |
webgpuAvailable | Whether WebGPU was available in the browser |
wallTimeMs | Total wall-clock time for the benchmark run |
| Category | Pattern | Typical Cause |
|---|---|---|
| OOM | out of memory, memory allocation | Model too large for available WASM memory |
| WASM Abort | wasm, abort, unreachable | WASM execution error, often from unsupported operations |
| Timeout | timeout, timed out | Benchmark exceeded time limit (model download or inference) |
| Download Failed | download, fetch, 404, network | Model file not found or network error |
| Other | everything else | Uncategorized errors |
The --consistency flag measures how faithfully the WebGPU backend reproduces the CPU computation for each quantization type.
For each variant, two runs are performed in the same browser:
n_gpu_layers=0): greedy-decodes 128 tokens and records the token ID sequence. Cached to results/cpu_baselines.json.n_gpu_layers=999): performs a forced-decoding pass — feeds the CPU's token sequence one token at a time and checks whether the WebGPU backend independently predicts the same top-1 token at each position.Naively comparing generated text suffers from cascading divergence: a single token difference changes the KV cache context for all subsequent tokens. Forced decoding evaluates each position independently, giving a clean per-token accuracy signal.
| agreement_rate | Interpretation |
|---|---|
1.00 | Numerically identical to CPU — no precision issues |
0.95–0.99 | A few tokens differ due to near-equal logits — expected for lower-precision quants |
< 0.90 | Systematic precision issues — GPU kernel may need investigation |
0.00 | First token wrong — quantization kernel likely broken |