WebGPU Bench

← Back to Dashboard

How Benchmarks Work

  1. build.sh compiles llama.cpp to WebAssembly with WebGPU support via Emscripten + emdawnwebgpu, producing two WASM variants: JSPI (Chrome) and Asyncify (Firefox, Safari).
  2. runner.js launches Playwright browsers and navigates to harness.html.
  3. harness.js detects JSPI support and loads the correct WASM variant.
  4. The GGUF model is downloaded from HuggingFace directly in the browser.
  5. Inference runs via WebGPU (or CPU fallback) using llama.cpp's C API with greedy sampling for deterministic output.
  6. Performance metrics are collected via llama_perf_context() and returned to Playwright.
  7. A fresh browser instance is launched for each variant to prevent WASM memory accumulation (OOM fix).

Metrics Glossary

MetricDescription
decode_tok_sToken generation speed (tokens/sec) — main performance metric
prefill_tok_sPrompt processing speed (tokens/sec)
t_eval_msTotal decode time in milliseconds
t_p_eval_msTotal prefill time in milliseconds
n_evalNumber of tokens generated
n_p_evalNumber of prompt tokens processed
buildTypejspi or asyncify — which WASM variant was used
webgpuAvailableWhether WebGPU was available in the browser
wallTimeMsTotal wall-clock time for the benchmark run

Error Categories

CategoryPatternTypical Cause
OOMout of memory, memory allocationModel too large for available WASM memory
WASM Abortwasm, abort, unreachableWASM execution error, often from unsupported operations
Timeouttimeout, timed outBenchmark exceeded time limit (model download or inference)
Download Faileddownload, fetch, 404, networkModel file not found or network error
Othereverything elseUncategorized errors

Consistency Measurement

The --consistency flag measures how faithfully the WebGPU backend reproduces the CPU computation for each quantization type.

How it works

For each variant, two runs are performed in the same browser:

  1. CPU baseline (n_gpu_layers=0): greedy-decodes 128 tokens and records the token ID sequence. Cached to results/cpu_baselines.json.
  2. WebGPU run (n_gpu_layers=999): performs a forced-decoding pass — feeds the CPU's token sequence one token at a time and checks whether the WebGPU backend independently predicts the same top-1 token at each position.

Why forced decoding

Naively comparing generated text suffers from cascading divergence: a single token difference changes the KV cache context for all subsequent tokens. Forced decoding evaluates each position independently, giving a clean per-token accuracy signal.

Interpreting results

agreement_rateInterpretation
1.00Numerically identical to CPU — no precision issues
0.95–0.99A few tokens differ due to near-equal logits — expected for lower-precision quants
< 0.90Systematic precision issues — GPU kernel may need investigation
0.00First token wrong — quantization kernel likely broken