Running a 205M-parameter forecasting model in the browser
How we got Chronos-Bolt-base from a Hugging Face checkpoint to a 107 MB INT4 ONNX graph running client-side via ORT Web, with notes on what broke along the way.
Amazon published Chronos-Bolt in late 2024. It is a T5 encoder-decoder, 205M parameters, trained to emit nine quantiles of a 64-step forecast in a single forward pass. It is also the kind of model that sits behind a Python notebook, a CUDA install, and a Hugging Face token. The question I wanted to answer: can someone who does not write code point a browser at a URL, paste in a CSV, and get a forecast?
The result lives at forecasting. This post is what it took to get there.
Why bother
Forecasting foundation models now hold their own against classical methods on zero-shot benchmarks (see the GIFT-Eval leaderboard for a current snapshot). The path from "model exists" to "a PM can use it on their own data" still runs through a notebook. That gap is a tooling problem, not a model problem. The browser solves three parts of it at once.
- No install. A URL is the entry point. No Python, no Docker, no API key.
- No upload. The CSV stays on the device. For anything under an NDA, that matters.
- No cost curve. The model ships once to the user and runs on their machine. A page view does not translate to a GPU-minute.
The pipeline
Hugging Face checkpoint
→ FP32 ONNX export
→ quantization (INT8, INT4)
→ Web Worker + ORT Web
→ IndexedDB cache
→ visx chart
Each arrow has a story. Notes below.
Exporting to ONNX
torch.onnx.export is the standard tool. On Chronos-Bolt it failed four times
before it worked.
torch.nanmean has no ONNX op. Chronos-Bolt's instance normalization layer
uses nanmean so that the scaler ignores padding positions. ONNX has no
nan-aware reduction. The fix is to monkey-patch InstanceNorm.forward before
export and compute the mean via a masked sum:
mask = ~torch.isnan(x)
x_safe = torch.where(mask, x, torch.zeros_like(x))
n = mask.to(torch.float32).sum(dim=-1, keepdim=True).clamp(min=1.0)
loc = x_safe.sum(dim=-1, keepdim=True) / nVariance needs the same treatment. Forgetting it produces a graph that runs but emits garbage for any input with padding.
torch.quantile carries a data-dependent guard. The dynamo exporter
refuses to trace torch.quantile because its guard depends on tensor values.
Replacing it with a sort, gather, and lerp equivalent gives a pure-functional
version the tracer accepts.
The default exporter path was a dead end. Opset 17 with the legacy
TorchScript-based exporter produced a graph that crashed
InferenceSession.create. Switching to dynamo=True at opset 20 produced
one that runs. Max absolute delta versus PyTorch on a 2048-step sine:
2.86e-6. That numeric check, not onnx.checker.check_model, is the real
safety net.
The .onnx.data sibling is part of the model. Any export over 2 GB, or
with save_as_external_data=True, writes initializers to a separate file
next to the graph. Both files have to ship. Copying only the .onnx leaves
a runtime that errors on session creation or, worse, loads a shell of a
model with no learned weights.
Quantization
FP32 Chronos-Bolt-base is 786 MB. That is a non-starter for a first-visit download. Two passes get it down to a serveable size.
| Variant | Storage | File size | Quality vs FP32 |
|---|---|---|---|
| FP32 | Unpacked float32 | 786 MB | baseline |
| INT8 per-channel (in-house) | Per-channel symmetric INT8, DequantizeLinear before each consumer | 200 MB | visually overlapping FP32 |
INT4 blockwise (MatMulNBitsQuantizer) | MatMulNBits op, 4-bit packed weights, 128-element blockwise scales | 107 MB | visually overlapping FP32 |
The INT8 pass is in-house: walk every FP32 weight initializer, compute a
per-output-channel scale, cast to INT8, insert DequantizeLinear before every
consumer. Activations stay in FP32 so calibration is not needed. MatMul
consumers still see FP32 weights at runtime, so the win is download size, not
compute.
The INT4 pass is ORT Web's own tool. MatMulNBitsQuantizer replaces MatMul
nodes with a fused MatMulNBits op that packs 4-bit weights and 128-element
blockwise scales into the op itself and dequantizes on the fly:
from onnxruntime.quantization.matmul_nbits_quantizer import MatMulNBitsQuantizer
MatMulNBitsQuantizer(
model,
bits=4,
block_size=128,
is_symmetric=True,
accuracy_level=4,
).process()INT4 is the default variant in the demo. Both variants overlap the FP32 reference on the three example datasets shipped with the page (AIRLINE, daily-births, sunspots). I did not run a formal accuracy sweep against held-out benchmarks.
INT2 as a stress test
I ran INT2 as a pipeline stress test rather than a production candidate. The
asymmetric [-2, 1] range produced forecasts above 500,000 on a sine input
in [40, 60], which is the failure mode you would predict if you thought
about it first. A ternary {-1, 0, 1} representation fixed most of the
channels, but the p90 channel still produced occasional numerical
explosions. The exercise was useful for hardening the chart against
pathological outputs; it is not a serveable variant.
Serving the graph
A 107 MB download is the easy part. Three things needed attention on the serving side.
Web Worker. InferenceSession.create on a 200 MB graph blocks the thread
for a couple of seconds while it parses the protobuf and allocates WASM heap.
On the main thread this freezes the tab. Moving the session into a worker
keeps the UI responsive during load and during inference.
IndexedDB cache. On first visit the worker fetches the .onnx, streams it
into an ArrayBuffer, and posts it to the session. On the way back down it
writes the buffer to IndexedDB under the URL as key. Subsequent visits pull
from IndexedDB with no network round trip. The write is fire-and-forget
because a 100 MB IDB write under quota pressure can take seconds and there is
no reason to block session creation on it.
NaN front-padding for short series. Chronos-Bolt's encode builds its
attention mask via ~torch.isnan(context). Front-padding the input with NaN
leaves the mask intact, so padded positions are ignored at attention time.
My first attempt padded with the series mean, which is the obvious choice
and the wrong one: the model sees a long flat run followed by the series
and emits a forecast with no slope. The NaN route costs nothing and keeps
the signal intact.
Stretching past 64 steps
Chronos-Bolt-base has a native prediction length of 64. For longer horizons the demo unrolls autoregressively: take the predicted median, append it to the context, run again, repeat up to 512 steps.
This is the honest-but-diminishing path. Every feedback step introduces a small error that compounds. The quantile ribbon widens in step, which gives a visual signal for when to stop trusting the forecast. The UI caps the horizon at 512 because the ribbon has saturated by then on most inputs.
What broke along the way
Try it
The demo is at forecasting. Upload a CSV, pick a variant, run a forecast. Backtest mode hides the last twenty percent of the series and compares the forecast against it. A download button emits the quantiles as CSV.
All inference runs in your browser. Nothing touches a server.