Running a 205M-parameter forecasting model in the browser

Amazon published Chronos-Bolt in late 2024. It is a T5 encoder-decoder, 205M parameters, trained to emit nine quantiles of a 64-step forecast in a single forward pass. It is also the kind of model that sits behind a Python notebook, a CUDA install, and a Hugging Face token. The question I wanted to answer: can someone who does not write code point a browser at a URL, paste in a CSV, and get a forecast?

The result lives at forecasting. This post is what it took to get there.

Why bother

Forecasting foundation models now hold their own against classical methods on zero-shot benchmarks (see the GIFT-Eval leaderboard for a current snapshot). The path from "model exists" to "a PM can use it on their own data" still runs through a notebook. That gap is a tooling problem, not a model problem. The browser solves three parts of it at once.

No install. A URL is the entry point. No Python, no Docker, no API key.
No upload. The CSV stays on the device. For anything under an NDA, that matters.
No cost curve. The model ships once to the user and runs on their machine. A page view does not translate to a GPU-minute.

The pipeline

Hugging Face checkpoint
  → FP32 ONNX export
  → quantization (INT8, INT4)
  → Web Worker + ORT Web
  → IndexedDB cache
  → visx chart

Each arrow has a story. Notes below.

Exporting to ONNX

torch.onnx.export is the standard tool. On Chronos-Bolt it failed four times before it worked.

torch.nanmean has no ONNX op. Chronos-Bolt's instance normalization layer uses nanmean so that the scaler ignores padding positions. ONNX has no nan-aware reduction. The fix is to monkey-patch InstanceNorm.forward before export and compute the mean via a masked sum:

mask = ~torch.isnan(x)
x_safe = torch.where(mask, x, torch.zeros_like(x))
n = mask.to(torch.float32).sum(dim=-1, keepdim=True).clamp(min=1.0)
loc = x_safe.sum(dim=-1, keepdim=True) / n

Variance needs the same treatment. Forgetting it produces a graph that runs but emits garbage for any input with padding.

torch.quantile carries a data-dependent guard. The dynamo exporter refuses to trace torch.quantile because its guard depends on tensor values. Replacing it with a sort, gather, and lerp equivalent gives a pure-functional version the tracer accepts.

The default exporter path was a dead end. Opset 17 with the legacy TorchScript-based exporter produced a graph that crashed InferenceSession.create. Switching to dynamo=True at opset 20 produced one that runs. Max absolute delta versus PyTorch on a 2048-step sine: 2.86e-6. That numeric check, not onnx.checker.check_model, is the real safety net.

The .onnx.data sibling is part of the model. Any export over 2 GB, or with save_as_external_data=True, writes initializers to a separate file next to the graph. Both files have to ship. Copying only the .onnx leaves a runtime that errors on session creation or, worse, loads a shell of a model with no learned weights.

Quantization

FP32 Chronos-Bolt-base is 786 MB. That is a non-starter for a first-visit download. Two passes get it down to a serveable size.

Variant	Storage	File size	Quality vs FP32
FP32	Unpacked float32	786 MB	baseline
INT8 per-channel (in-house)	Per-channel symmetric INT8, `DequantizeLinear` before each consumer	200 MB	visually overlapping FP32
INT4 blockwise (`MatMulNBitsQuantizer`)	`MatMulNBits` op, 4-bit packed weights, 128-element blockwise scales	107 MB	visually overlapping FP32

The INT8 pass is in-house: walk every FP32 weight initializer, compute a per-output-channel scale, cast to INT8, insert DequantizeLinear before every consumer. Activations stay in FP32 so calibration is not needed. MatMul consumers still see FP32 weights at runtime, so the win is download size, not compute.

The INT4 pass is ORT Web's own tool. MatMulNBitsQuantizer replaces MatMul nodes with a fused MatMulNBits op that packs 4-bit weights and 128-element blockwise scales into the op itself and dequantizes on the fly:

from onnxruntime.quantization.matmul_nbits_quantizer import MatMulNBitsQuantizer
 
MatMulNBitsQuantizer(
    model,
    bits=4,
    block_size=128,
    is_symmetric=True,
    accuracy_level=4,
).process()

INT4 is the default variant in the demo. Both variants overlap the FP32 reference on the three example datasets shipped with the page (AIRLINE, daily-births, sunspots). I did not run a formal accuracy sweep against held-out benchmarks.

INT2 as a stress test

I ran INT2 as a pipeline stress test rather than a production candidate. The asymmetric [-2, 1] range produced forecasts above 500,000 on a sine input in [40, 60], which is the failure mode you would predict if you thought about it first. A ternary {-1, 0, 1} representation fixed most of the channels, but the p90 channel still produced occasional numerical explosions. The exercise was useful for hardening the chart against pathological outputs; it is not a serveable variant.

Serving the graph

A 107 MB download is the easy part. Three things needed attention on the serving side.

Web Worker. InferenceSession.create on a 200 MB graph blocks the thread for a couple of seconds while it parses the protobuf and allocates WASM heap. On the main thread this freezes the tab. Moving the session into a worker keeps the UI responsive during load and during inference.

IndexedDB cache. On first visit the worker fetches the .onnx, streams it into an ArrayBuffer, and posts it to the session. On the way back down it writes the buffer to IndexedDB under the URL as key. Subsequent visits pull from IndexedDB with no network round trip. The write is fire-and-forget because a 100 MB IDB write under quota pressure can take seconds and there is no reason to block session creation on it.

NaN front-padding for short series. Chronos-Bolt's encode builds its attention mask via ~torch.isnan(context). Front-padding the input with NaN leaves the mask intact, so padded positions are ignored at attention time. My first attempt padded with the series mean, which is the obvious choice and the wrong one: the model sees a long flat run followed by the series and emits a forecast with no slope. The NaN route costs nothing and keeps the signal intact.

Stretching past 64 steps

Chronos-Bolt-base has a native prediction length of 64. For longer horizons the demo unrolls autoregressively: take the predicted median, append it to the context, run again, repeat up to 512 steps.

This is the honest-but-diminishing path. Every feedback step introduces a small error that compounds. The quantile ribbon widens in step, which gives a visual signal for when to stop trusting the forecast. The UI caps the horizon at 512 because the ribbon has saturated by then on most inputs.

What broke along the way

Try it

The demo is at forecasting. Upload a CSV, pick a variant, run a forecast. Backtest mode hides the last twenty percent of the series and compares the forecast against it. A download button emits the quantiles as CSV.

All inference runs in your browser. Nothing touches a server.