jsColorEngine · Browser Bench

Engine info

jsColorEngine—

WebAssembly —

WASM SIMD —

lcms-wasm —

Profile loading…

User agent —

CPU cores —

Page —

Full comparison — every direction × every mode

Runs all four colour-space directions (RGB→RGB, RGB→CMYK, CMYK→RGB, CMYK→CMYK) against eight jsColorEngine modes (no-LUT, float, int, int16, int16-wasm-scalar, int16-wasm-simd, int-wasm-scalar, int-wasm-simd) plus lcms-wasm in three flag configurations × two I/O widths (default, HIGHRESPRECALC, NOOPTIMIZE at both 8-bit and 16-bit). The Type column tags the hot path as f64 (jsce no-LUT or float f64 CLUT), u16 (lcms-wasm NOOPT (only u16 pipeline in the wasm build), u8 (8-bit I/O + u16 integer LUT), or u16 (16-bit I/O). Per cell measures LUT build time, cold first transform, and the hot run median ms/iter.

Pixels per iter Warmup iters Hot iters per batch (× 5 batches) include lcms-wasm

idle

Direction	Mode	Type	LUT	LUT build	Cold 1st	Hot ms	MPx/s	MB/s	relative speed	vs `int`

Accuracy sweep — Lab → device → Lab round-trip

jsColorEngine only. Sweeps the full Lab colour space (L = 0…100, a = −120…120, b = −120…120), converts every Lab point to a target profile's native device space as a Float object (no 8-bit /16-bit quantisation), then converts that float device value back to Lab and measures the residual ΔE₇₆. This is the most accurate path we ship — full f64 pipeline, no LUT, no dataFormat rounding.

We run two round-trips per click to exercise both math kernels jsColorEngine ships: RGB (matrix + per-channel curves — pure linear algebra), and CMYK (4-input tetrahedral interpolation against the profile's BToA/AToB pipeline). In this path the per-object overhead (object allocation, whitePoint carry-over, Lab staging) tends to dominate over the math, so both kernels land in the same 1–4 M round-trip/s ballpark on modern hardware — the raw math-kernel gap shows up much more starkly in the Full comparison tab's LUT-accelerated 8-bit image path, where per-pixel work drops low enough for the actual kernel cost to matter.

lcms-wasm is not included on this tab: its high-level JS binding exposes 8-bit and 16-bit buffer APIs but not a per-object float path, so the comparison wouldn't be like-for-like. If you want a pure-float lcms round-trip you need to drop down to cmsDoTransform with TYPE_Lab_DBL / TYPE_CMYK_DBL buffers — not currently wired up.

RGB target (matrix kernel) CMYK target (4D tetra kernel) Intent Sweep step (L, a, b) report in-gamut subset only (ΔE < 2)

idle

RGB round-trip · matrix kernel —

Round-trip rate

—

M round-trips / s

Total time

—

ms for full sweep

Points swept

—

Lab points

Median ΔE₇₆

—

in-gamut accuracy

Metric	Value	Notes

CMYK round-trip · 4D tetrahedral kernel —

Round-trip rate

—

M round-trips / s

Total time

—

ms for full sweep

Points swept

—

Lab points

Median ΔE₇₆

—

in-gamut accuracy

Metric	Value	Notes

What the ΔE numbers mean. When the starting Lab point is inside the target profile's gamut, the round-trip dE is the engine's residual numerical error — typically < 0.2 dE for jsColorEngine. When the starting Lab is outside gamut, the dE mostly reflects how far the gamut-mapping had to travel to land on the boundary; dE values of 20–80+ are normal for highly out-of-gamut Lab (e.g. pure a = ±120) and are a property of the profile, not the engine. Tick "in-gamut only" to filter those and see the pure math-fidelity number.

What the two kernels actually do. A Lab→RGB forward() is a matrix profile: 3 curve evaluations + a 3×3 matrix multiply + 3 inverse curves. That's < 30 scalar ops of pure linear algebra. A Lab→CMYK forward() is a 4-input tetrahedral interpolation against the profile's BToA LUT (17⁴ or 33⁴ grid): find the enclosing tetrahedron (sort 4 weights), fetch 5 grid corners with 4 output channels each (20 memory loads), barycentric blend. ∼80–100 ops, memory-bound.

In this per-object float path both kernels are dominated by the surrounding JS overhead (object allocation, whitePoint carry-over, Lab staging), so you'll see similar numbers for the two targets. The kernel cost difference is front-and-centre in the Full comparison tab's 8-bit image path — that's where jsColorEngine's SIMD kernel pulls 2–3× ahead on CMYK→CMYK and roughly matches lcms-wasm on RGB→RGB (because matrix math doesn't vectorise much beyond pairing loads, but 4D tetra does).

JIT warmup curve — watch TurboFan tier up

Runs the same transform N times and plots the per-iteration time. The first 50–300 iters typically run in Ignition / Sparkplug (slower); once V8 collects enough type feedback, TurboFan compiles the kernel and times drop sharply. WASM modes show much less ramp because the compiler runs once at create() time, not per-iter.

Direction Mode Pixels per iter Iters to record

idle

Pixel-count sweep — throughput vs buffer size

Same direction + mode, swept across pixel counts from 4 K to 4 M. Throughput should be roughly flat — if it falls off at large sizes the bottleneck is memory bandwidth, not the kernel. (4 K = 64×64, 4 M = 2048×2048.) Useful for sanity-checking that small-buffer cache effects aren't dominating the headline MPx/s number.

Direction Mode Hot iters per size

idle

Buffer (W×H)	Pixels	Hot ms / iter	MPx/s	MB/s	relative

About this bench

What gets measured

LUT build — wall-clock time of new Transform(…); .create(src, dst, intent). Includes the optimiser pass + integer-mirror LUT bake (for int* modes) + WebAssembly module compile (for int-wasm-* modes). One-shot, paid once per Transform; for lcms it's cmsCreateTransform(…).
Cold first transform — the very first transformArray() call. Captures Ignition→Sparkplug→ TurboFan tier-up cost on the JS path, and any first-touch page faults / cache misses on the WASM linear-memory path. Always slower than the steady-state hot number; the gap is "JIT warmup tax", not "the kernel is slow".
Hot run — 200+ warmup iters (so V8 has stabilised), then the median of 5 timed batches. This is the number to report if you want to know how fast the kernel actually runs in steady state.
MPx/s (Mega-pixels per second) — the primary throughput metric. One pixel = one colour sample regardless of channel count or bit depth. This is what the table sorts on and what the bar chart visualises. Use MPx/s when comparing kernels within the same direction (same channel layout), because the per-pixel work is identical and the number directly answers “how many pixels can I process per second?”
MB/s (Megabytes per second) — total data throughput, calculated as MPx/s × (inputChannels + outputChannels) × bytesPerSample. For u8 modes bytesPerSample = 1; for u16 modes bytesPerSample = 2. This counts both input and output bytes (the CPU has to read one and write the other, so both contribute to memory pressure). Use MB/s when comparing across different channel counts (e.g. RGB vs CMYK), across bit depths (u8 vs u16), or against other engines that report throughput in bytes — for example LittleCMS’s Fast Float plug-in benchmarks use MB/s.
Type (Full comparison table only) — f64, u8, or u16. f64 = jsce no-LUT or float (f64 path or f64 CLUT); u16 = lcms-wasm NOOPT rows only (the wasm port here exposes a the u8/u16 pipeline, not f64); u8 = 8-bit I/O with u16 integer CLUT; u16 = 16-bit I/O (jsce int16*, lcms 16-bit). Lets you see f64 vs u16 vs u8 vs u16 at a glance.

Modes

This bench tests every lutMode explicitly, one row per kernel, so you can see each cell. In real-world code you'll usually just write new Transform({ dataFormat: 'int8', buildLut: true }) and let the new v1.2 default lutMode: 'auto' pick int-wasm-simd for you (with automatic demotion to scalar WASM or JS int on older hosts). The explicit rows below are what 'auto' is choosing from.

jsColorEngine no-LUT (f64) — buildLut: false. No CLUT is built; every pixel walks the full per-stage pipeline (curves → matrix → LUT → PCS → inverse…) in 64-bit float. This is the accuracy configuration — use it when you need the most faithful math jsColorEngine can produce. Slowest by design; roughly comparable to lcms-wasm NOOPTIMIZE.
jsColorEngine float — buildLut: true + lutMode: 'float'. Float64Array CLUT (grid size shown in the LUT column, typically 33³ for RGB and 33⁴ for CMYK), tetrahedral interp in f64. Same float math as no-LUT for the interp step, but the pipeline is pre-collapsed to a LUT so it's much faster. If you want "float accuracy without the f64-pipeline tax", this is it.
jsColorEngine int — Uint16Array CLUT (Q0.16 weights), int32-specialised tetrahedral kernel via Math.imul. 8-bit I/O, bit-exact vs the float path. Added in v1.1 as the first non-float kernel.
jsColorEngine int16 (u16 I/O) — v1.3 16-bit-I/O JS kernel: Uint16Array CLUT scaled to the full [0..0xFFFF] range with Q0.13 fractional weights. Identity transforms round-trip to within ≤1 LSB of u16. Triggered by { dataFormat: 'int16', buildLut: true, lutMode: 'int16' }; with no explicit lutMode the auto-resolver demotes through int16-wasm-simd → int16-wasm-scalar → int16 based on host capability. Targeted at HDR / 16-bit TIFF / measurement workflows where banding from u8 quantisation is unacceptable.
jsColorEngine int16-wasm-scalar — v1.3 WASM port of the JS u16 kernel. Same Q0.13 CLUT, same arithmetic, bit-exact with the JS u16 row (0 LSB delta). Headline win: ~1.4× over int16 on 3D, modest on 4D.
jsColorEngine int16-wasm-simd — v1.3 WASM SIMD u16 kernel (channel-parallel v128, Q0.13). Bit-exact with both int16 and int16-wasm-scalar — SIMD is a pure speed lift, not a precision trade. ~1.7–2.4× over int16-wasm-scalar, ~2.0–2.6× over JS int16, ~3.9–4.9× over the fastest lcms 16-bit row. Falls back to int16-wasm-scalar on hosts without WASM SIMD.
jsColorEngine int-wasm-scalar — same int math compiled to WebAssembly. ~1.4× over int on 3D, ~1.2× on 4D.
jsColorEngine int-wasm-simd — channel-parallel WASM SIMD. ~3.0–3.5× over int on 3D, ~2.1–2.6× on 4D. Falls back to scalar if your browser lacks WASM SIMD.
lcms-wasm default / HIGHRES / NOOPT (8-bit) — LittleCMS 2.16 with TYPE_*_8 formats. Pinned heap buffers via _cmsDoTransform. default=flags=0; HIGHRES=cmsFLAGS_HIGHRESPRECALC (49 RGB / 23 CMYK grid — LUT-path accuracy ceiling); NOOPT=cmsFLAGS_NOOPTIMIZE (no precalc LUT; the wasm build does not expose the f64 paths, so we run the u16 pipeline on NOOPT path, not f64 like stock native lcms, so the Type column shows u16 and the jsce no-LUT row stays the f64 accuracy reference).
lcms-wasm default / HIGHRES / NOOPT (16-bit) — same three flag variants but with TYPE_*_16 formats. The lcms precalc-LUT grid is identical to the corresponding 8-bit row (lcms picks grid by colourspace, not by I/O bit depth), but the format converters at the pipeline edges flip from u8 to u16. Headline MPx/s typically drops ~10–15% vs the 8-bit row purely from the wider buffers. The NOOPT row shares the same u16 pipeline note as 8-bit NOOPT. The apples-to-apples row to read jsce int16 against is the lcms default / HIGHRES 16-bit rows, not NOOPT.

The four benchmark directions, and why RGB→RGB is sRGB→AdobeRGB

Each of the four directions exercises a genuinely distinct math path. The choice of endpoints matters — particularly on RGB→RGB, which is the one place where a naive choice lets one engine skew results by doing a passthough the both profiles are the same.

RGB → RGB — sRGB → AdobeRGB1998. Both profiles are matrix + tone-curve type, so the transform is three 1D curves + a 3×3 matrix + three inverse 1D curves. Pure linear algebra, < 30 scalar ops per pixel. This direction must NOT be sRGB→sRGB. When source and destination are the same profile, lcms's _cmsOptimizePipeline detects the identity during cmsCreateTransform() and collapses the hot path to a byte copy; measured impact: roughly +80% on lcms-wasm RGB→RGB throughput, purely from the passthrough optimisation. Matrix-shaper collapse is a legitimate lcms optimisation on RGB→RGB; So to measure performace, we just need it to be exercised against a different RGB profile so it's real work. Firefox drops from 165 → 91 MPx/s moving from sRGB→sRGB to sRGB→AdobeRGB, same machine, same session — that's the size of the difference. jsColorEngine doesn't have an equivalent identity-elision pass fow now, so it was not apples to apples, we want to test work done.
Our samples/profiles/AdobeRGB1998.icc is the 560-byte reference Adobe RGB (1998) profile. lcms-wasm doesn't export cmsCreateRGBProfile / cmsBuildGamma, so we load the ICC bytes rather than synthesise the profile in memory.
RGB → CMYK — sRGB → CoatedGRACoL2006. Matrix-shaper on input, 3D tetrahedral interp into a CMYK LUT on output. Real print-production workload.
CMYK → RGB — GRACoL → sRGB. 4D tetrahedral interp on input, matrix-shaper on output. Identical profile on both sides is fine here (verified: CMYK→sRGB and CMYK→AdobeRGB land within 1.5% of each other on the same host, so lcms isn't getting an identity shortcut this way round).
CMYK → CMYK — GRACoL → GRACoL. Yes, same profile. No identity collapse happens even though the profiles are pointer-equal: a CMYK profile's AToB (device→PCS) and BToA (PCS→device) tags aren't mathematical inverses (intent, black-point compensation and quantisation all break the symmetry), so the round-trip is genuinely two tetrahedral LUT traversals. Verified at 18 MPx/s on node — slower than CMYK→sRGB, which would be impossible if a passthrough were hitting.

How the LUT column is derived

The LUT column in the results table is ground-truth, not a guess. For jsColorEngine we read it straight from the Transform (xform.lut.g1 / .inputChannels / .intLut.CLUT). For lcms-wasm, which doesn't expose the precalc-LUT shape via its public API, we mirror the exact rule from lcms2-2.18/src/cmspcs.c :: _cmsReasonableGridpointsByColorspace:

Input channels	default	HIGHRESPRECALC	LOWRESPRECALC
1 (mono)	33	49	33
3 (RGB, Lab)	33	49	17
4 (CMYK)	17	23	17
>4 (HiFi)	7	7	6

So the RGB→CMYK row with lcms default shows 33×33×33 u16 (grid indexed by the 3-channel RGB input, picked by the default branch for 3-channel colour spaces), and the CMYK→RGB row with lcms HIGHRES shows 23×23×23×23 u16 (4-channel CMYK input, HIGHRES branch). jsColorEngine’s own grid picks happen to line up with lcms’s defaults — 33³ for RGB input, 17⁴ for CMYK — which is a convergent-evolution thing, not coordinated. Both engines arrive at the same size/fidelity tradeoff.

Accuracy vs image: which mode for which job?

The four summary cards at the top of the Full comparison panel answer this directly. In short:

Accuracy case (colour-critical, proofs, measurement) — 8-bit I/O, full pipeline per pixel, float math throughout. jsce no-LUT vs lcms NOOPTIMIZE. Slowest throughput; most faithful colour.
Image case (display, export, batch ICC conversion) — 8-bit I/O, LUT-baked. Any of jsce int* / jsce float vs lcms default / lcms HIGHRES. Orders of magnitude faster; the grid quantisation error is typically well under 1 delta-E for a 33³/33⁴ LUT. This is what you want for every non-colour-critical workflow.

Worth noting: jsce float is also a LUT-backed mode (Float64Array CLUT), so it lives in the image family even though the interp math is f64. The precision difference vs the int (Q0.16 u16) path is below the noise floor for 8-bit output — use whichever is faster on your host, or reach for no-LUT when you need no-LUT precision.

Why RGB and CMYK sit on different performance curves

The single most useful lens for reading the results table is which math kernel is this row using?. Every colour conversion is either:

Matrix + curves (RGB ↔ RGB, RGB ↔ Lab). The profile is a 3×3 matrix and three per-channel curves. A forward() call is < 30 scalar float ops — trivially fast, and trivially fast in any reasonable implementation, including scalar-only builds like stock lcms-wasm. SIMD still roughly doubles this because jsColorEngine packs 3 channels into one v128 register and collapses the 3×3 matrix + curves into 4–5 fused operations, but the absolute gap is smaller in ops than on the tetra paths, and the pressure on the memory subsystem is near-zero either way.
3D tetrahedral LUT (RGB ↔ CMYK, any RGB-profile conversion that can't be collapsed to matrix). A 3-input tetrahedral interp reads 4 corners of a 33×33×33 grid per pixel and does a barycentric blend. ∼40 ops, memory-ish bound. SIMD pays off meaningfully (v128 unit-weight mul + load-pair makes a 4-channel output a single fused op). jsColorEngine's int-wasm-simd kernel wins here by 2–3× over scalar, and by 3–4× over lcms-wasm default.
4D tetrahedral LUT (CMYK ↔ CMYK, CMYK as the input side of any non-matrix transform). 4-input tetra interpolation on a 17⁴ or 33⁴ grid: sort 4 weights to find the enclosing tetrahedron, fetch 5 corners with 4 output channels each (20 memory loads), barycentric blend. ∼80–100 ops per pixel, heavily memory-bound. This is where scalar implementations fall off a cliff — lcms-wasm's CMYK→CMYK number drops to 25–30 MPx/s on the same host where its RGB→RGB number is in the 90–130 range. jsColorEngine's SIMD kernel stays above 100 MPx/s because the channel-parallel v128 layout hides the memory-port pressure.

So when you see lcms-wasm being ~2× behind jsColorEngine on RGB→RGB and 4×–5× behind on CMYK→CMYK: that's not two different stories. It's the same story (scalar kernel runs fine on the cheap math path, suffers on the expensive one), reading differently because the two directions exercise fundamentally different code.

History note. An earlier version of this doc claimed lcms-wasm was within 1.2× of jsColorEngine's SIMD kernel on RGB→RGB. That number was wrong — we were benchmarking sRGB→sRGB (identity, optimised to memcpy by lcms), which gave lcms a ∼+80% boost on that row only. Switching to sRGB→AdobeRGB (real matrix + curves) fixed it; the gap is now ~2×, consistent with the 3D / 4D tetra directions and with the general "SIMD lets us compete even where scalar implementations are already decent" thesis.

Is lcms-wasm really running at full speed?

Yes. Verified by construction:

Input and output buffers are _malloc'd once at create() time and reused across every call (pinned heap buffers).
The hot path calls _cmsDoTransform(xf, inPtr, outPtr, pixelCount) directly — the raw C export. We deliberately skip lcms.cmsDoTransform(...) (no underscore), because that high-level wrapper does _malloc + new Uint8Array(HEAPU8, …).set() + ccall + .slice() + _free ×2 per call — an order-of-magnitude slowdown. We give lcms the same pinned-heap path a production app would use.
Three flag variants are tested: flags=0, HIGHRESPRECALC, and NOOPTIMIZE. The best of the three is the "lcms at its best" number.
Same seeded-PRNG input bytes as jsColorEngine, same INTENT_RELATIVE_COLORIMETRIC, same TYPE_*_8 formats, same GRACoL profile, same pixel count.

Where does the speed gap come from then? lcms-wasm is a stock emscripten build of LittleCMS 2.16 without -msimd128, so every lcms kernel runs scalar regardless of host SIMD support. jsColorEngine's int-wasm-simd path ships hand-tuned channel-parallel v128 kernels for 3D and 4D tetrahedral interpolation. That's a real capability gap, not a benchmark bias — the "lcms-wasm" line in the engine info panel says stock scalar build precisely to flag this. (Stock means an emscripten build without -msimd128; informational, not a problem — the cell stays green when lcms loads.)

A SIMD-compiled lcms-wasm (same source, built with -msimd128 -O3) would close a lot of the gap. We'd welcome that comparison if anyone wants to rebuild lcms with SIMD on and ship it back to samples/lcms-wasm-dist/lcms.wasm.

How to read the result

The right-hand bar in the results table is normalised per direction so the fastest mode for that direction fills the bar. The vs `int` column is the speedup vs jsColorEngine int for the same direction — this is the apples-to-apples "how much does the WASM/SIMD port buy us" number, and the one most directly comparable to the figures in docs/Performance.md.

Caveats

Run with the browser tab focused. Background tabs throttle setTimeout and (sometimes) performance.now() precision.
Close other heavy tabs / apps. CPU contention shows up directly in MPx/s.
First run after page load includes WASM compile + profile decode — hit "Run benchmark" twice for stable numbers.
Numbers in the same browser are highly reproducible (±5%); cross-browser they can swing 30%+. Use a single browser when comparing modes.
Mobile CPUs throttle aggressively under sustained load. The "Hot iters" setting is conservative by default — bumping it on mobile can show throttling kick in.

Source: samples/bench/ in the jsColorEngine repo. Stop the server with Ctrl+C in the terminal that ran npm run serve (or node samples/bench/serve.js).