Engine info
Full comparison — every direction × every mode
Runs all four colour-space directions (RGB→RGB, RGB→CMYK,
CMYK→RGB, CMYK→CMYK) against eight jsColorEngine modes
(no-LUT, float, int,
int16, int16-wasm-scalar,
int16-wasm-simd, int-wasm-scalar,
int-wasm-simd) plus lcms-wasm in three flag
configurations × two I/O widths (default,
HIGHRESPRECALC, NOOPTIMIZE at both 8-bit
and 16-bit). The Type column tags the hot path
as f64 (jsce no-LUT or float f64 CLUT), u16
(lcms-wasm NOOPT (only u16 pipeline in the wasm
build), u8 (8-bit I/O + u16 integer LUT),
or u16 (16-bit I/O).
Per cell measures LUT build time,
cold first transform, and the hot run median
ms/iter.
| Direction | Mode | Type | LUT | LUT build | Cold 1st | Hot ms | MPx/s | MB/s | relative speed | vs `int` |
|---|
Accuracy sweep — Lab → device → Lab round-trip
jsColorEngine only. Sweeps the full Lab colour space (L = 0…100, a = −120…120, b = −120…120), converts every Lab point to a target profile's native device space as a Float object (no 8-bit /16-bit quantisation), then converts that float device value back to Lab and measures the residual ΔE76. This is the most accurate path we ship — full f64 pipeline, no LUT, no dataFormat rounding.
We run two round-trips per click to exercise both math kernels jsColorEngine ships: RGB (matrix + per-channel curves — pure linear algebra), and CMYK (4-input tetrahedral interpolation against the profile's BToA/AToB pipeline). In this path the per-object overhead (object allocation, whitePoint carry-over, Lab staging) tends to dominate over the math, so both kernels land in the same 1–4 M round-trip/s ballpark on modern hardware — the raw math-kernel gap shows up much more starkly in the Full comparison tab's LUT-accelerated 8-bit image path, where per-pixel work drops low enough for the actual kernel cost to matter.
lcms-wasm is not included on this tab: its high-level JS
binding exposes 8-bit and 16-bit buffer APIs but not a per-object
float path, so the comparison wouldn't be like-for-like. If you want a
pure-float lcms round-trip you need to drop down to
cmsDoTransform with TYPE_Lab_DBL /
TYPE_CMYK_DBL buffers — not currently wired up.
RGB round-trip · matrix kernel —
| Metric | Value | Notes |
|---|
CMYK round-trip · 4D tetrahedral kernel —
| Metric | Value | Notes |
|---|
What the ΔE numbers mean. When the starting Lab point is inside the target profile's gamut, the round-trip dE is the engine's residual numerical error — typically < 0.2 dE for jsColorEngine. When the starting Lab is outside gamut, the dE mostly reflects how far the gamut-mapping had to travel to land on the boundary; dE values of 20–80+ are normal for highly out-of-gamut Lab (e.g. pure a = ±120) and are a property of the profile, not the engine. Tick "in-gamut only" to filter those and see the pure math-fidelity number.
What the two kernels actually do. A Lab→RGB
forward() is a matrix profile: 3 curve evaluations +
a 3×3 matrix multiply + 3 inverse curves. That's < 30 scalar
ops of pure linear algebra. A Lab→CMYK forward() is a
4-input tetrahedral interpolation against the profile's BToA LUT
(174 or 334 grid): find the enclosing
tetrahedron (sort 4 weights), fetch 5 grid corners with 4 output
channels each (20 memory loads), barycentric blend. ∼80–100
ops, memory-bound.
In this per-object float path both kernels are dominated by the surrounding JS overhead (object allocation, whitePoint carry-over, Lab staging), so you'll see similar numbers for the two targets. The kernel cost difference is front-and-centre in the Full comparison tab's 8-bit image path — that's where jsColorEngine's SIMD kernel pulls 2–3× ahead on CMYK→CMYK and roughly matches lcms-wasm on RGB→RGB (because matrix math doesn't vectorise much beyond pairing loads, but 4D tetra does).
JIT warmup curve — watch TurboFan tier up
Runs the same transform N times and plots the per-iteration
time. The first 50–300 iters typically run in Ignition / Sparkplug
(slower); once V8 collects enough type feedback, TurboFan compiles the
kernel and times drop sharply. WASM modes show much less ramp because
the compiler runs once at create() time, not per-iter.
Pixel-count sweep — throughput vs buffer size
Same direction + mode, swept across pixel counts from 4 K to 4 M. Throughput should be roughly flat — if it falls off at large sizes the bottleneck is memory bandwidth, not the kernel. (4 K = 64×64, 4 M = 2048×2048.) Useful for sanity-checking that small-buffer cache effects aren't dominating the headline MPx/s number.
| Buffer (W×H) | Pixels | Hot ms / iter | MPx/s | MB/s | relative |
|---|
About this bench
What gets measured
-
LUT build — wall-clock time of
new Transform(…); .create(src, dst, intent). Includes the optimiser pass + integer-mirror LUT bake (forint*modes) + WebAssembly module compile (forint-wasm-*modes). One-shot, paid once per Transform; for lcms it'scmsCreateTransform(…). -
Cold first transform — the very first
transformArray()call. Captures Ignition→Sparkplug→ TurboFan tier-up cost on the JS path, and any first-touch page faults / cache misses on the WASM linear-memory path. Always slower than the steady-state hot number; the gap is "JIT warmup tax", not "the kernel is slow". - Hot run — 200+ warmup iters (so V8 has stabilised), then the median of 5 timed batches. This is the number to report if you want to know how fast the kernel actually runs in steady state.
- MPx/s (Mega-pixels per second) — the primary throughput metric. One pixel = one colour sample regardless of channel count or bit depth. This is what the table sorts on and what the bar chart visualises. Use MPx/s when comparing kernels within the same direction (same channel layout), because the per-pixel work is identical and the number directly answers “how many pixels can I process per second?”
-
MB/s (Megabytes per second) — total data
throughput, calculated as
MPx/s × (inputChannels + outputChannels) × bytesPerSample. For u8 modesbytesPerSample = 1; for u16 modesbytesPerSample = 2. This counts both input and output bytes (the CPU has to read one and write the other, so both contribute to memory pressure). Use MB/s when comparing across different channel counts (e.g. RGB vs CMYK), across bit depths (u8 vs u16), or against other engines that report throughput in bytes — for example LittleCMS’s Fast Float plug-in benchmarks use MB/s. -
Type (Full comparison table only) —
f64,u8, oru16.f64= jsceno-LUTorfloat(f64 path or f64 CLUT);u16= lcms-wasmNOOPTrows only (the wasm port here exposes a the u8/u16 pipeline, not f64);u8= 8-bit I/O with u16 integer CLUT;u16= 16-bit I/O (jsceint16*, lcms 16-bit). Lets you see f64 vs u16 vs u8 vs u16 at a glance.
Modes
This bench tests every lutMode explicitly, one row per
kernel, so you can see each cell. In real-world code you'll
usually just write
new Transform({ dataFormat: 'int8', buildLut: true })
and let the new v1.2 default lutMode: 'auto' pick
int-wasm-simd for you (with automatic demotion to
scalar WASM or JS int on older hosts). The explicit
rows below are what 'auto' is choosing from.
- jsColorEngine no-LUT (f64) —
buildLut: false. No CLUT is built; every pixel walks the full per-stage pipeline (curves → matrix → LUT → PCS → inverse…) in 64-bit float. This is the accuracy configuration — use it when you need the most faithful math jsColorEngine can produce. Slowest by design; roughly comparable tolcms-wasm NOOPTIMIZE. - jsColorEngine float —
buildLut: true+lutMode: 'float'. Float64Array CLUT (grid size shown in the LUT column, typically 333 for RGB and 334 for CMYK), tetrahedral interp in f64. Same float math as no-LUT for the interp step, but the pipeline is pre-collapsed to a LUT so it's much faster. If you want "float accuracy without the f64-pipeline tax", this is it. - jsColorEngine int — Uint16Array CLUT (Q0.16 weights), int32-specialised tetrahedral kernel via
Math.imul. 8-bit I/O, bit-exact vs the float path. Added in v1.1 as the first non-float kernel. - jsColorEngine int16 (u16 I/O) — v1.3 16-bit-I/O JS kernel:
Uint16ArrayCLUT scaled to the full [0..0xFFFF] range with Q0.13 fractional weights. Identity transforms round-trip to within ≤1 LSB of u16. Triggered by{ dataFormat: 'int16', buildLut: true, lutMode: 'int16' }; with no explicitlutModethe auto-resolver demotes throughint16-wasm-simd→int16-wasm-scalar→int16based on host capability. Targeted at HDR / 16-bit TIFF / measurement workflows where banding from u8 quantisation is unacceptable. - jsColorEngine int16-wasm-scalar — v1.3 WASM port of the JS u16 kernel. Same Q0.13 CLUT, same arithmetic, bit-exact with the JS u16 row (0 LSB delta). Headline win: ~1.4× over
int16on 3D, modest on 4D. - jsColorEngine int16-wasm-simd — v1.3 WASM SIMD u16 kernel (channel-parallel
v128, Q0.13). Bit-exact with bothint16andint16-wasm-scalar— SIMD is a pure speed lift, not a precision trade. ~1.7–2.4× overint16-wasm-scalar, ~2.0–2.6× over JSint16, ~3.9–4.9× over the fastestlcms 16-bitrow. Falls back toint16-wasm-scalaron hosts without WASM SIMD. - jsColorEngine int-wasm-scalar — same int math compiled to WebAssembly. ~1.4× over
inton 3D, ~1.2× on 4D. - jsColorEngine int-wasm-simd — channel-parallel WASM SIMD. ~3.0–3.5× over
inton 3D, ~2.1–2.6× on 4D. Falls back to scalar if your browser lacks WASM SIMD. - lcms-wasm default / HIGHRES / NOOPT (8-bit) — LittleCMS 2.16 with
TYPE_*_8formats. Pinned heap buffers via_cmsDoTransform.default=flags=0;HIGHRES=cmsFLAGS_HIGHRESPRECALC(49 RGB / 23 CMYK grid — LUT-path accuracy ceiling);NOOPT=cmsFLAGS_NOOPTIMIZE(no precalc LUT; the wasm build does not expose the f64 paths, so we run the u16 pipeline on NOOPT path, not f64 like stock native lcms, so the Type column showsu16and the jsceno-LUTrow stays the f64 accuracy reference). - lcms-wasm default / HIGHRES / NOOPT (16-bit) — same three flag variants but with
TYPE_*_16formats. The lcms precalc-LUT grid is identical to the corresponding 8-bit row (lcms picks grid by colourspace, not by I/O bit depth), but the format converters at the pipeline edges flip from u8 to u16. Headline MPx/s typically drops ~10–15% vs the 8-bit row purely from the wider buffers. TheNOOPTrow shares the same u16 pipeline note as 8-bit NOOPT. The apples-to-apples row to read jsceint16against is the lcms default / HIGHRES 16-bit rows, notNOOPT.
The four benchmark directions, and why RGB→RGB is sRGB→AdobeRGB
Each of the four directions exercises a genuinely distinct math path. The choice of endpoints matters — particularly on RGB→RGB, which is the one place where a naive choice lets one engine skew results by doing a passthough the both profiles are the same.
-
RGB → RGB —
sRGB → AdobeRGB1998. Both profiles are matrix + tone-curve type, so the transform is three 1D curves + a 3×3 matrix + three inverse 1D curves. Pure linear algebra, < 30 scalar ops per pixel. This direction must NOT be sRGB→sRGB. When source and destination are the same profile, lcms's_cmsOptimizePipelinedetects the identity duringcmsCreateTransform()and collapses the hot path to a byte copy; measured impact: roughly +80% on lcms-wasm RGB→RGB throughput, purely from the passthrough optimisation. Matrix-shaper collapse is a legitimate lcms optimisation on RGB→RGB; So to measure performace, we just need it to be exercised against a different RGB profile so it's real work. Firefox drops from 165 → 91 MPx/s moving from sRGB→sRGB to sRGB→AdobeRGB, same machine, same session — that's the size of the difference. jsColorEngine doesn't have an equivalent identity-elision pass fow now, so it was not apples to apples, we want to test work done.
Oursamples/profiles/AdobeRGB1998.iccis the 560-byte reference Adobe RGB (1998) profile. lcms-wasm doesn't exportcmsCreateRGBProfile/cmsBuildGamma, so we load the ICC bytes rather than synthesise the profile in memory. -
RGB → CMYK —
sRGB → CoatedGRACoL2006. Matrix-shaper on input, 3D tetrahedral interp into a CMYK LUT on output. Real print-production workload. -
CMYK → RGB —
GRACoL → sRGB. 4D tetrahedral interp on input, matrix-shaper on output. Identical profile on both sides is fine here (verified: CMYK→sRGB and CMYK→AdobeRGB land within 1.5% of each other on the same host, so lcms isn't getting an identity shortcut this way round). -
CMYK → CMYK —
GRACoL → GRACoL. Yes, same profile. No identity collapse happens even though the profiles are pointer-equal: a CMYK profile's AToB (device→PCS) and BToA (PCS→device) tags aren't mathematical inverses (intent, black-point compensation and quantisation all break the symmetry), so the round-trip is genuinely two tetrahedral LUT traversals. Verified at 18 MPx/s on node — slower than CMYK→sRGB, which would be impossible if a passthrough were hitting.
How the LUT column is derived
The LUT column in the results table is ground-truth, not a
guess. For jsColorEngine we read it straight from the Transform
(xform.lut.g1 / .inputChannels /
.intLut.CLUT). For lcms-wasm, which doesn't expose the
precalc-LUT shape via its public API, we mirror the exact rule from
lcms2-2.18/src/cmspcs.c :: _cmsReasonableGridpointsByColorspace:
| Input channels | default | HIGHRESPRECALC | LOWRESPRECALC |
|---|---|---|---|
| 1 (mono) | 33 | 49 | 33 |
| 3 (RGB, Lab) | 33 | 49 | 17 |
| 4 (CMYK) | 17 | 23 | 17 |
| >4 (HiFi) | 7 | 7 | 6 |
So the RGB→CMYK row with lcms default shows
33×33×33 u16 (grid indexed by the 3-channel
RGB input, picked by the default branch for 3-channel colour spaces),
and the CMYK→RGB row with lcms HIGHRES shows
23×23×23×23 u16 (4-channel CMYK input,
HIGHRES branch). jsColorEngine’s own grid picks happen to line
up with lcms’s defaults — 333 for
RGB input, 174 for CMYK — which is a
convergent-evolution thing, not coordinated. Both engines arrive at
the same size/fidelity tradeoff.
Accuracy vs image: which mode for which job?
The four summary cards at the top of the Full comparison panel answer this directly. In short:
- Accuracy case (colour-critical, proofs, measurement) — 8-bit I/O, full pipeline per pixel, float math throughout.
jsce no-LUTvslcms NOOPTIMIZE. Slowest throughput; most faithful colour. - Image case (display, export, batch ICC conversion) — 8-bit I/O, LUT-baked. Any of
jsce int*/jsce floatvslcms default/lcms HIGHRES. Orders of magnitude faster; the grid quantisation error is typically well under 1 delta-E for a 333/334 LUT. This is what you want for every non-colour-critical workflow.
Worth noting: jsce float is also a LUT-backed mode
(Float64Array CLUT), so it lives in the image family even
though the interp math is f64. The precision difference vs the
int (Q0.16 u16) path is below the noise floor for 8-bit
output — use whichever is faster on your host, or reach for
no-LUT when you need no-LUT precision.
Why RGB and CMYK sit on different performance curves
The single most useful lens for reading the results table is which math kernel is this row using?. Every colour conversion is either:
- Matrix + curves (RGB ↔ RGB, RGB ↔ Lab). The profile is a 3×3 matrix and three per-channel curves. A forward() call is < 30 scalar float ops — trivially fast, and trivially fast in any reasonable implementation, including scalar-only builds like stock lcms-wasm. SIMD still roughly doubles this because jsColorEngine packs 3 channels into one v128 register and collapses the 3×3 matrix + curves into 4–5 fused operations, but the absolute gap is smaller in ops than on the tetra paths, and the pressure on the memory subsystem is near-zero either way.
- 3D tetrahedral LUT (RGB ↔ CMYK, any RGB-profile conversion that can't be collapsed to matrix). A 3-input tetrahedral interp reads 4 corners of a 33×33×33 grid per pixel and does a barycentric blend. ∼40 ops, memory-ish bound. SIMD pays off meaningfully (v128 unit-weight mul + load-pair makes a 4-channel output a single fused op). jsColorEngine's int-wasm-simd kernel wins here by 2–3× over scalar, and by 3–4× over lcms-wasm default.
- 4D tetrahedral LUT (CMYK ↔ CMYK, CMYK as the input side of any non-matrix transform). 4-input tetra interpolation on a 174 or 334 grid: sort 4 weights to find the enclosing tetrahedron, fetch 5 corners with 4 output channels each (20 memory loads), barycentric blend. ∼80–100 ops per pixel, heavily memory-bound. This is where scalar implementations fall off a cliff — lcms-wasm's CMYK→CMYK number drops to 25–30 MPx/s on the same host where its RGB→RGB number is in the 90–130 range. jsColorEngine's SIMD kernel stays above 100 MPx/s because the channel-parallel v128 layout hides the memory-port pressure.
So when you see lcms-wasm being ~2× behind jsColorEngine on RGB→RGB and 4×–5× behind on CMYK→CMYK: that's not two different stories. It's the same story (scalar kernel runs fine on the cheap math path, suffers on the expensive one), reading differently because the two directions exercise fundamentally different code.
History note. An earlier version of this doc claimed
lcms-wasm was within 1.2× of jsColorEngine's SIMD kernel
on RGB→RGB. That number was wrong — we were benchmarking
sRGB→sRGB (identity, optimised to memcpy by lcms),
which gave lcms a ∼+80% boost on that row only. Switching to
sRGB→AdobeRGB (real matrix + curves) fixed it;
the gap is now ~2×, consistent with the 3D / 4D tetra directions
and with the general "SIMD lets us compete even where scalar
implementations are already decent" thesis.
Is lcms-wasm really running at full speed?
Yes. Verified by construction:
- Input and output buffers are
_malloc'd once atcreate()time and reused across every call (pinned heap buffers). - The hot path calls
_cmsDoTransform(xf, inPtr, outPtr, pixelCount)directly — the raw C export. We deliberately skiplcms.cmsDoTransform(...)(no underscore), because that high-level wrapper does_malloc+new Uint8Array(HEAPU8, …).set()+ccall+.slice()+_free×2 per call — an order-of-magnitude slowdown. We give lcms the same pinned-heap path a production app would use. - Three flag variants are tested:
flags=0,HIGHRESPRECALC, andNOOPTIMIZE. The best of the three is the "lcms at its best" number. - Same seeded-PRNG input bytes as jsColorEngine, same
INTENT_RELATIVE_COLORIMETRIC, sameTYPE_*_8formats, same GRACoL profile, same pixel count.
Where does the speed gap come from then? lcms-wasm is a stock
emscripten build of LittleCMS 2.16 without -msimd128,
so every lcms kernel runs scalar regardless of host SIMD support.
jsColorEngine's int-wasm-simd path ships hand-tuned
channel-parallel v128 kernels for 3D and 4D tetrahedral interpolation.
That's a real capability gap, not a benchmark bias — the "lcms-wasm"
line in the engine info panel says stock scalar build
precisely to flag this. (Stock means an emscripten build without
-msimd128; informational, not a problem — the cell
stays green when lcms loads.)
A SIMD-compiled lcms-wasm (same source, built with -msimd128 -O3)
would close a lot of the gap. We'd welcome that comparison if anyone
wants to rebuild lcms with SIMD on and ship it back to
samples/lcms-wasm-dist/lcms.wasm.
How to read the result
The right-hand bar in the results table is normalised per direction so
the fastest mode for that direction fills the bar. The
vs `int` column is the speedup vs jsColorEngine
int for the same direction — this is the
apples-to-apples "how much does the WASM/SIMD port buy us" number,
and the one most directly comparable to the figures in
docs/Performance.md.
Caveats
- Run with the browser tab focused. Background tabs throttle
setTimeoutand (sometimes)performance.now()precision. - Close other heavy tabs / apps. CPU contention shows up directly in MPx/s.
- First run after page load includes WASM compile + profile decode — hit "Run benchmark" twice for stable numbers.
- Numbers in the same browser are highly reproducible (±5%); cross-browser they can swing 30%+. Use a single browser when comparing modes.
- Mobile CPUs throttle aggressively under sustained load. The "Hot iters" setting is conservative by default — bumping it on mobile can show throttling kick in.
Source: samples/bench/ in the
jsColorEngine repo.
Stop the server with Ctrl+C in the terminal that ran
npm run serve (or node samples/bench/serve.js).