OneCharacterCode Benchmark V3

Three-tier symbolic prototype with iterative refinement — reported alongside a separately-labeled system-level bandwidth simulation. Two different claims, both shown honestly.

Loading test-run metadata…

Required disclosure. This page reports two separate tests that are not the same claim:

Test 1 (File compression) measures how many bytes the V3 prototype symbolic encoder saves on individual files. Compared honestly against gzip and against the earlier V1/V2 prototypes.
Test 2 (System-level bandwidth) is a sync-model simulation: cloud full-download per session vs. install-once-locally + tiny update packet per session. The update packet is a simulated placeholder at 5% of initial size; it is not a measured compression ratio.

Reconstruction (SHA-256 round-trip) must PASS for any compression number in Test 1 to count.

What changed in V3 (vs V2)

Three-tier tokens: Tier 1 = 16 single-byte tokens, Tier 2 = 512 two-byte tokens, Tier 3 = up to 256 three-byte tokens (long-tail dictionary).
Iterative refinement: after each accepted entry, rescan the working text and re-rank remaining candidates. V2 only ranked once.
Longer phrase pool: 3, 4, 5, 6, 8, 10, 12, 16, 24, 32, 48, 64, 96, 128 bytes (V2 capped at 32).
Per-file adaptive mode: the encoder runs in two configurations (dict-only Tier 1+2 vs. hybrid Tier 1+2+3) and picks the smaller output, provided it round-trips.
Compact header: 1-byte length prefixes per entry; magic OCC3; 16-bit tier counts; 32-bit body length.
Reserved-byte escape moved to 0x17 to free additional Tier-1 byte slots.
Same honest reporting rule: SHA-256 PASS required; gzip(OCC V3) is measured separately so the reader can see whether the prototype helps or hurts gzip downstream.

Test 1 — File compression results

Per-file compression of the same three sample inputs used in V1 and V2. All numbers are honest, including gzip beating the prototype.

Raw / Gzip / OCC V1 / OCC V2 / OCC V3 / gzip(OCC V3) in bytes. *V3 reduction %* is OCC V3 vs. raw (positive = smaller). *V3 vs V2 %* is OCC V3 vs. OCC V2 (positive = V3 made the prototype smaller than V2). Reconstruction must PASS.
Test File	Raw	Gzip (raw)	OCC V1	OCC V2	OCC V3	Gzip (OCC V3)	V3 reduction	V3 vs V2	Mode	Reconstruction	SHA-256
Loading file-compression results…

Test 2 — System-level bandwidth simulation

Scope of this test. This is not file compression. It is a sync-model comparison: a cloud user who pulls the full package on every session vs. a user who installs once locally and pulls a small update packet on subsequent sessions. The update-packet size used here is a simulated placeholder set to 5% of the initial install. Substitute your own measured update size to model a real deployment.

*Cloud-full-download bytes* = sessions × initial package. *Local install + updates bytes* = initial package + (sessions − 1) × simulated update. The savings percentage is a model output, not a compression result.
Sessions	Cloud full download (bytes)	Local install + update packets (bytes)	Bytes saved	Savings %
Loading bandwidth simulation…

Flow

Raw Input

→

Multi-length Scan
3..128 bytes

→

Greedy Pick + Rescan
iterative refinement

→

Tier Assign
1B / 2B / 3B tokens

→

Adaptive Mode
pick smaller (PASS)

→

SHA-256 Round-trip

Downloads

benchmark-results-v3.jsonMachine-readable file-compression results system-level-bandwidth-results-v3.jsonMachine-readable bandwidth simulation benchmark-test-run-v3.txtHuman-readable V3 run report run_benchmark_v3.ps1The V3 benchmark script (PS 5.1 compatible) README_REPRODUCE_V3.txtHow to reproduce V3 locally SHA256_MANIFEST_V3.txtHashes of every V3 file ← V2 benchmark page(unchanged - prior optimized prototype) ← V1 benchmark page(unchanged - original prototype)

Technical notes

What V3's encoder is doing. The V3 encoder reads the input as bytes and treats it as a Latin-1 string for fast substring counting (no UTF-8 ambiguity). It scans candidate substrings at 14 phrase lengths (3 through 128 bytes), counts non-overlapping occurrences, and greedily accepts the highest net-savings candidate. After each acceptance, the substring is replaced in a working buffer with a private-use sentinel and the scan repeats - this is the iterative refinement that V2 lacked. Accepted entries are then ranked by total bytes saved and assigned to tiers: top 16 get Tier 1 single-byte tokens (bytes 0x01..0x08, 0x0B, 0x0C, 0x0E..0x13, chosen to avoid TAB / LF / CR / NUL); next 512 get Tier 2 two-byte tokens (0x14 or 0x15 + index); next 256 get Tier 3 three-byte tokens (0x16 + 16-bit index). The encoder also reserves 0x17 as a literal-escape so any input byte that happens to collide with a token byte can pass through losslessly.

What V3 fixes from V2. V2 ranked candidates once; V3 rescans after every accept, which catches savings that emerge after early picks remove longer common patterns. V2 capped phrase length at 32 bytes; V3 goes to 128, catching long structural repeats in HTML and JSON. V2 used two tiers; V3 adds a third (3-byte) tier for the long tail. V2 picked a single encoding mode per file; V3 runs the encoder twice (dict-only vs. full hybrid) and keeps the smaller output as long as it round-trips. The combined effect is roughly 28-34% reduction vs. raw on these three inputs, compared to V2's 20-23%.

Honest limitations for file compression (Test 1). V3 still does not match gzip on these inputs. Gzip combines LZ77 + Huffman and is mature; a dictionary-substitution prototype with no entropy coder cannot match it on short English-like inputs. The gzip(OCC V3) column lets the reader see whether OCC V3 made gzip's job easier or harder; in this run it makes gzip's job slightly harder, because V3's substitutions remove some of the redundancy gzip would otherwise have exploited.

What the bandwidth test is and isn't (Test 2). The bandwidth test models a deployment pattern: a SaaS user who pulls the full app over the network on every session vs. a user who installs once and only receives small updates after that. The math is straightforward arithmetic. The savings come from not redownloading the same bytes, not from compressing those bytes more tightly. A 90%+ savings number at high session counts is honest for that sync model - but a reader who reads it as "OCC compresses 90%" would be misreading the page. That is why the two tables are kept physically separate and labeled this way.

What this benchmark is not. Not the production OneCharacterCode engine. Not a universal compression claim. Not a comparison against the full modern compressor landscape (zstd / xz / lzma still need to be added). Not a generalization to inputs much larger or differently structured than these three KB-scale samples.