OneCharacterCode Benchmark V3
Three-tier symbolic prototype with iterative refinement — reported alongside a separately-labeled system-level bandwidth simulation. Two different claims, both shown honestly.
- Test 1 (File compression) measures how many bytes the V3 prototype symbolic encoder saves on individual files. Compared honestly against gzip and against the earlier V1/V2 prototypes.
- Test 2 (System-level bandwidth) is a sync-model simulation: cloud full-download per session vs. install-once-locally + tiny update packet per session. The update packet is a simulated placeholder at 5% of initial size; it is not a measured compression ratio.
What changed in V3 (vs V2)
- Three-tier tokens: Tier 1 = 16 single-byte tokens, Tier 2 = 512 two-byte tokens, Tier 3 = up to 256 three-byte tokens (long-tail dictionary).
- Iterative refinement: after each accepted entry, rescan the working text and re-rank remaining candidates. V2 only ranked once.
- Longer phrase pool: 3, 4, 5, 6, 8, 10, 12, 16, 24, 32, 48, 64, 96, 128 bytes (V2 capped at 32).
- Per-file adaptive mode: the encoder runs in two configurations (dict-only Tier 1+2 vs. hybrid Tier 1+2+3) and picks the smaller output, provided it round-trips.
- Compact header: 1-byte length prefixes per entry; magic
OCC3; 16-bit tier counts; 32-bit body length. - Reserved-byte escape moved to
0x17to free additional Tier-1 byte slots. - Same honest reporting rule: SHA-256 PASS required; gzip(OCC V3) is measured separately so the reader can see whether the prototype helps or hurts gzip downstream.
Test 1 — File compression results
Per-file compression of the same three sample inputs used in V1 and V2. All numbers are honest, including gzip beating the prototype.
| Test File | Raw | Gzip (raw) | OCC V1 | OCC V2 | OCC V3 | Gzip (OCC V3) | V3 reduction | V3 vs V2 | Mode | Reconstruction | SHA-256 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Loading file-compression results… | |||||||||||
Test 2 — System-level bandwidth simulation
| Sessions | Cloud full download (bytes) | Local install + update packets (bytes) | Bytes saved | Savings % |
|---|---|---|---|---|
| Loading bandwidth simulation… | ||||
Flow
3..128 bytes
iterative refinement
1B / 2B / 3B tokens
pick smaller (PASS)
Downloads
Technical notes
What V3's encoder is doing. The V3 encoder reads the input as bytes and treats it as a Latin-1 string for fast substring counting (no UTF-8 ambiguity). It scans candidate substrings at 14 phrase lengths (3 through 128 bytes), counts non-overlapping occurrences, and greedily accepts the highest net-savings candidate. After each acceptance, the substring is replaced in a working buffer with a private-use sentinel and the scan repeats - this is the iterative refinement that V2 lacked. Accepted entries are then ranked by total bytes saved and assigned to tiers: top 16 get Tier 1 single-byte tokens (bytes 0x01..0x08, 0x0B, 0x0C, 0x0E..0x13, chosen to avoid TAB / LF / CR / NUL); next 512 get Tier 2 two-byte tokens (0x14 or 0x15 + index); next 256 get Tier 3 three-byte tokens (0x16 + 16-bit index). The encoder also reserves 0x17 as a literal-escape so any input byte that happens to collide with a token byte can pass through losslessly.
What V3 fixes from V2. V2 ranked candidates once; V3 rescans after every accept, which catches savings that emerge after early picks remove longer common patterns. V2 capped phrase length at 32 bytes; V3 goes to 128, catching long structural repeats in HTML and JSON. V2 used two tiers; V3 adds a third (3-byte) tier for the long tail. V2 picked a single encoding mode per file; V3 runs the encoder twice (dict-only vs. full hybrid) and keeps the smaller output as long as it round-trips. The combined effect is roughly 28-34% reduction vs. raw on these three inputs, compared to V2's 20-23%.
Honest limitations for file compression (Test 1). V3 still does not match gzip on these inputs. Gzip combines LZ77 + Huffman and is mature; a dictionary-substitution prototype with no entropy coder cannot match it on short English-like inputs. The gzip(OCC V3) column lets the reader see whether OCC V3 made gzip's job easier or harder; in this run it makes gzip's job slightly harder, because V3's substitutions remove some of the redundancy gzip would otherwise have exploited.
What the bandwidth test is and isn't (Test 2). The bandwidth test models a deployment pattern: a SaaS user who pulls the full app over the network on every session vs. a user who installs once and only receives small updates after that. The math is straightforward arithmetic. The savings come from not redownloading the same bytes, not from compressing those bytes more tightly. A 90%+ savings number at high session counts is honest for that sync model - but a reader who reads it as "OCC compresses 90%" would be misreading the page. That is why the two tables are kept physically separate and labeled this way.
What this benchmark is not. Not the production OneCharacterCode engine. Not a universal compression claim. Not a comparison against the full modern compressor landscape (zstd / xz / lzma still need to be added). Not a generalization to inputs much larger or differently structured than these three KB-scale samples.