OneCharacterCode Benchmark V3

Three-tier symbolic prototype with iterative refinement — reported alongside a separately-labeled system-level bandwidth simulation. Two different claims, both shown honestly.

Loading test-run metadata…

Required disclosure. This page reports two separate tests that are not the same claim: Reconstruction (SHA-256 round-trip) must PASS for any compression number in Test 1 to count.

What changed in V3 (vs V2)

Test 1 — File compression results

Per-file compression of the same three sample inputs used in V1 and V2. All numbers are honest, including gzip beating the prototype.

Raw / Gzip / OCC V1 / OCC V2 / OCC V3 / gzip(OCC V3) in bytes. V3 reduction % is OCC V3 vs. raw (positive = smaller). V3 vs V2 % is OCC V3 vs. OCC V2 (positive = V3 made the prototype smaller than V2). Reconstruction must PASS.
Test File Raw Gzip (raw) OCC V1 OCC V2 OCC V3 Gzip (OCC V3) V3 reduction V3 vs V2 Mode Reconstruction SHA-256
Loading file-compression results…

Test 2 — System-level bandwidth simulation

Scope of this test. This is not file compression. It is a sync-model comparison: a cloud user who pulls the full package on every session vs. a user who installs once locally and pulls a small update packet on subsequent sessions. The update-packet size used here is a simulated placeholder set to 5% of the initial install. Substitute your own measured update size to model a real deployment.

 

Cloud-full-download bytes = sessions × initial package. Local install + updates bytes = initial package + (sessions − 1) × simulated update. The savings percentage is a model output, not a compression result.
Sessions Cloud full download (bytes) Local install + update packets (bytes) Bytes saved Savings %
Loading bandwidth simulation…

Flow

Raw Input
Multi-length Scan
3..128 bytes
Greedy Pick + Rescan
iterative refinement
Tier Assign
1B / 2B / 3B tokens
Adaptive Mode
pick smaller (PASS)
SHA-256 Round-trip

Downloads

Technical notes

What V3's encoder is doing. The V3 encoder reads the input as bytes and treats it as a Latin-1 string for fast substring counting (no UTF-8 ambiguity). It scans candidate substrings at 14 phrase lengths (3 through 128 bytes), counts non-overlapping occurrences, and greedily accepts the highest net-savings candidate. After each acceptance, the substring is replaced in a working buffer with a private-use sentinel and the scan repeats - this is the iterative refinement that V2 lacked. Accepted entries are then ranked by total bytes saved and assigned to tiers: top 16 get Tier 1 single-byte tokens (bytes 0x01..0x08, 0x0B, 0x0C, 0x0E..0x13, chosen to avoid TAB / LF / CR / NUL); next 512 get Tier 2 two-byte tokens (0x14 or 0x15 + index); next 256 get Tier 3 three-byte tokens (0x16 + 16-bit index). The encoder also reserves 0x17 as a literal-escape so any input byte that happens to collide with a token byte can pass through losslessly.

What V3 fixes from V2. V2 ranked candidates once; V3 rescans after every accept, which catches savings that emerge after early picks remove longer common patterns. V2 capped phrase length at 32 bytes; V3 goes to 128, catching long structural repeats in HTML and JSON. V2 used two tiers; V3 adds a third (3-byte) tier for the long tail. V2 picked a single encoding mode per file; V3 runs the encoder twice (dict-only vs. full hybrid) and keeps the smaller output as long as it round-trips. The combined effect is roughly 28-34% reduction vs. raw on these three inputs, compared to V2's 20-23%.

Honest limitations for file compression (Test 1). V3 still does not match gzip on these inputs. Gzip combines LZ77 + Huffman and is mature; a dictionary-substitution prototype with no entropy coder cannot match it on short English-like inputs. The gzip(OCC V3) column lets the reader see whether OCC V3 made gzip's job easier or harder; in this run it makes gzip's job slightly harder, because V3's substitutions remove some of the redundancy gzip would otherwise have exploited.

What the bandwidth test is and isn't (Test 2). The bandwidth test models a deployment pattern: a SaaS user who pulls the full app over the network on every session vs. a user who installs once and only receives small updates after that. The math is straightforward arithmetic. The savings come from not redownloading the same bytes, not from compressing those bytes more tightly. A 90%+ savings number at high session counts is honest for that sync model - but a reader who reads it as "OCC compresses 90%" would be misreading the page. That is why the two tables are kept physically separate and labeled this way.

What this benchmark is not. Not the production OneCharacterCode engine. Not a universal compression claim. Not a comparison against the full modern compressor landscape (zstd / xz / lzma still need to be added). Not a generalization to inputs much larger or differently structured than these three KB-scale samples.