OneCharacterCode Benchmark V2

Optimized Prototype Encoder — reproducible compression and reconstruction test, honestly reported.

Loading test-run metadata…

Required disclosure. V2 is still a prototype symbolic dictionary encoder, not the final patented OneCharacterCode engine. It improves candidate selection and dictionary-overhead handling but reports all results honestly, including comparisons against gzip and against the V1 prototype. Reconstruction must PASS for any compression number to count.

What changed in V2

Tier-1 tokens (1 byte each) for the top 8 highest-savings entries.
Tier-2 tokens (2 bytes each, ESC 0x0E + index) for the next 256 entries.
Net-gain threshold per entry: skip if (savings_per_use × count) - dict_cost ≤ 0.
Greedy savings-ranked acceptance with overlap rejection; recount in working text after each pick.
Multiple phrase lengths tried in one pass: 3, 4, 5, 6, 8, 10, 12, 16, 24, 32.
Reserved-byte escape (0x0F prefix) so the source can include any byte without confusing the decoder.
gzip-after-OCC measured separately so the reader can see whether the prototype helped or hurt gzip.

Results

Raw / Gzip / OCC V1 / OCC V2 / gzip(OCC V2) shown in bytes. *V2 reduction %* is OCC V2 vs. raw; positive means smaller. *V2 vs V1 reduction %* is OCC V2 vs. OCC V1; positive means V2 made the prototype smaller than V1. Reconstruction must PASS for any number to count.
Test File	Raw	Gzip (raw)	OCC V1	OCC V2	Gzip (OCC V2)	V2 reduction	V2 vs V1	Reconstruction	SHA-256
Loading results…

Flow

Raw Input

→

Candidate Scan
3..32 byte phrases

→

Net-gain Filter
only profitable entries

→

Greedy Assign
8 tier-1 + 256 tier-2

→

Local Reconstruction

→

Hash Match

Downloads

benchmark-results-v2.jsonMachine-readable V2 results benchmark-test-run-v2.txtHuman-readable V2 run report run_benchmark_v2.ps1The V2 benchmark script (PS 5.1 compatible) README_REPRODUCE_V2.txtHow to reproduce V2 locally SHA256_MANIFEST_V2.txtHashes of every V2 file ← V1 benchmark page(unchanged - original results)

Technical notes

What the prototype is doing. The encoder reads the input as UTF-8 text and looks for recurring substrings at ten target lengths (3, 4, 5, 6, 8, 10, 12, 16, 24, 32 chars). It scores each candidate by how many bytes it would save if assigned a 1-byte token vs. how many bytes the dictionary entry itself costs. It then greedily picks the top scorer that doesn't overlap an already-picked entry, replaces it in the working copy of the text with a private placeholder, and repeats. The top 8 entries get 1-byte tokens (bytes 0x01…0x08). The next 256 get 2-byte tokens (escape byte 0x0E followed by an index). Anything else in the source passes through as its original UTF-8 bytes. A 0x0F escape is reserved so the encoder can losslessly handle source bytes that would otherwise collide with token bytes.

What V2 fixes from V1. V1 used a single 3-byte Unicode private-use token per entry. The break-even on V1 required 5+ recurrences of a 5+ byte string. V2's 1-byte and 2-byte tokens dramatically lower that break-even, and the net-gain threshold prevents the dictionary header from outweighing the in-body savings. The result for these three inputs is roughly 30–39% smaller carriers than V1 produced. Reconstruction still passes SHA-256 round-trip on every input.

Honest limitations. V2 still does not match standard gzip on these inputs. Gzip combines an LZ77 sliding window with Huffman coding and is mature; a dictionary-substitution prototype isn't expected to match it on short, English-like inputs. The gzip(OCC V2) column lets the reader see whether OCC made gzip's job easier or harder — in this run it makes gzip's job slightly harder, because the prototype's substitutions remove some of the redundancy that gzip would have exploited.

What this benchmark is for. Two things. First, an honest reproducible measurement: any reader with PowerShell can rerun the V2 script on the same inputs and get the same byte counts and SHA-256 hashes. Second, a public diff between V1 and V2 of the prototype, so the next iteration can be measured against the same inputs the same way.

What this benchmark is not. Not a comparison against the production OneCharacterCode engine. Not a claim of universal compression superiority. Not a result that should be generalized to inputs much larger or much different in structure than these three samples.