OneCharacterCode Benchmark V2

Optimized Prototype Encoder — reproducible compression and reconstruction test, honestly reported.

Loading test-run metadata…

Required disclosure. V2 is still a prototype symbolic dictionary encoder, not the final patented OneCharacterCode engine. It improves candidate selection and dictionary-overhead handling but reports all results honestly, including comparisons against gzip and against the V1 prototype. Reconstruction must PASS for any compression number to count.

What changed in V2

Results

Raw / Gzip / OCC V1 / OCC V2 / gzip(OCC V2) shown in bytes. V2 reduction % is OCC V2 vs. raw; positive means smaller. V2 vs V1 reduction % is OCC V2 vs. OCC V1; positive means V2 made the prototype smaller than V1. Reconstruction must PASS for any number to count.
Test File Raw Gzip (raw) OCC V1 OCC V2 Gzip (OCC V2) V2 reduction V2 vs V1 Reconstruction SHA-256
Loading results…

Flow

Raw Input
Candidate Scan
3..32 byte phrases
Net-gain Filter
only profitable entries
Greedy Assign
8 tier-1 + 256 tier-2
Local Reconstruction
Hash Match

Downloads

Technical notes

What the prototype is doing. The encoder reads the input as UTF-8 text and looks for recurring substrings at ten target lengths (3, 4, 5, 6, 8, 10, 12, 16, 24, 32 chars). It scores each candidate by how many bytes it would save if assigned a 1-byte token vs. how many bytes the dictionary entry itself costs. It then greedily picks the top scorer that doesn't overlap an already-picked entry, replaces it in the working copy of the text with a private placeholder, and repeats. The top 8 entries get 1-byte tokens (bytes 0x01…0x08). The next 256 get 2-byte tokens (escape byte 0x0E followed by an index). Anything else in the source passes through as its original UTF-8 bytes. A 0x0F escape is reserved so the encoder can losslessly handle source bytes that would otherwise collide with token bytes.

What V2 fixes from V1. V1 used a single 3-byte Unicode private-use token per entry. The break-even on V1 required 5+ recurrences of a 5+ byte string. V2's 1-byte and 2-byte tokens dramatically lower that break-even, and the net-gain threshold prevents the dictionary header from outweighing the in-body savings. The result for these three inputs is roughly 30–39% smaller carriers than V1 produced. Reconstruction still passes SHA-256 round-trip on every input.

Honest limitations. V2 still does not match standard gzip on these inputs. Gzip combines an LZ77 sliding window with Huffman coding and is mature; a dictionary-substitution prototype isn't expected to match it on short, English-like inputs. The gzip(OCC V2) column lets the reader see whether OCC made gzip's job easier or harder — in this run it makes gzip's job slightly harder, because the prototype's substitutions remove some of the redundancy that gzip would have exploited.

What this benchmark is for. Two things. First, an honest reproducible measurement: any reader with PowerShell can rerun the V2 script on the same inputs and get the same byte counts and SHA-256 hashes. Second, a public diff between V1 and V2 of the prototype, so the next iteration can be measured against the same inputs the same way.

What this benchmark is not. Not a comparison against the production OneCharacterCode engine. Not a claim of universal compression superiority. Not a result that should be generalized to inputs much larger or much different in structure than these three samples.