OneCharacterCode Benchmark V4
Persistent dictionary, delta updates, system-level data savings — three separate claims, reported separately.
Three tables on this page report three different claims. Reconstruction (SHA-256 round-trip) must PASS for any number on this page to count.
1. What V4 tests
- Mode A — Standalone file compression. Per-file V4 encoder vs gzip. Cumulative bytes per file set.
- Mode B — Persistent shared dictionary. Build one dictionary from a training subset, install it once, then ship later files as body-only symbolic packets. The dictionary is not re-transmitted. Compares cumulative wire bytes against cumulative raw-download cost.
- Mode C — Delta updates. For versioned files, transmit only the changed middle bytes between common prefix and common suffix, inside an OCC carrier. Receiver applies the delta locally and verifies SHA-256.
2. Standalone compression results (Mode A)
Cumulative bytes per set. OCC V4 may not beat gzip here — this column is the same kind of comparison the V1/V2/V3 pages made.
| File Set | Files | Raw Bytes | Gzip(raw) Bytes | OCC V4 Bytes | Winner | OCC % | Reconstruction |
|---|---|---|---|---|---|---|---|
| Loading standalone results… | |||||||
3. Persistent dictionary results (Mode B)
Wire bytes (sender side) = one-time install + sum of per-file packets. Baseline = sum of raw file bytes pulled on every session.
| File Set | Files | Normal Full Download (bytes) | Initial Install + Symbolic Packets (bytes) | Bytes Saved | Percent Saved | Reconstruction |
|---|---|---|---|---|---|---|
| Loading persistent-dictionary results… | ||||||
4. Delta update results (Mode C)
| File Set | Versions | Full Download (bytes) | Delta Sync (bytes) | Bytes Saved | Percent Saved | Reconstruction |
|---|---|---|---|---|---|---|
| Loading delta-update results… | ||||||
5. Reconstruction verification
Every row in the three tables above is gated by SHA-256 reconstruction. For every transmitted carrier, packet, or delta, the receiver decodes back to bytes and the hash is compared to the original. If reconstruction FAILs, the byte counts are reported but the row does not constitute a claim.
- Mode A: per-file self-contained OCC4 carrier → decode → SHA-256 match.
- Mode B: install dictionary, OCC4P packet → decode using installed dict → SHA-256 match per file.
- Mode C: previous version + delta → apply delta + decode middle → SHA-256 match against the new version.
6. Why this matters for offline apps
The honest finding from V1, V2, V3, and the stacked test was that on short, single-file inputs, gzip's LZ77 + Huffman pipeline already extracts most of the redundancy; a dictionary substitution prototype can match or slightly help on cumulative numbers but cannot generally beat gzip on per-file size. That is still true in V4 Mode A.
The much stronger claim — and the reason OneCharacterCode is built the way it is — is what Modes B and C measure. Once a device has installed the shared dictionary, it does not need to re-receive that dictionary on subsequent transfers. Repeated structural payloads (UI chrome, JSON schemas, log line frames, agent state envelopes) can ship as small symbolic packets, and versioned files can ship as deltas. The cumulative wire savings across many sessions can be very high even though no individual file shrinks dramatically.
This is also why these tables are kept physically separate. “X percent saved” in Mode B is a sync-model result; it is not a compression ratio for any one file. Reading the page that way would be a misreading.
7. Honest limitations
- Mode A standalone compression: still a prototype encoder with no entropy coder downstream. Gzip is mature and usually wins on short inputs.
- Mode B persistent dictionary: the savings depend on how much cross-file redundancy the training corpus actually shares with the later corpus. Heterogeneous corpora will show much smaller gains than the structured ones generated here.
- Mode C delta updates: the prefix/suffix delta scheme is intentionally simple. Sophisticated delta encoders (xdelta, bsdiff) would do better on files where the change is in the middle of repeated regions.
- All inputs here are synthetic (deterministic generators) plus the V3 sample inputs. Real-world corpora will differ.
- Only gzip was used as the raw-side baseline. zstd / xz / lzma still need to be compared.
- The encoder is the V4 prototype, not the production OneCharacterCode engine.