OneCharacterCode benchmark V2 - test run report
=================================================

Started   : 2026-05-11T12:45:11
Finished  : 2026-05-11T12:45:18
Duration  : 6.63 seconds
Machine   : LIMITLESS
PowerShell: 5.1.26100.8115

Command run:
  powershell -File run_benchmark_v2.ps1

Optimizations:
  - Tier 1 = 1-byte tokens for top 8 highest-savings entries
  - Tier 2 = 2-byte tokens (ESC 0x0E + index) for next 256 entries
  - Net-gain threshold per entry: skip if (savings_per_use * count) - dict_cost <= 0
  - Greedy savings-ranked acceptance with overlap rejection
  - Multiple phrase lengths: 32, 24, 16, 12, 10, 8, 6, 5, 4, 3
  - Reserved-byte escape (0x0F prefix) for source bytes in 0x01-0x08, 0x0E, 0x0F
  - gzip(OCC V2) measured separately so reader can see whether OCC helped gzip

Inputs tested:
  JSON_AGENT_SAMPLE.json  (5,458 bytes)
  SIMPLE_HTML_SAMPLE.html  (6,904 bytes)
  TEXT_ARTICLE_SAMPLE.txt  (8,824 bytes)

Results table:
  File                                  Raw     Gzip    OCCv1    OCCv2   gz(v2)  Recon  V2 red%
  ----------------------------------------------------------------------------------------------
  JSON_AGENT_SAMPLE.json              5,458    2,529    6,188    4,324    2,774   PASS   20.78%
  SIMPLE_HTML_SAMPLE.html             6,904    2,929    8,738    5,359    3,315   PASS   22.38%
  TEXT_ARTICLE_SAMPLE.txt             8,824    3,529   10,714    6,800    4,000   PASS   22.94%

V2 vs V1 (negative number means V2 is smaller than V1, i.e. better):
  JSON_AGENT_SAMPLE.json            v1=   6,188  v2=   4,324  delta=30.12%
  SIMPLE_HTML_SAMPLE.html           v1=   8,738  v2=   5,359  delta=38.67%
  TEXT_ARTICLE_SAMPLE.txt           v1=  10,714  v2=   6,800  delta=36.53%

Reconstruction status (SHA-256 round-trip):
  JSON_AGENT_SAMPLE.json            PASS  (raw=0dd8cf66dc4a8e96...  recon=0dd8cf66dc4a8e96...)
  SIMPLE_HTML_SAMPLE.html           PASS  (raw=c26af82d440daab5...  recon=c26af82d440daab5...)
  TEXT_ARTICLE_SAMPLE.txt           PASS  (raw=f410ca5401c070c2...  recon=f410ca5401c070c2...)

Limitations:
  - V2 is still a prototype symbolic dictionary encoder, NOT the final
    patented OneCharacterCode engine. Honest results only.
  - On short KB-scale inputs gzip and Brotli are mature and very hard to
    beat. A simple dictionary-substitution prototype - even an improved
    one - typically will not match them. Results are reported as-is.
  - V2 improves on V1 in candidate selection, dictionary overhead, and
    token width. The improvement should be visible in the v2-vs-v1 column.
    Compare against gzip(raw) to see the gap to standard compression.
  - gzip(OCC V2) is included so the reader can see whether the prototypes
    output is more or less compressible to gzip than the raw input.

Next engine improvements (recommended order):
  - Iterative refinement: after each accepted entry, recompute candidate
    counts in the working text and re-rank the remaining candidates.
  - Variable-width tokens (3 tiers): single-byte for the top, two-byte for
    the middle, three-byte for the long tail.
  - Structural pattern templates: HTML tag opens/closes, JSON key:value
    headers, common punctuation runs, indentation runs.
  - Replace the prototype with the production OneCharacterCode engine and
    rerun the same harness for an apples-to-apples comparison.
  - Independent third-party reproduction on the same inputs.

All inputs and outputs are hashed in SHA256_MANIFEST_V2.txt.
Reproducibility instructions: README_REPRODUCE_V2.txt.