README_REPRODUCE_V2.txt
========================================================================
OneCharacterCode Benchmark V2 - reproducibility guide

WHAT'S DIFFERENT FROM V1
------------------------------------------------------------------------

V1 results are untouched.  V2 is a separate, optimized prototype.

V2 optimizations vs V1:

  1. Tier 1 tokens (1 byte each) for the top 8 highest-savings entries.
     V1 used a single 3-byte Unicode private-use codepoint per entry, so
     V1's break-even point was around 5+ recurrences of a 5+ byte string.
     V2's 1-byte tokens lower that break-even significantly.
  2. Tier 2 tokens (2 bytes each: 0x0E + index) for the next 256 entries.
  3. Per-entry net-gain threshold.  An entry is included in the
     dictionary only if (savings_per_use x count) - dict_cost > 0.
  4. Greedy savings-ranked acceptance with overlap rejection.  Highest
     net-savings candidate is picked first; subsequent candidates that
     contain or are contained by an already-accepted entry are skipped.
  5. Phrase length pool: {3, 4, 5, 6, 8, 10, 12, 16, 24, 32} in one pass.
  6. Reserved-byte escape (0x0F prefix) so the source can contain any
     byte, including the V2 token bytes themselves, without breaking
     the decoder.
  7. gzip(OCC V2) is measured separately so the reader can see whether
     the prototype made gzip's job easier or harder.

WHAT'S IN THIS FOLDER
------------------------------------------------------------------------

  run_benchmark_v2.ps1            The V2 benchmark script.
  run_benchmark_v1.ps1            Snapshot of the V1 script for reference.
  inputs\                         The three test inputs (same as V1).
  outputs\                        Per-file V2 byproducts: .gz / .occ1 /
                                  .occ2 / .occ2.gz / .occ2.reconstructed.
  benchmark-results-v2.json       Machine-readable V2 results.
  benchmark-test-run-v2.txt       Human-readable V2 run report.
  benchmark-results-v1.json       Snapshot of V1 results (for diff).
  benchmark-test-run-v1.txt       Snapshot of V1 run report (for diff).
  benchmark-v2.html / .js         The public V2 web demo page.
  SHA256_MANIFEST_V2.txt          SHA-256 of every file in this folder.
  README_REPRODUCE_V2.txt         This file.

HOW TO RUN
------------------------------------------------------------------------

Prerequisites:

  - Windows PowerShell 5.1+ (already on Windows 10/11) or PowerShell 7+.
  - No internet connection required.
  - No third-party tools required.

Steps:

  1.  Open PowerShell.
  2.  Change to this folder.
  3.  powershell -File run_benchmark_v2.ps1
  4.  The script prints per-file results and writes:
        benchmark-results-v2.json
        benchmark-test-run-v2.txt

HOW TO INTERPRET THE V2 RESULTS
------------------------------------------------------------------------

For each test file the V2 results table shows:

  Raw                Original file size.
  Gzip (raw)         gzip applied directly to the original.
  OCC V1             V1 prototype output size (for side-by-side compare).
  OCC V2             V2 prototype output size.
  Gzip (OCC V2)      gzip applied to the V2 output.
  V2 reduction %     (1 - OCC_V2 / Raw) * 100.  Positive = smaller.
  V2 vs V1 %         (1 - OCC_V2 / OCC_V1) * 100.  Positive = V2 made
                     the prototype smaller than V1.
  Reconstruction     PASS / FAIL.  Must be PASS for the number to count.
  SHA-256            First 12 hex digits of the matching hash.

What to expect on the three sample inputs:

  - Gzip still wins on raw input.  Gzip combines LZ77 + Huffman and is
    a mature compressor; the prototype is not expected to beat it on
    short English-like inputs.
  - V2 is significantly smaller than V1 - the prototype is now actually
    compressing (positive reduction) rather than expanding (V1 was
    negative on every sample).
  - gzip(OCC V2) is slightly LARGER than gzip(raw), because OCC's
    substitutions removed some of the redundancy structure that gzip
    would otherwise have exploited.  This is honest information about
    the prototype, not a flaw in the script.

VERIFYING HASHES
------------------------------------------------------------------------

Every file in this folder is listed in SHA256_MANIFEST_V2.txt with its
SHA-256 hash.  To re-verify after a copy or transfer:

  Get-ChildItem -File | ForEach-Object {
    $h = (Get-FileHash -Algorithm SHA256 -Path $_.FullName).Hash
    "$($_.Name) $h"
  }

The benchmark script itself performs the most important verification:
for each input file it computes the SHA-256 of the original bytes,
encodes with V2, decodes the V2 output back to bytes, and recomputes
the SHA-256.  The reconstruction status is PASS only when the two
hashes are identical.  Any compression number that comes with a FAIL
is meaningless and should be discarded.

WHAT THIS BENCHMARK IS NOT
------------------------------------------------------------------------

  - This is not the final patented OneCharacterCode engine.  V2 is
    a prototype that improves on V1.  The production engine has not
    been benchmarked here.
  - This is not a universal compression claim.  Three KB-scale text
    inputs cannot speak for all data.
  - This is not a comparison against the full landscape of modern
    compressors (zstd, xz, lzma).  Those should be added next.

NEXT IMPROVEMENTS
------------------------------------------------------------------------

  - Iterative refinement: after each accepted entry, recompute
    candidate counts in the working text and re-rank the remaining
    candidates rather than re-checking only against an initial ranking.
  - Three-tier tokens (1-byte / 2-byte / 3-byte) for a longer dict tail.
  - Structural pattern templates for HTML (tag opens/closes), JSON
    (key:value frames), and English text (sentence-frame templates).
  - Larger inputs (MB and GB scale) so the dictionary overhead is
    amortized differently.
  - Independent third-party reproduction of every run.

End of README_REPRODUCE_V2.txt.