README_REPRODUCE_V2.txt ======================================================================== OneCharacterCode Benchmark V2 - reproducibility guide WHAT'S DIFFERENT FROM V1 ------------------------------------------------------------------------ V1 results are untouched. V2 is a separate, optimized prototype. V2 optimizations vs V1: 1. Tier 1 tokens (1 byte each) for the top 8 highest-savings entries. V1 used a single 3-byte Unicode private-use codepoint per entry, so V1's break-even point was around 5+ recurrences of a 5+ byte string. V2's 1-byte tokens lower that break-even significantly. 2. Tier 2 tokens (2 bytes each: 0x0E + index) for the next 256 entries. 3. Per-entry net-gain threshold. An entry is included in the dictionary only if (savings_per_use x count) - dict_cost > 0. 4. Greedy savings-ranked acceptance with overlap rejection. Highest net-savings candidate is picked first; subsequent candidates that contain or are contained by an already-accepted entry are skipped. 5. Phrase length pool: {3, 4, 5, 6, 8, 10, 12, 16, 24, 32} in one pass. 6. Reserved-byte escape (0x0F prefix) so the source can contain any byte, including the V2 token bytes themselves, without breaking the decoder. 7. gzip(OCC V2) is measured separately so the reader can see whether the prototype made gzip's job easier or harder. WHAT'S IN THIS FOLDER ------------------------------------------------------------------------ run_benchmark_v2.ps1 The V2 benchmark script. run_benchmark_v1.ps1 Snapshot of the V1 script for reference. inputs\ The three test inputs (same as V1). outputs\ Per-file V2 byproducts: .gz / .occ1 / .occ2 / .occ2.gz / .occ2.reconstructed. benchmark-results-v2.json Machine-readable V2 results. benchmark-test-run-v2.txt Human-readable V2 run report. benchmark-results-v1.json Snapshot of V1 results (for diff). benchmark-test-run-v1.txt Snapshot of V1 run report (for diff). benchmark-v2.html / .js The public V2 web demo page. SHA256_MANIFEST_V2.txt SHA-256 of every file in this folder. README_REPRODUCE_V2.txt This file. HOW TO RUN ------------------------------------------------------------------------ Prerequisites: - Windows PowerShell 5.1+ (already on Windows 10/11) or PowerShell 7+. - No internet connection required. - No third-party tools required. Steps: 1. Open PowerShell. 2. Change to this folder. 3. powershell -File run_benchmark_v2.ps1 4. The script prints per-file results and writes: benchmark-results-v2.json benchmark-test-run-v2.txt HOW TO INTERPRET THE V2 RESULTS ------------------------------------------------------------------------ For each test file the V2 results table shows: Raw Original file size. Gzip (raw) gzip applied directly to the original. OCC V1 V1 prototype output size (for side-by-side compare). OCC V2 V2 prototype output size. Gzip (OCC V2) gzip applied to the V2 output. V2 reduction % (1 - OCC_V2 / Raw) * 100. Positive = smaller. V2 vs V1 % (1 - OCC_V2 / OCC_V1) * 100. Positive = V2 made the prototype smaller than V1. Reconstruction PASS / FAIL. Must be PASS for the number to count. SHA-256 First 12 hex digits of the matching hash. What to expect on the three sample inputs: - Gzip still wins on raw input. Gzip combines LZ77 + Huffman and is a mature compressor; the prototype is not expected to beat it on short English-like inputs. - V2 is significantly smaller than V1 - the prototype is now actually compressing (positive reduction) rather than expanding (V1 was negative on every sample). - gzip(OCC V2) is slightly LARGER than gzip(raw), because OCC's substitutions removed some of the redundancy structure that gzip would otherwise have exploited. This is honest information about the prototype, not a flaw in the script. VERIFYING HASHES ------------------------------------------------------------------------ Every file in this folder is listed in SHA256_MANIFEST_V2.txt with its SHA-256 hash. To re-verify after a copy or transfer: Get-ChildItem -File | ForEach-Object { $h = (Get-FileHash -Algorithm SHA256 -Path $_.FullName).Hash "$($_.Name) $h" } The benchmark script itself performs the most important verification: for each input file it computes the SHA-256 of the original bytes, encodes with V2, decodes the V2 output back to bytes, and recomputes the SHA-256. The reconstruction status is PASS only when the two hashes are identical. Any compression number that comes with a FAIL is meaningless and should be discarded. WHAT THIS BENCHMARK IS NOT ------------------------------------------------------------------------ - This is not the final patented OneCharacterCode engine. V2 is a prototype that improves on V1. The production engine has not been benchmarked here. - This is not a universal compression claim. Three KB-scale text inputs cannot speak for all data. - This is not a comparison against the full landscape of modern compressors (zstd, xz, lzma). Those should be added next. NEXT IMPROVEMENTS ------------------------------------------------------------------------ - Iterative refinement: after each accepted entry, recompute candidate counts in the working text and re-rank the remaining candidates rather than re-checking only against an initial ranking. - Three-tier tokens (1-byte / 2-byte / 3-byte) for a longer dict tail. - Structural pattern templates for HTML (tag opens/closes), JSON (key:value frames), and English text (sentence-frame templates). - Larger inputs (MB and GB scale) so the dictionary overhead is amortized differently. - Independent third-party reproduction of every run. End of README_REPRODUCE_V2.txt.