README_REPRODUCE_V3.txt ======================================================================== OneCharacterCode Benchmark V3 - reproducibility guide WHAT'S DIFFERENT FROM V2 ------------------------------------------------------------------------ V1 and V2 results are untouched. V3 is a separate, further-optimized prototype, and it reports TWO different tests on the same page (kept physically separate so they are not confused with each other). V3 optimizations vs V2: 1. Three-tier tokens. Tier 1 = 16 single-byte tokens (avoiding TAB / LF / CR / NUL). Tier 2 = 512 two-byte tokens. Tier 3 = up to 256 three-byte tokens for the long-tail dictionary. 2. Iterative refinement. After each accepted dictionary entry, the working text is rescanned and the remaining candidates are re-ranked. V2 ranked the candidate pool only once. 3. Longer phrase pool: 3, 4, 5, 6, 8, 10, 12, 16, 24, 32, 48, 64, 96, 128 bytes. V2 capped at 32. 4. Per-file adaptive mode. The encoder runs in two modes (dict-only Tier 1+2 vs. hybrid Tier 1+2+3) and keeps the smaller output as long as it passes SHA-256 round-trip. 5. Compact header: magic "OCC3", 1-byte length prefixes per entry, 16-bit tier counts, 32-bit body length. 6. Reserved-byte escape moved to 0x17 to free additional Tier-1 byte slots. V3 also adds a SEPARATE system-level bandwidth simulation. This is NOT file compression; it is a sync-model comparison (cloud full download per session vs. install-once + small update packet per session). The update packet is a simulated 5% of initial-install placeholder, not a measured compression ratio. WHAT'S IN THIS FOLDER ------------------------------------------------------------------------ run_benchmark_v3.ps1 The V3 benchmark script. run_benchmark_v2.ps1 Snapshot of the V2 script. inputs\ The three test inputs. outputs\ Per-file V3 byproducts. benchmark-results-v3.json Machine-readable V3 file compression results. system-level-bandwidth-results-v3.json Machine-readable bandwidth simulation results. benchmark-test-run-v3.txt Human-readable V3 run report. benchmark-results-v2.json Snapshot of V2 results. benchmark-results-v1.json Snapshot of V1 results. benchmark-test-run-v2.txt Snapshot of V2 run report. benchmark-v3.html / .js The public V3 web demo page. SHA256_MANIFEST_V3.txt SHA-256 of every file here. README_REPRODUCE_V3.txt This file. HOW TO RUN ------------------------------------------------------------------------ Prerequisites: - Windows PowerShell 5.1+ (already on Windows 10/11) or PS 7+. - No internet connection required. - No third-party tools required. Steps: 1. Open PowerShell. 2. Change to this folder. 3. powershell -File run_benchmark_v3.ps1 4. The script prints per-file results and writes: benchmark-results-v3.json system-level-bandwidth-results-v3.json benchmark-test-run-v3.txt HOW TO INTERPRET THE V3 RESULTS ------------------------------------------------------------------------ There are TWO tables. They report DIFFERENT things. Test 1 - File compression (per-file table): Raw Original file size. Gzip (raw) gzip applied directly to the original. OCC V1 V1 prototype output (for comparison). OCC V2 V2 prototype output (for comparison). OCC V3 V3 prototype output. Gzip (OCC V3) gzip applied to the V3 output. V3 reduction % (1 - OCC_V3 / Raw) * 100. Positive = smaller. V3 vs V2 % (1 - OCC_V3 / OCC_V2) * 100. Positive = V3 made the prototype smaller than V2. Mode "dict-only" (Tier 1+2 only) or "hybrid-3tier" (all three tiers). Reconstruction PASS / FAIL. Must be PASS for the number to count. SHA-256 First 12 hex digits of the matching hash. Test 2 - System-level bandwidth simulation: Sessions Number of user sessions modeled. Cloud full download bytes sessions * initial_package_bytes. Local install + updates bytes initial_package_bytes + (sessions - 1) * simulated_update. Bytes saved Cloud - Local. Savings % Saved / Cloud * 100. The savings percentage in Test 2 is a model output describing a sync strategy, NOT a compression ratio. Replace the simulated 5% update size with a real measurement to model a real deployment. What to expect on the three sample inputs (Test 1): - Gzip still wins on raw input. Gzip combines LZ77 + Huffman and is mature; a substitution prototype with no entropy coder is not expected to beat it on short English-like inputs. - V3 is smaller than V2. Reductions are roughly 28-34% on these inputs, vs. V2's roughly 20-23%. - gzip(OCC V3) is slightly LARGER than gzip(raw) on these inputs, because OCC's substitutions remove some of the redundancy structure gzip would otherwise have exploited. Honest finding, not a script flaw. VERIFYING HASHES ------------------------------------------------------------------------ Every file in this folder is listed in SHA256_MANIFEST_V3.txt. To re-verify: Get-ChildItem -File | ForEach-Object { $h = (Get-FileHash -Algorithm SHA256 -Path $_.FullName).Hash "$($_.Name) $h" } The benchmark script itself performs the most important verification: for each input file it SHA-256s the original bytes, encodes with V3, decodes the V3 output back to bytes, and re-hashes. The reconstruction status is PASS only when the two hashes are identical. Any compression number that comes with a FAIL is meaningless and should be discarded. WHAT THIS BENCHMARK IS NOT ------------------------------------------------------------------------ - Not the production patented OneCharacterCode engine. V3 is a prototype that improves on V2. The production engine is not represented here. - Not a universal compression claim. Three KB-scale text inputs cannot speak for all data. - Not a comparison against the full landscape of modern compressors (zstd, xz, lzma). Those should be added next. - Not a claim that file compression equals system-level bandwidth savings. Test 1 and Test 2 are different claims and are kept physically separate on the page. NEXT IMPROVEMENTS ------------------------------------------------------------------------ - Add an entropy coder (e.g. range / arithmetic / Huffman) on top of the dictionary so that gzip becomes catchable on these inputs. - Structural pattern templates for HTML (tag opens/closes), JSON (key:value frames), and English text (sentence-frame templates). - Larger inputs (MB and GB scale) so the dictionary header is amortized differently. - Replace the simulated 5% update packet with measured update deltas from a real deployment. - Independent third-party reproduction of every run. - Comparison against zstd / xz / lzma at multiple compression levels. End of README_REPRODUCE_V3.txt.