README_STACKED_TEST_V3.txt ======================================================================== OneCharacterCode V3 STACKED Compression Test - reproducibility guide This test answers the correction: "If OneCharacterCode data is transmitted, it can still be compressed again with gzip. The fair transport comparison is gzip(raw) versus gzip(OCC carrier)." The V1/V2/V3 file-compression pages compared OCC carrier bytes vs. raw bytes. That is the right comparison for "did the encoder compress this file?" but it is NOT the right comparison for "would I save bandwidth by shipping this over the wire?". Real transport would gzip the OCC carrier on the wire anyway. This test reports the wire-side comparison honestly. PIPELINE PER FILE ------------------------------------------------------------------------ baseline transport: raw -> gzip(raw) stacked transport: raw -> OCC V3 -> gzip(OCC V3) receiver: gzip(OCC V3) -> gunzip -> decode V3 -> raw verification: SHA-256(original) == SHA-256(reconstructed) Both paths use the same gzip implementation (.NET System.IO.Compression.GZipStream, default level), so the comparison is apples-to-apples. WINNER RULE ------------------------------------------------------------------------ Winner = whichever transport size is smaller. If gzip(OCC V3) >= gzip(raw): "Gzip(raw) still wins for this file." If gzip(OCC V3) < gzip(raw): "OCC V3 + gzip wins for this file." An OCC win is reported only when BOTH (a) gzip(OCC V3) is strictly smaller than gzip(raw), AND (b) the receiver-side roundtrip (gunzip -> decode V3) recovers the original bytes and SHA-256 matches. If either condition fails, gzip(raw) is declared the winner for that file. WHAT'S IN THIS FOLDER ------------------------------------------------------------------------ run_stacked_compression_test.ps1 The stacked-test script. run_benchmark_v3.ps1 Copy of the V3 encoder for reference (NOT executed by this folder's main script). inputs\ Three input files (same as V1/V2/V3). outputs\ Per-file byproducts: .gz, .occ3, .occ3.gz, .reconstructed_from_gzip_occ3 stacked-compression-results-v3.json Machine-readable results. stacked-compression-test-run-v3.txt Human-readable run report. benchmark-v3-stacked.html The public stacked page. SHA256_MANIFEST_STACKED_V3.txt SHA-256 of every file here. README_STACKED_TEST_V3.txt This file. HOW TO RUN ------------------------------------------------------------------------ Prerequisites: - Windows PowerShell 5.1+ (already on Windows 10/11) or PS 7+. - No internet connection required. - No third-party tools required. Steps: 1. Open PowerShell. 2. Change to this folder. 3. powershell -File run_stacked_compression_test.ps1 4. The script prints per-file results and writes: stacked-compression-results-v3.json stacked-compression-test-run-v3.txt and per-file byproducts in outputs\. HOW TO INTERPRET THE RESULTS ------------------------------------------------------------------------ Raw Bytes Original file size. Gzip(raw) Bytes Size after gzip applied to the raw file. OCC V3 Bytes Size after the V3 OCC encoder. Gzip(OCC V3) Bytes Size after gzip applied to the OCC V3 output. Best Transport Winner Whichever of gzip(raw) or gzip(OCC V3) is smaller, subject to a passing roundtrip. Gzip(OCC) vs Gzip(raw) (1 - gzip(OCC V3) / gzip(raw)) * 100. Positive = smaller; negative = larger. Roundtrip PASS if gunzip(gzip(OCC V3)) decoded back to the original bytes, FAIL otherwise. SHA-256 Match OK if SHA-256 of reconstructed bytes equals SHA-256 of original bytes. WHY THE RESULT MAY FAVOR GZIP(RAW) ------------------------------------------------------------------------ Gzip combines an LZ77 sliding window with Huffman entropy coding. For short English-like inputs it already extracts most of the redundancy. The OCC V3 dictionary substitutions consume some of that same redundancy, so the second-stage gzip starts with less to work with. What would change this on these inputs: - Long-range repeats outside gzip's 32 KB window (so much larger inputs). - An entropy coder on top of the OCC dictionary that does not pre- collide with gzip's redundancy model. - Structured corpora where OCC's dictionary entries capture cross-file patterns that gzip cannot see. VERIFYING HASHES ------------------------------------------------------------------------ Every file in this folder is listed in SHA256_MANIFEST_STACKED_V3.txt with its SHA-256 hash. The test script itself performs the most important verification: for each input file it SHA-256s the original bytes, encodes with OCC V3, gzips the carrier, gunzips it back, decodes with OCC V3, and re-hashes. PASS/FAIL is the comparison of the two hashes. STRICT REPORTING RULES IN THIS TEST ------------------------------------------------------------------------ - No fake wins. - No hype. - If gzip(raw) wins, the page says "Gzip(raw) still wins for this file." - If gzip(OCC V3) wins, the page says "OCC V3 + gzip wins for this file." - and only when the receiver-side roundtrip PASSES. - Any file whose roundtrip FAILS is treated as a non-win regardless of byte counts. End of README_STACKED_TEST_V3.txt.