README_REPRODUCE_V4.txt ======================================================================== OneCharacterCode Benchmark V4 - reproducibility guide THREE DIFFERENT TESTS ON THIS PAGE ------------------------------------------------------------------------ V4 separates standalone file compression from system-level data savings. Three modes are measured and reported separately: Mode A Standalone per-file compression. Per-file encoding with its own dictionary; cumulative bytes per file set; compared against gzip. May not beat gzip; we report honestly. Mode B Persistent shared dictionary. One dictionary is built from a training subset and "installed" once. Later files ship as body-only symbolic packets that reference the installed dictionary. The dictionary is NOT re-transmitted on later transfers. Cumulative wire bytes vs. cumulative raw download. Mode C Delta updates. For consecutive versioned files: prev + delta -> next. Only the changed middle bytes are sent (between longest common prefix and longest common suffix), inside an OCC carrier. Receiver applies the delta and verifies SHA-256. Reconstruction (SHA-256 round-trip) must PASS for any byte count to count as a claim. WHAT'S IN THIS FOLDER ------------------------------------------------------------------------ run_benchmark_v4.ps1 V4 benchmark script. generate_inputs.ps1 Deterministic dataset generator. inputs\ SMALL_CURRENT_SET\ 3 files copied from V3. REPEATED_APP_SESSIONS\ 100 synthetic sessions. WEBSITE_PAGE_SET\ 25 synthetic HTML pages. LOG_STYLE_SET\ 10000 log lines (1 file). AGENT_STATE_SET\ 100 synthetic agent states. outputs\ Per-mode byproducts. benchmark-results-v4.json Mode A results. system-persistent-results-v4.json Mode B results. delta-update-results-v4.json Mode C results. benchmark-test-run-v4.txt Human-readable run report. benchmark-v4.html Public V4 page. README_REPRODUCE_V4.txt This file. SHA256_MANIFEST_V4.txt Hashes of every V4 file. HOW TO RUN ------------------------------------------------------------------------ Prerequisites: - Windows PowerShell 5.1+ (already on Windows 10/11) or PS 7+. - No internet connection required. - No third-party tools required. Steps: 1. Open PowerShell. 2. Change to this folder. 3. powershell -File generate_inputs.ps1 (only if inputs missing) 4. powershell -File run_benchmark_v4.ps1 The script writes: benchmark-results-v4.json system-persistent-results-v4.json delta-update-results-v4.json benchmark-test-run-v4.txt per-mode byproducts in outputs\. HOW TO INTERPRET THE THREE TABLES ------------------------------------------------------------------------ Mode A - Standalone Compression: FileSet Name of the input set. RawBytes Sum of raw input bytes for the set. GzipRawBytes Sum of gzip(raw) for the set. OCCV4Bytes Sum of V4 self-contained carriers (dict + body) for the set, encoded per-file. Winner "Gzip(raw)" or "OCC V4", whichever cumulative is smaller, AND only when round-trip passes. Reconstruction PASS / FAIL across the whole set. Mode B - Persistent Dictionary: FileSet Name of the input set. Files Number of files in the set. NormalFullDownloadBytes Sum of raw bytes (every file pulled fresh). InitialInstallPlusSymbolicBytes One-time install of the shared dictionary + sum of per-file body-only OCC4P packets. BytesSaved NormalFullDownload - this column. PercentSaved BytesSaved / NormalFullDownload. Reconstruction PASS / FAIL across the whole set. Mode C - Delta Updates: FileSet Name of the versioned set. Versions Number of files in the version order. FullDownloadBytes Sum of raw bytes for versions 2..N (the versions that would otherwise be pulled in full). DeltaSyncBytes Sum of OCC delta packet bytes for those same versions. BytesSaved FullDownload - DeltaSync. PercentSaved BytesSaved / FullDownload. Reconstruction PASS / FAIL across the whole set. OPTIMIZATION SUMMARY (V4 vs V3) ------------------------------------------------------------------------ 1. Shared dictionary across files (Mode B). V3's encoder always embedded the dictionary in every carrier. V4 ships one installed dictionary and many small body-only packets after that. 2. Persistent symbol IDs. Tier 1 / Tier 2 / Tier 3 token bytes are stable across files in Mode B and Mode C. 3. Template-class tokens. The training set for each file set is chosen so the resulting dictionary contains nav, footer, JSON schema keys, agent command frames, repeated log prefixes, etc. 4. Delta-only packets (Mode C). Magic "OCC4X" delta format transmits only middle bytes between common prefix/suffix. 5. Binary compact headers (1-byte length prefixes, 16-bit tier counts, 32-bit body length). 6. No dictionary resend after install. Mode B counts the install once and only counts packet bytes afterward. 7. SHA-256 validation for every reconstruction. HONEST LIMITATIONS ------------------------------------------------------------------------ - Mode A: still a dictionary substitution prototype with no entropy coder. Gzip is mature; on short single-file inputs it tends to win, and we report that honestly. - Mode B: the gains depend on how much cross-file redundancy the training corpus actually shares with the later files. These file sets are deliberately structured to share chrome/templates; heterogeneous corpora will show much smaller gains. - Mode C: the prefix/suffix delta scheme is intentionally simple. Sophisticated delta encoders (xdelta, bsdiff) would do better on files with changes in the middle of repeated regions. - The encoder is a prototype, not the production OneCharacterCode engine. - Three input families are synthetic, deterministic generators (plus the 3 V3 sample inputs in SMALL_CURRENT_SET). Real-world corpora will behave differently. - Only gzip is compared on the raw side. zstd / xz / lzma still need to be added. WHAT THIS BENCHMARK IS NOT ------------------------------------------------------------------------ - Not the production patented OneCharacterCode engine. - Not a universal compression claim. - Not a comparison against the full modern compressor landscape. - Not a claim that file compression equals system-level data savings. Modes A, B, C are kept in three separate tables for this reason. End of README_REPRODUCE_V4.txt.