Format Evolution and Design Rationale ¶

1 Purpose ¶

This document records the design decisions behind the con file format version 2, the alternatives considered, and the reasoning for each choice. It serves as an institutional memory for contributors and a reference for implementers in other languages.

2 Version 1 to version 2 feature matrix ¶

Feature	v1 (original eOn)	v2 (readcon-core 0.6.0+)
Machine-readable metadata	No	JSON object on line 2
Spec version in file	No	`con_spec_version` key
Column 5 semantics	Undefined	`atom_id` (pre-grouping index)
Per-direction constraints	No (single flag)	Bitmask column 4 (0-7)
Declared data sections	No	`sections` JSON key
Force data	No	`Forces of Component` blocks
Per-frame energy/potential	No	`energy`, `potential` metadata keys
Convergence tracking	No	`convergence_fmax`, `converged` keys
Unit declaration	No	`units` metadata object
Trajectory ordering	No	`frame_index`, `time` keys
NEB image identity	No	`neb_bead`, `neb_band` keys
Compression	No	Transparent gzip (.con.gz)

3 Why JSON on line 2 (not elsewhere)¶

3.1 Alternatives considered ¶

New header line: adding a 10th header line breaks every existing parser that expects exactly 9 lines before atom data.
Line 5 or 6 (postbox): these carry opaque simulation state in some eOn files (“0 0”, “218 0 1”). Overwriting them risks losing data in round-trips through tools that preserve those values.
Line 1: the generator comment (“Generated by eOn”) is the most visible line. Tools and humans use it to identify the file origin.
Separate sidecar file: a .con.meta JSON file avoids changing the format but introduces file-pairing problems (lost sidecar, out-of-sync data).

3.2 Decision ¶

Line 2 was historically “Time” or empty in eOn files. No tool assigns it semantic meaning. The Python writer in eOn emits an empty string. The C++ writer round-trips whatever was there. Placing JSON on line 2 is invisible to old readers (they just see a different comment string) and detectable by new readers (starts with {).

4 Why bitmask for constraints (not 3 separate columns)¶

4.1 Alternatives considered ¶

Three separate 0/1 columns: changes atom lines from 5 columns to 7, breaking all existing parsers.
JSON metadata per-atom: storing fixed_directions as a list of 3-tuples in the metadata line would work but makes per-atom access expensive and splits constraint info across two locations.
Separate constraint section: a “Constraints of Component” section adds complexity without benefit – the constraint is a property of the atom, not a separate dataset.

4.2 Decision ¶

A 3-bit bitmask in the existing column 4 preserves the 5-column line format. Legacy value 1 (meaning “fully fixed”) maps to all-three-set ([true, true, true]). The canonical all-fixed value for new files is 7. Old readers that check ! 0= for “is fixed” will treat any non-zero bitmask value as fixed, which is a safe degradation.

5 Why JSON-declared sections (not positional)¶

5.1 Alternatives considered ¶

Fixed order: coordinates, then velocities, then forces – always in that order, detected by blank separators. Simple but rigid: new section types require all implementations to update their order.
Tagged sections: each section starts with a line like #SECTION:forces. Adds a new syntactic element to the format.
File extension encoding: .convelforce for velocity+force files. Combinatorial explosion of extensions.

5.2 Decision ¶

The sections key in the JSON metadata declares which sections exist and their order. The parser reads exactly those sections. Benefits:

New section types require no format-level changes – just add a string to the array.
Section order is explicit, not implicit.
Legacy files without sections fall back to blank-separator velocity detection (backward compatible).
The writer auto-populates sections from the frame data, so users never need to set it manually.

The optional validate=true metadata flag lets producers ask readers to verify that declared sections match coordinate blocks exactly: component symbols, labels, fixed masks, atom ids, numeric finiteness, metadata schema, and physical header invariants must all agree. This keeps the default reader permissive for existing files while giving v2 producers a strict interoperability contract.

6 Why per-frame energy in metadata (not per-atom section)¶

6.1 Alternatives considered ¶

Per-atom energy section: a “Energies of Component” section with one scalar per atom. Useful for ML potentials that provide local energy decomposition.
Both: per-frame in metadata, optional per-atom section.

6.2 Decision ¶

Most potentials (EMT, EAM, DFT) produce a total energy, not per-atom decomposition. Storing per-frame energy in the JSON metadata (energy key) is sufficient for the primary use case. Per-atom energies can be added as a future section type if ML potential adoption demands it.

The potential metadata key provides structured provenance: {"type":"EMT","params":{"cutoff":6.0}}. This makes energy and force values interpretable without external context.

7 Why gzip (not zstd, lz4, or bzip2)¶

7.1 Alternatives considered ¶

zstd: better compression ratio and speed, but less ubiquitous. Not available in Python stdlib. Would require an additional dependency for every implementation.
lz4: fastest decompression, weakest compression. Not in Python stdlib.
bzip2: best compression ratio, slowest. In Python stdlib but rarely used for scientific data.

7.2 Decision ¶

Gzip is available everywhere: Python stdlib, Rust flate2, C zlib, Fortran, Julia. Every Unix system has gzip and zcat. The magic bytes (0x1f 0x8b) are universally recognized. For the con format’s typical file sizes (KiB to low MiB), gzip’s compression ratio is adequate (60-80% reduction).

zstd support is available behind the optional zstd Cargo feature since v0.10.0, using the same magic-byte detection pattern (28 b5 2f fd for zstd frames; 1f 8b for gzip). .con.zst files are read and written through ConFrameWriter::from_path_zstd and the same read_all_frames entry point. Builds without the feature still detect zstd magic bytes and return a clear error pointing at the feature flag, so consumers never see a corrupt parse on a zstd file.

8 Migration guide for existing tools ¶

8.1 Reading v2 files in a v1 reader ¶

A v1 reader will:

Parse line 2 as a comment string (harmless).
Parse column 4 values: 0 and 7 work (free and fixed). Values 2-6 are treated as “fixed” by any ! 0= check. Value 1 still works.
Ignore the sections key (no JSON parsing).
Stop at the first blank line (velocity section), missing any force section that follows.

Degradation is safe: the reader gets coordinates and constraints (with reduced per-direction granularity) but misses forces.

8.2 Upgrading a v1 writer to v2 ¶

Emit {"con_spec_version":2} on line 2.
Emit 7 (not 1) for fully-fixed atoms.
Store and preserve atom_id (column 5) through read-write cycles.
If writing forces, add ,"sections":["forces"] to the JSON and append a force section after coordinates.

8.3 Reference implementations ¶

Implementation	Source	Lines	Language
eOn C++	`addl/referenceImpls/eon_cpp/` (archived)	592	C++
eOn Python	`pip install eon` / eOn repo `eon/fileio.py`	760	Python
ASE	`pip install ase` / `ase.io.eon`	307	Python

The C++ implementation is archived (March 2026 snapshot) because it requires the full eOn build system. The Python implementations are installable packages and do not need archiving.

All reference implementations support v1 only (no JSON metadata, no sections, no bitmask constraints).