Format Evolution and Design Rationale

1 Purpose

This document records the design decisions behind the con file format version 2, the alternatives considered, and the reasoning for each choice. It serves as an institutional memory for contributors and a reference for implementers in other languages.

2 Version 1 to version 2 feature matrix

Feature

v1 (original eOn)

v2 (readcon-core 0.6.0+)

Machine-readable metadata

No

JSON object on line 2

Spec version in file

No

con_spec_version key

Column 5 semantics

Undefined

atom_id (pre-grouping index)

Per-direction constraints

No (single flag)

Bitmask column 4 (0-7)

Declared data sections

No

sections JSON key

Force data

No

Forces of Component blocks

Per-frame energy/potential

No

energy, potential metadata keys

Convergence tracking

No

convergence_fmax, converged keys

Unit declaration

No

units metadata object

Trajectory ordering

No

frame_index, time keys

NEB image identity

No

neb_bead, neb_band keys

Compression

No

Transparent gzip (.con.gz)

3 Why JSON on line 2 (not elsewhere)

3.1 Alternatives considered

  1. New header line: adding a 10th header line breaks every existing parser that expects exactly 9 lines before atom data.

  2. Line 5 or 6 (postbox): these carry opaque simulation state in some eOn files (“0 0”, “218 0 1”). Overwriting them risks losing data in round-trips through tools that preserve those values.

  3. Line 1: the generator comment (“Generated by eOn”) is the most visible line. Tools and humans use it to identify the file origin.

  4. Separate sidecar file: a .con.meta JSON file avoids changing the format but introduces file-pairing problems (lost sidecar, out-of-sync data).

3.2 Decision

Line 2 was historically “Time” or empty in eOn files. No tool assigns it semantic meaning. The Python writer in eOn emits an empty string. The C++ writer round-trips whatever was there. Placing JSON on line 2 is invisible to old readers (they just see a different comment string) and detectable by new readers (starts with {).

4 Why bitmask for constraints (not 3 separate columns)

4.1 Alternatives considered

  1. Three separate 0/1 columns: changes atom lines from 5 columns to 7, breaking all existing parsers.

  2. JSON metadata per-atom: storing fixed_directions as a list of 3-tuples in the metadata line would work but makes per-atom access expensive and splits constraint info across two locations.

  3. Separate constraint section: a “Constraints of Component” section adds complexity without benefit – the constraint is a property of the atom, not a separate dataset.

4.2 Decision

A 3-bit bitmask in the existing column 4 preserves the 5-column line format. Legacy value 1 (meaning “fully fixed”) maps to all-three-set ([true, true, true]). The canonical all-fixed value for new files is 7. Old readers that check ! 0= for “is fixed” will treat any non-zero bitmask value as fixed, which is a safe degradation.

5 Why JSON-declared sections (not positional)

5.1 Alternatives considered

  1. Fixed order: coordinates, then velocities, then forces – always in that order, detected by blank separators. Simple but rigid: new section types require all implementations to update their order.

  2. Tagged sections: each section starts with a line like #SECTION:forces. Adds a new syntactic element to the format.

  3. File extension encoding: .convelforce for velocity+force files. Combinatorial explosion of extensions.

5.2 Decision

The sections key in the JSON metadata declares which sections exist and their order. The parser reads exactly those sections. Benefits:

  • New section types require no format-level changes – just add a string to the array.

  • Section order is explicit, not implicit.

  • Legacy files without sections fall back to blank-separator velocity detection (backward compatible).

  • The writer auto-populates sections from the frame data, so users never need to set it manually.

6 Why per-frame energy in metadata (not per-atom section)

6.1 Alternatives considered

  1. Per-atom energy section: a “Energies of Component” section with one scalar per atom. Useful for ML potentials that provide local energy decomposition.

  2. Both: per-frame in metadata, optional per-atom section.

6.2 Decision

Most potentials (EMT, EAM, DFT) produce a total energy, not per-atom decomposition. Storing per-frame energy in the JSON metadata (energy key) is sufficient for the primary use case. Per-atom energies can be added as a future section type if ML potential adoption demands it.

The potential metadata key provides structured provenance: {"type":"EMT","params":{"cutoff":6.0}}. This makes energy and force values interpretable without external context.

7 Why gzip (not zstd, lz4, or bzip2)

7.1 Alternatives considered

  1. zstd: better compression ratio and speed, but less ubiquitous. Not available in Python stdlib. Would require an additional dependency for every implementation.

  2. lz4: fastest decompression, weakest compression. Not in Python stdlib.

  3. bzip2: best compression ratio, slowest. In Python stdlib but rarely used for scientific data.

7.2 Decision

Gzip is available everywhere: Python stdlib, Rust flate2, C zlib, Fortran, Julia. Every Unix system has gzip and zcat. The magic bytes (0x1f 0x8b) are universally recognized. For the con format’s typical file sizes (KiB to low MiB), gzip’s compression ratio is adequate (60-80% reduction).

zstd support may be added behind an optional feature flag in the future, using the same magic-byte detection pattern.

8 Migration guide for existing tools

8.1 Reading v2 files in a v1 reader

A v1 reader will:

  • Parse line 2 as a comment string (harmless).

  • Parse column 4 values: 0 and 7 work (free and fixed). Values 2-6 are treated as “fixed” by any ! 0= check. Value 1 still works.

  • Ignore the sections key (no JSON parsing).

  • Stop at the first blank line (velocity section), missing any force section that follows.

Degradation is safe: the reader gets coordinates and constraints (with reduced per-direction granularity) but misses forces.

8.2 Upgrading a v1 writer to v2

  1. Emit {"con_spec_version":2} on line 2.

  2. Emit 7 (not 1) for fully-fixed atoms.

  3. Store and preserve atom_id (column 5) through read-write cycles.

  4. If writing forces, add ,"sections":["forces"] to the JSON and append a force section after coordinates.

8.3 Reference implementations

Implementation

Source

Lines

Language

eOn C++

addl/referenceImpls/eon_cpp/ (archived)

592

C++

eOn Python

pip install eon / eOn repo eon/fileio.py

760

Python

ASE

pip install ase / ase.io.eon

307

Python

The C++ implementation is archived (March 2026 snapshot) because it requires the full eOn build system. The Python implementations are installable packages and do not need archiving.

All reference implementations support v1 only (no JSON metadata, no sections, no bitmask constraints).