Format Evolution and Design Rationale¶
1 Purpose¶
This document records the design decisions behind the con file
format version 2, the alternatives considered, and the reasoning for
each choice. It serves as an institutional memory for contributors
and a reference for implementers in other languages.
2 Version 1 to version 2 feature matrix¶
Feature |
v1 (original eOn) |
v2 (readcon-core 0.6.0+) |
|---|---|---|
Machine-readable metadata |
No |
JSON object on line 2 |
Spec version in file |
No |
|
Column 5 semantics |
Undefined |
|
Per-direction constraints |
No (single flag) |
Bitmask column 4 (0-7) |
Declared data sections |
No |
|
Force data |
No |
|
Per-frame energy/potential |
No |
|
Convergence tracking |
No |
|
Unit declaration |
No |
|
Trajectory ordering |
No |
|
NEB image identity |
No |
|
Compression |
No |
Transparent gzip (.con.gz) |
3 Why JSON on line 2 (not elsewhere)¶
3.1 Alternatives considered¶
New header line: adding a 10th header line breaks every existing parser that expects exactly 9 lines before atom data.
Line 5 or 6 (postbox): these carry opaque simulation state in some eOn files (“0 0”, “218 0 1”). Overwriting them risks losing data in round-trips through tools that preserve those values.
Line 1: the generator comment (“Generated by eOn”) is the most visible line. Tools and humans use it to identify the file origin.
Separate sidecar file: a
.con.metaJSON file avoids changing the format but introduces file-pairing problems (lost sidecar, out-of-sync data).
3.2 Decision¶
Line 2 was historically “Time” or empty in eOn files. No tool assigns
it semantic meaning. The Python writer in eOn emits an empty string.
The C++ writer round-trips whatever was there. Placing JSON on line 2
is invisible to old readers (they just see a different comment string)
and detectable by new readers (starts with {).
4 Why bitmask for constraints (not 3 separate columns)¶
4.1 Alternatives considered¶
Three separate 0/1 columns: changes atom lines from 5 columns to 7, breaking all existing parsers.
JSON metadata per-atom: storing
fixed_directionsas a list of 3-tuples in the metadata line would work but makes per-atom access expensive and splits constraint info across two locations.Separate constraint section: a “Constraints of Component” section adds complexity without benefit – the constraint is a property of the atom, not a separate dataset.
4.2 Decision¶
A 3-bit bitmask in the existing column 4 preserves the 5-column line
format. Legacy value 1 (meaning “fully fixed”) maps to all-three-set
([true, true, true]). The canonical all-fixed value for new files is 7.
Old readers that check ! 0= for “is fixed” will treat any non-zero
bitmask value as fixed, which is a safe degradation.
5 Why JSON-declared sections (not positional)¶
5.1 Alternatives considered¶
Fixed order: coordinates, then velocities, then forces – always in that order, detected by blank separators. Simple but rigid: new section types require all implementations to update their order.
Tagged sections: each section starts with a line like
#SECTION:forces. Adds a new syntactic element to the format.File extension encoding:
.convelforcefor velocity+force files. Combinatorial explosion of extensions.
5.2 Decision¶
The sections key in the JSON metadata declares which sections exist
and their order. The parser reads exactly those sections. Benefits:
New section types require no format-level changes – just add a string to the array.
Section order is explicit, not implicit.
Legacy files without
sectionsfall back to blank-separator velocity detection (backward compatible).The writer auto-populates
sectionsfrom the frame data, so users never need to set it manually.
6 Why per-frame energy in metadata (not per-atom section)¶
6.1 Alternatives considered¶
Per-atom energy section: a “Energies of Component” section with one scalar per atom. Useful for ML potentials that provide local energy decomposition.
Both: per-frame in metadata, optional per-atom section.
6.2 Decision¶
Most potentials (EMT, EAM, DFT) produce a total energy, not per-atom
decomposition. Storing per-frame energy in the JSON metadata (energy
key) is sufficient for the primary use case. Per-atom energies can be
added as a future section type if ML potential adoption demands it.
The potential metadata key provides structured provenance:
{"type":"EMT","params":{"cutoff":6.0}}. This makes energy and force
values interpretable without external context.
7 Why gzip (not zstd, lz4, or bzip2)¶
7.1 Alternatives considered¶
zstd: better compression ratio and speed, but less ubiquitous. Not available in Python stdlib. Would require an additional dependency for every implementation.
lz4: fastest decompression, weakest compression. Not in Python stdlib.
bzip2: best compression ratio, slowest. In Python stdlib but rarely used for scientific data.
7.2 Decision¶
Gzip is available everywhere: Python stdlib, Rust flate2, C zlib,
Fortran, Julia. Every Unix system has gzip and zcat. The magic
bytes (0x1f 0x8b) are universally recognized. For the con format’s
typical file sizes (KiB to low MiB), gzip’s compression ratio is
adequate (60-80% reduction).
zstd support may be added behind an optional feature flag in the future, using the same magic-byte detection pattern.
8 Migration guide for existing tools¶
8.1 Reading v2 files in a v1 reader¶
A v1 reader will:
Parse line 2 as a comment string (harmless).
Parse column 4 values: 0 and 7 work (free and fixed). Values 2-6 are treated as “fixed” by any
!0= check. Value 1 still works.Ignore the
sectionskey (no JSON parsing).Stop at the first blank line (velocity section), missing any force section that follows.
Degradation is safe: the reader gets coordinates and constraints (with reduced per-direction granularity) but misses forces.
8.2 Upgrading a v1 writer to v2¶
Emit
{"con_spec_version":2}on line 2.Emit 7 (not 1) for fully-fixed atoms.
Store and preserve
atom_id(column 5) through read-write cycles.If writing forces, add
,"sections":["forces"]to the JSON and append a force section after coordinates.
8.3 Reference implementations¶
Implementation |
Source |
Lines |
Language |
|---|---|---|---|
eOn C++ |
|
592 |
C++ |
eOn Python |
|
760 |
Python |
ASE |
|
307 |
Python |
The C++ implementation is archived (March 2026 snapshot) because it requires the full eOn build system. The Python implementations are installable packages and do not need archiving.
All reference implementations support v1 only (no JSON metadata, no sections, no bitmask constraints).