This project provides tools to:
- Convert multi-sequence FASTA files into individual compressed
.seq
binary files. - Encode DNA/RNA sequences using 2-bit representations.
- Preserve metadata, type (DNA/RNA), and support versioning.
- Reconstruct
.fasta
files from.seq
binaries.
Offset Size Description
0 4 Signature: "SEQ\x01"
4 1 Version (currently 0x01)
5 4 Metadata length (uint32_t)
9 N Metadata string (FASTA header)
9+N 1 Type: 1 = DNA, 2 = RNA
10+N 8 Sequence length in bases (uint64_t)
18+N ? Sequence data (2 bits per base, packed)
Bases are encoded as follows:
Base | Bits | Decimal |
---|---|---|
A | 00 | 0 |
C | 01 | 1 |
G | 10 | 2 |
T/U | 11 | 3 |
- RNA sequences are detected if they contain
U
and converted to type 2. - The encoder replaces
U
withT
internally for 2-bit packing. - Metadata is the entire FASTA header line (starting with
>
).
This program supports two modes based on the number of arguments:
# Converts FASTA to .seq files
./program <input.fasta>
# Converts a .seq file back to FASTA
./program <input.seq> <output.fasta>
- If only one argument is given, it treats it as a FASTA file and parses it into .seq files.
- If two arguments are given, it decodes the .seq file into a valid FASTA file.
>example_sequence_53
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTA
The following demonstrates projected compression results when scaling to a large dataset:
- Original FASTA file size: 39.9 GB
- After
.seq
encoding: 9.8 GB - After 7zip compression: 8.2 GB
This shows that with 2-bit encoding and efficient archiving:
- The encoded file is ~75% smaller than the original FASTA.
- The archived
.dna
file further reduces storage requirements. - Metadata remains accessible without decompressing the entire archive.
This scale makes the format practical for handling large genomic datasets such as entire chromosomes or transcriptome libraries.
- Combine multiple
.seq
files into a single.dna
archive/ - Add
.dna
archive reader/unpacker. - Optional compression for
.dna
archives with metadata. - GUI Tools.
Note: 7zip
is a good fit for its robust archiving capabilities, especially its support for bundling both binary .seq
files and accompanying metadata (e.g., JSON) to a .dna
archive. This allows for efficient storage and distribution, while also enabling users,gui, or tools to quickly extract metadata files without unpacking the entire (potentially large) archive.
FASTA files from real-world sources often contain edge cases that must be handled with care. Our format currently uses a compact 2-bit encoding scheme for DNA/RNA sequences, which assumes only four standard bases: A
, C
, G
, and T
/U
.
However, actual sequence data frequently includes the following quirks:
N
is a standard IUPAC nucleotide representing any base (A/C/G/T).- Common in genome gaps, unresolved regions, and low-quality reads.
- Solution: We store the positions of
N
bases in the metadata to preserve full sequence information.
- Codes like
R
,Y
,S
,W
, etc., represent multiple possible bases. - These are less common but appear in some variant-rich datasets.
- Current Handling: These bases are not yet supported and will cause the encoder to skip or abort. A warning is logged.
- Future Plan: Consider expanding support or filtering them pre-encoding.
- Some FASTA files use lowercase letters to represent soft-masked regions (e.g., repeats).
- Current Handling: All bases are normalized to uppercase.
- Lines may be wrapped at different lengths (60, 80, or unwrapped).
- Files may include Windows (
\r\n
) or UNIX (\n
) line endings. - Current Handling: The parser reads sequences continuously, ignoring line breaks.
- Many FASTA files contain multiple records.
- Handling: Each sequence is extracted and encoded into a separate
.seq
file with metadata and type information.
Note: Our .seq
encoder is intentionally strict to ensure reliable 2-bit packing. However, these quirks must be resolved (filtered, tracked, or logged) to avoid losing information or causing encoding failures.
Compile with a C compiler:
make
Apache 2.0 License