This Python script searches for occurrences of a given DNA motif in a FASTA genome file and outputs the results in BED format.
- Search motif occurrences on the forward strand by default.
- Optionally search the reverse complement strand.
- Supports DNA motifs with IUPAC ambiguity codes (e.g. N, V, H, D, B, M, R, W, S, Y, K).
- Outputs BED file with motif locations and matched sequences.
- Supports output compression with
.gzextension. - Progress bar shows processing progress based on total bases for smooth tracking.
- Prints summary of total matches and elapsed runtime.
- Minimal dependencies (
biopythonrequired).
# Python 3.x
pip install biopython tqdmpython find_motif.py --fasta genome.fa --out motifs.bed --motif GATC| Parameter | Description |
|---|---|
--fasta |
Input genome FASTA file (required) |
--out |
Output BED file path (required). Use .gz for gzip output |
--motif |
Motif sequence with IUPAC codes (required). Examples: GATC, NVH |
--reverse |
If set, also searches reverse complement strand |
| Code | Meaning | Bases Included |
|---|---|---|
| A | Adenine | A |
| C | Cytosine | C |
| G | Guanine | G |
| T (U) | Thymine (Uracil) | T |
| N | Any base | A, C, G, T |
| V | Not T | A, C, G |
| H | Not G | A, C, T |
| D | Not C | A, G, T |
| B | Not A | C, G, T |
| M | Amino | A, C |
| R | Purine | A, G |
| W | Weak | A, T |
| S | Strong | C, G |
| Y | Pyrimidine | C, T |
| K | Keto | G, T |
The output file contains tab-separated columns:
chrom start end name score strand matched_sequence
startis 0-based coordinate.endis non-inclusive.nameincludes motif, chromosome, and match index.scoreis always 0.strandis+or-.matched_sequenceis the actual matched motif sequence from the genome.
Search for motif "GATC" on the plus strand and save compressed output:
python find_motif.py --fasta dm6.fa --out dm6_GATC_motifs.bed.gz --motif GATCSearch for a motif with ambiguity codes on both strands:
python find_motif.py --fasta genome.fa --out motifs.bed --motif NVH --reverseMIT License
For suggestions or issues, please submit a GitHub issue.