Skip to content

feat: add circle annotations #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 30 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
62f8f36
feat: attempt at downloading via biomart (already out-dated, will dow…
dlaehnemann Jun 24, 2024
18afbb6
feat: use ensembl GFF3s instead of biomart for cirecle annotations, d…
dlaehnemann Jul 3, 2024
cf9f4bb
feat: naive circle annotation, all into the same table (will move to …
dlaehnemann Jul 3, 2024
e8ac6ec
feat: clean up transcript and gene annotations with biomart-based map…
dlaehnemann Jul 4, 2024
0ebe540
fix: integer conversion of columns in pandas
dlaehnemann Jul 4, 2024
2463c7f
adapt annotation and clean up related stuff
dlaehnemann Jul 4, 2024
a555717
minor cleanup
dlaehnemann Jul 4, 2024
c490f64
fix: typo in input name
dlaehnemann Jul 10, 2024
2b0d315
fix: use latest ref download wrappers, with fix for ensembl-regulation
dlaehnemann Jul 10, 2024
dd80d0b
perf: use latest ensembl version in config.yaml, to have stable avail…
dlaehnemann Jul 10, 2024
5c882bf
fix: update column names to new consistent naming scheme
dlaehnemann Jul 16, 2024
720c77f
fix: typo in emsembl_id column name
dlaehnemann Jul 16, 2024
56eb1e6
fix: try dynamic resource specification for rule create_annotation_gff
dlaehnemann Aug 16, 2024
18649c0
chore: rename circles_overview table in datavzrd report template
dlaehnemann Aug 16, 2024
f29db36
chore: snakefmt
dlaehnemann Jun 11, 2025
745687b
perf: make all download rules localrule: True
dlaehnemann Jun 11, 2025
ba4da0e
ci: update main github actions jobs
dlaehnemann Jun 11, 2025
db089ac
ci: also update super-linter
dlaehnemann Jun 12, 2025
e3315e3
ci: fetch depth 0 for super-linter repo checkout
dlaehnemann Jun 12, 2025
8f07ad2
perf: update datavzrd
dlaehnemann Jun 12, 2025
cdcefd3
Merge branch 'feat/add-circle-annotations' of github.com:snakemake-wo…
dlaehnemann Jun 12, 2025
24d860e
feat: use new biomart-table and ensembl-table wrappers to download an…
dlaehnemann Jun 12, 2025
930090d
feat: start expanding datavzrd rendering of circles
dlaehnemann Jun 12, 2025
eee26e3
Merge branch 'feat/add-circle-annotations' of github.com:snakemake-wo…
dlaehnemann Jun 12, 2025
b8ec1f9
fix: file naming cleanup
dlaehnemann Jun 12, 2025
bf658b9
chore: remove unused script
dlaehnemann Jun 12, 2025
a20bb6a
chore: snakefmt
dlaehnemann Jun 12, 2025
d01eb49
ci: avoid ensembl release 112, as it seems to be missing some biomart…
dlaehnemann Jun 12, 2025
88bdb58
fix: ensure directory creation and correct file paths during annotation
dlaehnemann Jun 16, 2025
dc6b69d
fix: typo in datavzrd config
dlaehnemann Jun 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .test/config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,13 @@ ref:
# Ensembl species name
species: homo_sapiens
# Ensembl release
release: 108
release: 112
# Genome build
build: GRCh38
# Optionally, instead of downloading the whole reference from Ensembl via the
# parameters above, specify a specific chromosome below and uncomment the line.
# This is usually only relevant for testing.
chromosome: 7
#chromosome: 7

# These filters mostly correspond to the output columns of Circle-Map:
# https://github.com/iprada/Circle-Map/wiki/Circle-Map-Realign-output-files
Expand Down
2 changes: 1 addition & 1 deletion config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ ref:
# Ensembl species name
species: homo_sapiens
# Ensembl release
release: 108
release: 112
# Genome build
build: GRCh38
# Optionally, instead of downloading the whole reference from Ensembl via the
Expand Down
7 changes: 7 additions & 0 deletions workflow/envs/annotatr.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
channels:
- conda-forge
- bioconda
- nodefaults
dependencies:
- bioconductor-annotatr =1.28
- r-tidyverse = 2.0
7 changes: 7 additions & 0 deletions workflow/envs/biomart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
channels:
- conda-forge
- bioconda
- nodefaults
dependencies:
- bioconductor-biomart =2.58
- r-tidyverse = 2.0
7 changes: 7 additions & 0 deletions workflow/envs/rtracklayer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
channels:
- conda-forge
- bioconda
- nodefaults
dependencies:
- bioconductor-rtracklayer =1.62
- r-tidyverse = 2.0
16 changes: 16 additions & 0 deletions workflow/rules/circle_map.smk
Original file line number Diff line number Diff line change
Expand Up @@ -76,3 +76,19 @@ rule clean_circle_map_realign_output:
max_circle_length=config["circle_filtering"]["max_circle_length"],
script:
"../scripts/clean_circle_map_realign_output.py"


rule annotate_cleaned_circles:
input:
all_annotations="resources/all_annotations.harmonized.gff3.gz",
tsv="results/circle-map/{sample}.circles.cleaned.tsv",
output:
tsvs=directory("results/circle-map/{sample}.circles.cleaned.annotated/"),
log:
"logs/circle-map/{sample}.circles.cleaned.annotated.logs",
conda:
"../envs/annotatr.yaml"
params:
build=config["ref"]["build"],
script:
"../scripts/annotate_cleaned_circles.R"
6 changes: 6 additions & 0 deletions workflow/rules/common.smk
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,12 @@ def get_adapters(wildcards):
].get("adapters", ""),


def get_bioc_species_name():
first_letter = config["ref"]["species"][0]
subspecies = config["ref"]["species"].split("_")[1]
return first_letter + subspecies


def get_bwa_extra(wildcards):
"""
Denote sample name and platform in read group.
Expand Down
66 changes: 66 additions & 0 deletions workflow/rules/ref.smk
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ rule get_genome:
release=config["ref"]["release"],
chromosome=config["ref"].get("chromosome"),
cache: "omit-software"
localrule: True
wrapper:
"v1.21.2/bio/reference/ensembl-sequence"

Expand Down Expand Up @@ -69,6 +70,7 @@ rule get_known_variants:
type="all",
chromosome=config["ref"].get("chromosome"),
cache: "omit-software"
localrule: True
wrapper:
"v1.21.2/bio/reference/ensembl-variation"

Expand Down Expand Up @@ -99,3 +101,67 @@ rule tabix_known_variants:
"-p vcf",
wrapper:
"v1.21.2/bio/tabix/index"


rule get_annotation:
output:
"resources/genomic_annotations.gff3.gz",
params:
species=config["ref"]["species"],
release=config["ref"]["release"],
build=config["ref"]["build"],
log:
"logs/get-annotation.log",
cache: "omit-software"
localrule: True
wrapper:
"v3.13.6/bio/reference/ensembl-annotation"


rule get_regulatory_features_gff3_gz:
output:
"resources/regulatory_annotations.gff3.gz", # presence of .gz determines if downloaded is kept compressed
params:
species=config["ref"]["species"],
release=config["ref"]["release"],
build=config["ref"]["build"],
log:
"logs/get_regulatory_features.log",
cache: "omit-software" # save space and time with between workflow caching (see docs)
localrule: True
wrapper:
"v3.13.6/bio/reference/ensembl-regulation"


rule create_transcripts_to_genes_mappings:
output:
mapping="resources/transcripts_to_genes_mappings.tsv.gz",
params:
species=get_bioc_species_name(),
release=config["ref"]["release"],
build=config["ref"]["build"],
chromosome=config["ref"].get("chromosome", ""),
log:
"logs/transcripts_to_genes_mappings.log",
conda:
"../envs/biomart.yaml"
cache: "omit-software" # save space and time with between workflow caching (see docs)
script:
"../scripts/create_transcripts_to_genes_mappings.R"


rule create_annotation_gff:
input:
genomic_annotations="resources/genomic_annotations.gff3.gz",
mapping="resources/transcripts_to_genes_mappings.tsv.gz",
regulatory_annotations="resources/regulatory_annotations.gff3.gz",
output:
all_annotations="resources/all_annotations.harmonized.gff3.gz",
log:
"logs/all_annotations.harmonized.gff3.log",
conda:
"../envs/rtracklayer.yaml"
params:
build=config["ref"]["build"],
script:
"../scripts/create_annotation_gff3.R"
108 changes: 108 additions & 0 deletions workflow/scripts/annotate_cleaned_circles.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("tidyverse")
rlang::global_entrace()

library("annotatr")
library("GenomicRanges")

circles <- read_tsv(
snakemake@input[["tsv"]]
)

genome_build <- snakemake@params[["build"]]

circles_gr <- GRanges(
seqnames = pull(circles, region),
ecDNA_status = "ecDNA"
)

genome(circles_gr) <- genome_build

Comment on lines +17 to +23
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

GRanges construction is missing start/end coordinates → runtime error

GRanges() requires a ranges argument (IRanges). Supplying only seqnames will throw
argument "ranges" is missing, with no default”.

Parse the region column first:

-regions = pull(circles, region),
-ecDNA_status = "ecDNA"
+seqnames = sub(":.*", "", circles$region),
+ranges   = IRanges(
+    start = as.integer(sub(".*:(\\d+)-.*", "\\1", circles$region)),
+    end   = as.integer(sub(".*-(\\d+)$", "\\1", circles$region))
+),
+ecDNA_status = "ecDNA"

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In workflow/scripts/annotate_cleaned_circles.R around lines 17 to 23, the
GRanges object is constructed without specifying the required start and end
coordinates, causing a runtime error. Extract the start and end positions from
the 'region' column (which likely contains genomic coordinates) and create an
IRanges object to pass as the 'ranges' argument in GRanges. This will fix the
missing argument error and correctly define the genomic ranges.

overlapping_annotations <- rtracklayer::import(
snakemake@input[["all_annotations"]],
which = circles_gr, #import only intervals overlapping circles
genome = genome_build
)

annotated_circles <- annotate_regions(
regions = circles_gr,
annotations = overlapping_annotations,
ignore.strand = TRUE,
quiet = FALSE
) |>
as_tibble() |>
dplyr::select(
-c(
strand,
width,
ecDNA_status,
annot.width,
annot.source,
annot.score,
annot.phase
)
) |>
arrange(
annot.seqnames,
annot.start,
annot.end
) |>
mutate(
circle_region = str_c(
seqnames,
":",
start,
"-",
end
),
region = str_c(
annot.seqnames,
":",
annot.start,
"-",
annot.end
)
) |>
dplyr::select(
-c(
seqnames,
start,
end,
annot.seqnames,
annot.start,
annot.end
)
) |>
dplyr::rename(
exon_rank = annot.rank
) |>
dplyr::rename_with(
~ str_replace(.x, fixed("annot."), "")
) |>
dplyr::select(
region,
type,
id,
name,
parent_type,
parent_id,
exon_rank,
circle_region,
) |>
group_by(
circle_region
) |>
group_walk(
~ write_tsv(
.x,
file = file.path(
str_c(
str_replace_all(.y$circle_region, "[:-]", "_"),
".tsv"
)
)
)
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Files written outside the declared output directory

The call

file = file.path(str_c(str_replace_all(.y$circle_region, "[:-]", "_"), ".tsv"))

creates files in getwd(). The Snakemake rule expects them under the output directory.

Add:

outdir <- snakemake@params[["outdir"]]
dir.create(outdir, recursive = TRUE, showWarnings = FALSE)
...
file = file.path(outdir,
                 str_c(str_replace_all(.y$circle_region, "[:-]", "_"), ".tsv"))
🤖 Prompt for AI Agents
In workflow/scripts/annotate_cleaned_circles.R around lines 98 to 108, the code
writes output files directly to the working directory instead of the Snakemake
output directory. To fix this, retrieve the output directory path from
snakemake@params[["outdir"]], create the directory if it doesn't exist using
dir.create with recursive=TRUE and showWarnings=FALSE, and update the file path
in write_tsv to write files inside this output directory by prepending outdir to
the file path.

2 changes: 1 addition & 1 deletion workflow/scripts/clean_circle_map_realign_output.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
]

# turn int cols into int
circles.loc[:, int_cols] = circles.loc[:, int_cols].round(0).applymap(lambda v: int(v) if not pd.isna(v) else pd.NA)
circles[int_cols] = circles[int_cols].round(0).map(lambda v: int(v) if not pd.isna(v) else pd.NA)

Comment on lines 35 to 37
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

.map() on a DataFrame raises AttributeError – use .applymap() or column-wise logic

DataFrame objects do not implement .map; only Series do.
This line will therefore crash before any output is written.

-circles[int_cols] = circles[int_cols].round(0).map(lambda v: int(v) if not pd.isna(v) else pd.NA)
+# convert each value to integer while keeping NA
+circles[int_cols] = (
+    circles[int_cols]
+        .round(0)
+        .applymap(lambda v: pd.NA if pd.isna(v) else int(v))
+)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# turn int cols into int
circles.loc[:, int_cols] = circles.loc[:, int_cols].round(0).applymap(lambda v: int(v) if not pd.isna(v) else pd.NA)
circles[int_cols] = circles[int_cols].round(0).map(lambda v: int(v) if not pd.isna(v) else pd.NA)
# turn int cols into int
# convert each value to integer while keeping NA
circles[int_cols] = (
circles[int_cols]
.round(0)
.applymap(lambda v: pd.NA if pd.isna(v) else int(v))
)
🤖 Prompt for AI Agents
In workflow/scripts/clean_circle_map_realign_output.py around lines 35 to 37,
the code incorrectly uses .map() on a DataFrame, which causes an AttributeError
because .map() is only available on Series. To fix this, replace .map() with
.applymap() to apply the lambda function element-wise across the DataFrame, or
alternatively apply the lambda function column-wise to each Series in int_cols.

# filter out low-quality circles, according to:
# https://github.com/iprada/Circle-Map/wiki/Circle-Map-Realign-output-files
Expand Down
Loading