Skip to content

Commit d4af227

Browse files
authored
Dev (#773)
* Delete "family" tasks (#761) * Delete project tasks * cleanup * ruff format * well * rename * hacking away * almost there! * ruff * Fix missing updates change * ruff * Remove debug code * remove bad merge * more precision in test * project table * allow for missing project * remove some unnecessary checks * test already deleted family * add comment * Delete Project & Family Table Tasks. (#767) * Delete project tasks * cleanup * ruff format * well * rename * hacking away * almost there! * ruff * Fix missing updates change * ruff * Remove debug code * remove bad merge * more precision in test * project table * allow for missing project * remove some unnecessary checks * test already deleted family * Lots of renames * More updates * Sketch * Flesh out test * fix paths * Rename base hail table * a bunch more renames * delete project table * Add delete project families * add comment * test it! * Fix * add dep * some missing tasks * [optimization] read family tables directly from project table. (#769) * Delete project tasks * cleanup * ruff format * well * rename * hacking away * almost there! * ruff * Fix missing updates change * ruff * Remove debug code * remove bad merge * more precision in test * project table * allow for missing project * remove some unnecessary checks * test already deleted family * Lots of renames * More updates * Sketch * Flesh out test * fix paths * Rename base hail table * a bunch more renames * delete project table * Add delete project families * is it that simple? * add comment * test it! * Fix * add dep * Ensure rows are deleted after deleting samples! (#770) * Delete project tasks * cleanup * ruff format * well * rename * hacking away * almost there! * ruff * Fix missing updates change * ruff * Remove debug code * remove bad merge * more precision in test * project table * allow for missing project * remove some unnecessary checks * test already deleted family * Lots of renames * More updates * Sketch * Flesh out test * fix paths * Rename base hail table * a bunch more renames * delete project table * Add delete project families * add comment * test it! * Fix * add dep * Lookup table filtering * Ensure rows with no projects/families defined are removed * ruff * remove mock * Remove mocks from args * tweak tests * VEP 110 docker image and dataproc init script (#758) * Add VEP docker image * simplify * bump version * Add cloudbuild * first pass * a bit of cleanup * ws * ws * A few tweaks * twiddle options * Bunch of config * working! * Update vep-GRCh38.json * Update vep-110-GRCh38.sh * missing slash * more VEP * some vep cleanup * Remove genesplicer
1 parent 1b9c5a5 commit d4af227

File tree

50 files changed

+1369
-246
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+1369
-246
lines changed
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Run locally with:
2+
#
3+
# gcloud builds submit --quiet --substitutions='_VEP_VERSION=110' --config .cloudbuild/vep-docker.cloudbuild.yaml v03_pipeline/
4+
steps:
5+
- name: 'gcr.io/kaniko-project/executor:v1.3.0'
6+
args:
7+
- --destination=gcr.io/seqr-project/vep-docker-image:${_VEP_VERSION}
8+
- --dockerfile=deploy/Dockerfile.vep
9+
- --cache=true
10+
- --cache-ttl=168h
11+
12+
timeout: 1800s

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ version = {attr = "v03_pipeline.__version__"}
1111

1212
[tool.setuptools.packages.find]
1313
include = ["v03_pipeline*"]
14-
exclude = ["v03_pipeline.bin", "v03_pipeline*test*"]
14+
exclude = ["v03_pipeline.bin", "v03_pipeline.deploy", "v03_pipeline*test*"]
1515
namespaces = false
1616

1717
[tool.mypy]

v03_pipeline/bin/vep-110-GRCh38.sh

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
#
2+
# VEP init action for dataproc
3+
#
4+
# adapted/copied from
5+
# https://github.com/broadinstitute/gnomad_methods/blob/main/init_scripts/vep105-init.sh
6+
# and gs://hail-common/hailctl/dataproc/0.2.128/vep-GRCh38.sh
7+
#
8+
9+
set -x
10+
11+
export PROJECT="$(gcloud config get-value project)"
12+
export VEP_CONFIG_PATH="$(/usr/share/google/get_metadata_value attributes/VEP_CONFIG_PATH)"
13+
export VEP_REPLICATE="$(/usr/share/google/get_metadata_value attributes/VEP_REPLICATE)"
14+
export ASSEMBLY=GRCh38
15+
export VEP_DOCKER_IMAGE=gcr.io/seqr-project/vep-docker-image
16+
17+
mkdir -p /vep_data
18+
19+
# Install docker
20+
apt-get update
21+
apt-get -y install \
22+
apt-transport-https \
23+
ca-certificates \
24+
curl \
25+
gnupg2 \
26+
software-properties-common \
27+
tabix
28+
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo apt-key add -
29+
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable"
30+
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable"
31+
apt-get update
32+
apt-get install -y --allow-unauthenticated docker-ce
33+
34+
# https://github.com/hail-is/hail/issues/12936
35+
sleep 60
36+
sudo service docker restart
37+
38+
# Copied from the repo at v03_pipeline/var/vep_config
39+
gcloud storage cp --billing-project $PROJECT gs://seqr-reference-data/vep/110/vep-${ASSEMBLY}.json $VEP_CONFIG_PATH
40+
41+
# Copied from the UTRAnnotator repo (https://github.com/ImperialCardioGenetics/UTRannotator/tree/master)
42+
gcloud storage cp --billing-project $PROJECT gs://seqr-reference-data/vep/110/uORF_5UTR_${ASSEMBLY}_PUBLIC.txt /vep_data/ &
43+
44+
# Raw data files copied from the bucket (https://console.cloud.google.com/storage/browser/dm_alphamissense;tab=objects?prefix=&forceOnObjectsSortingFiltering=false)
45+
# Some investigation led us to want to combine the canonical and non-canonical transcript tsvs (run inside the VEP docker container):
46+
# cat AlphaMissense_hg38.tsv.gz | gunzip | grep -v '#' | awk 'BEGIN { OFS = "\t" };{$6=""; print $0}' > AlphaMissense_combined_hg38.tsv
47+
# cat AlphaMissense_isoforms_hg38.tsv.gz | gunzip | grep -v '#' >> AlphaMissense_combined_hg38.tsv
48+
# cat AlphaMissense_combined_hg38.tsv | sort --parallel=12 --buffer-size=20G -k1,1 -k2,2n > AlphaMissense_combined_sorted_hg38.tsv
49+
# cat AlphaMissense_combined_sorted_hg38.tsv | sed '1i #CHROM\tPOS\tREF\tALT\tgenome\ttranscript_id\tprotein_variant\tam_pathogenicity\tam_class' > AlphaMissense_hg38.tsv
50+
# bgzip AlphaMissense_hg38.tsv
51+
# tabix -s 1 -b 2 -e 2 -f -S 1 AlphaMissense_hg38.tsv.gz
52+
gcloud storage cp --billing-project $PROJECT 'gs://seqr-reference-data/vep/110/AlphaMissense_hg38.tsv.*' /vep_data/ &
53+
54+
gcloud storage cat --billing-project $PROJECT gs://seqr-reference-data/vep_data/loftee-beta/${ASSEMBLY}.tar | tar -xf - -C /vep_data/ &
55+
56+
# Copied from ftp://ftp.ensembl.org/pub/release-110/variation/indexed_vep_cache/homo_sapiens_merged_vep_110_${ASSEMBLY}.tar.gz
57+
gcloud storage cat --billing-project $PROJECT gs://seqr-reference-data/vep/110/homo_sapiens_vep_110_${ASSEMBLY}.tar.gz | tar -xzf - -C /vep_data/ &
58+
59+
# Generated with:
60+
# curl -O ftp://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.${ASSEMBLY}.dna.primary_assembly.fa.gz > Homo_sapiens.${ASSEMBLY}.dna.primary_assembly.fa.gz
61+
# gzip -d Homo_sapiens.${ASSEMBLY}.dna.primary_assembly.fa.gz
62+
# bgzip Homo_sapiens.${ASSEMBLY}.dna.primary_assembly.fa
63+
# samtools faidx Homo_sapiens.${ASSEMBLY}.dna.primary_assembly.fa.gz
64+
gcloud storage cp --billing-project $PROJECT 'gs://seqr-reference-data/vep/110/Homo_sapiens.${ASSEMBLY}.dna.primary_assembly.fa.*' /vep_data/ &
65+
docker pull ${VEP_DOCKER_IMAGE} &
66+
wait
67+
68+
cat >/vep.c <<EOF
69+
#include <unistd.h>
70+
#include <stdio.h>
71+
72+
int
73+
main(int argc, char *const argv[]) {
74+
if (setuid(geteuid()))
75+
perror( "setuid" );
76+
77+
execv("/vep.sh", argv);
78+
return 0;
79+
}
80+
EOF
81+
gcc -Wall -Werror -O2 /vep.c -o /vep
82+
chmod u+s /vep
83+
84+
cat >/vep.sh <<EOF
85+
#!/bin/bash
86+
87+
docker run -i -v /vep_data/:/opt/vep/.vep/:ro ${VEP_DOCKER_IMAGE} \
88+
/opt/vep/src/ensembl-vep/vep "\$@"
89+
EOF
90+
chmod +x /vep.sh

v03_pipeline/deploy/Dockerfile.vep

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
FROM ubuntu:18.04 as build
2+
3+
# Adapted from https://hub.docker.com/layers/konradjk/vep95_loftee/latest/images/sha256-d5f1a155293412acb5af4811142ba6907bad1cd708ca4000528f6317b784440e?context=explore
4+
# and https://github.com/broadinstitute/gnomad_methods/blob/main/docker_files/Dockerfile_VEP105
5+
RUN apt-get update && apt-get -y install wget libncurses5-dev libncursesw5-dev libbz2-dev liblzma-dev build-essential libz-dev git
6+
RUN wget https://github.com/samtools/samtools/releases/download/1.7/samtools-1.7.tar.bz2 && tar xjvf samtools-1.7.tar.bz2 && cd samtools-1.7 && make && make install
7+
RUN git clone -b grch38 https://github.com/konradjk/loftee.git
8+
9+
FROM ensemblorg/ensembl-vep:release_110.1 as runtime
10+
RUN cpanm DBD::SQLite
11+
COPY --from=build /usr/local/bin/samtools /usr/local/bin/samtools
12+
# semantics of mv vs COPY are different such that we don't need the '*' when moving files.
13+
COPY --from=build /loftee/ /plugins

v03_pipeline/lib/misc/lookup.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -98,15 +98,15 @@ def remove_family_guids(
9898
)
9999
),
100100
)
101-
project_i = ht.project_guids.index(
102-
project_guid,
103-
) # double reference because new expression
101+
ht = ht.filter(
102+
hl.any(ht.project_stats.map(lambda fs: hl.any(fs.map(hl.is_defined)))),
103+
)
104104
return ht.annotate_globals(
105105
project_families=hl.dict(
106-
hl.enumerate(ht.project_families.items()).starmap(
107-
lambda i, item: (
106+
ht.project_families.items().map(
107+
lambda item: (
108108
hl.if_else(
109-
i != project_i,
109+
item[0] != project_guid,
110110
item,
111111
(
112112
item[0],
@@ -140,6 +140,7 @@ def remove_project(
140140
)
141141
),
142142
)
143+
ht = ht.filter(hl.any(ht.project_stats.map(hl.is_defined)))
143144
return ht.annotate_globals(
144145
project_guids=project_indexes_to_keep.map(
145146
lambda i: ht.project_guids[i],

v03_pipeline/lib/misc/lookup_test.py

Lines changed: 10 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -89,23 +89,15 @@ def test_compute_callset_lookup_ht(self) -> None:
8989
],
9090
)
9191

92-
def test_remove_new_callset_family_guids(self) -> None:
92+
def test_remove_family_guids(self) -> None:
9393
lookup_ht = hl.Table.parallelize(
9494
[
9595
{
9696
'id': 0,
9797
'project_stats': [
9898
[
99-
hl.Struct(
100-
ref_samples=0,
101-
heteroplasmic_samples=0,
102-
homoplasmic_samples=0,
103-
),
104-
hl.Struct(
105-
ref_samples=1,
106-
heteroplasmic_samples=1,
107-
homoplasmic_samples=1,
108-
),
99+
None,
100+
None,
109101
hl.Struct(
110102
ref_samples=2,
111103
heteroplasmic_samples=2,
@@ -125,11 +117,7 @@ def test_remove_new_callset_family_guids(self) -> None:
125117
'id': 1,
126118
'project_stats': [
127119
[
128-
hl.Struct(
129-
ref_samples=0,
130-
heteroplasmic_samples=0,
131-
homoplasmic_samples=0,
132-
),
120+
None,
133121
hl.Struct(
134122
ref_samples=1,
135123
heteroplasmic_samples=1,
@@ -179,6 +167,11 @@ def test_remove_new_callset_family_guids(self) -> None:
179167
'project_a',
180168
hl.set(['3', '1']),
181169
)
170+
lookup_ht = remove_family_guids(
171+
lookup_ht,
172+
'project_a',
173+
hl.set(['1']),
174+
)
182175
lookup_ht = remove_family_guids(
183176
lookup_ht,
184177
'project_b',
@@ -196,19 +189,6 @@ def test_remove_new_callset_family_guids(self) -> None:
196189
self.assertCountEqual(
197190
lookup_ht.collect(),
198191
[
199-
hl.Struct(
200-
id=0,
201-
project_stats=[
202-
[
203-
hl.Struct(
204-
ref_samples=1,
205-
heteroplasmic_samples=1,
206-
homoplasmic_samples=1,
207-
),
208-
],
209-
[],
210-
],
211-
),
212192
hl.Struct(
213193
id=1,
214194
project_stats=[
@@ -248,13 +228,7 @@ def test_remove_project(self) -> None:
248228
homoplasmic_samples=2,
249229
),
250230
],
251-
[
252-
hl.Struct(
253-
ref_samples=3,
254-
heteroplasmic_samples=3,
255-
homoplasmic_samples=3,
256-
),
257-
],
231+
None,
258232
],
259233
},
260234
{
@@ -325,18 +299,6 @@ def test_remove_project(self) -> None:
325299
self.assertCountEqual(
326300
lookup_ht.collect(),
327301
[
328-
hl.Struct(
329-
id=0,
330-
project_stats=[
331-
[
332-
hl.Struct(
333-
ref_samples=3,
334-
heteroplasmic_samples=3,
335-
homoplasmic_samples=3,
336-
),
337-
],
338-
],
339-
),
340302
hl.Struct(
341303
id=1,
342304
project_stats=[

v03_pipeline/lib/tasks/__init__.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,16 @@
44
from v03_pipeline.lib.tasks.update_lookup_table import (
55
UpdateLookupTableTask,
66
)
7+
from v03_pipeline.lib.tasks.update_lookup_table_with_deleted_families import (
8+
UpdateLookupTableWithDeletedFamiliesTask,
9+
)
710
from v03_pipeline.lib.tasks.update_lookup_table_with_deleted_project import (
811
UpdateLookupTableWithDeletedProjectTask,
912
)
1013
from v03_pipeline.lib.tasks.update_project_table import UpdateProjectTableTask
14+
from v03_pipeline.lib.tasks.update_variant_annotations_table_with_deleted_families import (
15+
UpdateVariantAnnotationsTableWithDeletedFamiliesTask,
16+
)
1117
from v03_pipeline.lib.tasks.update_variant_annotations_table_with_deleted_project import (
1218
UpdateVariantAnnotationsTableWithDeletedProjectTask,
1319
)
@@ -23,8 +29,10 @@
2329
'UpdateProjectTableTask',
2430
'UpdateLookupTableTask',
2531
'UpdateLookupTableWithDeletedProjectTask',
32+
'UpdateLookupTableWithDeletedFamiliesTask',
2633
'UpdateVariantAnnotationsTableWithNewSamplesTask',
2734
'UpdateVariantAnnotationsTableWithDeletedProjectTask',
35+
'UpdateVariantAnnotationsTableWithDeletedFamiliesTask',
2836
'WriteCachedReferenceDatasetQuery',
2937
'WriteMetadataForRunTask',
3038
'WriteProjectFamilyTablesTask',
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
import hailtop.fs as hfs
2+
3+
from v03_pipeline.lib.logger import get_logger
4+
from v03_pipeline.lib.tasks.base.base_hail_table import BaseHailTableTask
5+
from v03_pipeline.lib.tasks.files import GCSorLocalFolderTarget, GCSorLocalTarget
6+
7+
logger = get_logger(__name__)
8+
9+
10+
class BaseDeleteTableTask(BaseHailTableTask):
11+
def complete(self) -> bool:
12+
logger.info(f'DeleteTableTask: checking if {self.output().path} exists')
13+
return (
14+
not GCSorLocalTarget(self.output().path).exists()
15+
and not GCSorLocalFolderTarget(self.output().path).exists()
16+
)
17+
18+
def run(self) -> None:
19+
hfs.rmtree(self.output().path)

v03_pipeline/lib/tasks/base/base_update_task.py renamed to v03_pipeline/lib/tasks/base/base_update.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import hail as hl
22

33
from v03_pipeline.lib.misc.io import write
4-
from v03_pipeline.lib.tasks.base.base_hail_table_task import BaseHailTableTask
4+
from v03_pipeline.lib.tasks.base.base_hail_table import BaseHailTableTask
55

66

77
class BaseUpdateTask(BaseHailTableTask):

0 commit comments

Comments
 (0)