GitHub - autosome-ru/greco-bit-data-processing

This repository contains various scripts used when preprocessing data at different stages of the MEX project.

IMPORTANT

Note that these scripts do not contain production-ready, polished code, and they were not intended to be used outside of the context of the MEX/Codebook project. They depend on multiple interim files and require a specific arrangement of the source files; thus, we do not guarantee that the code will be functional on its own.

During the course we had to act iteratively: include new experiments and motifs, add new benchmarking startegies, fix mistakes detected in metadata, make manual curation that excluded some data. Thus scripts in the repository bear scars of multiple corrections and hotfixes.

Yet, we consider this repo a useful reference to showcase the pipelines and preprocessing strategies used when assembling MEX and curating the underlying data.

In case of particular questions, please contact Ilya Vorontsov (@VorontsovIE, vorontsov.i.e@gmail.com).

Please refer to the MEX manuscript [https://www.biorxiv.org/content/10.1101/2024.11.11.622097v1] and MEX Zenodo repo [https://zenodo.org/records/15667805] for polished production-ready data and a detailed description of the underlying procedures.

General structure

On the first stage we prepared datasets. Then run benchmarks. And finally performed multiple stages of postprocessing.

Dataset preparation starts from process.sh file. In this file we create a pool of random dataset names which will be assigned to other datasets later. Then we run process_data.sh / process.rb files from process_peaks_CHS_AFS, process_reads_HTS_SMS_AFS, process_PBM as well as process_data_AFS_peaks.sh, process_data_AFS_reads.sh, process_data_CHS.sh files. Script collect_metadata.rb aggregates information about datasets from metadata files and other sources.

For PBM files we also make our own motifs (process_PBM/process_motifs.sh). Other motifs were obtained by other groups with their custom code and tools, code not provided here.

Then we collect these and all the rest motifs, named in a proper way. Motifs which were incorrectly named are to be renamed, see postprocessing/rename_motifs.rb.

In file postprocessing/motif_metrics.sh there is a bunch of commands to create benchmark runners — lists of shell commands invoking benchmarking, which are to be run in parallel. Scripts ./postprocessing/filter_motif_in_flanks.sh and ./calculate_artifact_similarities.sh filter out some bad motifs. And then postprocessing/reformat_metrics.rb and make_ranks are used to collect metrics and rankings of motifs. Script postprocessing/final_pack.sh collects all the artifacts in a single file.

Software requirements

Most scripts are written in ruby, shell (bash dialect), and python. Gemfile and requirements.txt contain libraries necessary to run these scripts.

Benchmarks are containerized, so one should have docker installed. Corresponding docker images will be automatically pulled from dockerhub. Source files are stored in motif_benchmarks repository.

Many calculations are parallelizable via GNU parallel. We adjust number of threads based on available computational resources.

Name		Name	Last commit message	Last commit date
Latest commit History 936 Commits
postprocessing		postprocessing
process_PBM		process_PBM
process_peaks_CHS_AFS		process_peaks_CHS_AFS
process_reads_HTS_SMS_AFS		process_reads_HTS_SMS_AFS
shared		shared
source_data_meta		source_data_meta
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
afs_stats.rb		afs_stats.rb
ape-3.0.5.jar		ape-3.0.5.jar
ape.jar		ape.jar
calc_motif_similarities_pack.sh		calc_motif_similarities_pack.sh
calc_motif_similarity.sh		calc_motif_similarity.sh
calculate_artifact_similarities.sh		calculate_artifact_similarities.sh
chipmunk-v8.jar		chipmunk-v8.jar
chipmunk.jar		chipmunk.jar
collect_metadata.rb		collect_metadata.rb
copy_data.sh		copy_data.sh
dataset_pvalues.py		dataset_pvalues.py
dataset_pvalues_new.py		dataset_pvalues_new.py
discordant_tfs_for_rozita.rb		discordant_tfs_for_rozita.rb
drop_duplicates_2.rb		drop_duplicates_2.rb
fix_sms_dataset_names.rb		fix_sms_dataset_names.rb
fix_sms_dataset_names_in_source_data_meta.rb		fix_sms_dataset_names_in_source_data_meta.rb
generate_heatmaps.py		generate_heatmaps.py
heatmap_stats.py		heatmap_stats.py
heatmap_stats.rb		heatmap_stats.rb
list_of_artifacts.txt		list_of_artifacts.txt
metadata_preprocess.rb		metadata_preprocess.rb
peak_based_dataset_similarities.rb		peak_based_dataset_similarities.rb
prepare_batch.sh		prepare_batch.sh
process.sh		process.sh
process_data.rb		process_data.rb
process_data_AFS_peaks.sh		process_data_AFS_peaks.sh
process_data_AFS_reads.sh		process_data_AFS_reads.sh
process_data_CHS.sh		process_data_CHS.sh
requirements.txt		requirements.txt
sarus-2.0.2.jar		sarus-2.0.2.jar
sarus.jar		sarus.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IMPORTANT

General structure

Software requirements

About

Uh oh!

Releases

Packages

Languages

autosome-ru/greco-bit-data-processing

Folders and files

Latest commit

History

Repository files navigation

IMPORTANT

General structure

Software requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages