GenomicArrays is a Python package for converting genomic data from BigWig format to TileDB arrays.
Install the package from PyPI
pip install genomicarraysBuilding a GenomicArray generates 3 TileDB files in the specified output directory:
feature_annotation: A TileDB file containing input feature intervals.sample_metadata: A TileDB file containing sample metadata, each BigWig file is considered a sample.- A matrix TileDB file named by the
layer_matrix_nameparameter. This allows the package to store multiple different matrices, e.g. 'coverage', 'some_computed_statistic', for the same interval, and sample metadata attributes.
The organization is inspired by the SummarizedExperiment data structure. The TileDB matrix file is stored in a features X samples orientation.
To build a GenomicArray from a collection of BigWig files:
import numpy as np
import tempfile
import genomicarrays as garr
# Create a temporary directory, this is where the
# output files are created. Pick your location here.
tempdir = tempfile.mkdtemp()
# List BigWig paths
bw_dir = "your/biwig/dir"
files = os.listdir(bw_dir)
bw_files = [f"{bw_dir}/{f}" for f in files]
features = pd.DataFrame({
"seqnames": ["chr1", "chr1"],
"starts": [1000, 2000],
"ends": [1500, 2500]
})
# Build GenomicArray
dataset = garr.build_genomicarray(
files=bw_files,
output_path=tempdir,
features=features,
# Specify a fasta file to extract sequences
# for each region in features
genome_fasta="path/to/genome.fasta",
# agg function to summarize mutiple values
# from bigwig within an input feature interval.
feature_annotation_options=garr.FeatureAnnotationOptions(
aggregate_function = np.nanmean
),
# for parallel processing multiple bigwig files
num_threads=4
)Note
- The aggregate function is expected to return either a scalar value or a 1-dimensional NumPy ndarray. If the later, users need to specify the expected dimension of the return array. e.g.
feature_annotation_options=garr.FeatureAnnotationOptions( aggregate_function = my_custom_func, expected_agg_function_length = 10, ),
- The build process stores missing intervals from a bigwig file as
np.nan. The default is to choose an aggregate functions that works withnp.nan.
Users have the option to reuse the dataset object retuned when building the arrays or by creating a GenomicArrayDataset object by initializing it to the path where the files were created.
# Create a GenomicArrayDataset object from the existing dataset
dataset = GenomicArrayDataset(dataset_path=tempdir)
# Query data for the first 10 regions across all samples
coverage_data = dataset[0:10, :]
print(expression_data.matrix)
print(expression_data.feature_annotation) ## output 1
array([[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , nan]], dtype=float32)
## output 2
seqnames starts ends genarr_feature_index
0 chr1 300 315 0
1 chr1 320 335 1
2 chr1 340 355 2
3 chr1 360 375 3
4 chr1 380 395 4
5 chr1 400 415 5
6 chr1 420 435 6
7 chr1 440 455 7
8 chr1 460 475 8
9 chr1 480 495 9
10 chr1 500 515 10
This project has been set up using PyScaffold 4.6. For details and usage information on PyScaffold see https://pyscaffold.org/.
