indexing-using-pyspark

In this file i'm indexing a vcf format file into elastic search. VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.

Here i'm using Elastic-hadoop.jar as a connector between spark and elastic search (http://download.elastic.co/hadoop/elasticsearch-hadoop-6.1.3.zip) and Elasticsearch library in python (pip install elasticsearch).

In this, i'm first reading raw file from HDFS and transforming that file with help og RDD actions in spark and converting using Elasticsearch writeable format.

After that i'm checked if that index exists in elasticsearch with help of python elasticsearch library, if not then creating new with some low memory cluster dependencies. And then save that RDD into elastic search with help of ES-hadoop connector.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
indexvcf.py		indexvcf.py
sample_vcf_file.vcf		sample_vcf_file.vcf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

indexing-using-pyspark

About

Uh oh!

Releases

Packages

Languages

aashishchauhan06/indexing-using-pyspark

Folders and files

Latest commit

History

Repository files navigation

indexing-using-pyspark

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages