DySparkExtensionOS

Open sourcing internal extension for Apache Spark we've developed in DynamicYield.

Description

Extension tries to solve problem of loading persisted data and providing partitioner
Usually when you save partitioned RDD to hdfs/s3 and want to load it next day(or on next iteration) in different spark context the information regarding partitioner is lost
The extension is based on Spark PR by Imran Rashid which wasn't merged into master branch
Important Detail : You should disable input splits first - usually you'll take your InputFormat, extend it and override isSplittable method e.g.

public class AvroKeyValueInputFormatNotSplittable<K, V> extends AvroKeyValueInputFormat<K, V> {
	@Override
	protected boolean isSplitable(JobContext context, Path filename) {
		return false;
	}
}

for basic usage example see AssumedPartitionsPairRDDFunctionsSuite.scala

Compatability

current code works with spark 1.5.1
in general each version of spark may require changes to this code

Declaimer

We are not scala experts, so please try not to vomit from our scala code. In general our spark code base is in java 7 and we write in scala only on special cases when we need add java api wrapper over scala api or to implement some extension to core spark. Thanks for understanding and suggestions are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DySparkExtensionOS

Description

Compatability

Declaimer

About

Uh oh!

Releases

Packages

Languages

IgorBerman/DySparkExtensionOS

Folders and files

Latest commit

History

Repository files navigation

DySparkExtensionOS

Description

Compatability

Declaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages