Spark Data Source for OpenStreetMap Protobuf (aka OSM PBF) files.
This library was heavily inspired by:
The main differences between spark-osmpbf and spark-osm-datasource are:
spark-osmpbfprocesses the input in the main thread whilespark-osm-datasourceis multithreaded.spark-osmpbfapplies early stage filtering based on selected OSM entities.
Include spark-osmpbf dependency and write:
val df = spark.read
.format("io.github.igorgatis.spark.osmpbf")
.load("path/to/file.pbf")The spark-osmpbf default schema was based on osm-parquetizer. It includes
an extra entity_type string field which is self-explanatory:
root
|-- entity_type: string (nullable = false)
|-- id: long (nullable = false)
|-- version: integer (nullable = true)
|-- timestamp: long (nullable = true)
|-- changeset: long (nullable = true)
|-- uid: integer (nullable = true)
|-- user_sid: string (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- nodes: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- index: integer (nullable = false)
| | |-- nodeId: long (nullable = false)
|-- members: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- member_id: long (nullable = false)
| | |-- role: string (nullable = true)
If you use osm-parquetizer, you can probably use spark-osmpbf without
changing much code, reading directly from .pbf files.
spark-osmpbf comes with OsmPbfOptions class which allows easy
configuration. In the example below, dataset will read only Node entities,
it will ignore metadata fields such as version, changeset, etc. It will
also read tags as a map.
val df = spark.read
.format(OsmPbfOptions.FORMAT)
.options(new OsmPbfOptions
.withType("node")
.withTagsAsMap(true)
.withExcludeMetadata(true)
.toMap)
.load("path/to/file.pbf")
df.printSchemaHere is the output schema:
root
|-- entity_type: string (nullable = false)
|-- id: long (nullable = false)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)