Add JSON Reader (#10)

nevi-me · web-flow · commit 70159b27280d · 2019-02-14T01:34:39.000+02:00
* Add WIP JSON Reader

* mark progress with reader

* remove JsonType and directly use Arrow's DataType

* support arrays

* use JSON reader from Arrow

* remove redundant JSON test files
diff --git a/Cargo.toml b/Cargo.toml
@@ -5,7 +5,7 @@ authors = ["Neville Dipale <nevilledips@gmail.com>"]
 edition = "2018"
 
 [dependencies]
-arrow = { git = "https://github.com/apache/arrow"}
+arrow = { git = "https://github.com/nevi-me/arrow", branch="rust/json-reader"}
 # arrow = { path = "../../arrow/rust/arrow"}
 num = "0.2"
 num-traits = "0.2"
diff --git a/README.md b/README.md
@@ -29,8 +29,28 @@ One can think of this library partly as a playground for features that could for
 
 ## Status
 
+### IO
+
+We found building this library with Arrow still having very limited IO options to be painful. To that end, we are implementing IO in some formats, some of which we can contribute upstream when happy with implementation details.
+
+To that end, we're trying to support CSV, JSON, and perhaps other simpler file formats.
+**Note on Feather:** The Feather file format support can be considered as deprecated in favour of Arrow IPC. Though we have implemented Feather, it's meant to be a stop-gap measure until Arrow supports IPC (in Rust). We'll try tackle this in the coming months.
+
+- IO Support
+  - [ ] CSV
+    - [X] Read
+    - [ ] Write
+  - [ ] JSON
+    - [X] Read (submitted to Arrow)
+    - [ ] Write
+  - [ ] Feather
+    - [X] Read
+    - [X] Write (**do not use**, the current limitation with slicing arrays means we write each record batch as a file, instead of a single file for all the data)
+
+### Functionality
+
 - DataFrame Operations
-  - [x] Read CSV into dataframe
+  <!-- - [x] Read CSV into dataframe -->
   - [X] Select single column
   - [ ] Select subset of columns, drop columns
   - [X] Add or remove columns
@@ -67,4 +87,4 @@ One can think of this library partly as a playground for features that could for
 
 ## Performance
 
-We plan on providing simple benchmarks in the near future, especially after we gain the ability to save dataframes to disk.
+We plan on providing simple benchmarks in the near future, especially after we gain the ability to save dataframes to disk. Specifically, after we implement CSV and JSON writers.
diff --git a/src/dataframe.rs b/src/dataframe.rs
@@ -8,8 +8,11 @@ use arrow::csv::Reader as CsvReader;
 use arrow::csv::ReaderBuilder as CsvReaderBuilder;
 use arrow::datatypes::*;
 use arrow::error::ArrowError;
+use arrow::json::Reader as JsonReader;
+use arrow::json::ReaderBuilder as JsonReaderBuilder;
 use arrow::record_batch::RecordBatch;
 use std::fs::File;
+use std::io::BufReader;
 use std::sync::Arc;
 
 use crate::error::DataFrameError;
@@ -363,6 +366,47 @@ impl DataFrame {
         })
     }
 
+    pub fn from_json(path: &str, schema: Option<Arc<Schema>>) -> Self {
+        let file = File::open(path).unwrap();
+        let mut reader = match schema {
+            Some(schema) => JsonReader::new(BufReader::new(file), schema, 1024, None),
+            None => {
+                let builder = JsonReaderBuilder::new()
+                    .infer_schema(None)
+                    .with_batch_size(1024);
+                builder.build::<File>(file).unwrap()
+            }
+        };
+        let mut batches: Vec<RecordBatch> = vec![];
+        let mut has_next = true;
+        while has_next {
+            match reader.next() {
+                Ok(batch) => match batch {
+                    Some(batch) => {
+                        batches.push(batch);
+                    }
+                    None => {
+                        has_next = false;
+                    }
+                },
+                Err(e) => {
+                    has_next = false;
+                }
+            }
+        }
+
+        let schema: Arc<Schema> = batches[0].schema().clone();
+
+        // convert to an arrow table
+        let table = crate::table::Table::from_record_batches(schema.clone(), batches);
+
+        // DataFrame::from_table(table)
+        DataFrame {
+            schema,
+            columns: table.columns,
+        }
+    }
+
     /// Write dataframe to a feather file
     ///
     /// Data is currently written as individual batches (as Arrow doesn't yet support slicing).
diff --git a/src/io/JSON.md b/src/io/JSON.md
@@ -0,0 +1,23 @@
+## JSON IO
+
+Keeping track of what's supported upstream in Arrow's JSON reader
+
+### Reader
+
+- [X] Read user schema
+- [X] Support other numeric types other than `Float64, Int64`
+- [X] Infer schema (basic)
+- [X] Infer list schema
+- [ ] Infer struct schema
+- [ ] Coerce fields that have scalars and lists to lists
+- [X] Support projection using field names
+- [ ] Add option for dealing with case sensitivity
+- [X] Coerce fields that can't be casted to provided schema (e.g. if one can't get int because of a float, convert the int to float instead of leaving null)
+- [ ] Reduce repetition where possible
+- [ ] Parity with CPP implementation (there's a Google Doc that has the spec)
+- [ ] Add comprehensive tests
+  - [X] Nulls at various places
+  - [ ] *All* supported Arrow types
+  - [X] Corrupt files and non-line-delimited files
+  - [ ] Files where schemas don't match at all 
+- [ ] Performance testing?