nevi-me
diff --git a/‎Cargo.toml
Lines changed: 3 additions & 1 deletion b/‎Cargo.toml
Lines changed: 3 additions & 1 deletion
diff --git a/‎README.md
Lines changed: 22 additions & 2 deletions b/‎README.md
Lines changed: 22 additions & 2 deletions
diff --git a/‎src/dataframe.rs
Lines changed: 44 additions & 0 deletions b/‎src/dataframe.rs
Lines changed: 44 additions & 0 deletions
diff --git a/‎src/io/JSON.md
Lines changed: 22 additions & 0 deletions b/‎src/io/JSON.md
Lines changed: 22 additions & 0 deletions
@@ -12,4 +12,6 @@ num-traits = "0.2"
 csv = "1"
 byteorder = "1"
 flatbuffers = "0.5"
-array_tool = "1"
+array_tool = "1"
+# TODO: using this to return schema with preserved order
+serde_json = {version = "1", features = ["preserve_order"]}
@@ -29,8 +29,28 @@ One can think of this library partly as a playground for features that could for
 
 ## Status
 
+### IO
+
+We found building this library with Arrow still having very limited IO options to be painful. To that end, we are implementing IO in some formats, some of which we can contribute upstream when happy with implementation details.
+
+To that end, we're trying to support CSV, JSON, and perhaps other simpler file formats.
+**Note on Feather:** The Feather file format support can be considered as deprecated in favour of Arrow IPC. Though we have implemented Feather, it's meant to be a stop-gap measure until Arrow supports IPC (in Rust). We'll try tackle this in the coming months.
+
+- IO Support
+  - [ ] CSV
+    - [X] Read
+    - [ ] Write
+  - [ ] JSON
+    - [X] Read
+    - [ ] Write
+  - [ ] Feather
+    - [X] Read
+    - [X] Write (**do not use**, the current limitation with slicing arrays means we write each record batch as a file, instead of a single file for all the data)
+
+### Functionality
+
 - DataFrame Operations
-  - [x] Read CSV into dataframe
+  <!-- - [x] Read CSV into dataframe -->
   - [X] Select single column
   - [ ] Select subset of columns, drop columns
   - [X] Add or remove columns
@@ -67,4 +87,4 @@ One can think of this library partly as a playground for features that could for
 
 ## Performance
 
-We plan on providing simple benchmarks in the near future, especially after we gain the ability to save dataframes to disk.
+We plan on providing simple benchmarks in the near future, especially after we gain the ability to save dataframes to disk. Specifically, after we implement CSV and JSON writers.
@@ -1,3 +1,5 @@
+use crate::io::json::Reader as JsonReader;
+use crate::io::json::ReaderBuilder as JsonReaderBuilder;
 use crate::table::Column;
 use crate::utils;
 use arrow::array;
@@ -11,6 +13,7 @@ use arrow::error::ArrowError;
 use arrow::record_batch::RecordBatch;
 use std::fs::File;
 use std::sync::Arc;
+use std::io::BufReader;
 
 use crate::error::DataFrameError;
 
@@ -363,6 +366,47 @@ impl DataFrame {
         })
     }
 
+    pub fn from_json(path: &str, schema: Option<Arc<Schema>>) -> Self {
+        let file = File::open(path).unwrap();
+        let mut reader = match schema {
+            Some(schema) => JsonReader::new(BufReader::new(file), schema, 1024, None),
+            None => {
+                let builder = JsonReaderBuilder::new()
+                    .infer_schema(None)
+                    .with_batch_size(1024);
+                builder.build::<File>(file).unwrap()
+            }
+        };
+        let mut batches: Vec<RecordBatch> = vec![];
+        let mut has_next = true;
+        while has_next {
+            match reader.next() {
+                Ok(batch) => match batch {
+                    Some(batch) => {
+                        batches.push(batch);
+                    }
+                    None => {
+                        has_next = false;
+                    }
+                },
+                Err(e) => {
+                    has_next = false;
+                }
+            }
+        }
+
+        let schema: Arc<Schema> = batches[0].schema().clone();
+
+        // convert to an arrow table
+        let table = crate::table::Table::from_record_batches(schema.clone(), batches);
+
+        // DataFrame::from_table(table)
+        DataFrame {
+            schema,
+            columns: table.columns,
+        }
+    }
+
     /// Write dataframe to a feather file
     ///
     /// Data is currently written as individual batches (as Arrow doesn't yet support slicing).
 
@@ -0,0 +1,22 @@
+## JSON IO Progress
+
+This is a brain-dump of what we want to achieve with JSON IO.
+
+### Reader
+
+- [ ] Read user schema
+- [ ] Support other numeric types other than `Float64, Int64`
+- [ ] Infer schema (basic)
+- [ ] Infer schema (lists and structs)
+- [ ] Coerce fields that have scalars and lists to lists
+- [ ] Support projection using field names
+- [ ] Add option for dealing with case sensitivity
+- [ ] Coerce fields that can't be casted to provided schema (e.g. if one can't get int because of a float, convert the int to float instead of leaving null)
+- [ ] Reduce repetition where possible
+- [ ] Parity with CPP implementation (there's a Google Doc that has the spec)
+- [ ] Add comprehensive tests
+  - [ ] Nulls at various places
+  - [ ] *All* supported Arrow types
+  - [ ] Corrupt files and non-line-delimited files
+  - [ ] Files where schemas don't match at all 
+- [ ] Performance testing?