Skip to content
This repository was archived by the owner on Dec 29, 2021. It is now read-only.

Commit 70159b2

Browse files
authored
Add JSON Reader (#10)
* Add WIP JSON Reader * mark progress with reader * remove JsonType and directly use Arrow's DataType * support arrays * use JSON reader from Arrow * remove redundant JSON test files
2 parents a83aa0b + c4f7c44 commit 70159b2

File tree

4 files changed

+90
-3
lines changed

4 files changed

+90
-3
lines changed

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ authors = ["Neville Dipale <nevilledips@gmail.com>"]
55
edition = "2018"
66

77
[dependencies]
8-
arrow = { git = "https://github.com/apache/arrow"}
8+
arrow = { git = "https://github.com/nevi-me/arrow", branch="rust/json-reader"}
99
# arrow = { path = "../../arrow/rust/arrow"}
1010
num = "0.2"
1111
num-traits = "0.2"

README.md

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,28 @@ One can think of this library partly as a playground for features that could for
2929

3030
## Status
3131

32+
### IO
33+
34+
We found building this library with Arrow still having very limited IO options to be painful. To that end, we are implementing IO in some formats, some of which we can contribute upstream when happy with implementation details.
35+
36+
To that end, we're trying to support CSV, JSON, and perhaps other simpler file formats.
37+
**Note on Feather:** The Feather file format support can be considered as deprecated in favour of Arrow IPC. Though we have implemented Feather, it's meant to be a stop-gap measure until Arrow supports IPC (in Rust). We'll try tackle this in the coming months.
38+
39+
- IO Support
40+
- [ ] CSV
41+
- [X] Read
42+
- [ ] Write
43+
- [ ] JSON
44+
- [X] Read (submitted to Arrow)
45+
- [ ] Write
46+
- [ ] Feather
47+
- [X] Read
48+
- [X] Write (**do not use**, the current limitation with slicing arrays means we write each record batch as a file, instead of a single file for all the data)
49+
50+
### Functionality
51+
3252
- DataFrame Operations
33-
- [x] Read CSV into dataframe
53+
<!-- - [x] Read CSV into dataframe -->
3454
- [X] Select single column
3555
- [ ] Select subset of columns, drop columns
3656
- [X] Add or remove columns
@@ -67,4 +87,4 @@ One can think of this library partly as a playground for features that could for
6787

6888
## Performance
6989

70-
We plan on providing simple benchmarks in the near future, especially after we gain the ability to save dataframes to disk.
90+
We plan on providing simple benchmarks in the near future, especially after we gain the ability to save dataframes to disk. Specifically, after we implement CSV and JSON writers.

src/dataframe.rs

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,11 @@ use arrow::csv::Reader as CsvReader;
88
use arrow::csv::ReaderBuilder as CsvReaderBuilder;
99
use arrow::datatypes::*;
1010
use arrow::error::ArrowError;
11+
use arrow::json::Reader as JsonReader;
12+
use arrow::json::ReaderBuilder as JsonReaderBuilder;
1113
use arrow::record_batch::RecordBatch;
1214
use std::fs::File;
15+
use std::io::BufReader;
1316
use std::sync::Arc;
1417

1518
use crate::error::DataFrameError;
@@ -363,6 +366,47 @@ impl DataFrame {
363366
})
364367
}
365368

369+
pub fn from_json(path: &str, schema: Option<Arc<Schema>>) -> Self {
370+
let file = File::open(path).unwrap();
371+
let mut reader = match schema {
372+
Some(schema) => JsonReader::new(BufReader::new(file), schema, 1024, None),
373+
None => {
374+
let builder = JsonReaderBuilder::new()
375+
.infer_schema(None)
376+
.with_batch_size(1024);
377+
builder.build::<File>(file).unwrap()
378+
}
379+
};
380+
let mut batches: Vec<RecordBatch> = vec![];
381+
let mut has_next = true;
382+
while has_next {
383+
match reader.next() {
384+
Ok(batch) => match batch {
385+
Some(batch) => {
386+
batches.push(batch);
387+
}
388+
None => {
389+
has_next = false;
390+
}
391+
},
392+
Err(e) => {
393+
has_next = false;
394+
}
395+
}
396+
}
397+
398+
let schema: Arc<Schema> = batches[0].schema().clone();
399+
400+
// convert to an arrow table
401+
let table = crate::table::Table::from_record_batches(schema.clone(), batches);
402+
403+
// DataFrame::from_table(table)
404+
DataFrame {
405+
schema,
406+
columns: table.columns,
407+
}
408+
}
409+
366410
/// Write dataframe to a feather file
367411
///
368412
/// Data is currently written as individual batches (as Arrow doesn't yet support slicing).

src/io/JSON.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
## JSON IO
2+
3+
Keeping track of what's supported upstream in Arrow's JSON reader
4+
5+
### Reader
6+
7+
- [X] Read user schema
8+
- [X] Support other numeric types other than `Float64, Int64`
9+
- [X] Infer schema (basic)
10+
- [X] Infer list schema
11+
- [ ] Infer struct schema
12+
- [ ] Coerce fields that have scalars and lists to lists
13+
- [X] Support projection using field names
14+
- [ ] Add option for dealing with case sensitivity
15+
- [X] Coerce fields that can't be casted to provided schema (e.g. if one can't get int because of a float, convert the int to float instead of leaving null)
16+
- [ ] Reduce repetition where possible
17+
- [ ] Parity with CPP implementation (there's a Google Doc that has the spec)
18+
- [ ] Add comprehensive tests
19+
- [X] Nulls at various places
20+
- [ ] *All* supported Arrow types
21+
- [X] Corrupt files and non-line-delimited files
22+
- [ ] Files where schemas don't match at all
23+
- [ ] Performance testing?

0 commit comments

Comments
 (0)