Skip to content
This repository was archived by the owner on Dec 29, 2021. It is now read-only.

Commit 482fb66

Browse files
committed
Add WIP JSON Reader
1 parent a83aa0b commit 482fb66

File tree

7 files changed

+611
-3
lines changed

7 files changed

+611
-3
lines changed

Cargo.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,6 @@ num-traits = "0.2"
1212
csv = "1"
1313
byteorder = "1"
1414
flatbuffers = "0.5"
15-
array_tool = "1"
15+
array_tool = "1"
16+
# TODO: using this to return schema with preserved order
17+
serde_json = {version = "1", features = ["preserve_order"]}

README.md

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,28 @@ One can think of this library partly as a playground for features that could for
2929

3030
## Status
3131

32+
### IO
33+
34+
We found building this library with Arrow still having very limited IO options to be painful. To that end, we are implementing IO in some formats, some of which we can contribute upstream when happy with implementation details.
35+
36+
To that end, we're trying to support CSV, JSON, and perhaps other simpler file formats.
37+
**Note on Feather:** The Feather file format support can be considered as deprecated in favour of Arrow IPC. Though we have implemented Feather, it's meant to be a stop-gap measure until Arrow supports IPC (in Rust). We'll try tackle this in the coming months.
38+
39+
- IO Support
40+
- [ ] CSV
41+
- [X] Read
42+
- [ ] Write
43+
- [ ] JSON
44+
- [X] Read
45+
- [ ] Write
46+
- [ ] Feather
47+
- [X] Read
48+
- [X] Write (**do not use**, the current limitation with slicing arrays means we write each record batch as a file, instead of a single file for all the data)
49+
50+
### Functionality
51+
3252
- DataFrame Operations
33-
- [x] Read CSV into dataframe
53+
<!-- - [x] Read CSV into dataframe -->
3454
- [X] Select single column
3555
- [ ] Select subset of columns, drop columns
3656
- [X] Add or remove columns
@@ -67,4 +87,4 @@ One can think of this library partly as a playground for features that could for
6787

6888
## Performance
6989

70-
We plan on providing simple benchmarks in the near future, especially after we gain the ability to save dataframes to disk.
90+
We plan on providing simple benchmarks in the near future, especially after we gain the ability to save dataframes to disk. Specifically, after we implement CSV and JSON writers.

src/dataframe.rs

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
use crate::io::json::Reader as JsonReader;
2+
use crate::io::json::ReaderBuilder as JsonReaderBuilder;
13
use crate::table::Column;
24
use crate::utils;
35
use arrow::array;
@@ -11,6 +13,7 @@ use arrow::error::ArrowError;
1113
use arrow::record_batch::RecordBatch;
1214
use std::fs::File;
1315
use std::sync::Arc;
16+
use std::io::BufReader;
1417

1518
use crate::error::DataFrameError;
1619

@@ -363,6 +366,47 @@ impl DataFrame {
363366
})
364367
}
365368

369+
pub fn from_json(path: &str, schema: Option<Arc<Schema>>) -> Self {
370+
let file = File::open(path).unwrap();
371+
let mut reader = match schema {
372+
Some(schema) => JsonReader::new(BufReader::new(file), schema, 1024, None),
373+
None => {
374+
let builder = JsonReaderBuilder::new()
375+
.infer_schema(None)
376+
.with_batch_size(1024);
377+
builder.build::<File>(file).unwrap()
378+
}
379+
};
380+
let mut batches: Vec<RecordBatch> = vec![];
381+
let mut has_next = true;
382+
while has_next {
383+
match reader.next() {
384+
Ok(batch) => match batch {
385+
Some(batch) => {
386+
batches.push(batch);
387+
}
388+
None => {
389+
has_next = false;
390+
}
391+
},
392+
Err(e) => {
393+
has_next = false;
394+
}
395+
}
396+
}
397+
398+
let schema: Arc<Schema> = batches[0].schema().clone();
399+
400+
// convert to an arrow table
401+
let table = crate::table::Table::from_record_batches(schema.clone(), batches);
402+
403+
// DataFrame::from_table(table)
404+
DataFrame {
405+
schema,
406+
columns: table.columns,
407+
}
408+
}
409+
366410
/// Write dataframe to a feather file
367411
///
368412
/// Data is currently written as individual batches (as Arrow doesn't yet support slicing).

src/io/JSON.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
## JSON IO Progress
2+
3+
This is a brain-dump of what we want to achieve with JSON IO.
4+
5+
### Reader
6+
7+
- [ ] Read user schema
8+
- [ ] Support other numeric types other than `Float64, Int64`
9+
- [ ] Infer schema (basic)
10+
- [ ] Infer schema (lists and structs)
11+
- [ ] Coerce fields that have scalars and lists to lists
12+
- [ ] Support projection using field names
13+
- [ ] Add option for dealing with case sensitivity
14+
- [ ] Coerce fields that can't be casted to provided schema (e.g. if one can't get int because of a float, convert the int to float instead of leaving null)
15+
- [ ] Reduce repetition where possible
16+
- [ ] Parity with CPP implementation (there's a Google Doc that has the spec)
17+
- [ ] Add comprehensive tests
18+
- [ ] Nulls at various places
19+
- [ ] *All* supported Arrow types
20+
- [ ] Corrupt files and non-line-delimited files
21+
- [ ] Files where schemas don't match at all
22+
- [ ] Performance testing?

0 commit comments

Comments
 (0)