CSV to Parquet

Hi, I'm experimenting with creating a CSV to Parquet writer, and I have a few questions.

My endgoal from the experiment is to create a crate that converts various file formats (csv, bson, json) to parquet. This would lend itself to being a possible CSV to Apache Arrow reader.

1. ~How can I write strings? I used `message schema {REQUIRED BYTE_ARRAY name;}`, but when I read a parquet file in Python, the strings are shown as bytes.~ I found this one, `{REQUIRED BYTE_ARRAY name;} (UTF8)`

If I have a csv file that looks like:

```csv
Id,Name,Age
123-adf,John Doe,25
sfd-ge2,Jane Doe,26
```

I have written the below:

```rust
fn write_parquet() -> Result<(), Box<Error>> {

    // TODO let message_type = build_parquet_schema();
    let message_type = "
        message schema {REQUIRED BYTE_ARRAY Id;REQUIRED BYTE_ARRAY Name;REQUIRED INT32 Age;}
    ";

    let schema = Rc::new(parse_message_type(message_type)?);
    let props = Rc::new(WriterProperties::builder().build());
    let file = File::create("./data/file1.parquet")?;
    let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
    let mut row_group_writer = writer.next_row_group().unwrap();
    let mut col_number = 0;

    while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
        col_number = col_number + 1;
        match col_writer {
            ColumnWriter::ByteArrayColumnWriter(ref mut typed_writer) => {
                println!("writing a byte array");
                // I can remove this if-else when I start taking fn parameters of my schema and columns
                if col_number == 1 {
                    typed_writer.write_batch(
                        &[parquet::data_type::ByteArray::from("123-adf"), parquet::data_type::ByteArray::from("sdf-ge2")], None, None
                    )?;
                } else {
                    typed_writer.write_batch(
                        &[parquet::data_type::ByteArray::from("John Doe"), parquet::data_type::ByteArray::from("Jane Doe")], None, None
                    )?;
                }
            },
            ColumnWriter::Int32ColumnWriter(ref mut typed_writer) => {
                println!("writing an integer");
                typed_writer.write_batch(
                    &[25, 26], None, None
                )?;
            },
            _ => {}
        }
        row_group_writer.close_column(col_writer)?;
    }
    writer.close_row_group(row_group_writer)?;
    writer.close()?;
    Ok(())
}
```

2. ~Similar to Q1, am I writing my strings properly, or is there a better way?~

3. From reading through the conversation on #174, it looks like I have to specify an index of where my values are. So, does `typed_writer.write_batch(&[24,25,24,26,27,28], None, None)` produce a less compact file? Is it even valid?

4. In general, does this library allow appending to existing parquet files? I haven't tried it yet.

I'm trying a naive approach of first reading csv files and generating the schema with a string builder. If that works, I can look at macros. 

The other thing that I'll explore separately is how to convert `csv`'s `StringRecord` values into parquet `Type`s. Might use regex to figure out if things are strings, i32, i64, timestamps, etc. If someone knows of an existing way, that'd also be welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CSV to Parquet #183

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

CSV to Parquet #183

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions