Skip to content
This repository was archived by the owner on Jan 11, 2021. It is now read-only.
This repository was archived by the owner on Jan 11, 2021. It is now read-only.

CSV to Parquet #183

@nevi-me

Description

@nevi-me

Hi, I'm experimenting with creating a CSV to Parquet writer, and I have a few questions.

My endgoal from the experiment is to create a crate that converts various file formats (csv, bson, json) to parquet. This would lend itself to being a possible CSV to Apache Arrow reader.

  1. How can I write strings? I used message schema {REQUIRED BYTE_ARRAY name;}, but when I read a parquet file in Python, the strings are shown as bytes. I found this one, {REQUIRED BYTE_ARRAY name;} (UTF8)

If I have a csv file that looks like:

Id,Name,Age
123-adf,John Doe,25
sfd-ge2,Jane Doe,26

I have written the below:

fn write_parquet() -> Result<(), Box<Error>> {

    // TODO let message_type = build_parquet_schema();
    let message_type = "
        message schema {REQUIRED BYTE_ARRAY Id;REQUIRED BYTE_ARRAY Name;REQUIRED INT32 Age;}
    ";

    let schema = Rc::new(parse_message_type(message_type)?);
    let props = Rc::new(WriterProperties::builder().build());
    let file = File::create("./data/file1.parquet")?;
    let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
    let mut row_group_writer = writer.next_row_group().unwrap();
    let mut col_number = 0;

    while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
        col_number = col_number + 1;
        match col_writer {
            ColumnWriter::ByteArrayColumnWriter(ref mut typed_writer) => {
                println!("writing a byte array");
                // I can remove this if-else when I start taking fn parameters of my schema and columns
                if col_number == 1 {
                    typed_writer.write_batch(
                        &[parquet::data_type::ByteArray::from("123-adf"), parquet::data_type::ByteArray::from("sdf-ge2")], None, None
                    )?;
                } else {
                    typed_writer.write_batch(
                        &[parquet::data_type::ByteArray::from("John Doe"), parquet::data_type::ByteArray::from("Jane Doe")], None, None
                    )?;
                }
            },
            ColumnWriter::Int32ColumnWriter(ref mut typed_writer) => {
                println!("writing an integer");
                typed_writer.write_batch(
                    &[25, 26], None, None
                )?;
            },
            _ => {}
        }
        row_group_writer.close_column(col_writer)?;
    }
    writer.close_row_group(row_group_writer)?;
    writer.close()?;
    Ok(())
}
  1. Similar to Q1, am I writing my strings properly, or is there a better way?

  2. From reading through the conversation on How to write a None value to a column #174, it looks like I have to specify an index of where my values are. So, does typed_writer.write_batch(&[24,25,24,26,27,28], None, None) produce a less compact file? Is it even valid?

  3. In general, does this library allow appending to existing parquet files? I haven't tried it yet.

I'm trying a naive approach of first reading csv files and generating the schema with a string builder. If that works, I can look at macros.

The other thing that I'll explore separately is how to convert csv's StringRecord values into parquet Types. Might use regex to figure out if things are strings, i32, i64, timestamps, etc. If someone knows of an existing way, that'd also be welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions