-
Notifications
You must be signed in to change notification settings - Fork 20
CSV to Parquet #183
Description
Hi, I'm experimenting with creating a CSV to Parquet writer, and I have a few questions.
My endgoal from the experiment is to create a crate that converts various file formats (csv, bson, json) to parquet. This would lend itself to being a possible CSV to Apache Arrow reader.
How can I write strings? I usedI found this one,message schema {REQUIRED BYTE_ARRAY name;}
, but when I read a parquet file in Python, the strings are shown as bytes.{REQUIRED BYTE_ARRAY name;} (UTF8)
If I have a csv file that looks like:
Id,Name,Age
123-adf,John Doe,25
sfd-ge2,Jane Doe,26
I have written the below:
fn write_parquet() -> Result<(), Box<Error>> {
// TODO let message_type = build_parquet_schema();
let message_type = "
message schema {REQUIRED BYTE_ARRAY Id;REQUIRED BYTE_ARRAY Name;REQUIRED INT32 Age;}
";
let schema = Rc::new(parse_message_type(message_type)?);
let props = Rc::new(WriterProperties::builder().build());
let file = File::create("./data/file1.parquet")?;
let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
let mut row_group_writer = writer.next_row_group().unwrap();
let mut col_number = 0;
while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
col_number = col_number + 1;
match col_writer {
ColumnWriter::ByteArrayColumnWriter(ref mut typed_writer) => {
println!("writing a byte array");
// I can remove this if-else when I start taking fn parameters of my schema and columns
if col_number == 1 {
typed_writer.write_batch(
&[parquet::data_type::ByteArray::from("123-adf"), parquet::data_type::ByteArray::from("sdf-ge2")], None, None
)?;
} else {
typed_writer.write_batch(
&[parquet::data_type::ByteArray::from("John Doe"), parquet::data_type::ByteArray::from("Jane Doe")], None, None
)?;
}
},
ColumnWriter::Int32ColumnWriter(ref mut typed_writer) => {
println!("writing an integer");
typed_writer.write_batch(
&[25, 26], None, None
)?;
},
_ => {}
}
row_group_writer.close_column(col_writer)?;
}
writer.close_row_group(row_group_writer)?;
writer.close()?;
Ok(())
}
-
Similar to Q1, am I writing my strings properly, or is there a better way? -
From reading through the conversation on How to write a None value to a column #174, it looks like I have to specify an index of where my values are. So, does
typed_writer.write_batch(&[24,25,24,26,27,28], None, None)
produce a less compact file? Is it even valid? -
In general, does this library allow appending to existing parquet files? I haven't tried it yet.
I'm trying a naive approach of first reading csv files and generating the schema with a string builder. If that works, I can look at macros.
The other thing that I'll explore separately is how to convert csv
's StringRecord
values into parquet Type
s. Might use regex to figure out if things are strings, i32, i64, timestamps, etc. If someone knows of an existing way, that'd also be welcome.