Testing fastparquet

I deal with a lot of sensor data which I partition by several attributes in order to increase flexibility. E.g. partitioning allows me limiting the data by regular expressions which should be loaded and processed. In a nutshell it's somehow structured like this:

.
+-- sensor=1
|   +-- year=2017
|       +-- part.0.parquet
|   +-- year=2018
|       +-- part.0.parquet   
+-- sensor=2
    +-- year=2017
        +-- part.0.parquet
    +-- year=2018
        +-- part.0.parquet

All parquet files have the same columns. One out of it is the column value where the actual sensor values are stored. All values are floats. But because of historical reasons the dtype of value is object (with float as string values) for the year 2017 and float for 2018.

When I use fastparquet to load the data to pandas DataFrame I actually have the problem that depending on the order the files are fed to fastparquet the app will work or even end in an error:

First load float data followed by string data works
First load string data followed by float data ends in a TypeError
```
 TypeError: expected array of bytes
```

To demonstrate the issue i wrote an unittest in test_fastparquet.py

The script run.sh creates a virtual environment with all required packages and executes the tests.

Work-around

When modifying the schema of the loaded fastparquet.ParquetFile the translation to pandas.DataFrame is more robust.

for schema in pf.fmd.schema:
    if schema.name == 'value':
        schema.converted_type=TType.VOID
        schema.type=TType.UTF8

By this modification the resulting dtype of the pandas column 'value' is object.

print(df['value'].dtype)
>> object

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.sh		run.sh
test_fastparquet.py		test_fastparquet.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Testing fastparquet

Work-around

About

Uh oh!

Releases

Packages

Uh oh!

Languages

asenfter/fp-test

Folders and files

Latest commit

History

Repository files navigation

Testing fastparquet

Work-around

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages