Skip to content

asenfter/fp-test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Testing fastparquet

I deal with a lot of sensor data which I partition by several attributes in order to increase flexibility. E.g. partitioning allows me limiting the data by regular expressions which should be loaded and processed. In a nutshell it's somehow structured like this:

.
+-- sensor=1
|   +-- year=2017
|       +-- part.0.parquet
|   +-- year=2018
|       +-- part.0.parquet   
+-- sensor=2
    +-- year=2017
        +-- part.0.parquet
    +-- year=2018
        +-- part.0.parquet

All parquet files have the same columns. One out of it is the column value where the actual sensor values are stored. All values are floats. But because of historical reasons the dtype of value is object (with float as string values) for the year 2017 and float for 2018.

When I use fastparquet to load the data to pandas DataFrame I actually have the problem that depending on the order the files are fed to fastparquet the app will work or even end in an error:

  • First load float data followed by string data works

  • First load string data followed by float data ends in a TypeError

     TypeError: expected array of bytes
    

To demonstrate the issue i wrote an unittest in test_fastparquet.py

The script run.sh creates a virtual environment with all required packages and executes the tests.

Work-around

When modifying the schema of the loaded fastparquet.ParquetFile the translation to pandas.DataFrame is more robust.

for schema in pf.fmd.schema:
    if schema.name == 'value':
        schema.converted_type=TType.VOID
        schema.type=TType.UTF8

By this modification the resulting dtype of the pandas column 'value' is object.

print(df['value'].dtype)
>> object

About

Test fastparquet python package

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published