Skip to content

Converter class does not convert Athena string data to pandas str type #148

@krishanunandy

Description

@krishanunandy

First of all, thank you for creating this library! It's been immensely helpful and I've used it in multiple contexts over several years and would love to contribute - especially if it helps solve my current problem!

With pyathena=1.10.7 and pandas=1.0.5 I am running the following code with the expectation that the converter class will cast the Athena string data type as an str pandas dtype.

from pyathena import connect
from pyathena.pandas_cursor import PandasCursor
from pyathena.converter import Converter

class CustomPandasTypeConverter(Converter):

    def __init__(self):
        super(CustomPandasTypeConverter, self).__init__(
            mappings=None,
            types={
                'boolean': bool,
                'tinyint': int,
                'smallint': int,
                'integer': int,
                'bigint': int,
                'float': float,
                'real': float,
                'double': float,
                'decimal': float,
                'char': str,
                'varchar': str,
                'array': str,
                'map': str,
                'row': str,
                'varbinary': str,
                'json': str,
                'string': str
            }
        )

    def convert(self, type_, value):
        # Not used in PandasCursor.
        pass
    
cur = connect(s3_staging_dir='<staging_directory_url>',
                region_name='<aws_region>',
                cursor_class = PandasCursor,
                converter=CustomPandasTypeConverter(),
                work_group = '<workgroup_name>').cursor()

query = 'SELECT * FROM <schema>.<table>'
df = cur.execute(query).as_pandas()
df.dtypes

When I inspect the dtypes, Athena ints are converted to pandas ints, decimals are converted to floats and strings are consistently returned as object dtypes. However Athena string NULLs are cast as NaNs which require explicit column-by-column fillna operations. This is particularly inconvenient, since I'm trying to subsequently convert the pandas dataframe to a Spark dataframe. Now that I've typed all this out, I'm guessing this is related to #118?

Also, I'm not sure where the right place to ask this is, but are there any plans to implement a PySparkCursor for PyAthena? If not can I help by contributing?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions