-
Notifications
You must be signed in to change notification settings - Fork 105
Description
First of all, thank you for creating this library! It's been immensely helpful and I've used it in multiple contexts over several years and would love to contribute - especially if it helps solve my current problem!
With pyathena=1.10.7
and pandas=1.0.5
I am running the following code with the expectation that the converter class will cast the Athena string
data type as an str
pandas dtype.
from pyathena import connect
from pyathena.pandas_cursor import PandasCursor
from pyathena.converter import Converter
class CustomPandasTypeConverter(Converter):
def __init__(self):
super(CustomPandasTypeConverter, self).__init__(
mappings=None,
types={
'boolean': bool,
'tinyint': int,
'smallint': int,
'integer': int,
'bigint': int,
'float': float,
'real': float,
'double': float,
'decimal': float,
'char': str,
'varchar': str,
'array': str,
'map': str,
'row': str,
'varbinary': str,
'json': str,
'string': str
}
)
def convert(self, type_, value):
# Not used in PandasCursor.
pass
cur = connect(s3_staging_dir='<staging_directory_url>',
region_name='<aws_region>',
cursor_class = PandasCursor,
converter=CustomPandasTypeConverter(),
work_group = '<workgroup_name>').cursor()
query = 'SELECT * FROM <schema>.<table>'
df = cur.execute(query).as_pandas()
df.dtypes
When I inspect the dtypes
, Athena int
s are converted to pandas int
s, decimals
are converted to floats
and string
s are consistently returned as object
dtypes. However Athena string
NULL
s are cast as NaN
s which require explicit column-by-column fillna
operations. This is particularly inconvenient, since I'm trying to subsequently convert the pandas dataframe to a Spark dataframe. Now that I've typed all this out, I'm guessing this is related to #118?
Also, I'm not sure where the right place to ask this is, but are there any plans to implement a PySparkCursor
for PyAthena
? If not can I help by contributing?