Questions about operationalising autokeras on big datasets #1502

robinvanschaik · 2021-01-10T22:31:39Z

robinvanschaik
Jan 10, 2021

Hi all,

For a current project I am training a Tensorflow/Keras model for churn prediction on Google Analytics data from a large dataset stored in BigQuery. As such I am also looking forward to test AutoKeras to this end.

I believe that the current best practice is to export the dataset to Google Cloud Storage in sharded .CSVs, and streaming the shards in using the tf Dataset Interleave generator.

Given that AutoKeras supports tf datasets generators, I believe that it should work with AutoKeras.

However, I still have some questions regarding the inner-workings of AutoKeras before I dive into it.

Given that we are able to stream the data in batches via generators, how does AutoKeras handle certain transformations?
For instance, I currently make use of the tf.feature_columns API to construct my features and pass them to the Keras DenseFeatures layer.

However, since the dataset is being streamed in batches, I query my Bigquery training dataset at model creation time via utility functions.
This helps me to calculate the min-max, means and standard deviations to pass them to scaling functions of my numeric features. Similarly, I retrieve a list of classes for my categorical features.

How does AutoKeras handle these kinds of transformations when the data is being streamed in batches?

How does AutoKeras handle null values?

The nice part of the tf.feature.columns API is that you can pass default values when training and at inference.
This helps in production when values might be missing without having to write extensive checks.

For instance like this:

tf.feature_column.numeric_column(
    key="revenue", default_value=0, dtype=tf.dtypes.float32, normalizer_fn=make_normalizer(col_mean, col_std)
)

Thanks!

haifeng-jin · 2021-02-15T20:16:50Z

haifeng-jin
Feb 15, 2021
Maintainer

AutoKeras uses preprocessing layers in Keras to handle the feature encodings. They are using tf operations, so support streaming data.
For numerical data, if it is nan, we use 0. If it is null, we currently will treat it as categorical. For categorical data, we will treat null as a normal categorical value.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions about operationalising autokeras on big datasets #1502

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Questions about operationalising autokeras on big datasets #1502

Uh oh!

robinvanschaik Jan 10, 2021

Replies: 1 comment

Uh oh!

haifeng-jin Feb 15, 2021 Maintainer

robinvanschaik
Jan 10, 2021

haifeng-jin
Feb 15, 2021
Maintainer