Skip to content

Handling NaNs from ElementProperty #898

@gbrunin

Description

@gbrunin

When using an ElementProperty featurizer, some elemental data may not be present, e.g., the bulk modulus of Ga is not in the Ga Element of pymatgen. In such a case, the featurizer will return a NaN.
There are different ways to handle such a case, and for now it is left to the user to handle it in whatever way they prefer. Basically, the possible approaches there would be:

  • ignore such a feature entirely, which can be a pity if only a small fraction of the dataset presents a NaN for this feature
  • replace the feature by a constant value, either one that is completely outside the range of possibilities, or the mean of the feature over the dataset. The former has the advantage that it is simple to set in place, while for the latter the mean of the feature should be stored somewhere to be re-used in the case the feature is NaN for a new prediction (that may not have access to the original dataset).
    I believe that these two possibilities could be implemented in matminer as some kind of post-processing step, that could be used by the user or not. This is arguable because it could be left to the user to handle these.

I see another possibility that could be implemented in matminer and that the user has no easy way to do. The ElementProperty could, when a value for an element is not found in the data, replace it by the mean of the values for all other elements. This is different from using the mean of the feature over the total dataset, in the sense that it is not biased by the dataset (the user could want one or the other), and that nothing is to be done for new predictions since this treatment is internal to the featurizer. The ElementProperty featurizer would not return NaNs for missing-data reasons. This could be triggered with an optional argument, e.g., ElementProperty(missing_is_mean=True).

If you think this is a good addition to matminer, I would be happy to submit a PR with whichever solution you think is best. I would actually be in favor of implementing all of them to leave the choice to the user, but make the users life easier.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions