Skip to content

Materials Project time split dataset - load_data_from_json returns None during debugging (conditionally) #832

@sgbaird

Description

@sgbaird

Sorted by earliest year of reference, limited to experimental entries with fewer than 52 sites: https://figshare.com/articles/dataset/Materials_Project_Time_Split_Data/19991516

How does this seem in terms of a matminer dataset contribution? See How do I do a time-split of Materials Project entries? e.g. pre-2018 vs. post-2018 and sparks-baird/xtal2png#12 (comment) for additional context. Starting to feel like I'm reinventing the wheel by trying to host it myself.

In my own code, I've been running into a strange issue where if I use:

def load_dataframe_from_json(filename, pbar=True, decode=True):
"""Load pandas dataframe from a json file.
Automatically decodes and instantiates pymatgen objects in the dataframe.
Args:
filename (str): Path to json file. Can be a compressed file (gz and bz2)
are supported.
pbar (bool): If true, shows an ASCII progress bar for loading data from disk.
decode (bool): If true, will automatically decode objects (slow, convenient).
If false, will return json representations of the objects (fast, inconvenient).
Returns:
(Pandas.DataFrame): A pandas dataframe.
"""
# Progress bar for reading file with hook
pbar1 = tqdm(desc=f"Reading file {filename}", position=0, leave=True, ascii=True, disable=not pbar)
def is_monty_object(o):
"""
Determine if an object can be decoded into json
by monty.
Args:
o (object): An object in dict-form.
Returns:
(bool)
"""
if isinstance(o, dict) and "@class" in o:
return True
else:
return False
def pbar_hook(obj):
"""
A hook for a pbar reading the raw data from json, not
using monty decoding to decode the object.
Args:
obj (object): A dict-like
Returns:
obj (object)
"""
if is_monty_object(obj):
pbar1.update(1)
sys.stderr.flush()
return obj
# Progress bar for decoding objects
pbar2 = tqdm(desc=f"Decoding objects from {filename}", position=0, leave=True, ascii=True, disable=not pbar)
class MontyDecoderPbar(MontyDecoder):
"""
A pbar-friendly version of MontyDecoder.
"""
def process_decoded(self, d):
if isinstance(d, dict) and "data" in d and "index" in d and "columns" in d:
# total number of objects to decode
# is the number of @class mentions
pbar2.total = str(d).count("@class")
elif is_monty_object(d):
pbar2.update(1)
sys.stderr.flush()
return super().process_decoded(d)
if decode:
decoder = MontyDecoderPbar if pbar else MontyDecoder
else:
decoder = None
hook = pbar_hook if pbar else lambda x: x
with zopen(filename, "rb") as f:
dataframe_data = json.load(f, cls=decoder, object_hook=hook)
pbar1.close()
pbar2.close()
# if only keys are data, columns, index then orient=split
if isinstance(dataframe_data, dict):
if set(dataframe_data.keys()) == {"data", "columns", "index"}:
return pandas.DataFrame(**dataframe_data)
else:
return pandas.DataFrame(dataframe_data)

It returns None during an uninterrupted debugging run, but if I set a breakpoint and run the line manually in the debug console (VS Code) then it returns the expected DataFrame.
See https://github.com/sparks-baird/mp-time-split/runs/6739787243?check_suite_focus=true/#step:5:1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions