Skip to content

DistributedEmbedding: Add a way to record FDO stats and update input config (max_ids, max_unique_ids, etc.) #145

@abheesht17

Description

@abheesht17

We often get warnings while training, similar to:

  08:44:12.046114: W jax_tpu_embedding/sparsecore/lib/core/input_preprocessing_util.cc:251] No Coo Buffer Size provided for table cat_14_table_cat_15_table_cat_23_table_cat_24_table_cat_25_table_cat_33_table_cat_34_table_cat_35_table_cat_36_table, the default value (6144) may be too large and can cause OOM. Utilize the stats returned from the sparse dense matmul preprocessing API.

Doing something similar to this will fix these warnings: https://github.com/AI-Hypercomputer/RecML/blob/1821350b346b66479baaa0ab624aa67929305dea/examples/dlrm/dlrm_main.py#L675-L680

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions