You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
Pull Request resolved: #4304
X-link: facebookresearch/FBGEMM#1380
# change set
## Eviction Related
1. move trigger eviction to the beginning of each get call, since get is called once per iteration
2. move resume() to the end of each set calls, there are 2 set calls each train iteration, one happens at the end of prefetch during forward, the other happen when SP embedding is updated during backward
3. change dram kv iteration counter to the get() call which will only bump up once on each train iteration
4. make evict_flag_ auto updated by the last finished shard, to get clearer and deterministic state transition in different cases
5. each eviction round will issue num_shards of long running threads that can be paused/resumed, instead of every pause/resume will destroy/create a new work item for thread pool
6. introduce an additional eviction trigger scheduling logic to avoid repeated and intensive insufficient eviction effort(eviction scan all item but evict nothing), by force a fixed wait interval for 2 consecutive evictions to happen
7. fix get_feature_evict_metric, we need to make a copy before getting out of the mutex scope, otherwise, metrics might be updated async by eviction threads
## Miscellaneous
1. wrap all eviction configs in FeatureEvictConfig and pass it down all the way from TBE to feature_evict, all the future eviction configs will be added inside FeatureEvictConfig
2. make state_dict wait until ongoing eviction finishes
3. fix misuse between feature hash cumsum with table hash cumsum, basically for feature evict, we want table level hash cumsum instead of feature level
4. add UT for different corner case of eviction and make sure state transition is expected
Reviewed By: emlin
Differential Revision: D76244371
fbshipit-source-id: 96b8e0f0563d5615e56d31d0f91c779be1ba1be5
0 commit comments