Torch.save() for large training Dataset

Hello, 

I know this is not directly related to awesome-align, but I have a large training set of 10M source/target pairs and it takes 4 hours to process them before the actual model training even starts. So, I used the --cache_data option provided, but when the code reaches `torch.save(self.examples, cache_fn)` it just takes forever and it uses all my RAM (60GB is not enough on a VM - it swaps into HDD on my 32gb RAM PC and then takes forever to complete). Would you know any alternative way to save the Dataset and then reuse it before training to skip the loading overhead time? 

Many thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Torch.save() for large training Dataset #56

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Torch.save() for large training Dataset #56

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions