Skip to content

How to reduce the memory overhead when processing millions of samples ? #734

@DonaldRR

Description

@DonaldRR

Before Asking 在提问之前

  • I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

My first custom operator load label file and image as cache for later processing of other OPs. That means, at the first OP.map process, it would load the file or image in the memory for Each sample. It causes memory overhead and OOM if I have over 1 million of samples. How can I resolve that ? Does the Data-Juicer supports splitting the samples into meta-batches and apply the series of OPs for each meta-batch ? (I do not prefer caching because it takes a HUGE amount of disk space)

The streamline of my process(OPs) works as: read_label_filenames -> load_label_files -> LABEL_PROCESS1 -> LABEL_PROCESS2 -> ...

我遇到的问题是,当我的operator需要加载一些label文件或者图像作为作为缓存或新的value,让后面的其他多个operators对这个缓存进行操作的时候,这一层dataset.map的结果会消耗特别大的内存,特别是当我的数据量有1million级别。有什么办法能自动把dataset从1million切分成10k,然后每次只处理10k呢?这需要手动写loop吗?还是这个框架能支持?

我的process流是这样的: read_label_filenames -> load_label_files -> LABEL_PROCESS1 -> LABEL_PROCESS2 -> ...

Additional 额外信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions