How to reduce the memory overhead when processing millions of samples ?

### Before Asking 在提问之前

- [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。

- [x] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。


### Search before asking 先搜索，再提问

- [x] I have searched the Data-Juicer [issues](https://github.com/alibaba/data-juicer/issues) and found no similar questions. 我已经在 [issue列表](https://github.com/alibaba/data-juicer/issues) 中搜索但是没有发现类似的问题。


### Question

My first custom operator load label file and image as cache for later processing of other OPs. That means, at the first OP.map process, it would load the file or image in the memory for Each sample. It causes memory overhead and OOM if I have over 1 million of samples. How can I resolve that ? Does the Data-Juicer supports splitting the samples into meta-batches and apply the series of OPs for each meta-batch ? (I do not prefer caching because it takes a HUGE amount of disk space)

The streamline of my process(OPs) works as: read_label_filenames -> load_label_files -> LABEL_PROCESS1 -> LABEL_PROCESS2 -> ... 

我遇到的问题是，当我的operator需要加载一些label文件或者图像作为作为缓存或新的value，让后面的其他多个operators对这个缓存进行操作的时候，这一层dataset.map的结果会消耗特别大的内存，特别是当我的数据量有1million级别。有什么办法能自动把dataset从1million切分成10k，然后每次只处理10k呢？这需要手动写loop吗？还是这个框架能支持？

我的process流是这样的: read_label_filenames -> load_label_files -> LABEL_PROCESS1 -> LABEL_PROCESS2 -> ... 

### Additional 额外信息

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to reduce the memory overhead when processing millions of samples ? #734

Before Asking 在提问之前

Search before asking 先搜索，再提问

Question

Additional 额外信息

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to reduce the memory overhead when processing millions of samples ? #734

Description

Before Asking 在提问之前

Search before asking 先搜索，再提问

Question

Additional 额外信息

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions