-
Notifications
You must be signed in to change notification settings - Fork 268
Description
Before Asking 在提问之前
-
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
Search before asking 先搜索,再提问
Question
My first custom operator load label file and image as cache for later processing of other OPs. That means, at the first OP.map process, it would load the file or image in the memory for Each sample. It causes memory overhead and OOM if I have over 1 million of samples. How can I resolve that ? Does the Data-Juicer supports splitting the samples into meta-batches and apply the series of OPs for each meta-batch ? (I do not prefer caching because it takes a HUGE amount of disk space)
The streamline of my process(OPs) works as: read_label_filenames -> load_label_files -> LABEL_PROCESS1 -> LABEL_PROCESS2 -> ...
我遇到的问题是,当我的operator需要加载一些label文件或者图像作为作为缓存或新的value,让后面的其他多个operators对这个缓存进行操作的时候,这一层dataset.map的结果会消耗特别大的内存,特别是当我的数据量有1million级别。有什么办法能自动把dataset从1million切分成10k,然后每次只处理10k呢?这需要手动写loop吗?还是这个框架能支持?
我的process流是这样的: read_label_filenames -> load_label_files -> LABEL_PROCESS1 -> LABEL_PROCESS2 -> ...
Additional 额外信息
No response