Skip to content

使用流式加载报错,传递了额外参数:TypeError: IterableDataset.map() got an unexpected keyword argument 'num_proc' #57

@CxsGhost

Description

@CxsGhost

Reminder

  • I have read the README and searched the existing issues.

System Info

使用流式加载时,TypeError: IterableDataset.map() got an unexpected keyword argument 'num_proc'
问题出在:https://github.com/Qihoo360/360-LLaMA-Factory/blob/3bc07289eefcf8c8ea05f553e4ef0b82008419e4/src/llamafactory/data/loader.py#L224。
经检查Datasets库中IterableDataset map函数无法接收kwargs中的三个参数:

    kwargs = dict(
        num_proc=data_args.preprocessing_num_workers,
        load_from_cache_file=(not data_args.overwrite_cache) or (training_args.local_process_index != 0),
        desc="Running sequence parallel split on dataset",
    )

Reproduction

开启流式加载即可 --streaming True

Expected behavior

一般的Dataset map函数可以接收这些参数:
Image
流式加载IterableDataset map:
Image

修复方式:只需在 _get_sequence_parallel_dataset 中添加额外的判断逻辑即可,目前我本地运行良好

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions