[Ready] align sft formats & new ops #454

drcege · 2024-10-21T11:20:52Z

Summary

Unified SFT data format
- add query_key, response_key, history_key in base_op
New OPs & unit-tests
- generate_qa_from_examples_mapper
- generate_qa_from_text_mapper
- optimize_qa_mapper
- optimize_query_mapper
- optimize_response_mapper
Fixed model inference (use chat_template)
- use chat interface for vllm
- use pipeline for HF
Updated docs
Optimized model loading
- load model directly to GPU instead of moving it to CPU first
- properly configure vllm to load only in tensor parallel mode
- preload only models without file locks
- sort prepare_*_model functions

tests/ops/mapper/test_extract_qa_mapper.py

tests/ops/mapper/test_generate_instruction_mapper.py

drcege · 2024-10-23T08:47:51Z

Note: extract_qa_mapper 改为 generate_qa_from_text_mapper 后输出格式变化，不再产出 chatml 的 json string，而是分解为具体的 sft 样本，注意后续对齐。

data_juicer/ops/mapper/optimize_query_mapper.py

configs/config_all.yaml

data_juicer/ops/mapper/generate_qa_from_examples_mapper.py

data_juicer/ops/mapper/generate_qa_from_text_mapper.py

BeachWang

LGTM

align sft formats

134f61f

drcege requested review from BeachWang, Cathy0908, HYLcool, garyzhang99, yxdyc and zhijianma October 21, 2024 11:20

drcege self-assigned this Oct 21, 2024

drcege had a problem deploying to Testing October 21, 2024 11:20 — with GitHub Actions Failure

fix test

a68d925

drcege temporarily deployed to Testing October 22, 2024 03:06 — with GitHub Actions Inactive

drcege marked this pull request as ready for review October 22, 2024 07:40

Merge branch 'main' into sft/align_ops

c5e6a6b

drcege had a problem deploying to Testing October 22, 2024 11:27 — with GitHub Actions Failure

minor fix

87fc4bb

drcege temporarily deployed to Testing October 22, 2024 11:29 — with GitHub Actions Inactive

BeachWang reviewed Oct 23, 2024

View reviewed changes

tests/ops/mapper/test_extract_qa_mapper.py Outdated Show resolved Hide resolved

BeachWang reviewed Oct 23, 2024

View reviewed changes

tests/ops/mapper/test_generate_instruction_mapper.py Outdated Show resolved Hide resolved

drcege added 4 commits October 23, 2024 15:42

improve tests assert

d309428

pre-commit

df3610c

sort

cf521cc

Merge branch 'main' into sft/align_ops

8463815

drcege temporarily deployed to Testing October 23, 2024 08:35 — with GitHub Actions Inactive

drcege added the enhancement New feature or request label Oct 23, 2024

drcege requested a review from BeachWang October 23, 2024 08:48

add associated ops

cc39ef7

drcege had a problem deploying to Testing October 24, 2024 07:51 — with GitHub Actions Failure

add tests

f2201e2

drcege had a problem deploying to Testing October 24, 2024 08:06 — with GitHub Actions Failure

Merge branch 'main' into sft/align_ops

4f7866f

drcege had a problem deploying to Testing October 29, 2024 03:03 — with GitHub Actions Error

drcege marked this pull request as ready for review October 29, 2024 03:04

drcege temporarily deployed to Testing October 29, 2024 07:00 — with GitHub Actions Inactive

refine model loading

dad81cd

drcege had a problem deploying to Testing October 31, 2024 09:14 — with GitHub Actions Failure

fix empty history schema

222512c

drcege had a problem deploying to Testing October 31, 2024 10:41 — with GitHub Actions Failure

drcege added 2 commits October 31, 2024 10:56

fix device

ba4a788

ensure with_rank is set properly

47f6b8e

drcege temporarily deployed to Testing October 31, 2024 11:17 — with GitHub Actions Inactive

drcege changed the title ~~align sft formats & new ops~~ [Ready] align sft formats & new ops Oct 31, 2024

BeachWang reviewed Oct 31, 2024

View reviewed changes

data_juicer/ops/mapper/optimize_query_mapper.py Outdated Show resolved Hide resolved

fix diffusion model_params

da8b254

drcege temporarily deployed to Testing November 1, 2024 03:11 — with GitHub Actions Inactive

drcege mentioned this pull request Nov 1, 2024

[Ready] Add API Call & Example OPs #463

Merged

1 task

BeachWang reviewed Nov 1, 2024

View reviewed changes

minor fix

a265398

drcege temporarily deployed to Testing November 1, 2024 08:31 — with GitHub Actions Inactive

Merge branch 'main' into sft/align_ops

872b08d

drcege had a problem deploying to Testing November 4, 2024 02:16 — with GitHub Actions Error

TODO: new OP tests to be checked

c6d5147

drcege had a problem deploying to Testing November 4, 2024 02:20 — with GitHub Actions Failure

drcege had a problem deploying to Testing November 4, 2024 06:05 — with GitHub Actions Failure

drcege temporarily deployed to Testing November 4, 2024 11:17 — with GitHub Actions Inactive

drcege requested a review from BeachWang November 5, 2024 02:04

BeachWang approved these changes Nov 5, 2024

View reviewed changes

drcege merged commit 65d7c91 into main Nov 5, 2024
3 checks passed

yxdyc mentioned this pull request Dec 10, 2024

sharegpt format support #488

Closed

3 tasks

HYLcool deleted the sft/align_ops branch February 24, 2025 06:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Ready] align sft formats & new ops #454

[Ready] align sft formats & new ops #454

Uh oh!

drcege commented Oct 21, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

drcege commented Oct 23, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BeachWang left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Ready] align sft formats & new ops #454

[Ready] align sft formats & new ops #454

Uh oh!

Conversation

drcege commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

Uh oh!

drcege commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BeachWang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drcege commented Oct 21, 2024 •

edited

Loading

drcege commented Oct 23, 2024 •

edited

Loading