Skip to content

boostcampaitech7/level1-semantictextsimilarity-nlp-15

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

74 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ† LV.1 NLP 기초 ν”„λ‘œμ νŠΈ : λ¬Έλ§₯적 μœ μ‚¬λ„ μΈ‘μ • (STS)

✏️ λŒ€νšŒ μ†Œκ°œ

νŠΉμ§• μ„€λͺ…
λŒ€νšŒ 주제 넀이버 λΆ€μŠ€νŠΈμΊ ν”„ AI-Tech 7κΈ° NLPνŠΈλž™μ˜ level 1 도메인 기초 λŒ€νšŒ
λŒ€νšŒ μ„€λͺ… 두 λ¬Έμž₯이 μ£Όμ–΄μ‘Œμ„ λ•Œ 두 λ¬Έμž₯에 λŒ€ν•œ STS(Semantic Text Simliarity)λ₯Ό μΆ”λ‘ ν•˜λŠ” λŒ€νšŒλ‘œ Kaggleκ³Ό Daconκ³Ό 같이 competition ν˜•νƒœ
데이터 ꡬ성 λ°μ΄ν„°λŠ” slack λŒ€ν™”, 넀이버 μ˜ν™” ν›„κΈ°, κ΅­λ―Ό 청원 λ¬Έμž₯으둜 ꡬ성. Train(9324개), Dev(550개), Test(1100개)
평가 μ§€ν‘œ λͺ¨λΈμ˜ ν‰κ°€μ§€ν‘œλŠ” ν”Όμ–΄μŠ¨ μƒκ΄€κ³„μˆ˜(Pearson correlation coefficient)둜 μΈ‘μ •

πŸŽ–οΈ Leader Board

πŸ₯ˆ Public Leader Board (2μœ„)

πŸ₯‰ Private Leader Board (3μœ„)

πŸ‘¨β€πŸ’» 15μ‘°κ°€μ‹­μ˜€μ‘° 멀버

κΉ€μ§„μž¬ λ°•κ·œνƒœ μœ€μ„ μ›… 이정민 μž„ν•œνƒ

πŸ‘Ό μ—­ν•  λΆ„λ‹΄

νŒ€μ› μ—­ν• 
κΉ€μ§„μž¬ EDA, 방법둠 μ œμ•ˆ, ν˜‘μ—… ν™˜κ²½ 및 베이슀라인 관리, λͺ¨λΈ 탐색, 증강 기법 및 μ „μ²˜λ¦¬ μ‹€ν—˜, 앙상블 μ½”λ“œ μž‘μ„± 및 μ‹€ν—˜
λ°•κ·œνƒœ EDA, λͺ¨λΈ 탐색, 데이터 증강 및 앙상블 기법에 λŒ€ν•œ μ‹€ν—˜, Bagging 기법 μ½”λ“œ μž‘μ„± 및 μ‹€ν—˜
μœ€μ„ μ›… EDA, ν˜‘μ—… ν™˜κ²½ 및 베이슀라인 관리, 데이터 뢄포 및 μž¬λΆ„ν•  총괄, λͺ¨λΈ 탐색, 데이터 증강 μ‹€ν—˜, 앙상블 μ½”λ“œ μž‘μ„± 및 μ‹€ν—˜
이정민 EDA, λͺ¨λΈ 탐색, λͺ¨λΈμ— λŒ€ν•œ 증강 기법 및 μ „μ²˜λ¦¬ μ‹€ν—˜, KoEDA 증강 μ‹€ν—˜, K-Fold Validation μ‹€ν—˜
μž„ν•œνƒ EDA, λͺ¨λΈ 탐색, λͺ¨λΈμ— λŒ€ν•œ 증강 기법 및 μ „μ²˜λ¦¬ μ‹€ν—˜, 앙상블 μ½”λ“œ μž‘μ„± 및 μ‹€ν—˜, Stacking λͺ¨λΈ μ‹€ν—˜

πŸƒ ν”„λ‘œμ νŠΈ

πŸ–₯️ ν”„λ‘œμ νŠΈ κ°œμš”

κ°œμš” μ„€λͺ…
주제 STS(Semantic Text Similarity) : 두 λ¬Έμž₯의 μœ μ‚¬λ„ 정도λ₯Ό 수치둜 μΆ”λ‘ ν•˜λŠ” Task
λͺ©ν‘œ 두 λ¬Έμž₯(sentence1, sentence2)이 μ£Όμ–΄μ‘Œμ„ λ•Œ, 이 두 λ¬Έμž₯의 μœ μ‚¬λ„λ₯Ό 0~5μ‚¬μ΄μ˜ 점수둜 μΆ”λ‘ ν•œλŠ” AI λͺ¨λΈ μ œμž‘
평가 μ§€ν‘œ μ‹€μ œ κ°’κ³Ό μ˜ˆμΈ‘κ°’μ˜ ν”Όμ–΄μŠ¨ 상관 κ³„μˆ˜(Pearson Correlation Coefficient)
개발 ν™˜κ²½ GPU : Tesla V100 Server 4λŒ€, IDE : Vscode, Jupyter Notebook
ν˜‘μ—… ν™˜κ²½ Notion(μ§„ν–‰ 상황 곡유), Github(μ½”λ“œ 및 데이터 곡유), Slack(μ‹€μ‹œκ°„ μ†Œν†΅)

πŸ“… ν”„λ‘œμ νŠΈ νƒ€μž„λΌμΈ

  • ν”„λ‘œμ νŠΈλŠ” 2024-09-11 ~ 2024-09-27κΉŒμ§€ μ§„ν–‰λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

πŸ•΅οΈ ν”„λ‘œμ νŠΈ μ§„ν–‰

  • ν”„λ‘œμ νŠΈλ₯Ό μ§„ν–‰ν•˜λ©° λ‹¨κ³„λ³„λ‘œ μ‹€ν—˜ν•˜μ—¬ μ μš©ν•œ λ‚΄μš©λ“€μ„ μ•„λž˜μ™€ κ°™μŠ΅λ‹ˆλ‹€.
ν”„λ‘œμ„ΈμŠ€ μ„€λͺ…
EDA 데이터 뢄포 뢄석, Baseline λͺ¨λΈ 예츑과 μ‹€μ œκ°’ 차이 뢄석
μ „μ²˜λ¦¬ λ™μ˜μ–΄ ꡐ체, 단어 μˆœμ„œ λ³€κ²½, 랜덀 μ‚­μ œ
증강 label 0 - undersampling, label 5 - copied sentence, swapping sentence
λͺ¨λΈ μ„ μ • upskyy/kf-deberta-multitask, team-lucid/deberta-v3-xlarge-korean, snunlp/KR-ELECTRA-discriminator, kykim/electra-kor-base, monologg/ko-electra-base-v3-discriminator, jhgan/ko-sroberta-multitask, FacebookAI/roberta-large-rtt, deliciouscat/kf-deberta-base-cross-sts, sorryhyun-sentence-embedding-klue-large
앙상블 soft voting, Nested Ensemble, Bagging

πŸ“Š Dataset

  • 데이터 증강 κ³Όμ •μ—μ„œ 라벨 뢄포λ₯Ό κ· ν˜•μžˆκ²Œ λ§žμΆ”κ³ μž 라벨별 μ¦κ°•λΉ„μœ¨μ„ μ‘°μ •ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
버전 μ„€λͺ… 크기
original_train_V1 원본 데이터 9324
augmentation_train_V2 SWAP, label 0 μ–Έλ”μƒ˜ν”Œλ§ + label 5 μ˜€λ²„μƒ˜ν”Œλ§ 28722

πŸ€– Ensemble Model

  • μ΅œμ’…μ μœΌλ‘œ 16개의 λͺ¨λΈμ„ 앙상블에 μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€.
Model val_pearson learning_rate batch_size μ‚¬μš© 데이터
upskyy/kf-deberta-multitask 0.9289 1e-5 16 augmentation_train_V2
team-lucid/deberta-v3-xlarge-korean 0.9378 1e-5 16 augmentation_train_V2
team-lucid/deberta-v3-xlarge-korean 0.9377 1e-5 16 original_train_V1
snunlp/KR-ELECTRA-discriminator 0.9325 1e-5 16 original_train_V1
snunlp/KR-ELECTRA-discriminator 0.9313 1e-5 32 original_train_V1
kykim/electra-kor-base 0.9255 1e-5 16 original_train_V1
monologg/ko-electra-base-v3-discriminator 0.9252 1e-5 16 original_train_V1
kykim/electra-kor-base 0.9252 1e-5 16 augmentation_train_V2
jhgan/ko-sroberta-multitask 0.9249 1e-5 16 original_train_V1
FacebookAI/roberta-large-rtt 0.9249 1e-5 16 original_train_V1
snunlp/KR-ELECTRA-discriminator 0.9223 1e-5 16 augmentation_train_V2
sorryhyun-sentence-embedding-klue-large 0.9301 1e-5 16 augmentation_train_V2
FacebookAI/xlm-roberta-large 0.9287 1e-5 16 augmentation_train_V2
deliciouscat/kf-deberta-base-cross-sts 0.929 1e-5 16 augmentation_train_V2
team-lucid-deberta-v3-xlarge-korean 0.9399 1e-5 16 augmentation_train_V2
snunlp-KR-ELECTRA-discriminator 0.9336 1e-5 16 augmentation_train_V2

πŸ“ ν”„λ‘œμ νŠΈ ꡬ쑰

πŸ“ level1-semantictextsimilarity-nlp-15
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
└── src
    β”œβ”€β”€ config.yaml
    β”œβ”€β”€ csv_ensemble
    β”œβ”€β”€ checkpoint
    β”œβ”€β”€ data
    β”œβ”€β”€ model
    β”‚Β Β  └── model.py
    β”œβ”€β”€ output
    β”œβ”€β”€ run.py
    β”œβ”€β”€ train.py
    β”œβ”€β”€ inference.py
    β”œβ”€β”€ bagging.py
    β”œβ”€β”€ ensemble.py
    └── util
        β”œβ”€β”€ data_augmentation.py
        └── util.py

πŸ“¦ src 폴더 ꡬ쑰 μ„€λͺ…

  • checkpoint : 체크포인트 파일(ckpt) μ €μž₯ 폴더
  • csv_ensemble : 앙상블이 된 csv κ²°κ³Όλ¬Ό μ €μž₯ 폴더
  • config : λͺ¨λΈ μ„€μ • κ΄€λ ¨ yaml 파일
  • data : ν•™μŠ΅ 및 좔둠을 μ§„ν–‰ν•  데이터 폴더 (여기에 train, dev, test νŒŒμΌμ„ λ„£μ–΄μ•Ό ν•©λ‹ˆλ‹€)
  • model : λͺ¨λΈ ν΄λž˜μŠ€κ°€ μ‘΄μž¬ν•˜λŠ” μ½”λ“œ + λͺ¨λΈ .pt 파일
  • output : λͺ¨λΈ ν•™μŠ΅ κ²°κ³Ό csv 파일
  • util : 기타 μœ ν‹Έλ¦¬ν‹°(dataset, dataloader, tokenizer) μ½”λ“œ
  • run.py : ν•™μŠ΅ 및 좔둠을 μ‹€ν–‰ν•˜λŠ” μ½”λ“œ
  • train.py : ν•™μŠ΅μ„ μ‹€ν–‰ν•˜λŠ” μ½”λ“œ
  • inference.py : 좔둠을 μ‹€ν–‰ν•˜λŠ” μ½”λ“œ
  • ensemble.py : 앙상블을 μ‹€ν–‰ν•˜λŠ” μ½”λ“œ

πŸ“ 보좩 μ„€λͺ…

  1. path, ν•˜μ΄νΌνŒŒλΌλ―Έν„° κ°’κ³Ό 같은 것은 μ „λΆ€ config.yamlμ—μ„œ κ΄€λ¦¬ν•©λ‹ˆλ‹€.
  2. config.yaml에 μ‘΄μž¬ν•˜λŠ” λͺ¨λΈ λͺ©λ‘μ΄ μ „λΆ€ run.pyμ—μ„œ for문을 λŒλ €μ„œ ν•™μŠ΅μ„ μ§„ν–‰ν•©λ‹ˆλ‹€.
    λ”°λΌμ„œ λͺ¨λΈμ„ λ³€κ²½ν•  λ•Œ yaml에 주석을 μ΄μš©ν•΄μ£Όμ„Έμš”
  3. 앙상블은 config.yaml의 ensemble_weight을 잘 μ‘°μ ˆν•΄ μ£Όμ„Έμš”. 길이가 λ‹€λ₯΄λ©΄ μžλ™μœΌλ‘œ Soft Voting을 μ§„ν–‰ν•©λ‹ˆλ‹€.
  4. 였λ₯˜λ‚˜ μ§ˆλ¬Έμ€ git issueλ₯Ό 톡해 λ‚¨κ²¨μ£Όμ„Έμš”

πŸ“¦ Installation

  1. pip install -r requirements.txt
  2. Put train, dev, test csv files at /src/data directory
  3. Put sample_submission.csv at /src/output directory
  4. Set models and augmentation methods on config.yaml
  5. Execute run.py

About

level1-semantictextsimilarity-nlp-15 created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5

Languages