νΉμ§ | μ€λͺ |
---|---|
λν μ£Όμ | λ€μ΄λ² λΆμ€νΈμΊ ν AI-Tech 7κΈ° NLPνΈλμ level 1 λλ©μΈ κΈ°μ΄ λν |
λν μ€λͺ | λ λ¬Έμ₯μ΄ μ£Όμ΄μ‘μ λ λ λ¬Έμ₯μ λν STS(Semantic Text Simliarity)λ₯Ό μΆλ‘ νλ λνλ‘ Kaggleκ³Ό Daconκ³Ό κ°μ΄ competition νν |
λ°μ΄ν° κ΅¬μ± | λ°μ΄ν°λ slack λν, λ€μ΄λ² μν νκΈ°, κ΅λ―Ό μ²μ λ¬Έμ₯μΌλ‘ ꡬμ±. Train(9324κ°), Dev(550κ°), Test(1100κ°) |
νκ° μ§ν | λͺ¨λΈμ νκ°μ§νλ νΌμ΄μ¨ μκ΄κ³μ(Pearson correlation coefficient)λ‘ μΈ‘μ |
νμ | μν |
---|---|
κΉμ§μ¬ | EDA, λ°©λ²λ‘ μ μ, νμ νκ²½ λ° λ² μ΄μ€λΌμΈ κ΄λ¦¬, λͺ¨λΈ νμ, μ¦κ° κΈ°λ² λ° μ μ²λ¦¬ μ€ν, μμλΈ μ½λ μμ± λ° μ€ν |
λ°κ·ν | EDA, λͺ¨λΈ νμ, λ°μ΄ν° μ¦κ° λ° μμλΈ κΈ°λ²μ λν μ€ν, Bagging κΈ°λ² μ½λ μμ± λ° μ€ν |
μ€μ μ | EDA, νμ νκ²½ λ° λ² μ΄μ€λΌμΈ κ΄λ¦¬, λ°μ΄ν° λΆν¬ λ° μ¬λΆν μ΄κ΄, λͺ¨λΈ νμ, λ°μ΄ν° μ¦κ° μ€ν, μμλΈ μ½λ μμ± λ° μ€ν |
μ΄μ λ―Ό | EDA, λͺ¨λΈ νμ, λͺ¨λΈμ λν μ¦κ° κΈ°λ² λ° μ μ²λ¦¬ μ€ν, KoEDA μ¦κ° μ€ν, K-Fold Validation μ€ν |
μνν | EDA, λͺ¨λΈ νμ, λͺ¨λΈμ λν μ¦κ° κΈ°λ² λ° μ μ²λ¦¬ μ€ν, μμλΈ μ½λ μμ± λ° μ€ν, Stacking λͺ¨λΈ μ€ν |
κ°μ | μ€λͺ |
---|---|
μ£Όμ | STS(Semantic Text Similarity) : λ λ¬Έμ₯μ μ μ¬λ μ λλ₯Ό μμΉλ‘ μΆλ‘ νλ Task |
λͺ©ν | λ λ¬Έμ₯(sentence1, sentence2)μ΄ μ£Όμ΄μ‘μ λ, μ΄ λ λ¬Έμ₯μ μ μ¬λλ₯Ό 0~5μ¬μ΄μ μ μλ‘ μΆλ‘ νλ AI λͺ¨λΈ μ μ |
νκ° μ§ν | μ€μ κ°κ³Ό μμΈ‘κ°μ νΌμ΄μ¨ μκ΄ κ³μ(Pearson Correlation Coefficient) |
κ°λ° νκ²½ | GPU : Tesla V100 Server 4λ, IDE : Vscode, Jupyter Notebook |
νμ νκ²½ | Notion (μ§ν μν© κ³΅μ ), Github (μ½λ λ° λ°μ΄ν° 곡μ ), Slack (μ€μκ° μν΅) |
- νλ‘μ νΈλ 2024-09-11 ~ 2024-09-27κΉμ§ μ§νλμμ΅λλ€.
- νλ‘μ νΈλ₯Ό μ§ννλ©° λ¨κ³λ³λ‘ μ€ννμ¬ μ μ©ν λ΄μ©λ€μ μλμ κ°μ΅λλ€.
νλ‘μΈμ€ | μ€λͺ |
---|---|
EDA | λ°μ΄ν° λΆν¬ λΆμ, Baseline λͺ¨λΈ μμΈ‘κ³Ό μ€μ κ° μ°¨μ΄ λΆμ |
μ μ²λ¦¬ | λμμ΄ κ΅μ²΄ , λ¨μ΄ μμ λ³κ²½ , λλ€ μμ |
μ¦κ° | label 0 - undersampling , label 5 - copied sentence , swapping sentence |
λͺ¨λΈ μ μ | upskyy/kf-deberta-multitask , team-lucid/deberta-v3-xlarge-korean , snunlp/KR-ELECTRA-discriminator , kykim/electra-kor-base , monologg/ko-electra-base-v3-discriminator , jhgan/ko-sroberta-multitask , FacebookAI/roberta-large-rtt , deliciouscat/kf-deberta-base-cross-sts , sorryhyun-sentence-embedding-klue-large |
μμλΈ | soft voting , Nested Ensemble , Bagging |
- λ°μ΄ν° μ¦κ° κ³Όμ μμ λΌλ²¨ λΆν¬λ₯Ό κ· νμκ² λ§μΆκ³ μ λΌλ²¨λ³ μ¦κ°λΉμ¨μ μ‘°μ νμμ΅λλ€.
λ²μ | μ€λͺ | ν¬κΈ° |
---|---|---|
original_train_V1 | μλ³Έ λ°μ΄ν° | 9324 |
augmentation_train_V2 | SWAP , label 0 μΈλμνλ§ + label 5 μ€λ²μνλ§ |
28722 |
- μ΅μ’ μ μΌλ‘ 16κ°μ λͺ¨λΈμ μμλΈμ μ¬μ©νμ΅λλ€.
Model | val_pearson | learning_rate | batch_size | μ¬μ© λ°μ΄ν° |
---|---|---|---|---|
upskyy/kf-deberta-multitask | 0.9289 | 1e-5 | 16 | augmentation_train_V2 |
team-lucid/deberta-v3-xlarge-korean | 0.9378 | 1e-5 | 16 | augmentation_train_V2 |
team-lucid/deberta-v3-xlarge-korean | 0.9377 | 1e-5 | 16 | original_train_V1 |
snunlp/KR-ELECTRA-discriminator | 0.9325 | 1e-5 | 16 | original_train_V1 |
snunlp/KR-ELECTRA-discriminator | 0.9313 | 1e-5 | 32 | original_train_V1 |
kykim/electra-kor-base | 0.9255 | 1e-5 | 16 | original_train_V1 |
monologg/ko-electra-base-v3-discriminator | 0.9252 | 1e-5 | 16 | original_train_V1 |
kykim/electra-kor-base | 0.9252 | 1e-5 | 16 | augmentation_train_V2 |
jhgan/ko-sroberta-multitask | 0.9249 | 1e-5 | 16 | original_train_V1 |
FacebookAI/roberta-large-rtt | 0.9249 | 1e-5 | 16 | original_train_V1 |
snunlp/KR-ELECTRA-discriminator | 0.9223 | 1e-5 | 16 | augmentation_train_V2 |
sorryhyun-sentence-embedding-klue-large | 0.9301 | 1e-5 | 16 | augmentation_train_V2 |
FacebookAI/xlm-roberta-large | 0.9287 | 1e-5 | 16 | augmentation_train_V2 |
deliciouscat/kf-deberta-base-cross-sts | 0.929 | 1e-5 | 16 | augmentation_train_V2 |
team-lucid-deberta-v3-xlarge-korean | 0.9399 | 1e-5 | 16 | augmentation_train_V2 |
snunlp-KR-ELECTRA-discriminator | 0.9336 | 1e-5 | 16 | augmentation_train_V2 |
π level1-semantictextsimilarity-nlp-15
βββ README.md
βββ requirements.txt
βββ src
βββ config.yaml
βββ csv_ensemble
βββ checkpoint
βββ data
βββ model
βΒ Β βββ model.py
βββ output
βββ run.py
βββ train.py
βββ inference.py
βββ bagging.py
βββ ensemble.py
βββ util
βββ data_augmentation.py
βββ util.py
- checkpoint : 체ν¬ν¬μΈνΈ νμΌ(ckpt) μ μ₯ ν΄λ
- csv_ensemble : μμλΈμ΄ λ csv κ²°κ³Όλ¬Ό μ μ₯ ν΄λ
- config : λͺ¨λΈ μ€μ κ΄λ ¨ yaml νμΌ
- data : νμ΅ λ° μΆλ‘ μ μ§νν λ°μ΄ν° ν΄λ (μ¬κΈ°μ train, dev, test νμΌμ λ£μ΄μΌ ν©λλ€)
- model : λͺ¨λΈ ν΄λμ€κ° μ‘΄μ¬νλ μ½λ + λͺ¨λΈ .pt νμΌ
- output : λͺ¨λΈ νμ΅ κ²°κ³Ό csv νμΌ
- util : κΈ°ν μ νΈλ¦¬ν°(dataset, dataloader, tokenizer) μ½λ
- run.py : νμ΅ λ° μΆλ‘ μ μ€ννλ μ½λ
- train.py : νμ΅μ μ€ννλ μ½λ
- inference.py : μΆλ‘ μ μ€ννλ μ½λ
- ensemble.py : μμλΈμ μ€ννλ μ½λ
- path, νμ΄νΌνλΌλ―Έν° κ°κ³Ό κ°μ κ²μ μ λΆ config.yamlμμ κ΄λ¦¬ν©λλ€.
- config.yamlμ μ‘΄μ¬νλ λͺ¨λΈ λͺ©λ‘μ΄ μ λΆ run.pyμμ forλ¬Έμ λλ €μ νμ΅μ μ§νν©λλ€.
λ°λΌμ λͺ¨λΈμ λ³κ²½ν λ yamlμ μ£Όμμ μ΄μ©ν΄μ£ΌμΈμ - μμλΈμ config.yamlμ ensemble_weightμ μ μ‘°μ ν΄ μ£ΌμΈμ. κΈΈμ΄κ° λ€λ₯΄λ©΄ μλμΌλ‘ Soft Votingμ μ§νν©λλ€.
- μ€λ₯λ μ§λ¬Έμ git issueλ₯Ό ν΅ν΄ λ¨κ²¨μ£ΌμΈμ
- pip install -r requirements.txt
- Put train, dev, test csv files at /src/data directory
- Put sample_submission.csv at /src/output directory
- Set models and augmentation methods on config.yaml
- Execute run.py