Skip to content

Fine-tuned ModernBERT on chunked AWS documentation with Matryoshka embeddings (768→64d) to power RAG retrieval, delivering +40–65% gains on nDCG/MRR/MAP and a +63.6% boost to the 64d seq_score.

Notifications You must be signed in to change notification settings

cadenshokat/aws-rag-embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fine-Tuning Embedding Models for Retrieval on AWS Documentation (MRL)

Links: Dataset · Model · Space


At a Glance

  • Domain: AWS service documentation (top services), chunked + synthetic Q→chunk pairs
  • Dataset size: 3,680 positive pairs → 3,312 train / 368 test (90/10 split)
  • Model: nomic-ai/modernbert-embed-base (ModernBERT) → fine-tuned with MRL
  • Dims supported: 768, 512, 256, 128, 64
  • Evaluator: Sentence-Transformers InformationRetrievalEvaluator (cosine)
  • Hardware: Hugging Face Spaces NVIDIA T4 (Small)
  • Training time: ~276s (4.6 min) for 4 epochs (train_runtime: 275.7s)
  • Global batch size: 512 (32 per device × grad_accum=16)
  • Best single-number score (seq_score = 64d nDCG@10): 0.1280 → 0.2095 (+63.6%)

Key Results (Same-Dimension Comparisons)

Average (across all dims) percentage improvements:

  • nDCG@10: +52.6% (avg Δ +0.0917)
  • MRR@10: +54.3% (avg Δ +0.0555)
  • MAP@100: +43.7% (avg Δ +0.0526)
  • Accuracy@10: +50.6% (avg Δ +0.2060)

Selected callouts:

  • 768d nDCG@10: 0.1989 → 0.3032 (+52.4%)
  • 768d MRR@10: 0.1152 → 0.1802 (+56.4%)
  • 768d MAP@100: 0.1337 → 0.1932 (+44.5%)
  • 64d nDCG@10 (Matryoshka low-dim): 0.1280 → 0.2095 (+63.7%)
  • Accuracy@10 (256d): 0.4375 → 0.6658 (+52.2%)

Full Evaluation Tables

Base Model (ModernBERT-embed-base)

Base Model Evaluation Results
-------------------------------------------------------------------------------------
Metric                   768d          512d          256d          128d           64d
-------------------------------------------------------------------------------------
==ndcg@10==           0.1989       0.2049       0.1870       0.1683       0.1280
mrr@10                0.1152       0.1215       0.1100       0.0985       0.0763
map@100               0.1337       0.1396       0.1291       0.1173       0.0940
accuracy@1            0.0027       0.0082       0.0054       0.0000       0.0054
accuracy@3            0.1196       0.1304       0.1168       0.1196       0.0951
accuracy@5            0.3261       0.3288       0.2962       0.2745       0.2011
accuracy@10           0.4701       0.4755       0.4375       0.3940       0.2962
precision@1           0.0027       0.0082       0.0054       0.0000       0.0054
precision@3           0.0399       0.0435       0.0389       0.0399       0.0317
precision@5           0.0652       0.0658       0.0592       0.0549       0.0402
precision@10          0.0470       0.0476       0.0438       0.0394       0.0296
recall@1              0.0027       0.0082       0.0054       0.0000       0.0054
recall@3              0.1196       0.1304       0.1168       0.1196       0.0951
recall@5              0.3261       0.3288       0.2962       0.2745       0.2011
recall@10             0.4701       0.4755       0.4375       0.3940       0.2962
-------------------------------------------------------------------------------------
seq_score: 0.128045

Fine-Tuned Model (MRL on AWS Docs)

Fine Tuned Model Evaluation Results
-------------------------------------------------------------------------------------
Metric                   768d          512d          256d          128d           64d
-------------------------------------------------------------------------------------
==ndcg@10==           0.3032       0.2884       0.2899       0.2544       0.2095
mrr@10                0.1802       0.1680       0.1731       0.1516       0.1263
map@100               0.1932       0.1823       0.1877       0.1693       0.1443
accuracy@1            0.0027       0.0000       0.0082       0.0027       0.0082
accuracy@3            0.2255       0.1712       0.1875       0.1875       0.1576
accuracy@5            0.5082       0.4973       0.4946       0.4402       0.3370
accuracy@10           0.6984       0.6766       0.6658       0.5842       0.4783
precision@1           0.0027       0.0000       0.0082       0.0027       0.0082
precision@3           0.0752       0.0571       0.0625       0.0625       0.0525
precision@5           0.1016       0.0995       0.0989       0.0880       0.0674
precision@10          0.0698       0.0677       0.0666       0.0584       0.0478
recall@1              0.0027       0.0000       0.0082       0.0027       0.0082
recall@3              0.2255       0.1712       0.1875       0.1875       0.1576
recall@5              0.5082       0.4973       0.4946       0.4402       0.3370
recall@10             0.6984       0.6766       0.6658       0.5842       0.4783
-------------------------------------------------------------------------------------
seq_score: 0.209524

Base vs Fine-Tuned — Same Dimension Improvements

nDCG@10

Dimension Base Fine-tuned Abs Δ % Δ
768d 0.1989 0.3032 +0.1043 +52.4%
512d 0.2049 0.2884 +0.0835 +40.8%
256d 0.1870 0.2899 +0.1029 +55.0%
128d 0.1683 0.2544 +0.0861 +51.2%
64d 0.1280 0.2095 +0.0815 +63.7%

MRR@10

Dimension Base Fine-tuned Abs Δ % Δ
768d 0.1152 0.1802 +0.0650 +56.4%
512d 0.1215 0.1680 +0.0465 +38.3%
256d 0.1100 0.1731 +0.0631 +57.4%
128d 0.0985 0.1516 +0.0531 +53.9%
64d 0.0763 0.1263 +0.0500 +65.5%

MAP@100

Dimension Base Fine-tuned Abs Δ % Δ
768d 0.1337 0.1932 +0.0595 +44.5%
512d 0.1396 0.1823 +0.0427 +30.6%
256d 0.1291 0.1877 +0.0586 +45.4%
128d 0.1173 0.1693 +0.0520 +44.3%
64d 0.0940 0.1443 +0.0503 +53.5%

ACCURACY@10

Dimension Base Fine-tuned Abs Δ % Δ
768d 0.4701 0.6984 +0.2283 +48.6%
512d 0.4755 0.6766 +0.2011 +42.3%
256d 0.4375 0.6658 +0.2283 +52.2%
128d 0.3940 0.5842 +0.1902 +48.3%
64d 0.2962 0.4783 +0.1821 +61.5%

Why This Matters

  • Domain-fit retrieval: AWS documentation uses precise terminology and structure—fine-tuning dramatically improves the model’s ability to surface the right chunk for real-world queries.
  • Matryoshka flexibility: One model, many sizes. Use 64d for fast initial recall (still +63.7% nDCG@10) and re-rank with 768d for maximum quality—no separate models required.
  • Production efficiency: Lower-dim embeddings → less memory and faster ANN search, with negligible accuracy trade-off post-fine-tune.

Method

Task: Information Retrieval over chunked AWS documentation Data: Synthetic (question, positive chunk) pairs (3,680 total). Split: 90/10 train/test → 3,312 / 368 pairs. Evaluator: Sentence-Transformers IR evaluator with cosine similarity:

  • queries: test questions
  • corpus: all chunks (train + test)
  • relevant_docs: mapped via shared global_chunk_id (multiple questions may map to a single chunk)

Training:

  • Loss: MultipleNegativesRankingLoss wrapped by MatryoshkaLoss
  • Dims: [768, 512, 256, 128, 64]
  • Epochs: 4
  • LR: 2e-5, scheduler: cosine
  • Precision: bf16, tf32 enabled, SDPA attention
  • Batch: 32 per device × grad_accum 16 = 512 global
  • Eval/save each epoch, metric_for_best_model = eval_dim_128_cosine_ndcg@10

Runtime: train_runtime: 275.7s on NVIDIA T4 (HF Spaces Small)

About

Fine-tuned ModernBERT on chunked AWS documentation with Matryoshka embeddings (768→64d) to power RAG retrieval, delivering +40–65% gains on nDCG/MRR/MAP and a +63.6% boost to the 64d seq_score.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published