Word2Vec from Scratch: Contrastive Learning for Text Representation

This project implements the Word2Vec (Skip-Gram with Negative Sampling) model from scratch using PyTorch, with the aim of examining it through the modern lens of contrastive self-supervised learning.

The objective is to learn meaningful word r√epresentations from unlabeled data (WikiText-2) and to evaluate the transferability of these representations on a supervised downstream task (AG News classification).

Project Structure

word2vec-as-contrastive-learning/ı
├── assets/                           # Result plots and figures
├── notebooks/
│   ├── word2vec.ipynb                # Word2Vec model training on WikiText-2
│   └── classification.ipynb          # Downstream classification experiments (AG News)
├── checkpoints/                      # Saved embedding weights (.ckpt)
│   └── *.ckpt
└── requirements.txt                  # Python dependencies

Results Summary

The experiments confirm the effectiveness of pre-training:

Pre-training vs. Vanilla: Initializing the classifier with pre-trained Word2Vec embeddings outperforms a randomly initialized model, leading to lower validation loss and higher accuracy.

Validation accuracy and loss comparison between the classifier initialized with pre-trained Word2Vec embeddings and Vanilla baseline (25,000 labeled samples).

Effect of Data Scarcity:: The advantage of pre-training increases when fewer labeled samples are available (e.g., 5k vs. 25k), confirming that self-supervised pre-training is most valuable in low-resource regimes.

Validation accuracy and loss comparison between the classifier initialized with pre-trained Word2Vec embeddings and Vanilla baseline (5,000 labeled samples).

Ablation on R (Context Radius): For this specific classification task, smaller context windows (e.g., R=5) yielded the best performance. This suggests that local semantic information was more valuable than a wider, more general context.

Impact of context radius R on validation accuracy and loss.

Ablation on K (Negative Sample Ratio): A smaller ratio of negative samples (e.g., K=1) produced the best classification results. Larger values of K seemed to focus the optimization too much on distinguishing random pairs, resulting in less informative embeddings for this task.

Impact of negative samples K on validation accuracy and loss.

How to Run

First, clone the repository and install the required dependencies with pip install -r requirements.txt.

Step 1: Train Word2Vec Embeddings

Run the word2vec.ipynb notebook cells in order. It will :

Train the Word2Vec model on the WikiText-2 dataset.
Train all the model configurations required for the next step (e.g., varying R and K).
Create a checkpoints/ directory and populate it with the trained embedding weights (.ckpt files).

Step 2: Run Classification Experiments

Run the classification.ipynb notebook cells in order. It will :

Load the pre-trained embeddings from the checkpoints/ folder.
Train the ClassAttentionModel on the AG News dataset.
Run the experiment comparing Word2Vec initialization vs. Vanilla (random) initialization.
Run the ablation studies on the R and K hyperparameters.
Display the final plots in the notebook itself.

References

[1] Chopra, S., Hadsell, R., & LeCun, Y. (2005). Learning a Similarity Metric Discriminatively, with Application to Face Verification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 539–546. https://doi.org/10.1109/CVPR.2005.202

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
checkpoints		checkpoints
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Word2Vec from Scratch: Contrastive Learning for Text Representation

Project Structure

Results Summary

How to Run

Step 1: Train Word2Vec Embeddings

Step 2: Run Classification Experiments

References

About

Uh oh!

Releases

Packages

Languages

Louiscrrn/word2vec-contrastive-learning

Folders and files

Latest commit

History

Repository files navigation

Word2Vec from Scratch: Contrastive Learning for Text Representation

Project Structure

Results Summary

How to Run

Step 1: Train Word2Vec Embeddings

Step 2: Run Classification Experiments

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages