This project implements the Word2Vec (Skip-Gram with Negative Sampling) model from scratch using PyTorch, with the aim of examining it through the modern lens of contrastive self-supervised learning.
The objective is to learn meaningful word r√epresentations from unlabeled data (WikiText-2) and to evaluate the transferability of these representations on a supervised downstream task (AG News classification).
word2vec-as-contrastive-learning/ı
├── assets/ # Result plots and figures
├── notebooks/
│ ├── word2vec.ipynb # Word2Vec model training on WikiText-2
│ └── classification.ipynb # Downstream classification experiments (AG News)
├── checkpoints/ # Saved embedding weights (.ckpt)
│ └── *.ckpt
└── requirements.txt # Python dependencies
The experiments confirm the effectiveness of pre-training:
- Pre-training vs. Vanilla: Initializing the classifier with pre-trained Word2Vec embeddings outperforms a randomly initialized model, leading to lower validation loss and higher accuracy.

Validation accuracy and loss comparison between the classifier initialized with pre-trained Word2Vec embeddings and Vanilla baseline (25,000 labeled samples).
- Effect of Data Scarcity:: The advantage of pre-training increases when fewer labeled samples are available (e.g., 5k vs. 25k), confirming that self-supervised pre-training is most valuable in low-resource regimes.

Validation accuracy and loss comparison between the classifier initialized with pre-trained Word2Vec embeddings and Vanilla baseline (5,000 labeled samples).
- Ablation on
R(Context Radius): For this specific classification task, smaller context windows (e.g.,R=5) yielded the best performance. This suggests that local semantic information was more valuable than a wider, more general context.

Impact of context radius R on validation accuracy and loss.
- Ablation on
K(Negative Sample Ratio): A smaller ratio of negative samples (e.g.,K=1) produced the best classification results. Larger values ofKseemed to focus the optimization too much on distinguishing random pairs, resulting in less informative embeddings for this task.

Impact of negative samples K on validation accuracy and loss.
First, clone the repository and install the required dependencies with pip install -r requirements.txt.
Run the word2vec.ipynb notebook cells in order. It will :
- Train the Word2Vec model on the WikiText-2 dataset.
- Train all the model configurations required for the next step (e.g., varying
RandK). - Create a
checkpoints/directory and populate it with the trained embedding weights (.ckptfiles).
Run the classification.ipynb notebook cells in order. It will :
- Load the pre-trained embeddings from the
checkpoints/folder. - Train the
ClassAttentionModelon the AG News dataset. - Run the experiment comparing Word2Vec initialization vs. Vanilla (random) initialization.
- Run the ablation studies on the
RandKhyperparameters. - Display the final plots in the notebook itself.
[1] Chopra, S., Hadsell, R., & LeCun, Y. (2005). Learning a Similarity Metric Discriminatively, with Application to Face Verification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 539–546. https://doi.org/10.1109/CVPR.2005.202