Skip to content

KengoA/wordvec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wordvec

OCaml implementation of word embedding algorithms with minimal dependencies.

Supported algorithms:

Getting started

For a reasonably large example corpus, I would recommend using Simple Wikipedia dumps available at https://dumps.wikimedia.org/simplewiki/.

Training on the English Simple Wikipedia corpus (around 1.3GB in XML) results in Pearson Correlation of 0.50 with the Wordsim-353 benchmark.

brew install opam

# Download and extract Simple Wikipedia dataset
curl -L https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles-multistream.xml.bz2 | bunzip2 > data/train/simplewikipedia.xml

dune build
dune exec bin/train.exe -- --input data/train/simplewikipedia.xml --dim 150 --window-size 15 --epochs 2 --workers 8

# Evaluate Top 5 similar words for common words
dune exec bin/evaluate.exe
# Evaluate against WordSim-353 dataset
dune exec bin/benchmark.exe

Options

The training script supports the following command-line options:

  • --input: Input file path (.xml or .txt)
  • --dim: Embedding dimension - default: 150
  • --window-size: Context window size - default: 15
  • --neg-samples: Number of negative samples - default: 5
  • --epochs: Number of training epochs - default: 1
  • --neg-table-size: Negative sampling table size - default: 1000000
  • --chunk-size: Chunk size in MB for processing - default: 100
  • --workers: Number of parallel workers - default: 10
  • --min-freq: Minimum vocabulary frequency - default: 10
  • --vocab-output: Output vocabulary file - default: data/artifacts/vocab_freq.csv
  • --embed-output: Output embeddings file - default: data/artifacts/embeddings.csv

About

OCaml implementation of word embedding algorithms with minimal dependencies

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •