|
| 1 | +FastText Notes |
| 2 | +============== |
| 3 | + |
| 4 | +The implementation is split across several submodules: |
| 5 | + |
| 6 | +- models.fasttext |
| 7 | +- models.keyedvectors (includes FastText-specific code, not good) |
| 8 | +- models.word2vec (superclasses) |
| 9 | +- models.base_any2vec (superclasses) |
| 10 | + |
| 11 | +The implementation consists of several key classes: |
| 12 | + |
| 13 | +1. models.fasttext.FastTextVocab: the vocabulary |
| 14 | +2. models.keyedvectors.FastTextKeyedVectors: the vectors |
| 15 | +3. models.fasttext.FastTextTrainables: the underlying neural network |
| 16 | +4. models.fasttext.FastText: ties everything together |
| 17 | + |
| 18 | +FastTextVocab |
| 19 | +------------- |
| 20 | + |
| 21 | +Seems to be an entirely redundant class. |
| 22 | +Inherits from models.word2vec.Word2VecVocab, adding no new functionality. |
| 23 | + |
| 24 | +FastTextKeyedVectors |
| 25 | +-------------------- |
| 26 | + |
| 27 | +Inheritance hierarchy: |
| 28 | + |
| 29 | +1. FastTextKeyedVectors |
| 30 | +2. WordEmbeddingsKeyedVectors. Implements word similarity e.g. cosine similarity, WMD, etc. |
| 31 | +3. BaseKeyedVectors (abstract base class) |
| 32 | +4. utils.SaveLoad |
| 33 | + |
| 34 | +There are many attributes. |
| 35 | + |
| 36 | +Inherited from BaseKeyedVectors: |
| 37 | + |
| 38 | +- vectors: a 2D numpy array. Flexible number of rows (0 by default). Number of columns equals vector dimensionality. |
| 39 | +- vocab: a dictionary. Keys are words. Items are Vocab instances: these are essentially namedtuples that contain an index and a count. The former is the index of a term in the entire vocab. The latter is the number of times the term occurs. |
| 40 | +- vector_size (dimensionality) |
| 41 | +- index2entity |
| 42 | + |
| 43 | +Inherited from WordEmbeddingsKeyedVectors: |
| 44 | + |
| 45 | +- vectors_norm |
| 46 | +- index2word |
| 47 | + |
| 48 | +Added by FastTextKeyedVectors: |
| 49 | + |
| 50 | +- vectors_vocab: 2D array. Rows are vectors. Columns correspond to vector dimensions. Initialized in FastTextTrainables.init_ngrams_weights. Reset in reset_ngrams_weights. Referred to as syn0_vocab in fasttext_inner.pyx. These are vectors for every word in the vocabulary. |
| 51 | +- vectors_vocab_norm: looks unused, see _clear_post_train method. |
| 52 | +- vectors_ngrams: 2D array. Each row is a bucket. Columns correspond to vector dimensions. Initialized in init_ngrams_weights function. Initialized in _load_vectors method when reading from native FB binary. Modified in reset_ngrams_weights method. This is the first matrix loaded from the native binary files. |
| 53 | +- vectors_ngrams_norm: looks unused, see _clear_post_train method. |
| 54 | +- buckets_word: A hashmap. Keyed by the index of a term in the vocab. Each value is an array, where each element is an integer that corresponds to a bucket. Initialized in init_ngrams_weights function |
| 55 | +- hash2index: A hashmap. Keys are hashes of ngrams. Values are the number of ngrams (?). Initialized in init_ngrams_weights function. |
| 56 | +- min_n: minimum ngram length |
| 57 | +- max_n: maximum ngram length |
| 58 | +- num_ngram_vectors: initialized in the init_ngrams_weights function |
| 59 | + |
| 60 | +The init_ngrams_method looks like an internal method of FastTextTrainables. |
| 61 | +It gets called as part of the prepare_weights method, which is effectively part of the FastModel constructor. |
| 62 | + |
| 63 | +The above attributes are initialized to None in the FastTextKeyedVectors class constructor. |
| 64 | +Unfortunately, their real initialization happens in an entirely different module, models.fasttext - another indication of poor separation of concerns. |
| 65 | + |
| 66 | +Some questions: |
| 67 | + |
| 68 | +- What is the x_lockf stuff? Why is it used only by the fast C implementation? |
| 69 | +- How are vectors_vocab and vectors_ngrams different? |
| 70 | + |
| 71 | +vectors_vocab contains vectors for entire vocabulary. |
| 72 | +vectors_ngrams contains vectors for each _bucket_. |
| 73 | + |
| 74 | + |
| 75 | +FastTextTrainables |
| 76 | +------------------ |
| 77 | + |
| 78 | +[Link](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastTextTrainables) |
| 79 | + |
| 80 | +This is a neural network that learns the vectors for the FastText embedding. |
| 81 | +Mostly inherits from its [Word2Vec parent](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2VecTrainables). |
| 82 | +Adds logic for calculating and maintaining ngram weights. |
| 83 | + |
| 84 | +Key attributes: |
| 85 | + |
| 86 | +- hashfxn: function for randomly initializing weights. Defaults to the built-in hash() |
| 87 | +- layer1_size: The size of the inner layer of the NN. Equal to the vector dimensionality. Set in the Word2VecTrainables constructor. |
| 88 | +- seed: The random generator seed used in reset_weights and update_weights |
| 89 | +- syn1: The inner layer of the NN. Each row corresponds to a term in the vocabulary. Columns correspond to weights of the inner layer. There are layer1_size such weights. Set in the reset_weights and update_weights methods, only if hierarchical sampling is used. |
| 90 | +- syn1neg: Similar to syn1, but only set if negative sampling is used. |
| 91 | +- vectors_lockf: A one-dimensional array with one element for each term in the vocab. Set in reset_weights to an array of ones. |
| 92 | +- vectors_vocab_lockf: Similar to vectors_vocab_lockf, ones(len(model.trainables.vectors), dtype=REAL) |
| 93 | +- vectors_ngrams_lockf = ones((self.bucket, wv.vector_size), dtype=REAL) |
| 94 | + |
| 95 | +The lockf stuff looks like it gets used by the fast C implementation. |
| 96 | + |
| 97 | +The inheritance hierarchy here is: |
| 98 | + |
| 99 | +1. FastTextTrainables |
| 100 | +2. Word2VecTrainables |
| 101 | +3. utils.SaveLoad |
| 102 | + |
| 103 | +FastText |
| 104 | +-------- |
| 105 | + |
| 106 | +Inheritance hierarchy: |
| 107 | + |
| 108 | +1. FastText |
| 109 | +2. BaseWordEmbeddingsModel: vocabulary management plus a ton of deprecated attrs |
| 110 | +3. BaseAny2VecModel: logging and training functionality |
| 111 | +4. utils.SaveLoad: for loading and saving |
| 112 | + |
| 113 | +Lots of attributes (many inherited from superclasses). |
| 114 | + |
| 115 | +From BaseAny2VecModel: |
| 116 | + |
| 117 | +- workers |
| 118 | +- vector_size |
| 119 | +- epochs |
| 120 | +- callbacks |
| 121 | +- batch_words |
| 122 | +- kv |
| 123 | +- vocabulary |
| 124 | +- trainables |
| 125 | + |
| 126 | +From BaseWordEmbeddingModel: |
| 127 | + |
| 128 | +- alpha |
| 129 | +- min_alpha |
| 130 | +- min_alpha_yet_reached |
| 131 | +- window |
| 132 | +- random |
| 133 | +- hs |
| 134 | +- negative |
| 135 | +- ns_exponent |
| 136 | +- cbow_mean |
| 137 | +- compute_loss |
| 138 | +- running_training_loss |
| 139 | +- corpus_count |
| 140 | +- corpus_total_words |
| 141 | +- neg_labels |
| 142 | + |
| 143 | +FastText attributes: |
| 144 | + |
| 145 | +- wv: FastTextWordVectors. Used instead of .kv |
| 146 | + |
| 147 | +Logging |
| 148 | +------- |
| 149 | + |
| 150 | +The logging seems to be inheritance-based. |
| 151 | +It may be better to refactor this using aggregation istead of inheritance in the future. |
| 152 | +The benefits would be leaner classes with less responsibilities and better separation of concerns. |
0 commit comments