|
1 | 1 | Changes |
2 | 2 | =========== |
| 3 | + |
| 4 | +## Unreleased |
| 5 | + |
| 6 | +### :star2: New Features |
| 7 | + |
| 8 | +- `gensim.models.fasttext.load_facebook_model` function: load full model (slower, more CPU/memory intensive, supports training continuation) |
| 9 | +- `gensim.models.fasttext.load_facebook_vectors` function: load embeddings only (faster, less CPU/memory usage, does not support training continuation) |
| 10 | + |
| 11 | +### :red_circle: Bug fixes |
| 12 | + |
| 13 | +* Fix unicode error when loading FastText vocabulary (__[@mpenkov](https://github.com/mpenkov)__, [#2390](https://github.com/RaRe-Technologies/gensim/pull/2390)) |
| 14 | +* Avoid division by zero in fasttext_inner.pyx (__[@mpenkov](https://github.com/mpenkov)__, [#2404](https://github.com/RaRe-Technologies/gensim/pull/2404)) |
| 15 | +* Avoid incorrect filename inference when loading model (__[@mpenkov](https://github.com/mpenkov)__, [#2408](https://github.com/RaRe-Technologies/gensim/pull/2408)) |
| 16 | +* Handle invalid unicode when loading native FastText models (__[@mpenkov](https://github.com/mpenkov)__, [#2411](https://github.com/RaRe-Technologies/gensim/pull/2411)) |
| 17 | +* Avoid divide by zero when calculating vectors for terms with no ngrams (__[@mpenkov](https://github.com/mpenkov)__, [#2411](https://github.com/RaRe-Technologies/gensim/pull/2411)) |
| 18 | + |
| 19 | +### :books: Tutorial and doc improvements |
| 20 | + |
| 21 | +* Add link to bindr (__[rogueleaderr](https://github.com/rogueleaderr)__, [#2387](https://github.com/RaRe-Technologies/gensim/pull/2387)) |
| 22 | + |
| 23 | +### :+1: Improvements |
| 24 | + |
| 25 | +* Undo the hash2index optimization (__[mpenkov](https://github.com/mpenkov)__, [#2370](https://github.com/RaRe-Technologies/gensim/pull/2387)) |
| 26 | + |
| 27 | +### :warning: Changes in FastText behavior |
| 28 | + |
| 29 | +#### Out-of-vocab word handling |
| 30 | + |
| 31 | +To achieve consistency with the reference implementation from Facebook, |
| 32 | +a `FastText` model will now always report any word, out-of-vocabulary or |
| 33 | +not, as being in the model, and always return some vector for any word |
| 34 | +looked-up. Specifically: |
| 35 | + |
| 36 | +1. `'any_word' in ft_model` will always return `True`. Previously, it |
| 37 | +returned `True` only if the full word was in the vocabulary. (To test if a |
| 38 | +full word is in the known vocabulary, you can consult the `wv.vocab` |
| 39 | +property: `'any_word' in ft_model.wv.vocab` will return `False` if the full |
| 40 | +word wasn't learned during model training.) |
| 41 | +2. `ft_model['any_word']` will always return a vector. Previously, it |
| 42 | +raised `KeyError` for OOV words when the model had no vectors |
| 43 | +for **any** ngrams of the word. |
| 44 | +3. If no ngrams from the term are present in the model, |
| 45 | +or when no ngrams could be extracted from the term, a vector pointing |
| 46 | +to the origin will be returned. Previously, a vector of NaN (not a number) |
| 47 | +was returned as a consequence of a divide-by-zero problem. |
| 48 | +4. Models may use more more memory, or take longer for word-vector |
| 49 | +lookup, especially after training on smaller corpuses where the previous |
| 50 | +non-compliant behavior discarded some ngrams from consideration. |
| 51 | + |
| 52 | +#### Loading models in Facebook .bin format |
| 53 | + |
| 54 | +The `gensim.models.FastText.load_fasttext_format` function (deprecated) now loads the entire model contained in the .bin file, including the shallow neural network that enables training continuation. |
| 55 | +Loading this NN requires more CPU and RAM than previously required. |
| 56 | + |
| 57 | +Since this function is deprecated, consider using one of its alternatives (see below). |
| 58 | + |
| 59 | +Furthermore, you must now pass the full path to the file to load, **including the file extension.** |
| 60 | +Previously, if you specified a model path that ends with anything other than .bin, the code automatically appended .bin to the path before loading the model. |
| 61 | +This behavior was [confusing](https://github.com/RaRe-Technologies/gensim/issues/2407), so we removed it. |
| 62 | + |
| 63 | +### :warning: Deprecations (will be removed in the next major release) |
| 64 | + |
| 65 | +Remove: |
| 66 | + |
| 67 | +- `gensim.models.FastText.load_fasttext_format`: use load_facebook_vectors to load embeddings only (faster, less CPU/memory usage, does not support training continuation) and load_facebook_model to load full model (slower, more CPU/memory intensive, supports training continuation) |
| 68 | + |
3 | 69 | ## 3.7.1, 2019-01-31 |
4 | 70 |
|
5 | 71 | ### :+1: Improvements |
|
0 commit comments