Skip to content

Commit e4537c3

Browse files
committed
Merge branch 'release-3.7.2'
2 parents 4f8c03c + a9b6d33 commit e4537c3

38 files changed

+5028
-26851
lines changed

CHANGELOG.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,71 @@
11
Changes
22
===========
3+
4+
## Unreleased
5+
6+
### :star2: New Features
7+
8+
- `gensim.models.fasttext.load_facebook_model` function: load full model (slower, more CPU/memory intensive, supports training continuation)
9+
- `gensim.models.fasttext.load_facebook_vectors` function: load embeddings only (faster, less CPU/memory usage, does not support training continuation)
10+
11+
### :red_circle: Bug fixes
12+
13+
* Fix unicode error when loading FastText vocabulary (__[@mpenkov](https://github.com/mpenkov)__, [#2390](https://github.com/RaRe-Technologies/gensim/pull/2390))
14+
* Avoid division by zero in fasttext_inner.pyx (__[@mpenkov](https://github.com/mpenkov)__, [#2404](https://github.com/RaRe-Technologies/gensim/pull/2404))
15+
* Avoid incorrect filename inference when loading model (__[@mpenkov](https://github.com/mpenkov)__, [#2408](https://github.com/RaRe-Technologies/gensim/pull/2408))
16+
* Handle invalid unicode when loading native FastText models (__[@mpenkov](https://github.com/mpenkov)__, [#2411](https://github.com/RaRe-Technologies/gensim/pull/2411))
17+
* Avoid divide by zero when calculating vectors for terms with no ngrams (__[@mpenkov](https://github.com/mpenkov)__, [#2411](https://github.com/RaRe-Technologies/gensim/pull/2411))
18+
19+
### :books: Tutorial and doc improvements
20+
21+
* Add link to bindr (__[rogueleaderr](https://github.com/rogueleaderr)__, [#2387](https://github.com/RaRe-Technologies/gensim/pull/2387))
22+
23+
### :+1: Improvements
24+
25+
* Undo the hash2index optimization (__[mpenkov](https://github.com/mpenkov)__, [#2370](https://github.com/RaRe-Technologies/gensim/pull/2387))
26+
27+
### :warning: Changes in FastText behavior
28+
29+
#### Out-of-vocab word handling
30+
31+
To achieve consistency with the reference implementation from Facebook,
32+
a `FastText` model will now always report any word, out-of-vocabulary or
33+
not, as being in the model, and always return some vector for any word
34+
looked-up. Specifically:
35+
36+
1. `'any_word' in ft_model` will always return `True`. Previously, it
37+
returned `True` only if the full word was in the vocabulary. (To test if a
38+
full word is in the known vocabulary, you can consult the `wv.vocab`
39+
property: `'any_word' in ft_model.wv.vocab` will return `False` if the full
40+
word wasn't learned during model training.)
41+
2. `ft_model['any_word']` will always return a vector. Previously, it
42+
raised `KeyError` for OOV words when the model had no vectors
43+
for **any** ngrams of the word.
44+
3. If no ngrams from the term are present in the model,
45+
or when no ngrams could be extracted from the term, a vector pointing
46+
to the origin will be returned. Previously, a vector of NaN (not a number)
47+
was returned as a consequence of a divide-by-zero problem.
48+
4. Models may use more more memory, or take longer for word-vector
49+
lookup, especially after training on smaller corpuses where the previous
50+
non-compliant behavior discarded some ngrams from consideration.
51+
52+
#### Loading models in Facebook .bin format
53+
54+
The `gensim.models.FastText.load_fasttext_format` function (deprecated) now loads the entire model contained in the .bin file, including the shallow neural network that enables training continuation.
55+
Loading this NN requires more CPU and RAM than previously required.
56+
57+
Since this function is deprecated, consider using one of its alternatives (see below).
58+
59+
Furthermore, you must now pass the full path to the file to load, **including the file extension.**
60+
Previously, if you specified a model path that ends with anything other than .bin, the code automatically appended .bin to the path before loading the model.
61+
This behavior was [confusing](https://github.com/RaRe-Technologies/gensim/issues/2407), so we removed it.
62+
63+
### :warning: Deprecations (will be removed in the next major release)
64+
65+
Remove:
66+
67+
- `gensim.models.FastText.load_fasttext_format`: use load_facebook_vectors to load embeddings only (faster, less CPU/memory usage, does not support training continuation) and load_facebook_model to load full model (slower, more CPU/memory intensive, supports training continuation)
68+
369
## 3.7.1, 2019-01-31
470

571
### :+1: Improvements

ISSUE_TEMPLATE.md

Lines changed: 14 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,48 +1,29 @@
11
<!--
2-
If your issue is a usage or a general question, please submit it here instead:
3-
- Mailing List: https://groups.google.com/forum/#!forum/gensim
4-
For more information, see Recipes&FAQ: https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ
5-
-->
6-
7-
<!-- Instructions For Filing a Bug: https://github.com/RaRe-Technologies/gensim/blob/develop/CONTRIBUTING.md -->
2+
**IMPORTANT**:
83
9-
#### Description
10-
TODO: change commented example
11-
<!-- Example: Vocabulary size is not what I expected when training Word2Vec. -->
12-
13-
#### Steps/Code/Corpus to Reproduce
14-
<!--
15-
Example:
16-
```
17-
from gensim.models import word2vec
4+
- Use the [Gensim mailing list](https://groups.google.com/forum/#!forum/gensim) to ask general or usage questions. Github issues are only for bug reports.
5+
- Check [Recipes&FAQ](https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ) first for common answers.
186
19-
sentences = ['human', 'machine']
20-
model = word2vec.Word2Vec(sentences)
21-
print(model.syn0.shape)
22-
```
23-
If the code is too long, feel free to put it in a public gist and link
24-
it in the issue: https://gist.github.com
7+
Github bug reports that do not include relevant information and context will be closed without an answer. Thanks!
258
-->
269

27-
#### Expected Results
28-
<!-- Example: Expected shape of (100,2).-->
10+
#### Problem description
2911

30-
#### Actual Results
31-
<!-- Example: Actual shape of (100,5).
12+
What are you trying to achieve? What is the expected result? What are you seeing instead?
3213

33-
Please paste or specifically describe the actual output or traceback. -->
14+
#### Steps/code/corpus to reproduce
15+
16+
Include full tracebacks, logs and datasets if necessary. Please keep the examples minimal ("minimal reproducible example").
3417

3518
#### Versions
36-
<!--
37-
Please run the following snippet and paste the output below.
19+
20+
Please provide the output of:
21+
22+
```python
3823
import platform; print(platform.platform())
3924
import sys; print("Python", sys.version)
4025
import numpy; print("NumPy", numpy.__version__)
4126
import scipy; print("SciPy", scipy.__version__)
4227
import gensim; print("gensim", gensim.__version__)
4328
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
44-
-->
45-
46-
47-
<!-- Thanks for contributing! -->
48-
29+
```

MANIFEST.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ include COPYING.LESSER
66
include ez_setup.py
77

88
include gensim/models/voidptr.h
9+
include gensim/models/stdint_wrapper.h
910
include gensim/models/fast_line_sentence.h
1011

1112
include gensim/models/word2vec_inner.c

0 commit comments

Comments
 (0)