Skip to content

Commit 69877c5

Browse files
committed
Merge branch 'release-3.7.3'
2 parents 7631b3e + d2634a5 commit 69877c5

33 files changed

+4187
-3003
lines changed

CHANGELOG.md

Lines changed: 48 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,51 @@
11
Changes
2-
===========
2+
=======
3+
4+
## 3.7.3, 2019-05-06
5+
6+
### :red_circle: Bug fixes
7+
8+
* Fix fasttext model loading from gzip files ([mpenkov](https://api.github.com/users/mpenkov), [#2476](https://api.github.com/repos/RaRe-Technologies/gensim/pulls/2476))
9+
* Fix misleading Doc2Vec.docvecs comment ([gojomo](https://api.github.com/users/gojomo), [#2472](https://api.github.com/repos/RaRe-Technologies/gensim/pulls/2472))
10+
* Nmf bugfix ([mpenkov](https://api.github.com/users/mpenkov), [#2466](https://api.github.com/repos/RaRe-Technologies/gensim/pulls/2466))
11+
* Fix WordEmbeddingsKeyedVectors.most_similar ([Witiko](https://api.github.com/users/Witiko), [#2461](https://api.github.com/repos/RaRe-Technologies/gensim/pulls/2461))
12+
* fix backwards compatibility ([mpenkov](https://api.github.com/users/mpenkov), [#2457](https://api.github.com/repos/RaRe-Technologies/gensim/pulls/2457))
13+
* Fix Lda Sequence model by updating to num_documents ([Bharat123rox](https://api.github.com/users/Bharat123rox), [#2410](https://api.github.com/repos/RaRe-Technologies/gensim/pulls/2410))
14+
* Make termsim matrix positive definite even with negative similarities ([Witiko](https://api.github.com/users/Witiko), [#2397](https://api.github.com/repos/RaRe-Technologies/gensim/pulls/2397))
15+
* Fix the off-by-one bug in the TFIDF model. ([AMR-KELEG](https://api.github.com/users/AMR-KELEG), [#2392](https://api.github.com/repos/RaRe-Technologies/gensim/pulls/2392))
16+
* update legacy model loading, fix #2453 ([mpenkov](https://api.github.com/users/mpenkov), [#2454](https://api.github.com/repos/RaRe-Technologies/gensim/pulls/2454))
17+
* Make matutils.unitvec always return float norm when requested ([Witiko](https://api.github.com/users/Witiko), [#2419](https://api.github.com/repos/RaRe-Technologies/gensim/pulls/2419))
18+
19+
### :books: Tutorial and doc improvements
20+
21+
* Update word2vec.ipynb ([asyabo](https://api.github.com/users/asyabo), [#2423](https://api.github.com/repos/RaRe-Technologies/gensim/pulls/2423))
22+
23+
### :+1: Improvements
24+
25+
* Adding type check for corpus_file argument ([saraswatmks](https://api.github.com/users/saraswatmks), [#2469](https://api.github.com/repos/RaRe-Technologies/gensim/pulls/2469))
26+
* Clean up FastText Cython code, fix division by zero ([mpenkov](https://api.github.com/users/mpenkov), [#2382](https://api.github.com/repos/RaRe-Technologies/gensim/pulls/2382))
27+
28+
### :warning: Deprecations (will be removed in the next major release)
29+
30+
* Remove
31+
- `gensim.models.FastText.load_fasttext_format`: use load_facebook_vectors to load embeddings only (faster, less CPU/memory usage, does not support training continuation) and load_facebook_model to load full model (slower, more CPU/memory intensive, supports training continuation)
32+
- `gensim.models.wrappers.fasttext` (obsoleted by the new native `gensim.models.fasttext` implementation)
33+
- `gensim.examples`
34+
- `gensim.nosy`
35+
- `gensim.scripts.word2vec_standalone`
36+
- `gensim.scripts.make_wiki_lemma`
37+
- `gensim.scripts.make_wiki_online`
38+
- `gensim.scripts.make_wiki_online_lemma`
39+
- `gensim.scripts.make_wiki_online_nodebug`
40+
- `gensim.scripts.make_wiki` (all of these obsoleted by the new native `gensim.scripts.segment_wiki` implementation)
41+
- "deprecated" functions and attributes
42+
43+
* Move
44+
- `gensim.scripts.make_wikicorpus` ➡ `gensim.scripts.make_wiki.py`
45+
- `gensim.summarization` ➡ `gensim.models.summarization`
46+
- `gensim.topic_coherence` ➡ `gensim.models._coherence`
47+
- `gensim.utils` ➡ `gensim.utils.utils` (old imports will continue to work)
48+
- `gensim.parsing.*` ➡ `gensim.utils.text_utils`
349

450
## 3.7.2, 2019-04-06
551

@@ -22,7 +68,7 @@ Changes
2268

2369
### :+1: Improvements
2470

25-
* Undo the hash2index optimization (__[mpenkov](https://github.com/mpenkov)__, [#2370](https://github.com/RaRe-Technologies/gensim/pull/2387))
71+
* Undo the hash2index optimization (__[mpenkov](https://github.com/mpenkov)__, [#2370](https://github.com/RaRe-Technologies/gensim/pull/2370))
2672

2773
### :warning: Changes in FastText behavior
2874

docs/notebooks/word2vec.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@
116116
" \n",
117117
" def __iter__(self):\n",
118118
" for fname in os.listdir(self.dirname):\n",
119-
" for line in smart_open(os.path.join(self.dirname, fname), 'rb'):\n",
119+
" for line in smart_open(os.path.join(self.dirname, fname), 'r'):\n",
120120
" yield line.split()"
121121
]
122122
},

docs/src/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@
5757
# The short X.Y version.
5858
version = '3.7'
5959
# The full version, including alpha/beta/rc tags.
60-
release = '3.7.2'
60+
release = '3.7.3'
6161

6262
# The language for content autogenerated by Sphinx. Refer to documentation
6363
# for a list of supported languages.

gensim/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
from gensim import parsing, corpora, matutils, interfaces, models, similarities, summarization, utils # noqa:F401
66
import logging
77

8-
__version__ = '3.7.2'
8+
__version__ = '3.7.3'
99

1010

1111
logger = logging.getLogger('gensim')

gensim/matutils.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -734,15 +734,18 @@ def unitvec(vec, norm='l2', return_norm=False):
734734
return vec
735735
else:
736736
if return_norm:
737-
return vec, 1.
737+
return vec, 1.0
738738
else:
739739
return vec
740740

741741
if isinstance(vec, np.ndarray):
742742
if norm == 'l1':
743743
veclen = np.sum(np.abs(vec))
744744
if norm == 'l2':
745-
veclen = blas_nrm2(vec)
745+
if vec.size == 0:
746+
veclen = 0.0
747+
else:
748+
veclen = blas_nrm2(vec)
746749
if veclen > 0.0:
747750
if np.issubdtype(vec.dtype, np.integer):
748751
vec = vec.astype(np.float)
@@ -752,14 +755,17 @@ def unitvec(vec, norm='l2', return_norm=False):
752755
return blas_scal(1.0 / veclen, vec).astype(vec.dtype)
753756
else:
754757
if return_norm:
755-
return vec, 1
758+
return vec, 1.0
756759
else:
757760
return vec
758761

759762
try:
760763
first = next(iter(vec)) # is there at least one element?
761764
except StopIteration:
762-
return vec
765+
if return_norm:
766+
return vec, 1.0
767+
else:
768+
return vec
763769

764770
if isinstance(first, (tuple, list)) and len(first) == 2: # gensim sparse format
765771
if norm == 'l1':

gensim/models/_fasttext_bin.py

Lines changed: 57 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@
3131

3232
import codecs
3333
import collections
34+
import gzip
3435
import io
3536
import logging
3637
import struct
@@ -74,6 +75,14 @@
7475
('t', 'd'),
7576
]
7677

78+
_FLOAT_SIZE = struct.calcsize('@f')
79+
if _FLOAT_SIZE == 4:
80+
_FLOAT_DTYPE = np.dtype(np.float32)
81+
elif _FLOAT_SIZE == 8:
82+
_FLOAT_DTYPE = np.dtype(np.float64)
83+
else:
84+
_FLOAT_DTYPE = None
85+
7786

7887
def _yield_field_names():
7988
for name, _ in _OLD_HEADER_FORMAT + _NEW_HEADER_FORMAT:
@@ -220,24 +229,65 @@ def _load_matrix(fin, new_format=True):
220229
The number of columns of the array will correspond to the vector size.
221230
222231
"""
232+
if _FLOAT_DTYPE is None:
233+
raise ValueError('bad _FLOAT_SIZE: %r' % _FLOAT_SIZE)
234+
223235
if new_format:
224236
_struct_unpack(fin, '@?') # bool quant_input in fasttext.cc
225237

226238
num_vectors, dim = _struct_unpack(fin, '@2q')
239+
count = num_vectors * dim
227240

228-
float_size = struct.calcsize('@f')
229-
if float_size == 4:
230-
dtype = np.dtype(np.float32)
231-
elif float_size == 8:
232-
dtype = np.dtype(np.float64)
241+
#
242+
# numpy.fromfile doesn't play well with gzip.GzipFile as input:
243+
#
244+
# - https://github.com/RaRe-Technologies/gensim/pull/2476
245+
# - https://github.com/numpy/numpy/issues/13470
246+
#
247+
# Until they fix it, we have to apply a workaround. We only apply the
248+
# workaround when it's necessary, because np.fromfile is heavily optimized
249+
# and very efficient (when it works).
250+
#
251+
if isinstance(fin, gzip.GzipFile):
252+
logger.warning(
253+
'Loading model from a compressed .gz file. This can be slow. '
254+
'This is a work-around for a bug in NumPy: https://github.com/numpy/numpy/issues/13470. '
255+
'Consider decompressing your model file for a faster load. '
256+
)
257+
matrix = _fromfile(fin, _FLOAT_DTYPE, count)
233258
else:
234-
raise ValueError("Incompatible float size: %r" % float_size)
259+
matrix = np.fromfile(fin, _FLOAT_DTYPE, count)
235260

236-
matrix = np.fromfile(fin, dtype=dtype, count=num_vectors * dim)
261+
assert matrix.shape == (count,), 'expected (%r,), got %r' % (count, matrix.shape)
237262
matrix = matrix.reshape((num_vectors, dim))
238263
return matrix
239264

240265

266+
def _batched_generator(fin, count, batch_size=1e6):
267+
"""Read `count` floats from `fin`.
268+
269+
Batches up read calls to avoid I/O overhead. Keeps no more than batch_size
270+
floats in memory at once.
271+
272+
Yields floats.
273+
274+
"""
275+
while count > batch_size:
276+
batch = _struct_unpack(fin, '@%df' % batch_size)
277+
for f in batch:
278+
yield f
279+
count -= batch_size
280+
281+
batch = _struct_unpack(fin, '@%df' % count)
282+
for f in batch:
283+
yield f
284+
285+
286+
def _fromfile(fin, dtype, count):
287+
"""Reimplementation of numpy.fromfile."""
288+
return np.fromiter(_batched_generator(fin, count), dtype=dtype)
289+
290+
241291
def load(fin, encoding='utf-8', full_model=True):
242292
"""Load a model from a binary stream.
243293

gensim/models/doc2vec.py

Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@
7070
except ImportError:
7171
from Queue import Queue # noqa:F401
7272

73-
from collections import namedtuple, defaultdict
73+
from collections import namedtuple, defaultdict, Iterable
7474
from timeit import default_timer
7575

7676
from numpy import zeros, float32 as REAL, empty, ones, \
@@ -447,18 +447,13 @@ class Doc2Vec(BaseWordEmbeddingsModel):
447447
directly to query those embeddings in various ways. See the module level docstring for examples.
448448
449449
docvecs : :class:`~gensim.models.keyedvectors.Doc2VecKeyedVectors`
450-
This object contains the paragraph vectors. Remember that the only difference between this model and
451-
:class:`~gensim.models.word2vec.Word2Vec` is that besides the word vectors we also include paragraph embeddings
452-
to capture the paragraph.
450+
This object contains the paragraph vectors learned from the training data. There will be one such vector
451+
for each unique document tag supplied during training. They may be individually accessed using the tag
452+
as an indexed-access key. For example, if one of the training documents used a tag of 'doc003':
453453
454-
In this way we can capture the difference between the same word used in a different context.
455-
For example we now have a different representation of the word "leaves" in the following two sentences ::
456-
457-
1. Manos leaves the office every day at 18:00 to catch his train
458-
2. This season is called Fall, because leaves fall from the trees.
454+
.. sourcecode:: pycon
459455
460-
In a plain :class:`~gensim.models.word2vec.Word2Vec` model the word would have exactly the same representation
461-
in both sentences, in :class:`~gensim.models.doc2vec.Doc2Vec` it will not.
456+
>>> model.docvecs['doc003']
462457
463458
vocabulary : :class:`~gensim.models.doc2vec.Doc2VecVocab`
464459
This object represents the vocabulary (sometimes called Dictionary in gensim) of the model.
@@ -794,6 +789,19 @@ def train(self, documents=None, corpus_file=None, total_examples=None, total_wor
794789
795790
"""
796791
kwargs = {}
792+
793+
if corpus_file is None and documents is None:
794+
raise TypeError("Either one of corpus_file or documents value must be provided")
795+
796+
if corpus_file is not None and documents is not None:
797+
raise TypeError("Both corpus_file and documents must not be provided at the same time")
798+
799+
if documents is None and not os.path.isfile(corpus_file):
800+
raise TypeError("Parameter corpus_file must be a valid path to a file, got %r instead" % corpus_file)
801+
802+
if documents is not None and not isinstance(documents, Iterable):
803+
raise TypeError("documents must be an iterable of list, got %r instead" % documents)
804+
797805
if corpus_file is not None:
798806
# Calculate offsets for each worker along with initial doctags (doctag ~ document/line number in a file)
799807
offsets, start_doctags = self._get_offsets_and_start_doctags_for_corpusfile(corpus_file, self.workers)

gensim/models/fasttext.py

Lines changed: 20 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -280,10 +280,12 @@
280280
"""
281281

282282
import logging
283+
import os
283284

284285
import numpy as np
285286
from numpy import ones, vstack, float32 as REAL, sum as np_sum
286287
import six
288+
from collections import Iterable
287289

288290
import gensim.models._fasttext_bin
289291

@@ -901,6 +903,19 @@ def train(self, sentences=None, corpus_file=None, total_examples=None, total_wor
901903
>>> model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)
902904
903905
"""
906+
907+
if corpus_file is None and sentences is None:
908+
raise TypeError("Either one of corpus_file or sentences value must be provided")
909+
910+
if corpus_file is not None and sentences is not None:
911+
raise TypeError("Both corpus_file and sentences must not be provided at the same time")
912+
913+
if sentences is None and not os.path.isfile(corpus_file):
914+
raise TypeError("Parameter corpus_file must be a valid path to a file, got %r instead" % corpus_file)
915+
916+
if sentences is not None and not isinstance(sentences, Iterable):
917+
raise TypeError("sentences must be an iterable of list, got %r instead" % sentences)
918+
904919
super(FastText, self).train(
905920
sentences=sentences, corpus_file=corpus_file, total_examples=total_examples, total_words=total_words,
906921
epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count,
@@ -1023,30 +1038,22 @@ def load(cls, *args, **kwargs):
10231038
"""
10241039
try:
10251040
model = super(FastText, cls).load(*args, **kwargs)
1026-
if hasattr(model.wv, 'hash2index'):
1027-
gensim.models.keyedvectors._rollback_optimization(model.wv)
10281041

10291042
if not hasattr(model.trainables, 'vectors_vocab_lockf') and hasattr(model.wv, 'vectors_vocab'):
10301043
model.trainables.vectors_vocab_lockf = ones(model.wv.vectors_vocab.shape, dtype=REAL)
10311044
if not hasattr(model.trainables, 'vectors_ngrams_lockf') and hasattr(model.wv, 'vectors_ngrams'):
10321045
model.trainables.vectors_ngrams_lockf = ones(model.wv.vectors_ngrams.shape, dtype=REAL)
10331046

1034-
if not hasattr(model.wv, 'compatible_hash'):
1035-
logger.warning(
1036-
"This older model was trained with a buggy hash function. "
1037-
"The model will continue to work, but consider training it "
1038-
"from scratch."
1039-
)
1040-
model.wv.compatible_hash = False
1041-
10421047
if not hasattr(model.wv, 'bucket'):
10431048
model.wv.bucket = model.trainables.bucket
1044-
1045-
return model
10461049
except AttributeError:
10471050
logger.info('Model saved using code from earlier Gensim Version. Re-loading old model in a compatible way.')
10481051
from gensim.models.deprecated.fasttext import load_old_fasttext
1049-
return load_old_fasttext(*args, **kwargs)
1052+
model = load_old_fasttext(*args, **kwargs)
1053+
1054+
gensim.models.keyedvectors._try_upgrade(model.wv)
1055+
1056+
return model
10501057

10511058
@deprecated("Method will be removed in 4.0.0, use self.wv.accuracy() instead")
10521059
def accuracy(self, questions, restrict_vocab=30000, most_similar=None, case_insensitive=True):

0 commit comments

Comments
 (0)