Skip to content

Commit 42e47a3

Browse files
committed
Merge branch 'release-3.7.0'
2 parents 355ecc6 + 7d84b7e commit 42e47a3

File tree

187 files changed

+57324
-16406
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

187 files changed

+57324
-16406
lines changed

.circleci/config.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ jobs:
3030
name: Build documentation
3131
command: |
3232
source venv/bin/activate
33-
tox -e docs -vv
33+
tox -e compile,docs -vv
3434
3535
- store_artifacts:
3636
path: docs/src/_build

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@
77
*.o
88
*.so
99
*.pyc
10+
*.pyo
11+
*.pyd
1012

1113
# Packages #
1214
############

.travis.yml

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,10 @@ language: python
1313
matrix:
1414
include:
1515
- python: '2.7'
16-
env: TOXENV="flake8"
16+
env: TOXENV="flake8,flake8-docs"
17+
18+
- python: '3.6'
19+
env: TOXENV="flake8,flake8-docs"
1720

1821
- python: '2.7'
1922
env: TOXENV="py27-linux"
@@ -24,5 +27,13 @@ matrix:
2427
- python: '3.6'
2528
env: TOXENV="py36-linux"
2629

30+
- python: '3.7'
31+
env:
32+
- TOXENV="py37-linux"
33+
- BOTO_CONFIG="/dev/null"
34+
dist: xenial
35+
sudo: true
36+
37+
2738
install: pip install tox
2839
script: tox -vv

CHANGELOG.md

Lines changed: 217 additions & 5 deletions
Large diffs are not rendered by default.

MANIFEST.in

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,36 @@ include CHANGELOG.md
44
include COPYING
55
include COPYING.LESSER
66
include ez_setup.py
7+
78
include gensim/models/voidptr.h
9+
include gensim/models/fast_line_sentence.h
10+
811
include gensim/models/word2vec_inner.c
912
include gensim/models/word2vec_inner.pyx
1013
include gensim/models/word2vec_inner.pxd
14+
include gensim/models/word2vec_corpusfile.cpp
15+
include gensim/models/word2vec_corpusfile.pyx
16+
include gensim/models/word2vec_corpusfile.pxd
17+
1118
include gensim/models/doc2vec_inner.c
1219
include gensim/models/doc2vec_inner.pyx
20+
include gensim/models/doc2vec_inner.pxd
21+
include gensim/models/doc2vec_corpusfile.cpp
22+
include gensim/models/doc2vec_corpusfile.pyx
23+
1324
include gensim/models/fasttext_inner.c
1425
include gensim/models/fasttext_inner.pyx
26+
include gensim/models/fasttext_inner.pxd
27+
include gensim/models/fasttext_corpusfile.cpp
28+
include gensim/models/fasttext_corpusfile.pyx
29+
1530
include gensim/models/_utils_any2vec.c
1631
include gensim/models/_utils_any2vec.pyx
1732
include gensim/corpora/_mmreader.c
1833
include gensim/corpora/_mmreader.pyx
1934
include gensim/_matutils.c
2035
include gensim/_matutils.pyx
36+
37+
include gensim/models/nmf_pgd.c
38+
include gensim/models/nmf_pgd.pyx
39+

README.md

Lines changed: 17 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -119,29 +119,23 @@ Documentation
119119
Adopters
120120
--------
121121

122-
123-
124-
| Name | Logo | URL | Description |
125-
|----------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
126-
| RaRe Technologies | ![rare](docs/src/readme_images/rare.png) | [rare-technologies.com](http://rare-technologies.com) | Machine learning & NLP consulting and training. Creators and maintainers of Gensim. |
127-
| Mindseye | ![mindseye](docs/src/readme_images/mindseye.png) | [mindseye.com](http://www.mindseyesolutions.com/) | Similarities in legal documents |
128-
| Talentpair | ![talent-pair](docs/src/readme_images/talent-pair.png) | [talentpair.com](http://talentpair.com) | Data science driving high-touch recruiting |
129-
| Tailwind | ![tailwind](docs/src/readme_images/tailwind.png)| [Tailwindapp.com](https://www.tailwindapp.com/)| Post interesting and relevant content to Pinterest |
130-
| Issuu | ![issuu](docs/src/readme_images/issuu.png) | [Issuu.com](https://issuu.com/)| Gensim’s LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it’s all about.
131-
| Sports Authority | ![sports-authority](docs/src/readme_images/sports-authority.png) | [sportsauthority.com](https://en.wikipedia.org/wiki/Sports_Authority)| Text mining of customer surveys and social media sources |
132-
| Search Metrics | ![search-metrics](docs/src/readme_images/search-metrics.png) | [searchmetrics.com](http://www.searchmetrics.com/)| Gensim word2vec used for entity disambiguation in Search Engine Optimisation
133-
| Cisco Security | ![cisco](docs/src/readme_images/cisco.png) | [cisco.com](http://www.cisco.com/c/en/us/products/security/index.html)| Large-scale fraud detection
134-
| 12K Research | ![12k](docs/src/readme_images/12k.png)| [12k.co](https://12k.co/)| Document similarity analysis on media articles
135-
| National Institutes of Health | ![nih](docs/src/readme_images/nih.png) | [github/NIHOPA](https://github.com/NIHOPA/pipeline_word2vec)| Processing grants and publications with word2vec
136-
| Codeq LLC | ![codeq](docs/src/readme_images/codeq.png) | [codeq.com](https://codeq.com)| Document classification with word2vec
137-
| Mass Cognition | ![mass-cognition](docs/src/readme_images/mass-cognition.png) | [masscognition.com](http://www.masscognition.com/) | Topic analysis service for consumer text data and general text data |
138-
| Stillwater Supercomputing | ![stillwater](docs/src/readme_images/stillwater.png) | [stillwater-sc.com](http://www.stillwater-sc.com/) | Document comprehension and association with word2vec |
139-
| Channel 4 | ![channel4](docs/src/readme_images/channel4.png) | [channel4.com](http://www.channel4.com/) | Recommendation engine |
140-
| Amazon | ![amazon](docs/src/readme_images/amazon.png) | [amazon.com](http://www.amazon.com/) | Document similarity|
141-
| SiteGround Hosting | ![siteground](docs/src/readme_images/siteground.png) | [siteground.com](https://www.siteground.com/) | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. |
142-
| Juju | ![juju](docs/src/readme_images/juju.png) | [www.juju.com](http://www.juju.com/) | Provide non-obvious related job suggestions. |
143-
| NLPub | ![nlpub](docs/src/readme_images/nlpub.png) | [nlpub.org](https://nlpub.org/) | Distributional semantic models including word2vec. |
144-
|Capital One | ![capitalone](docs/src/readme_images/capitalone.png) | [www.capitalone.com](https://www.capitalone.com/) | Topic modeling for customer complaints exploration. |
122+
| Company | Logo | Industry | Use of Gensim |
123+
|---------|------|----------|---------------|
124+
| [RARE Technologies](http://rare-technologies.com) | ![rare](docs/src/readme_images/rare.png) | ML & NLP consulting | Creators of Gensim – this is us! |
125+
| [Amazon](http://www.amazon.com/) | ![amazon](docs/src/readme_images/amazon.png) | Retail | Document similarity. |
126+
| [National Institutes of Health](https://github.com/NIHOPA/pipeline_word2vec) | ![nih](docs/src/readme_images/nih.png) | Health | Processing grants and publications with word2vec. |
127+
| [Cisco Security](http://www.cisco.com/c/en/us/products/security/index.html) | ![cisco](docs/src/readme_images/cisco.png) | Security | Large-scale fraud detection. |
128+
| [Mindseye](http://www.mindseyesolutions.com/) | ![mindseye](docs/src/readme_images/mindseye.png) | Legal | Similarities in legal documents. |
129+
| [Channel 4](http://www.channel4.com/) | ![channel4](docs/src/readme_images/channel4.png) | Media | Recommendation engine. |
130+
| [Talentpair](http://talentpair.com) | ![talent-pair](docs/src/readme_images/talent-pair.png) | HR | Candidate matching in high-touch recruiting. |
131+
| [Juju](http://www.juju.com/) | ![juju](docs/src/readme_images/juju.png) | HR | Provide non-obvious related job suggestions. |
132+
| [Tailwind](https://www.tailwindapp.com/) | ![tailwind](docs/src/readme_images/tailwind.png) | Media | Post interesting and relevant content to Pinterest. |
133+
| [Issuu](https://issuu.com/) | ![issuu](docs/src/readme_images/issuu.png) | Media | Gensim's LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it's all about. |
134+
| [Search Metrics](http://www.searchmetrics.com/) | ![search-metrics](docs/src/readme_images/search-metrics.png) | Content Marketing | Gensim word2vec used for entity disambiguation in Search Engine Optimisation. |
135+
| [12K Research](https://12k.co/) | ![12k](docs/src/readme_images/12k.png)| Media | Document similarity analysis on media articles. |
136+
| [Stillwater Supercomputing](http://www.stillwater-sc.com/) | ![stillwater](docs/src/readme_images/stillwater.png) | Hardware | Document comprehension and association with word2vec. |
137+
| [SiteGround](https://www.siteground.com/) | ![siteground](docs/src/readme_images/siteground.png) | Web hosting | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. |
138+
| [Capital One](https://www.capitalone.com/) | ![capitalone](docs/src/readme_images/capitalone.png) | Finance | Topic modeling for customer complaints exploration. |
145139

146140
-------
147141

appveyor.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,11 @@ environment:
2828
PYTHON_ARCH: "64"
2929
TOXENV: "py36-win"
3030

31+
- PYTHON: "C:\\Python37-x64"
32+
PYTHON_VERSION: "3.7.0"
33+
PYTHON_ARCH: "64"
34+
TOXENV: "py37-win"
35+
3136
init:
3237
- "ECHO %PYTHON% %PYTHON_VERSION% %PYTHON_ARCH%"
3338
- "ECHO \"%APPVEYOR_SCHEDULED_BUILD%\""

docs/fasttext-notes.md

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
FastText Notes
2+
==============
3+
4+
The implementation is split across several submodules:
5+
6+
- models.fasttext
7+
- models.keyedvectors (includes FastText-specific code, not good)
8+
- models.word2vec (superclasses)
9+
- models.base_any2vec (superclasses)
10+
11+
The implementation consists of several key classes:
12+
13+
1. models.fasttext.FastTextVocab: the vocabulary
14+
2. models.keyedvectors.FastTextKeyedVectors: the vectors
15+
3. models.fasttext.FastTextTrainables: the underlying neural network
16+
4. models.fasttext.FastText: ties everything together
17+
18+
FastTextVocab
19+
-------------
20+
21+
Seems to be an entirely redundant class.
22+
Inherits from models.word2vec.Word2VecVocab, adding no new functionality.
23+
24+
FastTextKeyedVectors
25+
--------------------
26+
27+
Inheritance hierarchy:
28+
29+
1. FastTextKeyedVectors
30+
2. WordEmbeddingsKeyedVectors. Implements word similarity e.g. cosine similarity, WMD, etc.
31+
3. BaseKeyedVectors (abstract base class)
32+
4. utils.SaveLoad
33+
34+
There are many attributes.
35+
36+
Inherited from BaseKeyedVectors:
37+
38+
- vectors: a 2D numpy array. Flexible number of rows (0 by default). Number of columns equals vector dimensionality.
39+
- vocab: a dictionary. Keys are words. Items are Vocab instances: these are essentially namedtuples that contain an index and a count. The former is the index of a term in the entire vocab. The latter is the number of times the term occurs.
40+
- vector_size (dimensionality)
41+
- index2entity
42+
43+
Inherited from WordEmbeddingsKeyedVectors:
44+
45+
- vectors_norm
46+
- index2word
47+
48+
Added by FastTextKeyedVectors:
49+
50+
- vectors_vocab: 2D array. Rows are vectors. Columns correspond to vector dimensions. Initialized in FastTextTrainables.init_ngrams_weights. Reset in reset_ngrams_weights. Referred to as syn0_vocab in fasttext_inner.pyx. These are vectors for every word in the vocabulary.
51+
- vectors_vocab_norm: looks unused, see _clear_post_train method.
52+
- vectors_ngrams: 2D array. Each row is a bucket. Columns correspond to vector dimensions. Initialized in init_ngrams_weights function. Initialized in _load_vectors method when reading from native FB binary. Modified in reset_ngrams_weights method. This is the first matrix loaded from the native binary files.
53+
- vectors_ngrams_norm: looks unused, see _clear_post_train method.
54+
- buckets_word: A hashmap. Keyed by the index of a term in the vocab. Each value is an array, where each element is an integer that corresponds to a bucket. Initialized in init_ngrams_weights function
55+
- hash2index: A hashmap. Keys are hashes of ngrams. Values are the number of ngrams (?). Initialized in init_ngrams_weights function.
56+
- min_n: minimum ngram length
57+
- max_n: maximum ngram length
58+
- num_ngram_vectors: initialized in the init_ngrams_weights function
59+
60+
The init_ngrams_method looks like an internal method of FastTextTrainables.
61+
It gets called as part of the prepare_weights method, which is effectively part of the FastModel constructor.
62+
63+
The above attributes are initialized to None in the FastTextKeyedVectors class constructor.
64+
Unfortunately, their real initialization happens in an entirely different module, models.fasttext - another indication of poor separation of concerns.
65+
66+
Some questions:
67+
68+
- What is the x_lockf stuff? Why is it used only by the fast C implementation?
69+
- How are vectors_vocab and vectors_ngrams different?
70+
71+
vectors_vocab contains vectors for entire vocabulary.
72+
vectors_ngrams contains vectors for each _bucket_.
73+
74+
75+
FastTextTrainables
76+
------------------
77+
78+
[Link](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastTextTrainables)
79+
80+
This is a neural network that learns the vectors for the FastText embedding.
81+
Mostly inherits from its [Word2Vec parent](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2VecTrainables).
82+
Adds logic for calculating and maintaining ngram weights.
83+
84+
Key attributes:
85+
86+
- hashfxn: function for randomly initializing weights. Defaults to the built-in hash()
87+
- layer1_size: The size of the inner layer of the NN. Equal to the vector dimensionality. Set in the Word2VecTrainables constructor.
88+
- seed: The random generator seed used in reset_weights and update_weights
89+
- syn1: The inner layer of the NN. Each row corresponds to a term in the vocabulary. Columns correspond to weights of the inner layer. There are layer1_size such weights. Set in the reset_weights and update_weights methods, only if hierarchical sampling is used.
90+
- syn1neg: Similar to syn1, but only set if negative sampling is used.
91+
- vectors_lockf: A one-dimensional array with one element for each term in the vocab. Set in reset_weights to an array of ones.
92+
- vectors_vocab_lockf: Similar to vectors_vocab_lockf, ones(len(model.trainables.vectors), dtype=REAL)
93+
- vectors_ngrams_lockf = ones((self.bucket, wv.vector_size), dtype=REAL)
94+
95+
The lockf stuff looks like it gets used by the fast C implementation.
96+
97+
The inheritance hierarchy here is:
98+
99+
1. FastTextTrainables
100+
2. Word2VecTrainables
101+
3. utils.SaveLoad
102+
103+
FastText
104+
--------
105+
106+
Inheritance hierarchy:
107+
108+
1. FastText
109+
2. BaseWordEmbeddingsModel: vocabulary management plus a ton of deprecated attrs
110+
3. BaseAny2VecModel: logging and training functionality
111+
4. utils.SaveLoad: for loading and saving
112+
113+
Lots of attributes (many inherited from superclasses).
114+
115+
From BaseAny2VecModel:
116+
117+
- workers
118+
- vector_size
119+
- epochs
120+
- callbacks
121+
- batch_words
122+
- kv
123+
- vocabulary
124+
- trainables
125+
126+
From BaseWordEmbeddingModel:
127+
128+
- alpha
129+
- min_alpha
130+
- min_alpha_yet_reached
131+
- window
132+
- random
133+
- hs
134+
- negative
135+
- ns_exponent
136+
- cbow_mean
137+
- compute_loss
138+
- running_training_loss
139+
- corpus_count
140+
- corpus_total_words
141+
- neg_labels
142+
143+
FastText attributes:
144+
145+
- wv: FastTextWordVectors. Used instead of .kv
146+
147+
Logging
148+
-------
149+
150+
The logging seems to be inheritance-based.
151+
It may be better to refactor this using aggregation istead of inheritance in the future.
152+
The benefits would be leaner classes with less responsibilities and better separation of concerns.

docs/notebooks/FastText_Tutorial.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@
134134
"cell_type": "markdown",
135135
"metadata": {},
136136
"source": [
137-
"Hyperparameters for training the model follow the same pattern as Word2Vec. FastText supports the folllowing parameters from the original word2vec - \n",
137+
"Hyperparameters for training the model follow the same pattern as Word2Vec. FastText supports the following parameters from the original word2vec - \n",
138138
" - model: Training architecture. Allowed values: `cbow`, `skipgram` (Default `cbow`)\n",
139139
" - size: Size of embeddings to be learnt (Default 100)\n",
140140
" - alpha: Initial learning rate (Default 0.025)\n",

docs/notebooks/Poincare Evaluation.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1706,7 +1706,7 @@
17061706
"cell_type": "markdown",
17071707
"metadata": {},
17081708
"source": [
1709-
"1. The model can be investigated further to understand why it doesn't produce results as good as the paper. It is possible that this might be due to training details not present in the paper, or due to us incorrectly interpreting some ambiguous parts of the paper. We have not been able to clarify all such ambiguitities in communication with the authors.\n",
1709+
"1. The model can be investigated further to understand why it doesn't produce results as good as the paper. It is possible that this might be due to training details not present in the paper, or due to us incorrectly interpreting some ambiguous parts of the paper. We have not been able to clarify all such ambiguities in communication with the authors.\n",
17101710
"2. Optimizing the training process further - with a model size of 50 dimensions and a dataset with ~700k relations and ~80k nodes, the Gensim implementation takes around 45 seconds to complete an epoch (~15k relations per second), whereas the open source C++ implementation takes around 1/6th the time (~95k relations per second).\n",
17111711
"3. Implementing the variant of the model mentioned in the paper for symmetric graphs and evaluating on the scientific collaboration datasets described earlier in the report."
17121712
]

0 commit comments

Comments
 (0)