Skip to content

Commit 6dd8ae7

Browse files
committed
Merge branch 'release-3.2.0'
2 parents b6234e7 + 25014fc commit 6dd8ae7

File tree

174 files changed

+131793
-3073
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

174 files changed

+131793
-3073
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,8 @@ Thumbs.db
4040

4141
# Other #
4242
#########
43+
.tox/
44+
.cache/
4345
.project
4446
.pydevproject
4547
.ropeproject

.travis.yml

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,18 +5,24 @@ cache:
55
directories:
66
- $HOME/.cache/pip
77
- $HOME/.ccache
8-
8+
- $HOME/.pip-cache
99
dist: trusty
1010
language: python
1111

1212

1313
matrix:
1414
include:
15-
- env: PYTHON_VERSION="2.7" NUMPY_VERSION="1.11.3" SCIPY_VERSION="0.18.1" ONLY_CODESTYLE="yes"
16-
- env: PYTHON_VERSION="2.7" NUMPY_VERSION="1.11.3" SCIPY_VERSION="0.18.1" ONLY_CODESTYLE="no"
17-
- env: PYTHON_VERSION="3.5" NUMPY_VERSION="1.11.3" SCIPY_VERSION="0.18.1" ONLY_CODESTYLE="no"
18-
- env: PYTHON_VERSION="3.6" NUMPY_VERSION="1.11.3" SCIPY_VERSION="0.18.1" ONLY_CODESTYLE="no"
15+
- python: '2.7'
16+
env: TOXENV="flake8, docs"
17+
18+
- python: '2.7'
19+
env: TOXENV="py27-linux"
20+
21+
- python: '3.5'
22+
env: TOXENV="py35-linux"
1923

24+
- python: '3.6'
25+
env: TOXENV="py36-linux"
2026

21-
install: source continuous_integration/travis/install.sh
22-
script: bash continuous_integration/travis/run.sh
27+
install: pip install tox
28+
script: tox -vv

CHANGELOG.md

Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,151 @@
11
Changes
22
===========
3+
## 3.2.0, 2017-12-09
4+
5+
:star2: New features:
6+
7+
* New download API for corpora and pre-trained models (__[@chaitaliSaini](https://github.com/chaitaliSaini)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1705](https://github.com/RaRe-Technologies/gensim/pull/1705) & [#1632](https://github.com/RaRe-Technologies/gensim/pull/1632) & [#1492](https://github.com/RaRe-Technologies/gensim/pull/1492))
8+
- Download large NLP datasets in one line of Python, then use with memory-efficient data streaming:
9+
```python
10+
import gensim.downloader as api
11+
12+
for article in api.load("wiki-english-20171001"):
13+
pass
14+
15+
```
16+
- Don’t waste time searching for good word embeddings, use the curated ones we included:
17+
```python
18+
import gensim.downloader as api
19+
20+
model = api.load("glove-twitter-25")
21+
model.most_similar("engineer")
22+
23+
# [('specialist', 0.957542896270752),
24+
# ('developer', 0.9548177123069763),
25+
# ('administrator', 0.9432312846183777),
26+
# ('consultant', 0.93915855884552),
27+
# ('technician', 0.9368376135826111),
28+
# ('analyst', 0.9342101216316223),
29+
# ('architect', 0.9257484674453735),
30+
# ('engineering', 0.9159940481185913),
31+
# ('systems', 0.9123805165290833),
32+
# ('consulting', 0.9112802147865295)]
33+
```
34+
- [Blog post](https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/) introducing the API and design decisions.
35+
- [Notebook with examples](https://github.com/RaRe-Technologies/gensim/blob/be4500e4f0616ec2864c2ce70cb5d4db4b46512d/docs/notebooks/downloader_api_tutorial.ipynb)
36+
37+
* New model: Poincaré embeddings (__[@jayantj](https://github.com/jayantj)__, [#1696](https://github.com/RaRe-Technologies/gensim/pull/1696) & [#1700](https://github.com/RaRe-Technologies/gensim/pull/1700) & [#1757](https://github.com/RaRe-Technologies/gensim/pull/1757) & [#1734](https://github.com/RaRe-Technologies/gensim/pull/1734))
38+
- Embed a graph (taxonomy) in the same way as word2vec embeds words:
39+
```python
40+
from gensim.models.poincare import PoincareRelations, PoincareModel
41+
from gensim.test.utils import datapath
42+
43+
data = PoincareRelations(datapath('poincare_hypernyms.tsv'))
44+
model = PoincareModel(data)
45+
model.kv.most_similar("cat.n.01")
46+
47+
# [('kangaroo.n.01', 0.010581353439700418),
48+
# ('gib.n.02', 0.011171531439892076),
49+
# ('striped_skunk.n.01', 0.012025106076442395),
50+
# ('metatherian.n.01', 0.01246679759214648),
51+
# ('mammal.n.01', 0.013281303506525968),
52+
# ('marsupial.n.01', 0.013941330203709653)]
53+
```
54+
- [Tutorial notebook on Poincaré embeddings](https://github.com/RaRe-Technologies/gensim/blob/920c029ca97f961c8df264672c34936607876694/docs/notebooks/Poincare%20Tutorial.ipynb)
55+
- [Model introduction and the journey of its implementation](https://rare-technologies.com/implementing-poincare-embeddings/)
56+
- [Original paper](https://arxiv.org/abs/1705.08039) on arXiv
57+
58+
* Optimized FastText (__[@manneshiva](https://github.com/manneshiva)__, [#1742](https://github.com/RaRe-Technologies/gensim/pull/1742))
59+
- New fast multithreaded implementation of FastText, natively in Python/Cython. Deprecates the existing wrapper for Facebook’s C++ implementation.
60+
```python
61+
import gensim.downloader as api
62+
from gensim.models import FastText
63+
64+
model = FastText(api.load("text8"))
65+
model.most_similar("cat")
66+
67+
# [('catnip', 0.8538144826889038),
68+
# ('catwalk', 0.8136177062988281),
69+
# ('catchy', 0.7828493118286133),
70+
# ('caf', 0.7826495170593262),
71+
# ('bobcat', 0.7745151519775391),
72+
# ('tomcat', 0.7732658386230469),
73+
# ('moat', 0.7728310823440552),
74+
# ('caye', 0.7666271328926086),
75+
# ('catv', 0.7651021480560303),
76+
# ('caveat', 0.7643581628799438)]
77+
78+
79+
```
80+
81+
* Binary pre-compiled wheels for Windows, OSX and Linux (__[@menshikh-iv](https://github.com/menshikh-iv)__, [MacPython/gensim-wheels/#7](https://github.com/MacPython/gensim-wheels/pull/7))
82+
- Users no longer need to have a C compiler for using the fast (Cythonized) version of word2vec, doc2vec, etc.
83+
- Faster Gensim pip installation
84+
85+
* Added `DeprecationWarnings` to deprecated methods and parameters, with a clear schedule for removal.
86+
87+
:+1: Improvements:
88+
* Add Montemurro and Zanette's entropy based keyword extraction algorithm. Fix #665 (__[@PeteBleackley](https://github.com/PeteBleackley)__, [#1738](https://github.com/RaRe-Technologies/gensim/pull/1738))
89+
* Fix flake8 E731, E402, refactor tests & sklearn API code. Partial fix #1644 (__[@horpto](https://github.com/horpto)__, [#1689](https://github.com/RaRe-Technologies/gensim/pull/1689))
90+
* Reduce distribution size. Fix #1698 (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1699](https://github.com/RaRe-Technologies/gensim/pull/1699))
91+
* Improve `scan_vocab` speed, `build_vocab_from_freq` method (__[@jodevak](https://github.com/jodevak)__, [#1695](https://github.com/RaRe-Technologies/gensim/pull/1695))
92+
* Improve `segment_wiki` script (__[@piskvorky](https://github.com/piskvorky)__, [#1707](https://github.com/RaRe-Technologies/gensim/pull/1707))
93+
* Add custom `dtype` support for `LdaModel`. Partially fix #1576 (__[@xelez](https://github.com/xelez)__, [#1656](https://github.com/RaRe-Technologies/gensim/pull/1656))
94+
* Add `doc2idx` method for `gensim.corpora.Dictionary`. Fix #1634 (__[@roopalgarg](https://github.com/roopalgarg)__, [#1720](https://github.com/RaRe-Technologies/gensim/pull/1720))
95+
* Add tox and pytest to gensim, integration with Travis and Appveyor. Fix #1613, #1644 (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1721](https://github.com/RaRe-Technologies/gensim/pull/1721))
96+
* Add flag for hiding outdated data for `gensim.downloader.info` (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1736](https://github.com/RaRe-Technologies/gensim/pull/1736))
97+
* Add reproducible order between python versions for `gensim.corpora.Dictionary` (__[@formi23](https://github.com/formi23)__, [#1715](https://github.com/RaRe-Technologies/gensim/pull/1715))
98+
* Update `tox.ini`, `setup.cfg`, `README.md` (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1741](https://github.com/RaRe-Technologies/gensim/pull/1741))
99+
* Add custom `logsumexp` for `LdaModel` (__[@arlenk](https://github.com/arlenk)__, [#1745](https://github.com/RaRe-Technologies/gensim/pull/1745))
100+
101+
:red_circle: Bug fixes:
102+
* Fix ranking formula in `gensim.summarization.bm25`. Fix #1718 (__[@souravsingh](https://github.com/souravsingh)__, [#1726](https://github.com/RaRe-Technologies/gensim/pull/1726))
103+
* Fixed incompatibility in persistence for `FastText` wrapper. Fix #1642 (__[@chinmayapancholi13](https://github.com/chinmayapancholi13)__, [#1723](https://github.com/RaRe-Technologies/gensim/pull/1723))
104+
* Fix `gensim.sklearn_api` bug with `documents_columns` parameter. Fix #1676 (__[@chinmayapancholi13](https://github.com/chinmayapancholi13)__, [#1704](https://github.com/RaRe-Technologies/gensim/pull/1704))
105+
* Fix slowdown of CI, remove pytest-cov (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1728](https://github.com/RaRe-Technologies/gensim/pull/1728))
106+
* Replace outdated packages in Dockerfile (__[@rbahumi](https://github.com/rbahumi)__, [#1730](https://github.com/RaRe-Technologies/gensim/pull/1730))
107+
* Replace `num_words` to `topn` in `LdaMallet.show_topics`. Fix #1747 (__[@apoorvaeternity](https://github.com/apoorvaeternity)__, [#1749](https://github.com/RaRe-Technologies/gensim/pull/1749))
108+
* Fix `os.rename` from `gensim.downloader` when 'src' and 'dst' on different partitions (__[@anotherbugmaster](https://github.com/anotherbugmaster)__, [#1733](https://github.com/RaRe-Technologies/gensim/pull/1733))
109+
* Fix `DeprecationWarning` from `logsumexp` (__[@dreamgonfly](https://github.com/dreamgonfly)__, [#1703](https://github.com/RaRe-Technologies/gensim/pull/1703))
110+
* Fix backward compatibility problem in `Phrases.load`. Fix #1751 (__[@alexgarel](https://github.com/alexgarel)__, [#1758](https://github.com/RaRe-Technologies/gensim/pull/1758))
111+
* Fix `load_word2vec_format` from `FastText`. Fix #1743 (__[@manneshiva](https://github.com/manneshiva)__, [#1755](https://github.com/RaRe-Technologies/gensim/pull/1755))
112+
* Fix ipython kernel version in `Dockerfile`. Fix #1762 (__[@rbahumi](https://github.com/rbahumi)__, [#1764](https://github.com/RaRe-Technologies/gensim/pull/1764))
113+
* Fix writing in `segment_wiki` (__[@horpto](https://github.com/horpto)__, [#1763](https://github.com/RaRe-Technologies/gensim/pull/1763))
114+
* Fix write method of file requires byte-like object in `segment_wiki` (__[@horpto](https://github.com/horpto)__, [#1750](https://github.com/RaRe-Technologies/gensim/pull/1750))
115+
* Fix incorrect vectors learned during online training for `FastText`. Fix #1752 (__[@manneshiva](https://github.com/manneshiva)__, [#1756](https://github.com/RaRe-Technologies/gensim/pull/1756))
116+
* Fix `dtype` of `model.wv.syn0_vocab` on updating `vocab` for `FastText`. Fix #1759 (__[@manneshiva](https://github.com/manneshiva)__, [#1760](https://github.com/RaRe-Technologies/gensim/pull/1760))
117+
* Fix hashing-trick from `FastText.build_vocab`. Fix #1765 (__[@manneshiva](https://github.com/manneshiva)__, [#1768](https://github.com/RaRe-Technologies/gensim/pull/1768))
118+
* Add explicit `DeprecationWarning` for all outdated stuff. Fix #1753 (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1769](https://github.com/RaRe-Technologies/gensim/pull/1769))
119+
* Fix epsilon according to `dtype` in `LdaModel` (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1770](https://github.com/RaRe-Technologies/gensim/pull/1770))
120+
121+
:books: Tutorial and doc improvements:
122+
* Update perf numbers of `segment_wiki` (__[@piskvorky](https://github.com/piskvorky)__, [#1708](https://github.com/RaRe-Technologies/gensim/pull/1708))
123+
* Update docstring for `gensim.summarization.summarize`. Fix #1575 (__[@fbarrios](https://github.com/fbarrios)__, [#1702](https://github.com/RaRe-Technologies/gensim/pull/1702))
124+
* Refactor API Reference for `gensim.parsing`. Fix #1664 (__[@CLearERR](https://github.com/CLearERR)__, [#1684](https://github.com/RaRe-Technologies/gensim/pull/1684))
125+
* Fix typos in doc2vec-wikipedia notebook (__[@youqad](https://github.com/youqad)__, [#1727](https://github.com/RaRe-Technologies/gensim/pull/1727))
126+
* Fix PyPI long description rendering (__[@edigaryev](https://github.com/edigaryev)__, [#1739](https://github.com/RaRe-Technologies/gensim/pull/1739))
127+
* Fix twitter badge src (__[@menshikh-iv](https://github.com/menshikh-iv)__)
128+
* Fix maillist badge color (__[@menshikh-iv](https://github.com/menshikh-iv)__)
129+
130+
:warning: Deprecations (will be removed in the next major release)
131+
* Remove
132+
- `gensim.examples`
133+
- `gensim.nosy`
134+
- `gensim.scripts.word2vec_standalone`
135+
- `gensim.scripts.make_wiki_lemma`
136+
- `gensim.scripts.make_wiki_online`
137+
- `gensim.scripts.make_wiki_online_lemma`
138+
- `gensim.scripts.make_wiki_online_nodebug`
139+
- `gensim.scripts.make_wiki`
140+
141+
* Move
142+
- `gensim.scripts.make_wikicorpus` ➡ `gensim.scripts.make_wiki.py`
143+
- `gensim.summarization` ➡ `gensim.models.summarization`
144+
- `gensim.topic_coherence` ➡ `gensim.models._coherence`
145+
- `gensim.utils` ➡ `gensim.utils.utils` (old imports will continue to work)
146+
- `gensim.parsing.*` ➡ `gensim.utils.text_utils`
147+
148+
3149
## 3.1.0, 2017-11-06
4150

5151

MANIFEST.in

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,4 @@
1-
recursive-include docs *
21
recursive-include gensim/test/test_data *
3-
recursive-include . *.sh
4-
prune docs/src*
5-
prune docs/notebooks/datasets
62
include README.md
73
include CHANGELOG.md
84
include COPYING
@@ -14,3 +10,5 @@ include gensim/models/word2vec_inner.pyx
1410
include gensim/models/word2vec_inner.pxd
1511
include gensim/models/doc2vec_inner.c
1612
include gensim/models/doc2vec_inner.pyx
13+
include gensim/models/fasttext_inner.c
14+
include gensim/models/fasttext_inner.pyx

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@ gensim – Topic Modelling in Python
55
[![GitHub release](https://img.shields.io/github/release/rare-technologies/gensim.svg?maxAge=3600)](https://github.com/RaRe-Technologies/gensim/releases)
66
[![Wheel](https://img.shields.io/pypi/wheel/gensim.svg)](https://pypi.python.org/pypi/gensim)
77
[![DOI](https://zenodo.org/badge/DOI/10.13140/2.1.2393.1847.svg)](https://doi.org/10.13140/2.1.2393.1847)
8-
[![Mailing List](https://img.shields.io/badge/-Mailing%20List-lightgrey.svg)](https://groups.google.com/forum/#!forum/gensim)
8+
[![Mailing List](https://img.shields.io/badge/-Mailing%20List-brightgreen.svg)](https://groups.google.com/forum/#!forum/gensim)
99
[![Gitter](https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg)](https://gitter.im/RaRe-Technologies/gensim)
10-
[![Follow](https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow)](https://twitter.com/gensim_py)
10+
[![Follow](https://img.shields.io/twitter/follow/gensim_py.svg?style=social&label=Follow)](https://twitter.com/gensim_py)
1111

1212

1313

appveyor.yml

Lines changed: 5 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -13,29 +13,20 @@ environment:
1313
secure: qXqY3dFmLOqvxa3Om2gQi/BjotTOK+EP2IPLolBNo0c61yDtNWxbmE4wH3up72Be
1414

1515
matrix:
16-
# - PYTHON: "C:\\Python27"
17-
# PYTHON_VERSION: "2.7.12"
18-
# PYTHON_ARCH: "32"
19-
2016
- PYTHON: "C:\\Python27-x64"
2117
PYTHON_VERSION: "2.7.12"
2218
PYTHON_ARCH: "64"
23-
24-
# - PYTHON: "C:\\Python35"
25-
# PYTHON_VERSION: "3.5.2"
26-
# PYTHON_ARCH: "32"
19+
TOXENV: "py27-win"
2720

2821
- PYTHON: "C:\\Python35-x64"
2922
PYTHON_VERSION: "3.5.2"
3023
PYTHON_ARCH: "64"
31-
32-
# - PYTHON: "C:\\Python36"
33-
# PYTHON_VERSION: "3.6.0"
34-
# PYTHON_ARCH: "32"
24+
TOXENV: "py35-win"
3525

3626
- PYTHON: "C:\\Python36-x64"
3727
PYTHON_VERSION: "3.6.0"
3828
PYTHON_ARCH: "64"
29+
TOXENV: "py36-win"
3930

4031
init:
4132
- "ECHO %PYTHON% %PYTHON_VERSION% %PYTHON_ARCH%"
@@ -57,48 +48,16 @@ install:
5748
# not already installed.
5849
- "powershell ./continuous_integration/appveyor/install.ps1"
5950
- "SET PATH=%PYTHON%;%PYTHON%\\Scripts;%PATH%"
60-
- "python -m pip install -U pip"
51+
- "python -m pip install -U pip tox"
6152

6253
# Check that we have the expected version and architecture for Python
6354
- "python --version"
6455
- "python -c \"import struct; print(struct.calcsize('P') * 8)\""
6556

66-
# Install the build and runtime dependencies of the project.
67-
- "%CMD_IN_ENV% pip install --timeout=60 --trusted-host 28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com -r continuous_integration/appveyor/requirements.txt"
68-
- "%CMD_IN_ENV% python setup.py bdist_wheel bdist_wininst"
69-
- ps: "ls dist"
70-
71-
# Install the genreated wheel package to test it
72-
- "pip install --pre --no-index --find-links dist/ gensim"
73-
74-
# Not a .NET project, we build scikit-learn in the install step instead
7557
build: false
7658

7759
test_script:
78-
# Change to a non-source folder to make sure we run the tests on the
79-
# installed library.
80-
- "mkdir empty_folder"
81-
- "cd empty_folder"
82-
- "pip install pyemd testfixtures sklearn Morfessor==2.0.2a4"
83-
- "pip freeze"
84-
- "python -c \"import nose; nose.main()\" -s -v gensim"
85-
# Move back to the project folder
86-
- "cd .."
87-
88-
artifacts:
89-
# Archive the generated wheel package in the ci.appveyor.com build report.
90-
- path: dist\*
91-
on_success:
92-
# Upload the generated wheel package to Rackspace
93-
# On Windows, Apache Libcloud cannot find a standard CA cert bundle so we
94-
# disable the ssl checks.
95-
- "python -m wheelhouse_uploader upload --no-ssl-check --local-folder=dist gensim-windows-wheels"
96-
97-
notifications:
98-
- provider: Webhook
99-
url: https://webhooks.gitter.im/e/62c44ad26933cd7ed7e8
100-
on_build_success: false
101-
on_build_failure: True
60+
- tox -vv
10261

10362
cache:
10463
# Use the appveyor cache to avoid re-downloading large archives such

0 commit comments

Comments
 (0)