|
1 | 1 | Changes |
2 | 2 | =========== |
| 3 | +## 3.2.0, 2017-12-09 |
| 4 | + |
| 5 | +:star2: New features: |
| 6 | + |
| 7 | +* New download API for corpora and pre-trained models (__[@chaitaliSaini](https://github.com/chaitaliSaini)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1705](https://github.com/RaRe-Technologies/gensim/pull/1705) & [#1632](https://github.com/RaRe-Technologies/gensim/pull/1632) & [#1492](https://github.com/RaRe-Technologies/gensim/pull/1492)) |
| 8 | + - Download large NLP datasets in one line of Python, then use with memory-efficient data streaming: |
| 9 | + ```python |
| 10 | + import gensim.downloader as api |
| 11 | + |
| 12 | + for article in api.load("wiki-english-20171001"): |
| 13 | + pass |
| 14 | + |
| 15 | + ``` |
| 16 | + - Don’t waste time searching for good word embeddings, use the curated ones we included: |
| 17 | + ```python |
| 18 | + import gensim.downloader as api |
| 19 | + |
| 20 | + model = api.load("glove-twitter-25") |
| 21 | + model.most_similar("engineer") |
| 22 | + |
| 23 | + # [('specialist', 0.957542896270752), |
| 24 | + # ('developer', 0.9548177123069763), |
| 25 | + # ('administrator', 0.9432312846183777), |
| 26 | + # ('consultant', 0.93915855884552), |
| 27 | + # ('technician', 0.9368376135826111), |
| 28 | + # ('analyst', 0.9342101216316223), |
| 29 | + # ('architect', 0.9257484674453735), |
| 30 | + # ('engineering', 0.9159940481185913), |
| 31 | + # ('systems', 0.9123805165290833), |
| 32 | + # ('consulting', 0.9112802147865295)] |
| 33 | + ``` |
| 34 | + - [Blog post](https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/) introducing the API and design decisions. |
| 35 | + - [Notebook with examples](https://github.com/RaRe-Technologies/gensim/blob/be4500e4f0616ec2864c2ce70cb5d4db4b46512d/docs/notebooks/downloader_api_tutorial.ipynb) |
| 36 | + |
| 37 | +* New model: Poincaré embeddings (__[@jayantj](https://github.com/jayantj)__, [#1696](https://github.com/RaRe-Technologies/gensim/pull/1696) & [#1700](https://github.com/RaRe-Technologies/gensim/pull/1700) & [#1757](https://github.com/RaRe-Technologies/gensim/pull/1757) & [#1734](https://github.com/RaRe-Technologies/gensim/pull/1734)) |
| 38 | + - Embed a graph (taxonomy) in the same way as word2vec embeds words: |
| 39 | + ```python |
| 40 | + from gensim.models.poincare import PoincareRelations, PoincareModel |
| 41 | + from gensim.test.utils import datapath |
| 42 | + |
| 43 | + data = PoincareRelations(datapath('poincare_hypernyms.tsv')) |
| 44 | + model = PoincareModel(data) |
| 45 | + model.kv.most_similar("cat.n.01") |
| 46 | + |
| 47 | + # [('kangaroo.n.01', 0.010581353439700418), |
| 48 | + # ('gib.n.02', 0.011171531439892076), |
| 49 | + # ('striped_skunk.n.01', 0.012025106076442395), |
| 50 | + # ('metatherian.n.01', 0.01246679759214648), |
| 51 | + # ('mammal.n.01', 0.013281303506525968), |
| 52 | + # ('marsupial.n.01', 0.013941330203709653)] |
| 53 | + ``` |
| 54 | + - [Tutorial notebook on Poincaré embeddings](https://github.com/RaRe-Technologies/gensim/blob/920c029ca97f961c8df264672c34936607876694/docs/notebooks/Poincare%20Tutorial.ipynb) |
| 55 | + - [Model introduction and the journey of its implementation](https://rare-technologies.com/implementing-poincare-embeddings/) |
| 56 | + - [Original paper](https://arxiv.org/abs/1705.08039) on arXiv |
| 57 | + |
| 58 | +* Optimized FastText (__[@manneshiva](https://github.com/manneshiva)__, [#1742](https://github.com/RaRe-Technologies/gensim/pull/1742)) |
| 59 | + - New fast multithreaded implementation of FastText, natively in Python/Cython. Deprecates the existing wrapper for Facebook’s C++ implementation. |
| 60 | + ```python |
| 61 | + import gensim.downloader as api |
| 62 | + from gensim.models import FastText |
| 63 | + |
| 64 | + model = FastText(api.load("text8")) |
| 65 | + model.most_similar("cat") |
| 66 | + |
| 67 | + # [('catnip', 0.8538144826889038), |
| 68 | + # ('catwalk', 0.8136177062988281), |
| 69 | + # ('catchy', 0.7828493118286133), |
| 70 | + # ('caf', 0.7826495170593262), |
| 71 | + # ('bobcat', 0.7745151519775391), |
| 72 | + # ('tomcat', 0.7732658386230469), |
| 73 | + # ('moat', 0.7728310823440552), |
| 74 | + # ('caye', 0.7666271328926086), |
| 75 | + # ('catv', 0.7651021480560303), |
| 76 | + # ('caveat', 0.7643581628799438)] |
| 77 | + |
| 78 | + |
| 79 | + ``` |
| 80 | + |
| 81 | +* Binary pre-compiled wheels for Windows, OSX and Linux (__[@menshikh-iv](https://github.com/menshikh-iv)__, [MacPython/gensim-wheels/#7](https://github.com/MacPython/gensim-wheels/pull/7)) |
| 82 | + - Users no longer need to have a C compiler for using the fast (Cythonized) version of word2vec, doc2vec, etc. |
| 83 | + - Faster Gensim pip installation |
| 84 | + |
| 85 | +* Added `DeprecationWarnings` to deprecated methods and parameters, with a clear schedule for removal. |
| 86 | + |
| 87 | +:+1: Improvements: |
| 88 | +* Add Montemurro and Zanette's entropy based keyword extraction algorithm. Fix #665 (__[@PeteBleackley](https://github.com/PeteBleackley)__, [#1738](https://github.com/RaRe-Technologies/gensim/pull/1738)) |
| 89 | +* Fix flake8 E731, E402, refactor tests & sklearn API code. Partial fix #1644 (__[@horpto](https://github.com/horpto)__, [#1689](https://github.com/RaRe-Technologies/gensim/pull/1689)) |
| 90 | +* Reduce distribution size. Fix #1698 (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1699](https://github.com/RaRe-Technologies/gensim/pull/1699)) |
| 91 | +* Improve `scan_vocab` speed, `build_vocab_from_freq` method (__[@jodevak](https://github.com/jodevak)__, [#1695](https://github.com/RaRe-Technologies/gensim/pull/1695)) |
| 92 | +* Improve `segment_wiki` script (__[@piskvorky](https://github.com/piskvorky)__, [#1707](https://github.com/RaRe-Technologies/gensim/pull/1707)) |
| 93 | +* Add custom `dtype` support for `LdaModel`. Partially fix #1576 (__[@xelez](https://github.com/xelez)__, [#1656](https://github.com/RaRe-Technologies/gensim/pull/1656)) |
| 94 | +* Add `doc2idx` method for `gensim.corpora.Dictionary`. Fix #1634 (__[@roopalgarg](https://github.com/roopalgarg)__, [#1720](https://github.com/RaRe-Technologies/gensim/pull/1720)) |
| 95 | +* Add tox and pytest to gensim, integration with Travis and Appveyor. Fix #1613, #1644 (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1721](https://github.com/RaRe-Technologies/gensim/pull/1721)) |
| 96 | +* Add flag for hiding outdated data for `gensim.downloader.info` (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1736](https://github.com/RaRe-Technologies/gensim/pull/1736)) |
| 97 | +* Add reproducible order between python versions for `gensim.corpora.Dictionary` (__[@formi23](https://github.com/formi23)__, [#1715](https://github.com/RaRe-Technologies/gensim/pull/1715)) |
| 98 | +* Update `tox.ini`, `setup.cfg`, `README.md` (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1741](https://github.com/RaRe-Technologies/gensim/pull/1741)) |
| 99 | +* Add custom `logsumexp` for `LdaModel` (__[@arlenk](https://github.com/arlenk)__, [#1745](https://github.com/RaRe-Technologies/gensim/pull/1745)) |
| 100 | + |
| 101 | +:red_circle: Bug fixes: |
| 102 | +* Fix ranking formula in `gensim.summarization.bm25`. Fix #1718 (__[@souravsingh](https://github.com/souravsingh)__, [#1726](https://github.com/RaRe-Technologies/gensim/pull/1726)) |
| 103 | +* Fixed incompatibility in persistence for `FastText` wrapper. Fix #1642 (__[@chinmayapancholi13](https://github.com/chinmayapancholi13)__, [#1723](https://github.com/RaRe-Technologies/gensim/pull/1723)) |
| 104 | +* Fix `gensim.sklearn_api` bug with `documents_columns` parameter. Fix #1676 (__[@chinmayapancholi13](https://github.com/chinmayapancholi13)__, [#1704](https://github.com/RaRe-Technologies/gensim/pull/1704)) |
| 105 | +* Fix slowdown of CI, remove pytest-cov (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1728](https://github.com/RaRe-Technologies/gensim/pull/1728)) |
| 106 | +* Replace outdated packages in Dockerfile (__[@rbahumi](https://github.com/rbahumi)__, [#1730](https://github.com/RaRe-Technologies/gensim/pull/1730)) |
| 107 | +* Replace `num_words` to `topn` in `LdaMallet.show_topics`. Fix #1747 (__[@apoorvaeternity](https://github.com/apoorvaeternity)__, [#1749](https://github.com/RaRe-Technologies/gensim/pull/1749)) |
| 108 | +* Fix `os.rename` from `gensim.downloader` when 'src' and 'dst' on different partitions (__[@anotherbugmaster](https://github.com/anotherbugmaster)__, [#1733](https://github.com/RaRe-Technologies/gensim/pull/1733)) |
| 109 | +* Fix `DeprecationWarning` from `logsumexp` (__[@dreamgonfly](https://github.com/dreamgonfly)__, [#1703](https://github.com/RaRe-Technologies/gensim/pull/1703)) |
| 110 | +* Fix backward compatibility problem in `Phrases.load`. Fix #1751 (__[@alexgarel](https://github.com/alexgarel)__, [#1758](https://github.com/RaRe-Technologies/gensim/pull/1758)) |
| 111 | +* Fix `load_word2vec_format` from `FastText`. Fix #1743 (__[@manneshiva](https://github.com/manneshiva)__, [#1755](https://github.com/RaRe-Technologies/gensim/pull/1755)) |
| 112 | +* Fix ipython kernel version in `Dockerfile`. Fix #1762 (__[@rbahumi](https://github.com/rbahumi)__, [#1764](https://github.com/RaRe-Technologies/gensim/pull/1764)) |
| 113 | +* Fix writing in `segment_wiki` (__[@horpto](https://github.com/horpto)__, [#1763](https://github.com/RaRe-Technologies/gensim/pull/1763)) |
| 114 | +* Fix write method of file requires byte-like object in `segment_wiki` (__[@horpto](https://github.com/horpto)__, [#1750](https://github.com/RaRe-Technologies/gensim/pull/1750)) |
| 115 | +* Fix incorrect vectors learned during online training for `FastText`. Fix #1752 (__[@manneshiva](https://github.com/manneshiva)__, [#1756](https://github.com/RaRe-Technologies/gensim/pull/1756)) |
| 116 | +* Fix `dtype` of `model.wv.syn0_vocab` on updating `vocab` for `FastText`. Fix #1759 (__[@manneshiva](https://github.com/manneshiva)__, [#1760](https://github.com/RaRe-Technologies/gensim/pull/1760)) |
| 117 | +* Fix hashing-trick from `FastText.build_vocab`. Fix #1765 (__[@manneshiva](https://github.com/manneshiva)__, [#1768](https://github.com/RaRe-Technologies/gensim/pull/1768)) |
| 118 | +* Add explicit `DeprecationWarning` for all outdated stuff. Fix #1753 (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1769](https://github.com/RaRe-Technologies/gensim/pull/1769)) |
| 119 | +* Fix epsilon according to `dtype` in `LdaModel` (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1770](https://github.com/RaRe-Technologies/gensim/pull/1770)) |
| 120 | + |
| 121 | +:books: Tutorial and doc improvements: |
| 122 | +* Update perf numbers of `segment_wiki` (__[@piskvorky](https://github.com/piskvorky)__, [#1708](https://github.com/RaRe-Technologies/gensim/pull/1708)) |
| 123 | +* Update docstring for `gensim.summarization.summarize`. Fix #1575 (__[@fbarrios](https://github.com/fbarrios)__, [#1702](https://github.com/RaRe-Technologies/gensim/pull/1702)) |
| 124 | +* Refactor API Reference for `gensim.parsing`. Fix #1664 (__[@CLearERR](https://github.com/CLearERR)__, [#1684](https://github.com/RaRe-Technologies/gensim/pull/1684)) |
| 125 | +* Fix typos in doc2vec-wikipedia notebook (__[@youqad](https://github.com/youqad)__, [#1727](https://github.com/RaRe-Technologies/gensim/pull/1727)) |
| 126 | +* Fix PyPI long description rendering (__[@edigaryev](https://github.com/edigaryev)__, [#1739](https://github.com/RaRe-Technologies/gensim/pull/1739)) |
| 127 | +* Fix twitter badge src (__[@menshikh-iv](https://github.com/menshikh-iv)__) |
| 128 | +* Fix maillist badge color (__[@menshikh-iv](https://github.com/menshikh-iv)__) |
| 129 | + |
| 130 | +:warning: Deprecations (will be removed in the next major release) |
| 131 | +* Remove |
| 132 | + - `gensim.examples` |
| 133 | + - `gensim.nosy` |
| 134 | + - `gensim.scripts.word2vec_standalone` |
| 135 | + - `gensim.scripts.make_wiki_lemma` |
| 136 | + - `gensim.scripts.make_wiki_online` |
| 137 | + - `gensim.scripts.make_wiki_online_lemma` |
| 138 | + - `gensim.scripts.make_wiki_online_nodebug` |
| 139 | + - `gensim.scripts.make_wiki` |
| 140 | + |
| 141 | +* Move |
| 142 | + - `gensim.scripts.make_wikicorpus` ➡ `gensim.scripts.make_wiki.py` |
| 143 | + - `gensim.summarization` ➡ `gensim.models.summarization` |
| 144 | + - `gensim.topic_coherence` ➡ `gensim.models._coherence` |
| 145 | + - `gensim.utils` ➡ `gensim.utils.utils` (old imports will continue to work) |
| 146 | + - `gensim.parsing.*` ➡ `gensim.utils.text_utils` |
| 147 | + |
| 148 | + |
3 | 149 | ## 3.1.0, 2017-11-06 |
4 | 150 |
|
5 | 151 |
|
|
0 commit comments