Skip to content

Commit 75d9164

Browse files
committed
update changelog to 3.1.0
1 parent 53ae984 commit 75d9164

File tree

1 file changed

+98
-0
lines changed

1 file changed

+98
-0
lines changed

CHANGELOG.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,103 @@
11
Changes
22
===========
3+
## 3.1.0, 2017-11-06
4+
5+
6+
:star2: New features:
7+
* Massive optimizations to LSI model training (__[@isamaru](https://github.com/isamaru)__, [#1620](https://github.com/RaRe-Technologies/gensim/pull/1620) & [#1622](https://github.com/RaRe-Technologies/gensim/pull/1622))
8+
- LSI model allows use of single precision (float32), to consume *40% less memory* while being *40% faster*.
9+
- LSI model can now also accept CSC matrix as input, for further memory and speed boost.
10+
- Overall, if your entire corpus fits in RAM: 3x faster LSI training (SVD) in 4x less memory!
11+
```python
12+
# just an example; the corpus stream is up to you
13+
streaming_corpus = gensim.corpora.MmCorpus("my_tfidf_corpus.mm.gz")
14+
15+
# convert your corpus to a CSC sparse matrix (assumes the entire corpus fits in RAM)
16+
in_memory_csc_matrix = gensim.matutils.corpus2csc(streaming_corpus, dtype=np.float32)
17+
18+
# then pass the CSC to LsiModel directly
19+
model = LsiModel(corpus=in_memory_csc_matrix, num_topics=500, dtype=np.float32)
20+
```
21+
- Even if you continue to use streaming corpora (your training dataset is too large for RAM), you should see significantly faster processing times and a lower memory footprint. In our experiments with a very large LSI model, we saw a drop from 29 GB peak RAM and 38 minutes (before) to 19 GB peak RAM and 26 minutes (now):
22+
```python
23+
model = LsiModel(corpus=streaming_corpus, num_topics=500, dtype=np.float32)
24+
```
25+
* Add common terms to Phrases. Fix #1258 (__[@alexgarel](https://github.com/alexgarel)__, [#1568](https://github.com/RaRe-Technologies/gensim/pull/1568))
26+
- Phrases allows to use common terms in bigrams. Before, if you are searching to reveal ngrams like `car_with_driver` and `car_without_driver`, you can either remove stop words before processing, but you will only find `car_driver`, or you won't find any of those forms (because they have three words, but also because high frequency of with will avoid them to be scored correctly), inspired by [ES common grams token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-common-grams-tokenfilter.html).
27+
```python
28+
phr_old = Phrases(corpus)
29+
phr_new = Phrases(corpus, common_terms=stopwords.words('en'))
30+
31+
print(phr_old[["we", "provide", "car", "with", "driver"]]) # ["we", "provide", "car_with", "driver"]
32+
print(phr_new[["we", "provide", "car", "with", "driver"]]) # ["we", "provide", "car_with_driver"]
33+
```
34+
* New [segment_wiki.py](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/scripts/segment_wiki.py) script (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1483](https://github.com/RaRe-Technologies/gensim/pull/1483) & [#1694](https://github.com/RaRe-Technologies/gensim/pull/1694))
35+
- CLI script for processing a raw Wikipedia dump (the xml.bz2 format provided by WikiMedia) to extract its articles in a plain text format. It extracts each article's title, section names and section content and saves them as json-line:
36+
```bash
37+
python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 | gzip > enwiki-latest-pages-articles.json.gz
38+
```
39+
Processing the entire English Wikipedia dump (13.5 GB, link [here](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2)) takes about 2.5 hours (i7-6700HQ, SSD).
40+
41+
The output format is one article per line, serialized into JSON:
42+
```python
43+
for line in smart_open('enwiki-latest-pages-articles.json.gz'): # read the file we just created
44+
article = json.loads(line)
45+
print("Article title: %s" % article['title'])
46+
for section_title, section_text in zip(article['section_titles'], article['section_texts']):
47+
print("Section title: %s" % section_title)
48+
print("Section text: %s" % section_text)
49+
```
50+
51+
:+1: Improvements:
52+
* Speedup FastText tests (__[@horpto](https://github.com/horpto)__, [#1686](https://github.com/RaRe-Technologies/gensim/pull/1686))
53+
* Add optimization for `SlicedCorpus.__len__` (__[@horpto](https://github.com/horpto)__, [#1679](https://github.com/RaRe-Technologies/gensim/pull/1679))
54+
* Make `word_vec` return immutable vector. Fix #1651 (__[@CLearERR](https://github.com/CLearERR)__, [#1662](https://github.com/RaRe-Technologies/gensim/pull/1662))
55+
* Drop Win x32 support & add rolling builds (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1652](https://github.com/RaRe-Technologies/gensim/pull/1652))
56+
* Fix scoring function in Phrases. Fix #1533, #1635 (__[@michaelwsherman](https://github.com/michaelwsherman)__, [#1573](https://github.com/RaRe-Technologies/gensim/pull/1573))
57+
* Add configuration for flake8 to setup.cfg (__[@mcobzarenco](https://github.com/mcobzarenco)__, [#1636](https://github.com/RaRe-Technologies/gensim/pull/1636))
58+
* Add `build_vocab_from_freq` to Word2Vec, speedup scan\_vocab (__[@jodevak](https://github.com/jodevak)__, [#1599](https://github.com/RaRe-Technologies/gensim/pull/1599))
59+
* Add `most_similar_to_given` method for KeyedVectors (__[@TheMathMajor](https://github.com/TheMathMajor)__, [#1582](https://github.com/RaRe-Technologies/gensim/pull/1582))
60+
* Add `__getitem__` method to Sparse2Corpus to allow direct queries (__[@isamaru](https://github.com/isamaru)__, [#1621](https://github.com/RaRe-Technologies/gensim/pull/1621))
61+
62+
:red_circle: Bug fixes:
63+
* Add single core mode to CoherenceModel. Fix #1683 (__[@horpto](https://github.com/horpto)__, [#1685](https://github.com/RaRe-Technologies/gensim/pull/1685))
64+
* Fix ResourceWarnings in tests. Partially fix #1519 (__[@horpto](https://github.com/horpto)__, [#1660](https://github.com/RaRe-Technologies/gensim/pull/1660))
65+
* Fix DeprecationWarnings generated by deprecated assertEquals. Partial fix #1519 (__[@poornagurram](https://github.com/poornagurram)__, [#1658](https://github.com/RaRe-Technologies/gensim/pull/1658))
66+
* Fix DeprecationWarnings for regex string literals. Fix #1646 (__[@franklsf95](https://github.com/franklsf95)__, [#1649](https://github.com/RaRe-Technologies/gensim/pull/1649))
67+
* Fix pagerank algorithm. Fix #805 (__[@xelez](https://github.com/xelez)__, [#1653](https://github.com/RaRe-Technologies/gensim/pull/1653))
68+
* Fix FastText inconsistent dtype. Fix #1637 (__[@mcobzarenco](https://github.com/mcobzarenco)__, [#1638](https://github.com/RaRe-Technologies/gensim/pull/1638))
69+
* Fix `test_filename_filtering` test (__[@nehaljwani](https://github.com/nehaljwani)__, [#1647](https://github.com/RaRe-Technologies/gensim/pull/1647))
70+
71+
:books: Tutorial and doc improvements:
72+
* Fix code/docstring style (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1650](https://github.com/RaRe-Technologies/gensim/pull/1650))
73+
* Update error message for supervised FastText. Fix #1498 (__[@ElSaico](https://github.com/ElSaico)__, [#1645](https://github.com/RaRe-Technologies/gensim/pull/1645))
74+
* Add "DOI badge" to README. Fix #1610 (__[@dphov](https://github.com/dphov)__, [#1639](https://github.com/RaRe-Technologies/gensim/pull/1639))
75+
* Remove duplicate annoy notebook. Fix #1415 (__[@Karamax](https://github.com/Karamax)__, [#1640](https://github.com/RaRe-Technologies/gensim/pull/1640))
76+
* Fix duplication and wrong markup in docs (__[@horpto](https://github.com/horpto)__, [#1633](https://github.com/RaRe-Technologies/gensim/pull/1633))
77+
* Refactor dendrogram & topic network notebooks (__[@parulsethi](https://github.com/parulsethi)__, [#1571](https://github.com/RaRe-Technologies/gensim/pull/1571))
78+
* Fix release badge (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1631](https://github.com/RaRe-Technologies/gensim/pull/1631))
79+
80+
:warning: Deprecation part (will come into force in the next major release)
81+
* Remove
82+
- `gensim.examples`
83+
- `gensim.nosy`
84+
- `gensim.scripts.word2vec_standalone`
85+
- `gensim.scripts.make_wiki_lemma`
86+
- `gensim.scripts.make_wiki_online`
87+
- `gensim.scripts.make_wiki_online_lemma`
88+
- `gensim.scripts.make_wiki_online_nodebug`
89+
- `gensim.scripts.make_wiki`
90+
91+
* Move
92+
- `gensim.scripts.make_wikicorpus` ➡ `gensim.scripts.make_wiki.py`
93+
- `gensim.summarization` ➡ `gensim.models.summarization`
94+
- `gensim.topic_coherence` ➡ `gensim.models._coherence`
95+
- `gensim.utils` ➡ `gensim.utils.utils` (old imports will continue to work)
96+
- `gensim.parsing.*` ➡ `gensim.utils.text_utils`
97+
98+
Also, we'll create `experimental` subpackage for unstable models. Specific lists will be available in the next major release.
99+
100+
3101
## 3.0.1, 2017-10-12
4102

5103

0 commit comments

Comments
 (0)