Skip to content

Commit b6234e7

Browse files
committed
Merge branch 'release-3.1.0'
2 parents 86e0618 + 75d9164 commit b6234e7

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+14018
-9867
lines changed

CHANGELOG.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,103 @@
11
Changes
22
===========
3+
## 3.1.0, 2017-11-06
4+
5+
6+
:star2: New features:
7+
* Massive optimizations to LSI model training (__[@isamaru](https://github.com/isamaru)__, [#1620](https://github.com/RaRe-Technologies/gensim/pull/1620) & [#1622](https://github.com/RaRe-Technologies/gensim/pull/1622))
8+
- LSI model allows use of single precision (float32), to consume *40% less memory* while being *40% faster*.
9+
- LSI model can now also accept CSC matrix as input, for further memory and speed boost.
10+
- Overall, if your entire corpus fits in RAM: 3x faster LSI training (SVD) in 4x less memory!
11+
```python
12+
# just an example; the corpus stream is up to you
13+
streaming_corpus = gensim.corpora.MmCorpus("my_tfidf_corpus.mm.gz")
14+
15+
# convert your corpus to a CSC sparse matrix (assumes the entire corpus fits in RAM)
16+
in_memory_csc_matrix = gensim.matutils.corpus2csc(streaming_corpus, dtype=np.float32)
17+
18+
# then pass the CSC to LsiModel directly
19+
model = LsiModel(corpus=in_memory_csc_matrix, num_topics=500, dtype=np.float32)
20+
```
21+
- Even if you continue to use streaming corpora (your training dataset is too large for RAM), you should see significantly faster processing times and a lower memory footprint. In our experiments with a very large LSI model, we saw a drop from 29 GB peak RAM and 38 minutes (before) to 19 GB peak RAM and 26 minutes (now):
22+
```python
23+
model = LsiModel(corpus=streaming_corpus, num_topics=500, dtype=np.float32)
24+
```
25+
* Add common terms to Phrases. Fix #1258 (__[@alexgarel](https://github.com/alexgarel)__, [#1568](https://github.com/RaRe-Technologies/gensim/pull/1568))
26+
- Phrases allows to use common terms in bigrams. Before, if you are searching to reveal ngrams like `car_with_driver` and `car_without_driver`, you can either remove stop words before processing, but you will only find `car_driver`, or you won't find any of those forms (because they have three words, but also because high frequency of with will avoid them to be scored correctly), inspired by [ES common grams token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-common-grams-tokenfilter.html).
27+
```python
28+
phr_old = Phrases(corpus)
29+
phr_new = Phrases(corpus, common_terms=stopwords.words('en'))
30+
31+
print(phr_old[["we", "provide", "car", "with", "driver"]]) # ["we", "provide", "car_with", "driver"]
32+
print(phr_new[["we", "provide", "car", "with", "driver"]]) # ["we", "provide", "car_with_driver"]
33+
```
34+
* New [segment_wiki.py](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/scripts/segment_wiki.py) script (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1483](https://github.com/RaRe-Technologies/gensim/pull/1483) & [#1694](https://github.com/RaRe-Technologies/gensim/pull/1694))
35+
- CLI script for processing a raw Wikipedia dump (the xml.bz2 format provided by WikiMedia) to extract its articles in a plain text format. It extracts each article's title, section names and section content and saves them as json-line:
36+
```bash
37+
python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 | gzip > enwiki-latest-pages-articles.json.gz
38+
```
39+
Processing the entire English Wikipedia dump (13.5 GB, link [here](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2)) takes about 2.5 hours (i7-6700HQ, SSD).
40+
41+
The output format is one article per line, serialized into JSON:
42+
```python
43+
for line in smart_open('enwiki-latest-pages-articles.json.gz'): # read the file we just created
44+
article = json.loads(line)
45+
print("Article title: %s" % article['title'])
46+
for section_title, section_text in zip(article['section_titles'], article['section_texts']):
47+
print("Section title: %s" % section_title)
48+
print("Section text: %s" % section_text)
49+
```
50+
51+
:+1: Improvements:
52+
* Speedup FastText tests (__[@horpto](https://github.com/horpto)__, [#1686](https://github.com/RaRe-Technologies/gensim/pull/1686))
53+
* Add optimization for `SlicedCorpus.__len__` (__[@horpto](https://github.com/horpto)__, [#1679](https://github.com/RaRe-Technologies/gensim/pull/1679))
54+
* Make `word_vec` return immutable vector. Fix #1651 (__[@CLearERR](https://github.com/CLearERR)__, [#1662](https://github.com/RaRe-Technologies/gensim/pull/1662))
55+
* Drop Win x32 support & add rolling builds (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1652](https://github.com/RaRe-Technologies/gensim/pull/1652))
56+
* Fix scoring function in Phrases. Fix #1533, #1635 (__[@michaelwsherman](https://github.com/michaelwsherman)__, [#1573](https://github.com/RaRe-Technologies/gensim/pull/1573))
57+
* Add configuration for flake8 to setup.cfg (__[@mcobzarenco](https://github.com/mcobzarenco)__, [#1636](https://github.com/RaRe-Technologies/gensim/pull/1636))
58+
* Add `build_vocab_from_freq` to Word2Vec, speedup scan\_vocab (__[@jodevak](https://github.com/jodevak)__, [#1599](https://github.com/RaRe-Technologies/gensim/pull/1599))
59+
* Add `most_similar_to_given` method for KeyedVectors (__[@TheMathMajor](https://github.com/TheMathMajor)__, [#1582](https://github.com/RaRe-Technologies/gensim/pull/1582))
60+
* Add `__getitem__` method to Sparse2Corpus to allow direct queries (__[@isamaru](https://github.com/isamaru)__, [#1621](https://github.com/RaRe-Technologies/gensim/pull/1621))
61+
62+
:red_circle: Bug fixes:
63+
* Add single core mode to CoherenceModel. Fix #1683 (__[@horpto](https://github.com/horpto)__, [#1685](https://github.com/RaRe-Technologies/gensim/pull/1685))
64+
* Fix ResourceWarnings in tests. Partially fix #1519 (__[@horpto](https://github.com/horpto)__, [#1660](https://github.com/RaRe-Technologies/gensim/pull/1660))
65+
* Fix DeprecationWarnings generated by deprecated assertEquals. Partial fix #1519 (__[@poornagurram](https://github.com/poornagurram)__, [#1658](https://github.com/RaRe-Technologies/gensim/pull/1658))
66+
* Fix DeprecationWarnings for regex string literals. Fix #1646 (__[@franklsf95](https://github.com/franklsf95)__, [#1649](https://github.com/RaRe-Technologies/gensim/pull/1649))
67+
* Fix pagerank algorithm. Fix #805 (__[@xelez](https://github.com/xelez)__, [#1653](https://github.com/RaRe-Technologies/gensim/pull/1653))
68+
* Fix FastText inconsistent dtype. Fix #1637 (__[@mcobzarenco](https://github.com/mcobzarenco)__, [#1638](https://github.com/RaRe-Technologies/gensim/pull/1638))
69+
* Fix `test_filename_filtering` test (__[@nehaljwani](https://github.com/nehaljwani)__, [#1647](https://github.com/RaRe-Technologies/gensim/pull/1647))
70+
71+
:books: Tutorial and doc improvements:
72+
* Fix code/docstring style (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1650](https://github.com/RaRe-Technologies/gensim/pull/1650))
73+
* Update error message for supervised FastText. Fix #1498 (__[@ElSaico](https://github.com/ElSaico)__, [#1645](https://github.com/RaRe-Technologies/gensim/pull/1645))
74+
* Add "DOI badge" to README. Fix #1610 (__[@dphov](https://github.com/dphov)__, [#1639](https://github.com/RaRe-Technologies/gensim/pull/1639))
75+
* Remove duplicate annoy notebook. Fix #1415 (__[@Karamax](https://github.com/Karamax)__, [#1640](https://github.com/RaRe-Technologies/gensim/pull/1640))
76+
* Fix duplication and wrong markup in docs (__[@horpto](https://github.com/horpto)__, [#1633](https://github.com/RaRe-Technologies/gensim/pull/1633))
77+
* Refactor dendrogram & topic network notebooks (__[@parulsethi](https://github.com/parulsethi)__, [#1571](https://github.com/RaRe-Technologies/gensim/pull/1571))
78+
* Fix release badge (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1631](https://github.com/RaRe-Technologies/gensim/pull/1631))
79+
80+
:warning: Deprecation part (will come into force in the next major release)
81+
* Remove
82+
- `gensim.examples`
83+
- `gensim.nosy`
84+
- `gensim.scripts.word2vec_standalone`
85+
- `gensim.scripts.make_wiki_lemma`
86+
- `gensim.scripts.make_wiki_online`
87+
- `gensim.scripts.make_wiki_online_lemma`
88+
- `gensim.scripts.make_wiki_online_nodebug`
89+
- `gensim.scripts.make_wiki`
90+
91+
* Move
92+
- `gensim.scripts.make_wikicorpus` ➡ `gensim.scripts.make_wiki.py`
93+
- `gensim.summarization` ➡ `gensim.models.summarization`
94+
- `gensim.topic_coherence` ➡ `gensim.models._coherence`
95+
- `gensim.utils` ➡ `gensim.utils.utils` (old imports will continue to work)
96+
- `gensim.parsing.*` ➡ `gensim.utils.text_utils`
97+
98+
Also, we'll create `experimental` subpackage for unstable models. Specific lists will be available in the next major release.
99+
100+
3101
## 3.0.1, 2017-10-12
4102

5103

README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
11
gensim – Topic Modelling in Python
22
==================================
33

4-
[![Build Status](https://travis-ci.org/RaRe-Technologies/gensim.svg?branch=develop)](https://travis-ci.org/RaRe-Technologies/gensim)[![GitHub release](https://img.shields.io/github/release/rare-technologies/gensim.svg?maxAge=2592000)]()[![Wheel](https://img.shields.io/pypi/wheel/gensim.svg)](https://pypi.python.org/pypi/gensim)
4+
[![Build Status](https://travis-ci.org/RaRe-Technologies/gensim.svg?branch=develop)](https://travis-ci.org/RaRe-Technologies/gensim)
5+
[![GitHub release](https://img.shields.io/github/release/rare-technologies/gensim.svg?maxAge=3600)](https://github.com/RaRe-Technologies/gensim/releases)
6+
[![Wheel](https://img.shields.io/pypi/wheel/gensim.svg)](https://pypi.python.org/pypi/gensim)
7+
[![DOI](https://zenodo.org/badge/DOI/10.13140/2.1.2393.1847.svg)](https://doi.org/10.13140/2.1.2393.1847)
58
[![Mailing List](https://img.shields.io/badge/-Mailing%20List-lightgrey.svg)](https://groups.google.com/forum/#!forum/gensim)
69
[![Gitter](https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg)](https://gitter.im/RaRe-Technologies/gensim)
710
[![Follow](https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow)](https://twitter.com/gensim_py)
811

912

1013

11-
1214
Gensim is a Python library for *topic modelling*, *document indexing*
1315
and *similarity retrieval* with large corpora. Target audience is the
1416
*natural language processing* (NLP) and *information retrieval* (IR)

appveyor.yml

Lines changed: 23 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -13,30 +13,44 @@ environment:
1313
secure: qXqY3dFmLOqvxa3Om2gQi/BjotTOK+EP2IPLolBNo0c61yDtNWxbmE4wH3up72Be
1414

1515
matrix:
16-
- PYTHON: "C:\\Python27"
17-
PYTHON_VERSION: "2.7.12"
18-
PYTHON_ARCH: "32"
16+
# - PYTHON: "C:\\Python27"
17+
# PYTHON_VERSION: "2.7.12"
18+
# PYTHON_ARCH: "32"
1919

2020
- PYTHON: "C:\\Python27-x64"
2121
PYTHON_VERSION: "2.7.12"
2222
PYTHON_ARCH: "64"
2323

24-
- PYTHON: "C:\\Python35"
25-
PYTHON_VERSION: "3.5.2"
26-
PYTHON_ARCH: "32"
24+
# - PYTHON: "C:\\Python35"
25+
# PYTHON_VERSION: "3.5.2"
26+
# PYTHON_ARCH: "32"
2727

2828
- PYTHON: "C:\\Python35-x64"
2929
PYTHON_VERSION: "3.5.2"
3030
PYTHON_ARCH: "64"
3131

32-
- PYTHON: "C:\\Python36"
33-
PYTHON_VERSION: "3.6.0"
34-
PYTHON_ARCH: "32"
32+
# - PYTHON: "C:\\Python36"
33+
# PYTHON_VERSION: "3.6.0"
34+
# PYTHON_ARCH: "32"
3535

3636
- PYTHON: "C:\\Python36-x64"
3737
PYTHON_VERSION: "3.6.0"
3838
PYTHON_ARCH: "64"
3939

40+
init:
41+
- "ECHO %PYTHON% %PYTHON_VERSION% %PYTHON_ARCH%"
42+
- "ECHO \"%APPVEYOR_SCHEDULED_BUILD%\""
43+
# If there is a newer build queued for the same PR, cancel this one.
44+
# The AppVeyor 'rollout builds' option is supposed to serve the same
45+
# purpose but it is problematic because it tends to cancel builds pushed
46+
# directly to master instead of just PR builds (or the converse).
47+
# credits: JuliaLang developers.
48+
- ps: if ($env:APPVEYOR_PULL_REQUEST_NUMBER -and $env:APPVEYOR_BUILD_NUMBER -ne ((Invoke-RestMethod `
49+
https://ci.appveyor.com/api/projects/$env:APPVEYOR_ACCOUNT_NAME/$env:APPVEYOR_PROJECT_SLUG/history?recordsNumber=50).builds | `
50+
Where-Object pullRequestId -eq $env:APPVEYOR_PULL_REQUEST_NUMBER)[0].buildNumber) { `
51+
Write-Host "There are newer queued builds for this pull request, skipping build."
52+
Exit-AppveyorBuild
53+
}
4054

4155
install:
4256
# Install Python (from the official .msi of http://python.org) and pip when

continuous_integration/travis/flake8_diff.sh

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ set -o pipefail
2020

2121
PROJECT=RaRe-Technologies/gensim
2222
PROJECT_URL=https://github.com/${PROJECT}.git
23+
FLAKE_CONFIG_FILE=setup.cfg
2324

2425
# Find the remote with the project name (upstream in most cases)
2526
REMOTE=$(git remote -v | grep ${PROJECT} | cut -f1 | head -1 || echo '')
@@ -133,14 +134,14 @@ check_files() {
133134
if [ -n "$files" ]; then
134135
# Conservative approach: diff without context (--unified=0) so that code
135136
# that was not changed does not create failures
136-
git diff --unified=0 ${COMMIT_RANGE} -- ${files} | flake8 --diff --show-source ${options}
137+
git diff --unified=0 ${COMMIT_RANGE} -- ${files} | flake8 --config ${FLAKE_CONFIG_FILE} --diff --show-source ${options}
137138
fi
138139
}
139140

140141
if [[ "$MODIFIED_PY_FILES" == "no_match" ]]; then
141142
echo "No .py files has been modified"
142143
else
143-
check_files "$(echo "$MODIFIED_PY_FILES" )" "--ignore=E501,E731,E12,W503"
144+
check_files "$(echo "$MODIFIED_PY_FILES" )"
144145
fi
145146
echo -e "No problem detected by flake8\n"
146147

@@ -150,7 +151,7 @@ else
150151
for fname in ${MODIFIED_IPYNB_FILES}
151152
do
152153
echo "File: $fname"
153-
jupyter nbconvert --to script --stdout ${fname} | flake8 - --show-source --ignore=E501,E731,E12,W503,E402 --builtins=get_ipython || true
154+
jupyter nbconvert --to script --stdout ${fname} | flake8 --config ${FLAKE_CONFIG_FILE} --show-source --builtins=get_ipython || true
154155
done
155156
fi
156157

docs/notebooks/Topic_dendrogram.ipynb

Lines changed: 3305 additions & 3590 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)