Skip to content

Commit 0ef66aa

Browse files
committed
refine ridmi
1 parent 19ed31c commit 0ef66aa

File tree

1 file changed

+117
-56
lines changed

1 file changed

+117
-56
lines changed

README.md

Lines changed: 117 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,14 @@
2121
</a>
2222
</div>
2323

24+
A high-level interface developed by [Machine Intelligence Laboratory](https://mipt.ai/en) for [BigARTM](https://github.com/bigartm/bigartm) library.
2425

25-
### What is TopicNet?
2626

27-
TopicNet is a high-level interface developed by [Machine Intelligence Laboratory](https://mipt.ai/en) for [BigARTM](https://github.com/bigartm/bigartm) library.
27+
## What is TopicNet?
2828

29-
```TopicNet``` library was created to assist in the task of building topic models. It aims at automating model training routine freeing more time for artistic process of constructing a target functional for the task at hand.
29+
30+
`TopicNet` library was created to assist in the task of building topic models.
31+
It aims at automating model training routine freeing more time for artistic process of constructing a target functional for the task at hand.
3032

3133
Consider using TopicNet if:
3234

@@ -35,7 +37,7 @@ Consider using TopicNet if:
3537
* you want to build a good topic model quickly (out-of-box, with default parameters).
3638
* you have an ARTM model at hand and you want to explore it's topics.
3739

38-
`TopicNet` provides an infrastructure for your prototyping (`Experiment` class) and helps to observe results of your actions via `viewers` module.
40+
`TopicNet` provides an infrastructure for your prototyping with the help of `Experiment` class and helps to observe results of your actions via `viewers` module.
3941

4042
<p>
4143
<div align="center">
@@ -51,55 +53,94 @@ Consider using TopicNet if:
5153
</p>
5254

5355

54-
### How to start?
56+
## How to start
5557

5658
Define `TopicModel` from an ARTM model at hand or with help from `model_constructor` module, where you can set models main parameters. Then create an `Experiment`, assigning a root position to this model and path to store your experiment. Further, you can define a set of training stages by the functionality provided by the `cooking_machine.cubes` module.
5759

58-
Further you can read documentation [here](https://machine-intelligence-laboratory.github.io/TopicNet/). Currently we are in the process of imporving it.
60+
Further you can read documentation [here](https://machine-intelligence-laboratory.github.io/TopicNet/).
5961

60-
## How to install TopicNet
6162

62-
**Core library functionality is based on BigARTM library** which required manual installation on all systems.
63-
Currently we have working solution for Linux users:
64-
```
63+
# Installation
64+
65+
**Core library functionality is based on BigARTM library**.
66+
So BigARTM should also be installed on the machine.
67+
Fortunately, the installation process should not be so difficult now.
68+
Below are the detailed explanations.
69+
70+
71+
## Via pip
72+
73+
The easiest way to install everything is via `pip` (but currently works fine only for Linux users!)
74+
75+
```bash
6576
pip install topicnet
6677
```
67-
as it is currently awailiable to install BigARTM on linux systems via `pip`. We hoping to bring `pip` installation support to other systems, hovewer right now you may find the following guide useful.
6878

69-
To avoid installing BigARTM you can use [docker images](https://hub.docker.com/r/xtonev/bigartm/tags) with preinstalled different versions of BigARTM library in them.
79+
The command also installs BigARTM library, not only TopicNet.
7080

71-
#### Using docker image
72-
```
81+
If working on Windows or Mac, you should install BigARTM by yourself first, then `pip install topicnet` will work just fine.
82+
We are hoping to bring all-in-`pip` installation support to the mentioned systems.
83+
However, right now you may find the following guide useful.
84+
85+
To avoid installing BigARTM you can use [docker images](https://hub.docker.com/r/xtonev/bigartm/tags) with preinstalled different versions of BigARTM library:
86+
87+
```bash
7388
docker pull xtonev/bigartm:v0.10.0
7489
docker run -t -i xtonev/bigartm:v0.10.0
7590
```
76-
#### Check if import is sucessfull
77-
```
91+
92+
Checking if all installed successfully:
93+
94+
```bash
7895
python3
7996
import artm
8097
artm.version()
8198
```
8299

83100
Alternatively, you can follow [BigARTM installation manual](https://bigartm.readthedocs.io/en/stable/installation/index.html).
84-
After setting up the environment you can fork this repository or use ```pip install topicnet``` to install the library.
101+
After setting up the environment you can fork this repository or use `pip install topicnet` to install the library.
102+
103+
104+
## From source
85105

86-
## How to use TopicNet
106+
One can also install the library from GitHub, which may give more flexibility in developing (for example, making one's own viewers or regularizers a part of the module as .py files)
87107

88-
Let's say you have a handful of raw text mined from some source and you want to perform some topic modelling on them. Where should you start?
89-
### Data Preparation
90-
Every ML problem starts with data preprocess step. TopicNet does not perform data preprocessing itself. Instead, it demands data being prepared by the user and loaded via [Dataset class.](topicnet/cooking_machine/dataset.py)
108+
```bash
109+
git clone https://github.com/machine-intelligence-laboratory/TopicNet.git
110+
cd topicnet
111+
pip install .
112+
```
113+
114+
115+
# Usage
116+
117+
Let's say you have a handful of raw text mined from some source and you want to perform some topic modelling on them.
118+
Where should you start?
119+
120+
## Data Preparation
121+
122+
Every ML problem starts with data preprocess step.
123+
TopicNet does not perform data preprocessing itself.
124+
Instead, it demands data being prepared by the user and loaded via [Dataset](topicnet/cooking_machine/dataset.py) class.
91125
Here is a basic example of how one can achieve that: [rtl_wiki_preprocessing](topicnet/demos/RTL-WIKI-PREPROCESSING.ipynb).
92126

93-
### Training topic model
127+
## Training topic model
128+
94129
Here we can finally get on the main part: making your own, best of them all, manually crafted Topic Model
95-
#### Get your data
130+
131+
### Get your data
132+
96133
We need to load our data prepared previously with Dataset:
97-
```
134+
135+
```python
98136
data = Dataset('/Wiki_raw_set/wiki_data.csv')
99137
```
100-
#### Make initial model
138+
139+
### Make initial model
140+
101141
In case you want to start from a fresh model we suggest you use this code:
102-
```
142+
143+
```python
103144
from topicnet.cooking_machine.model_constructor import init_simple_default_model
104145

105146
model_artm = init_simple_default_model(
@@ -110,9 +151,11 @@ model_artm = init_simple_default_model(
110151
background_topics=1,
111152
)
112153
```
113-
Note that here we have model with two modalities: `'@lemmatized'` and `'@bigram'`.
154+
155+
Note that here we have model with two modalities: `'@lemmatized'` and `'@bigram'`.
114156
Further, if needed, one can define a custom score to be calculated during the model training.
115-
```
157+
158+
```python
116159
from topicnet.cooking_machine.models.base_score import BaseScore
117160

118161
class CustomScore(BaseScore):
@@ -122,26 +165,36 @@ class CustomScore(BaseScore):
122165
def call(self, model,
123166
eps=1e-5,
124167
n_specific_topics=14):
168+
125169
phi = model.get_phi().values[:,:n_specific_topics]
126170
specific_sparsity = np.sum(phi < eps) / np.sum(phi < 1)
171+
127172
return specific_sparsity
128173
```
174+
129175
Now, `TopicModel` with custom score can be defined:
130-
```
176+
177+
```python
131178
from topicnet.cooking_machine.models.topic_model import TopicModel
132179

133-
custom_score_dict = {'SpecificSparsity': CustomScore()}
134-
tm = TopicModel(model_artm, model_id='Groot', custom_scores=custom_score_dict)
180+
custom_scores = {'SpecificSparsity': CustomScore()}
181+
topic_model = TopicModel(model_artm, model_id='Groot', custom_scores=custom_scores)
135182
```
136-
#### Define experiment
183+
184+
### Define experiment
185+
137186
For further model training and tuning `Experiment` is necessary:
138-
```
187+
188+
```python
139189
from topicnet.cooking_machine.experiment import Experiment
140-
experiment = Experiment(experiment_id="simple_experiment", save_path="experiments", topic_model=tm)
190+
191+
experiment = Experiment(experiment_id="simple_experiment", save_path="experiments", topic_model=topic_model)
141192
```
142-
#### Toy with the cubes
193+
194+
### Toy with the cubes
143195
Defining a next stage of the model training to select a decorrelator parameter:
144-
```
196+
197+
```python
145198
from topicnet.cooking_machine.cubes import RegularizersModifierCube
146199

147200
my_first_cube = RegularizersModifierCube(
@@ -152,52 +205,60 @@ my_first_cube = RegularizersModifierCube(
152205
'tau_grid': [0,1,2,3,4,5],
153206
},
154207
reg_search='grid',
155-
verbose=True
208+
verbose=True,
156209
)
210+
157211
my_first_cube(tm, demo_data)
158212
```
213+
159214
Selecting a model with best perplexity score:
215+
216+
```python
217+
perplexity_criterion = 'PerplexityScore@lemmatized -> min COLLECT 1'
218+
best_model = experiment.select(perplexity_criterion)
160219
```
161-
perplexity_select = 'PerplexityScore@lemmatized -> min COLLECT 1'
162-
best_model = experiment.select(perplexity_select)
163-
```
164-
#### View the results
220+
221+
### View the results
165222
Browsing the model is easy: create a viewer and call its `view()` method:
166-
```
167-
thresh = 1e-5
168-
top_tok = TopTokensViewer(best_model, num_top_tokens=10, method='phi')
169-
top_tok_html = top_tok.to_html(top_tok.view(),thresh=thresh)
170-
for line in first_model_html:
171-
display_html(line, raw=True)
223+
224+
```python
225+
from IPython.display import HTML
226+
227+
threshold = 1e-5
228+
viewer = TopTokensViewer(best_model, num_top_tokens=10, method='phi')
229+
html_view = viewer.to_html(top_tok.view(), thresh=threshold)
230+
231+
HTML(html_view)
172232
```
173233

174-
## FAQ
234+
# FAQ
175235

176-
#### In the example we used to write vw modality like **@modality**, is it a VowpallWabbit format?
236+
## In the example we used to write vw modality like **@modality**, is it a VowpallWabbit format?
177237

178238
It is a convention to write data designating modalities with @ sign taken by TopicNet from BigARTM.
179239

180-
#### CubeCreator helps to perform a grid search over initial model parameters. How can I do it with modalities?
240+
## CubeCreator helps to perform a grid search over initial model parameters. How can I do it with modalities?
181241

182242
Modality search space can be defined using standart library logic like:
183-
```
243+
244+
```python
184245
class_ids_cube = CubeCreator(
185246
num_iter=5,
186247
parameters: [
187248
name: 'class_ids',
188249
values: {
189-
'@text': [1, 2, 3],
190-
'@ngrams': [4, 5, 6],
250+
'@text': [1, 2, 3],
251+
'@ngrams': [4, 5, 6],
191252
},
192253
]
193254
reg_search='grid',
194-
verbose=True
255+
verbose=True,
195256
)
196-
197257
```
258+
198259
However for the case of modalities a couple of slightly more convenient methods are availiable:
199260

200-
```
261+
```python
201262
parameters : [
202263
{
203264
'name': 'class_ids@text',

0 commit comments

Comments
 (0)