The app is currenty hosted here: https://true-real-michael.github.io/tg-message-search
- Export a telegram chat in a JSON format
- Upload the
result.json
file (the website runs in the browser and no data is saved) - Search and browse threads and messages
- Экспортируйте телеграм-чат в формате JSON
- Загрузите файл
result.json
(сайт работает полностью в браузере, и никакие данные не сохраняются) - Производите поиск по тредам и сообщениям
- Install Trunk, add wasm32 target
cargo install trunk
rustup target add wasm32-unknown-unknown
- Download the
lemmatization-ru.tsv.gz
file from releases and place it under the/data
directory. Alternatively, download the morphological dictionary from OpenCorpora's website, place it under/data
, run thescripts/preprocess_opcorpora.py
script, and gzip the result - Run the project
trunk serve --port 3000 --release
- The project will be available at localhost:3000/tg-message-search
If you want to use this project for a different language, you should replace the lemmatization dictionary with the one for your language.
If you want more complex lemmatization/stemming/embedding logic, you should take a look at the Lemmatizer
struct in src/analysis/lemmatizer.rs
and modify it accordingly.
- The Telegram native message search was not convenient for me, especially:
- When I wanted to search for synonyms.
- When I wanted to search for combinations of words.
- When the info is scattered across multiple messages, which form a reply chain.
- When there are many results, it is inconvenient to scroll through them in a tiny search results bar.
- Why WASM?
- To maintain privacy by keeping all data client-side.
- To avoid round-trips for queries and data upload.
- I didn't want to spend money on a backend.
- Because it is a cool technology and I wanted to try it out.
- Why no embedded db?
- Because I wanted bespoke lemmatization logic.
- I also wanted to keep the app lightweight and minimalistic.
- Why Leptos?
- No reason at all, just wanted to try it out.
- This project used to use
wasm-bindgen
+ vanilla JS + HTML, but I tried doing reactive UI with Leptos and it worked well. - Language unification was a nice bonus.
- Why dictionary-based lemmatization?
- I initially considered using word embeddings, but I could not find a suitable model for Russian.
- Dictionary gets the work done and does not take too much space (arguably): ≈9MB compressed, ≈300MB uncompressed.
- How to use WASM in a web application.
- How to use Leptos for building a reactive UI.
- Refreshed memories on parsing.
- Web Workers for background initialization. Currently it is blocking the main thread.
- Revise the code because it contains a lot of clones and unwraps.
All the code is licensed under MIT License
The file lemmatization-ru.tsv.gz
in this repository's GitHub releases is a derivative of OpenCorpora's Russian language morphologic dictionary and is licenced under Creative Commons Attribution-ShareAlike 3.0
Весь код находится под лицензией MIT
Файл lemmatization-ru.tsv.gz
в GitHub-релизах этого репозитория является производной от морфологического словаря OpenCorpora и находится под лицензией Creative Commons Attribution-ShareAlike 3.0