The purpose of this repo is to perform a yearly survey of major machine learning conferences and arXiv. Extract all the metadata, abstracts, and other information from of all the papers and look for topic frequencies that show up.
Current product version is 0.3.2. __main__.py
in the TUI folder is operational. Current working search models are Fuzzy, Cosine, Word2vec, Marco and Specter
.
The two search parameters that work best are title and abstract
as those have the least amount of missing values. (Scraping data isn't always perfect)
- Python >= 3.11
- numpy
- pandas
- rich
- textual
- requests
- matplotlib
- spacy
- scikit-learn
- beautifulsoup4
- pyzotero (eventually)
In VSCODE
press CTRL + SHIFT + ~
to open a terminal
Navigate to the directory where you want to clone the repo.
Launch VSCode if that is IDE of choice.
In your terminal, navigate to your root folder.
If poetry is not installed, do so in order to continue.
On Windows
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -
On Linux/Mac
curl -sSL https://install.python-poetry.org | python3 -
To check if poetry is installed on your system. Type the following into your terminal
poetry -V
if you see a version
returned, you have Poetry installed. The second command is to update poetry if its installed. (Always a good idea). If not, follow this link and follow installation commands for your systems requirements. If on windows, we recommend the powershell
option for easiest installation. Using pip to install poetry will lead to problems down the road and we do not recommend that option. It needs to be installed separately from your standard python installation to manage your many python installations. Note: Python 2.7 is not supported
. You are more than welcome to go the pip route but I can't guarantee your dependencies won't clash.
Some prefer Poetry's default storage method of storing environments in one location on your system. The default storage are nested under the {cache_dir}/virtualenvs
.
If you want to store you virtual environment locally. Set this global configuration flag below once poetry is installed. This will now search for whatever environments you have in the root folder before trying any global versions of the environment in the cache.
poetry config virtualenvs.in-project true
For general instruction as to poetry's functionality and commands, please read through poetry's cli documentation
To create a new venv
python -m venv .venv
or
This command will automatically activate the env
poetry env use python3.12
or
Activate the venv Windows
.venv\scripts\activate
Mac/Linux
source .venv/bin/activate
While in root directory run commands below
$ mkdir data/logs data/logs/scrape data/logs/tui searches
$ mkdir data/searches data/models/marco data/models/specter
To use your GPU, or not to use your GPU. That is the question. If you're lucky enough to have workhorse GPU on your rig, you might be inclined to use it when selecting the "Marco" and "Specter" models. To do so requires... a few extra annoying steps. Hopefully you bought into the NVIDIA hype and have one of their GPU's as most of pytorch's implmentations are based on the NVIDIA CUDA drivers.
First order of business is to see what NVIDIA drivers you can currently operate at.
nvidia-smi
After running the above look on the top right for CUDA Version: xx.x
This will be the maximum CUDA version you can use with your current installation. If you want to install pytorch, you'll need to install a CUDA toolkit that is BELOW
that max version. If you go over it... well that's on you.
Now you'll need to head over to pytorchs getting started page
Go through the selections and see which align with your system. My only options were 11.8 or 12.6. Since my NVIDIA max driver version is 12.5. 11.8 it is! Because poetry is a bit extra, we'll have to add the source for whatever cuda version will fit below your GPU's current NVIDIA drivers.
poetry source add --priority=explicit pytorch-cuda "https://download.pytorch.org/whl/cu118"
After the source is added, you should see something like this in your project.toml file.
[[tool.poetry.source]]
name = "pytorch-cuda"
url = "https://download.pytorch.org/whl/cu118"
priority = "explicit"
Now you can install the specific versions of what you'll need to run SBert models on your GPU. In my case, these were the available versions from the 11.8 CUDA Toolkit.
poetry add torch==2.7.0+cu118 torchaudio==2.7.0+cu118 torchvision==0.22.0+cu118 --source pytorch-cuda
poetry add sentence-transformers
You'll want to go into the project.toml file and before you run the command below. Delete lines 23-25
and 34-44
. Then run the following below. To update the lock file (first) then install libraries. Do the following
poetry lock
poetry install --no-root
This will read from the project.toml file that is included in this repo and install all necessary packgage versions. Should other versions be needed, the project TOML file will be utilized and packages updated according to your system requirements. To view the current libraries installed
poetry show
To view only top level library requirements
poetry show -T
If you'd like to use word2vec
to do your asymetric semantic search, you'll need to do a few things before starting. In your terminal, with your environment activated
type the following in your terminal. This should install the model in your activated environment. You can check by looking for something like en_core_web_md-3.8.0.... in your .venv/Lib/site-packages folder.
python -m spacy download en_core_web_md
This repo also comes with a TUI (Terminal User Interface) that allows you to explore the JSON objects for each conference / year. This repo was forked from here. Thank you to oleksis for creating the initial structure!! 🎉
To run the TUI with poetry
poetry run python tui/__main__.py data/scraped/2024_ICML.json
#replace year/conf
With python
python tui/__main__.py data/scraped/2024_ICML.json
#replace year/conf
With no file args, like a madman. This will launch a file picking application that scans the data/conferences
folder and shows you a list of available files. Enter a number of the conference you want, and you're good to go.
poetry run python tui/__main__.py
python tui/__main__.py
- Search with word2vec takes longer to run. Patience Iago
- Fuzzy search on abstract will take even longer
Suggested operation ranges
- Fuzzy => 1 to 10
- Best results around 5
- Cosine => -1 to 1
- Best results around 0.40
- Word2vec => -1 to 1
- Best results around 0.85
- Marco => -1 to 1
- Best results around 0.85
- Specter => -1 to 1
- Best results around 0.85
With the TUI running, it should look something like this.