II-Commons is a platform for collaboratively developing large, shared knowledge bases. It offers tools for distributed data handling, embedding computation, index creation, and information retrieval. Organizations and individuals can use it to create private or public knowledge resources.
For more details about our project, please visit our blog post.
This repository II-Commons contains tools for managing text and image datasets, including loading, fetching, and embedding large datasets.
The dataset processed by these tools are suitable for model training, fine-tuning, RAG, MCP, and other applications.
- PostgreSQL for metadata and vector storage (PostgreSQL License)
- VectorChord for vector indexing (ELv2, AGPLv3)
- pg_search for BM25 indexing (AGPLv3)
You can build your own dataset from scratch or quickly begin experimenting with our pre-prepared datasets.
This session shows how to recovery from our pre-computed database backup to run a vector similarity search instance.
We utilize pg_basebackup
for comprehensive database backups. Refer to the official documentation for detailed information about pg_basebackup
.
Download a database backup from huggingface: Wikipedia English or PD12M
Untar all tar files:
cat basebackup/basebackup.tar.part.* | tar -xvf -
remove all tar.part files, or move them to another dir for backup.
rm -f basebackup/basebackup.tar.*
then, basebackup dir will look like this:
PG_VERSION
backup_label.old
backup_manifest
base
global
pg_commit_ts
pg_dynshmem
...
Use our Docker image to run a postgresql node. for example, the Wikipedia English
download dir is /data/wikipedia_en
.
Note
the default postgres password is postgres.1234
, please change the password!
sudo docker run --rm -it \
--name postgres-localvector \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=postgres.1234 \
-e POSTGRES_DB=localvector \
-e PGDATA=/var/lib/postgresql/data/basebackup \
-v /data/wikipedia_en:/var/lib/postgresql/data \
-p 5432:5432 \
postgres-17-parade-vchord
Use psql
command to connect the postgresql node, and connect to database localvector
:
postgres=# \c localvector
Run \dx
command to make sure extensions pg_search
and vchord
are available.
Setup probes
for vectorchord query, you can try a higher value to blance query and performance.
ALTER SYSTEM SET vchordrq.probes = 100;
then restart postgresql. Congratulation, the database is ready to use.
Next step, try to run benchmark or api server.
Note
warm the index to improve performance:
SELECT vchordrq_prewarm('ts_wikipedia_en_embed_vector_index');
$ git clone https://github.com/Intelligent-Internet/ii-commons.git
$ cd ii-commons
$ pip install -r requirements.txt
Create a .env
file from sample.env and configure the necessary parameters.
Be sure to configure the PostgreSQL and S3 related environment variables. Most of the features are dependent on them. The easiest way is to run it use our Docker image, or build your own
We provide prebuilt datasets for your use. You can import, index and use them out of the box.
- 🤗 Wikipedia English
- 🤗 PD12M
Skip the preparation steps and go to the Query section if you want to use these prebuilt versions.
More prebuilt datasets are under construction and will be released soon.
Evaluation NDCG@10 on TREC-DL 2019, with MS Marco v1.1 Dataset. Retrieval 30 results/query, (Hybrid search includes 30 embedding results and 30 BM25 results) with similarly sorting and reranker model.
Approach | Similarly | Ms-marco-MiniLM-L12-v2 1 | Bge-reranker-v2-m3 |
---|---|---|---|
BM25 (pg_search) | 0.302 | 0.418 | 0.415 |
embedding (VectorChord) | 0.661 | 0.712 | 0.700 |
emb * 0.8 + bm25 * 0.2 | 0.598 | 0.723 | 0.726 |
emb * 1.0 + bm25 * 0 2 | 0.661 | 0.733 | 0.723 |
Run random 500 queries on:
Google Cloud, e2-standard-2 (2 vCPU, 1 core, 8 GB memory)
- database dir on 100 GB ssd: Average 0.13s/query (cost ~US$67/month)
- database dir on 100 GB balanced persistent : Average 0.32s/query (cost ~US$60/month)
ii-Commons
supports multiple image datasets, for example PD12M, CC12M,
cc12m-cleaned, and so on. It also supports custom datasets in parquet, jsonl, or csv format. In this demonstration, we will use a sample mini dataset which is the first 100,000 entries from PD12M for the sake of speed.
First the dataset meta must be loaded into the database.
$ python . -w load -d pd12m -p ./meta/PD12M/metadata
Then we need to fetch raw data items and save them to object storage. It supports S3 and S3-compatible object storage services. For local deployments, SeaweedFS is recommended.
$ python . -w fetch -d pd12m
After the data items are fetched, we can embed the images.
We use google/siglip2-so400m-patch16-naflex as default image embedding model.
$ python . -w embed_image -d pd12m
You can run the above command multiple times parallelly to speed up the embedding process in a single machine or in a distributed environment. II-commons
will automatically divide the dataset into multiple parts and embed them in parallel. And also, a worker can be up and down dynamically, II-commons
will automatically manage the workers and the dataset parts, you don't need to care about it.
II-commons
is designed to support text based datasets like wikipedia, arXiv and so on. We will use the Wikipedia English dataset for demonstration. Full support for arXiv is coming soon.
Navigate to the wikipedia dump directory. Download the dump file pages-articles-multistream
in xml.bz2
format, like enwiki-20250501-pages-articles-multistream.xml.bz2. Extract the xml
file from the bz2
archive.
You can use the sample mini dataset for testing, jump to the Load the Dataset to Database section.
The best way to extract pages from the raw dataset is to use the wikiextractor tool.
Besure to apply this patch to the wikiextractor
tool to fix this issue before extracting pages.
$ wikiextractor enwiki-20250501-pages-articles-multistream.xml --json --no-templates -o /path/to/wikipedia_en
Extract pages with links if you need.
$ wikiextractor enwiki-20250501-pages-articles-multistream.xml --json --no-templates--links -o /path/to/wikipedia_en
This step will analyze all the pages extracted from the raw dataset, upload them to the object storage, and save the metadata to the database.
$ python . -w load -d wikipedia_en -p ./meta/wikipedia_en
This step will split the pages into chunks of a certain size, save the chunks to the chunking database, and embed the chunks.
We use Snowflake/snowflake-arctic-embed-m-v2.0 as default text embedding model.
$ python . -w embed_text -d wikipedia_en
You can run the above command multiple times parallelly to speed up the embedding process in a single machine or in a distributed environment. II-commons
will automatically divide the dataset into multiple parts and process them in parallel. And also, a worker can be up and down dynamically, II-commons
will automatically manage the workers and the dataset parts, you don't need to care about it.
$ python . -q [TOPIC]
$ docker build -t ii-commons .
$ docker run --rm --gpus all -v ./.env:/app/.env ii-commons
Checkout the documentation for API/MCP services and more details.
- Simplify installation and operation.
- Offer more pre-computed indexes for modalities like PDFs, video and audio.
- Create more AI-assisted generated knowledge bases for public good.
- Provide API services for datasets.
- Establish a knowledge base hub for easier sharing and downloading.
- Develop a desktop version for personal everyday data retrieval.