II-Commons

II-Commons is a platform for collaboratively developing large, shared knowledge bases. It offers tools for distributed data handling, embedding computation, index creation, and information retrieval. Organizations and individuals can use it to create private or public knowledge resources.

For more details about our project, please visit our blog post.

Features

This repository II-Commons contains tools for managing text and image datasets, including loading, fetching, and embedding large datasets.

The dataset processed by these tools are suitable for model training, fine-tuning, RAG, MCP, and other applications.

Latest Updates

2025-07-15 Adds II-Commons-Store: A Plug-and-Play Knowledge Hub for AI Applications.

Requirements

PostgreSQL for metadata and vector storage (PostgreSQL License)
VectorChord for vector indexing (ELv2, AGPLv3)
pg_search for BM25 indexing (AGPLv3)

Quick start

You can build your own dataset from scratch or quickly begin experimenting with our pre-prepared datasets.

This session shows how to recovery from our pre-computed database backup to run a vector similarity search instance.

We utilize pg_basebackup for comprehensive database backups. Refer to the official documentation for detailed information about pg_basebackup.

Download a database backup from huggingface: Wikipedia English or PD12M

Untar all tar files:

cat basebackup/basebackup.tar.part.* | tar -xvf -

remove all tar.part files, or move them to another dir for backup.

rm -f basebackup/basebackup.tar.*

then, basebackup dir will look like this:

PG_VERSION
backup_label.old
backup_manifest
base
global
pg_commit_ts
pg_dynshmem
...

Use our Docker image to run a postgresql node. for example, the Wikipedia English download dir is /data/wikipedia_en.

Note

the default postgres password is postgres.1234, please change the password!

sudo docker run --rm -it \
  --name postgres-localvector \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=postgres.1234 \
  -e POSTGRES_DB=localvector \
  -e PGDATA=/var/lib/postgresql/data/basebackup \
  -v /data/wikipedia_en:/var/lib/postgresql/data \
  -p 5432:5432 \
  postgres-17-parade-vchord

Use psql command to connect the postgresql node, and connect to database localvector:

postgres=# \c localvector

Run \dx command to make sure extensions pg_search and vchord are available.

Setup probes for vectorchord query, you can try a higher value to blance query and performance.

ALTER SYSTEM SET vchordrq.probes = 100;

then restart postgresql. Congratulation, the database is ready to use.

Next step, try to run benchmark or api server.

Note

warm the index to improve performance:

SELECT vchordrq_prewarm('ts_wikipedia_en_embed_vector_index');

Installation

$ git clone https://github.com/Intelligent-Internet/ii-commons.git
$ cd ii-commons
$ pip install -r requirements.txt

Configuration

Create a .env file from sample.env and configure the necessary parameters.

Be sure to configure the PostgreSQL and S3 related environment variables. Most of the features are dependent on them. The easiest way is to run it use our Docker image, or build your own

Prebuilt datasets

We provide prebuilt datasets for your use. You can import, index and use them out of the box.

🤗 Wikipedia English
🤗 PD12M

Skip the preparation steps and go to the Query section if you want to use these prebuilt versions.

More prebuilt datasets are under construction and will be released soon.

Evaluation

Evaluation NDCG@10 on TREC-DL 2019, with MS Marco v1.1 Dataset. Retrieval 30 results/query, (Hybrid search includes 30 embedding results and 30 BM25 results) with similarly sorting and reranker model.

Approach	Similarly	Ms-marco-MiniLM-L12-v2 ¹	Bge-reranker-v2-m3
BM25 (pg_search)	0.302	0.418	0.415
embedding (VectorChord)	0.661	0.712	0.700
emb * 0.8 + bm25 * 0.2	0.598	0.723	0.726
emb * 1.0 + bm25 * 0 ²	0.661	0.733	0.723

Benchmark and Cost

Run random 500 queries on:

Google Cloud, e2-standard-2 (2 vCPU, 1 core, 8 GB memory)

database dir on 100 GB ssd: Average 0.13s/query (cost ~US$67/month)
database dir on 100 GB balanced persistent : Average 0.32s/query (cost ~US$60/month)

Prepare a Image Dataset

ii-Commons supports multiple image datasets, for example PD12M, CC12M, cc12m-cleaned, and so on. It also supports custom datasets in parquet, jsonl, or csv format. In this demonstration, we will use a sample mini dataset which is the first 100,000 entries from PD12M for the sake of speed.

1. Load Metadata to Database

First the dataset meta must be loaded into the database.

$ python . -w load -d pd12m -p ./meta/PD12M/metadata

2. Fetch Data from Source

Then we need to fetch raw data items and save them to object storage. It supports S3 and S3-compatible object storage services. For local deployments, SeaweedFS is recommended.

$ python . -w fetch -d pd12m

3. Embed Images in a Dataset

After the data items are fetched, we can embed the images.

We use google/siglip2-so400m-patch16-naflex as default image embedding model.

$ python . -w embed_image -d pd12m

You can run the above command multiple times parallelly to speed up the embedding process in a single machine or in a distributed environment. II-commons will automatically divide the dataset into multiple parts and embed them in parallel. And also, a worker can be up and down dynamically, II-commons will automatically manage the workers and the dataset parts, you don't need to care about it.

Prepare a Text Dataset

II-commons is designed to support text based datasets like wikipedia, arXiv and so on. We will use the Wikipedia English dataset for demonstration. Full support for arXiv is coming soon.

1. Get the Raw Dataset

Navigate to the wikipedia dump directory. Download the dump file pages-articles-multistream in xml.bz2 format, like enwiki-20250501-pages-articles-multistream.xml.bz2. Extract the xml file from the bz2 archive.

You can use the sample mini dataset for testing, jump to the Load the Dataset to Database section.

2. Extract Pages from the Raw Dataset

The best way to extract pages from the raw dataset is to use the wikiextractor tool.

Besure to apply this patch to the wikiextractor tool to fix this issue before extracting pages.

$ wikiextractor enwiki-20250501-pages-articles-multistream.xml --json --no-templates -o /path/to/wikipedia_en

Extract pages with links if you need.

$ wikiextractor enwiki-20250501-pages-articles-multistream.xml --json --no-templates--links -o /path/to/wikipedia_en

3. Load the Dataset to Database

This step will analyze all the pages extracted from the raw dataset, upload them to the object storage, and save the metadata to the database.

$ python . -w load -d wikipedia_en -p ./meta/wikipedia_en

4. Split Pages into Chunks, and Embed the Chunks

This step will split the pages into chunks of a certain size, save the chunks to the chunking database, and embed the chunks.

We use Snowflake/snowflake-arctic-embed-m-v2.0 as default text embedding model.

$ python . -w embed_text -d wikipedia_en

You can run the above command multiple times parallelly to speed up the embedding process in a single machine or in a distributed environment. II-commons will automatically divide the dataset into multiple parts and process them in parallel. And also, a worker can be up and down dynamically, II-commons will automatically manage the workers and the dataset parts, you don't need to care about it.

Query

$ python . -q [TOPIC]

Docker

Build

$ docker build -t ii-commons .

Run

$ docker run --rm --gpus all -v ./.env:/app/.env ii-commons

Try the demo API/MCP services

Checkout the documentation for API/MCP services and more details.

FAQ

What's Next: Our Roadmap

Simplify installation and operation.
Offer more pre-computed indexes for modalities like PDFs, video and audio.
Create more AI-assisted generated knowledge bases for public good.
Provide API services for datasets.
Establish a knowledge base hub for easier sharing and downloading.
Develop a desktop version for personal everyday data retrieval.

Ms-marco-MiniLM-L12-v2 trained on the MS Marco Passage Ranking tasks ↩
emb1.0 / bm250: BM25 results included in search, scores set to 0 ↩

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
bin		bin
commons-store		commons-store
examples		examples
lib		lib
meta		meta
models		models
prompts		prompts
scripts		scripts
workflows		workflows
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
II-Commons.png		II-Commons.png
LICENSE		LICENSE
README.md		README.md
__main__.py		__main__.py
install.sh		install.sh
requirements.txt		requirements.txt
sample.env		sample.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

II-Commons

Features

Latest Updates

Requirements

Quick start

Installation

Configuration

Prebuilt datasets

Evaluation

Benchmark and Cost

Prepare a Image Dataset

1. Load Metadata to Database

2. Fetch Data from Source

3. Embed Images in a Dataset

Prepare a Text Dataset

1. Get the Raw Dataset

2. Extract Pages from the Raw Dataset

3. Load the Dataset to Database

4. Split Pages into Chunks, and Embed the Chunks

Query

Docker

Build

Run

Try the demo API/MCP services

FAQ

What's Next: Our Roadmap

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

Intelligent-Internet/II-Commons

Folders and files

Latest commit

History

Repository files navigation

II-Commons

Features

Latest Updates

Requirements

Quick start

Installation

Configuration

Prebuilt datasets

Evaluation

Benchmark and Cost

Prepare a Image Dataset

1. Load Metadata to Database

2. Fetch Data from Source

3. Embed Images in a Dataset

Prepare a Text Dataset

1. Get the Raw Dataset

2. Extract Pages from the Raw Dataset

3. Load the Dataset to Database

4. Split Pages into Chunks, and Embed the Chunks

Query

Docker

Build

Run

Try the demo API/MCP services

FAQ

What's Next: Our Roadmap

Footnotes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages