Chilka

Chilka is a corpus builder and server library with a pluggable document database backend. Two plugins: one for MongoDB, and another for a serverless ChromaDB are provided as examples. For more, read the documentation.

Features

Easy interface that hides the complexity of the database.
Direct ingestion of text files helps automate your corpus generation task.
Flexible read interface that lets you retrieve documents:
- by sentence number range.
- by keyword filtering.
- combination of both of the above.
- as a text blob.
Simple database schema allows you to access the database directly if necessary.
Database backend lets your corpus scale and also grants ease of access.
Flexible plugin interface allows passing of custom database arguments and/or retrieve data in DB-native format.

Schema

Chilka sentencizes a text document and stores the sentences as individual database documents using the following default schema:

	{'n' : <sentence-number>, 'sent' : <sentence-text>}

Each file name gets its own collection of the same name (assuming MongoDB).

Note: It is the responsibility of the plugin to choose or enforce a particular schema. For databases that do not have the concept of a collection, the plugin needs to ensure that this abstraction is realized suitably.

Example code

Get going with just a few lines of code.

from chilka import CorpusClient
from pprint import pprint

# Assumes a MongoDB instance is running on localhost.  Replace with your
# connection string if this is not applicable.
# Connect to the MongoDB server and create a 'corpus' database
# Will get reference to a corpus DB if it already exists
my_corpus = CorpusClient("corpus", "mongodb://localhost:27017/",db_plugin="mongodb")

# List all the collections present in the db, one collection per file
print("-" * 79)
print(f"List of collections in DB: {my_corpus.list()}")
print("-" * 79)

# Add files to the DB
filefolder = "./Text/"
for filename in ["mayon_volcano.txt","ukraine_dam.txt"]:
    pprint(f"Add file {filename}:\n{my_corpus.add(f'{filefolder}'+f'{filename}')}")
    print("-"*40)
    
# List all the collections present in the DB
print(f"List of collections/files in DB: {my_corpus.list()}")
print("-" * 79)

# List sentences in each collection using filters.  Gives object IDs and sentence numbers
# as well.  Extract the 'sent' key to get just the sentences.
print("Sentences with the word 'sun' in file mayon_volcano.txt:")
pprint(f"{list(my_corpus.readSents('mayon_volcano.txt', range_filter=None,kw_filter='Sun.+'))}")
print("-" * 79)
print("Sentences in the range(15,20) in file mayon_volcano.txt:")
pprint(f"{list(my_corpus.readSents('mayon_volcano.txt', range_filter=(15,20),kw_filter=None))}")
print("-" * 79)

print(f"Sentences in the form of a text blob:\n {my_corpus.readBlob('ukraine_dam.txt')}")
print("-" * 79)

# Remove unwanted collections from DB
print(f"Remove collection ukraine_dam: {my_corpus.remove('ukraine_dam.txt')}")
print("-" * 79)

# List all collections in the DB
print(f"List of collections in DB: {my_corpus.list()}")

Installation

Use the provided requirements.txt with pip to create/verify the installation environment. Then, copy or clone this repository to start using Chilka.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Text		Text
docs		docs
docs_source		docs_source
plugins		plugins
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chilka.py		chilka.py
demo_client.py		demo_client.py
demo_client2.py		demo_client2.py
gutenberg_jokes_client.py		gutenberg_jokes_client.py
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt
test_client.py		test_client.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chilka

Features

Schema

Example code

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

progmatix21/Chilka

Folders and files

Latest commit

History

Repository files navigation

Chilka

Features

Schema

Example code

Installation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages