Skip to content

Commit ab269e2

Browse files
Merge pull request #3 from shcherbak-ai/dev
docs: Updated documentation and project development setup
2 parents 827b104 + 5ba469a commit ab269e2

33 files changed

+3644
-1213
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,10 @@ env
77
venv
88
.venv
99
.coverage
10+
.cz.msg
1011

1112
notebooks
13+
!dev/notebooks
1214
docs/build
1315
dist
1416
.DS_Store

.pre-commit-config.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,11 @@
11
repos:
2+
3+
# Commitizen hook for conventional commits
4+
- repo: https://github.com/commitizen-tools/commitizen
5+
rev: v4.5.1
6+
hooks:
7+
- id: commitizen
8+
stages: [commit-msg]
29

310
# Custom local hooks
411
- repo: local
@@ -58,3 +65,12 @@ repos:
5865
pass_filenames: false
5966
always_run: true
6067
stages: [pre-commit]
68+
69+
# Generate example notebooks
70+
- id: generate-notebooks
71+
name: Generate example notebooks
72+
entry: python dev/generate_notebooks.py
73+
language: system
74+
pass_filenames: false
75+
always_run: true
76+
stages: [pre-commit]

CITATION.cff

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,5 +5,5 @@ authors:
55
given-names: Sergii
66
email: sergii@shcherbak.ai
77
title: "ContextGem: Easier and faster way to build LLM extraction workflows through powerful abstractions"
8-
date-released: 2024-04-02
8+
date-released: 2025-04-02
99
url: "https://github.com/shcherbak-ai/contextgem"

CONTRIBUTING.md

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,11 @@ To sign the agreement:
5353

5454
3. **Install pre-commit hooks**:
5555
```bash
56+
# Install pre-commit hooks
5657
pre-commit install
58+
59+
# Install commit-msg hooks (for commitizen)
60+
pre-commit install --hook-type commit-msg
5761
```
5862

5963

@@ -102,12 +106,23 @@ To sign the agreement:
102106

103107
Please note that we use pytest-vcr to record and replay LLM API interactions. Your changes may require re-recording VCR cassettes for the tests. See [VCR Cassette Management](#vcr-cassette-management) section below for details.
104108

105-
4. **Commit your changes** with a descriptive commit message:
109+
4. **Commit your changes** using Conventional Commits format:
106110

107-
For example:
111+
We use [Conventional Commits](https://www.conventionalcommits.org/) format for our commit messages, which enables automatic changelog generation and semantic versioning. Instead of using regular git commit, please use commitizen:
108112

109113
```bash
110-
git commit -m "Add feature: description of your changes"
114+
poetry run cz commit
115+
```
116+
117+
This will guide you through an interactive prompt to create a properly formatted commit message with:
118+
- Type of change (feat, fix, docs, style, refactor, etc.)
119+
- Optional scope (e.g., api, cli, docs)
120+
- Short description
121+
- Optional longer description and breaking change notes
122+
123+
Example of resulting commit message:
124+
```
125+
docs(readme): update installation instructions
111126
```
112127

113128

NOTICE

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,10 @@ Core Dependencies:
3535

3636
Development Dependencies:
3737
- black: Code formatting
38+
- commitizen: Conventional commit tool and release management
3839
- coverage: Test coverage measurement
3940
- isort: Sorting imports
41+
- nbformat: Notebook format utilities
4042
- pip-tools: Dependency management
4143
- pre-commit: Pre-commit hooks
4244
- pytest: Testing framework

README.md

Lines changed: 84 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,8 @@ ContextGem addresses this challenge by providing a flexible, intuitive framework
2828
Read more on the project [motivation](https://contextgem.dev/motivation.html) in the documentation.
2929

3030

31-
## 💡 What can you do with ContextGem?
31+
## 💡 With ContextGem, you can:
3232

33-
With ContextGem, you can:
3433
- **Extract structured data** from documents (text, images) with minimal code
3534
- **Identify and analyze key aspects** (topics, themes, categories) within documents
3635
- **Extract specific concepts** (entities, facts, conclusions, assessments) from documents
@@ -173,15 +172,74 @@ pip install -U contextgem
173172

174173
## 🚀 Quick start
175174

175+
### Aspect extraction
176+
177+
Aspect is a defined area or topic within a document (or another aspect). Each aspect reflects a specific subject or theme.
178+
179+
```python
180+
# Quick Start Example - Extracting payment terms from a document
181+
182+
import os
183+
184+
from contextgem import Aspect, Document, DocumentLLM
185+
186+
# Sample document text (shortened for brevity)
187+
doc = Document(
188+
raw_text=(
189+
"SERVICE AGREEMENT\n"
190+
"SERVICES. Provider agrees to provide the following services to Client: "
191+
"Cloud-based data analytics platform access and maintenance...\n"
192+
"PAYMENT. Client agrees to pay $5,000 per month for the services. "
193+
"Payment is due on the 1st of each month. Late payments will incur a 2% fee per month...\n"
194+
"CONFIDENTIALITY. Both parties agree to keep all proprietary information confidential "
195+
"for a period of 5 years following termination of this Agreement..."
196+
),
197+
)
198+
199+
# Define the aspects to extract
200+
doc.aspects = [
201+
Aspect(
202+
name="Payment Terms",
203+
description="Payment terms and conditions in the contract",
204+
# see the docs for more configuration options, e.g. sub-aspects, concepts, etc.
205+
),
206+
# Add more aspects as needed
207+
]
208+
# Or use `doc.add_aspects([...])`
209+
210+
# Define an LLM for extracting information from the document
211+
llm = DocumentLLM(
212+
model="openai/gpt-4o-mini", # or any other LLM from e.g. Anthropic, etc.
213+
api_key=os.environ.get(
214+
"CONTEXTGEM_OPENAI_API_KEY"
215+
), # your API key for the LLM provider, e.g. OpenAI, Anthropic, etc.
216+
# see the docs for more configuration options
217+
)
218+
219+
# Extract information from the document
220+
doc = llm.extract_all(doc) # or use async version `await llm.extract_all_async(doc)`
221+
222+
# Access extracted information in the document object
223+
for item in doc.aspects[0].extracted_items:
224+
print(f"{item.value}")
225+
# or `doc.get_aspect_by_name("Payment Terms").extracted_items`
226+
227+
```
228+
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shcherbak-ai/contextgem/blob/main/dev/notebooks/readme/quickstart_aspect.ipynb)
229+
230+
231+
### Concept extraction
232+
233+
Concept is a unit of information or an entity, derived from an aspect or the broader document context.
234+
176235
```python
177236
# Quick Start Example - Extracting anomalies from a document, with source references and justifications
178237

179238
import os
180239

181240
from contextgem import Document, DocumentLLM, StringConcept
182241

183-
# Example document instance
184-
# Document content is shortened for brevity
242+
# Sample document text (shortened for brevity)
185243
doc = Document(
186244
raw_text=(
187245
"Consultancy Agreement\n"
@@ -203,13 +261,14 @@ doc.concepts = [
203261
reference_depth="sentences",
204262
add_justifications=True,
205263
justification_depth="brief",
264+
# see the docs for more configuration options
206265
)
207266
# add more concepts to the document, if needed
208267
# see the docs for available concepts: StringConcept, JsonObjectConcept, etc.
209268
]
210-
# Or use doc.add_concepts([...])
269+
# Or use `doc.add_concepts([...])`
211270

212-
# Create an LLM for extracting data and insights from the document
271+
# Define an LLM for extracting information from the document
213272
llm = DocumentLLM(
214273
model="openai/gpt-4o-mini", # or any other LLM from e.g. Anthropic, etc.
215274
api_key=os.environ.get(
@@ -219,15 +278,18 @@ llm = DocumentLLM(
219278
)
220279

221280
# Extract information from the document
222-
doc = llm.extract_all(doc) # or use async version llm.extract_all_async(doc)
281+
doc = llm.extract_all(doc) # or use async version `await llm.extract_all_async(doc)`
223282

224283
# Access extracted information in the document object
225284
print(
226285
doc.concepts[0].extracted_items
227286
) # extracted items with references & justifications
228-
# or doc.get_concept_by_name("Anomalies").extracted_items
287+
# or `doc.get_concept_by_name("Anomalies").extracted_items`
229288

230289
```
290+
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shcherbak-ai/contextgem/blob/main/dev/notebooks/readme/quickstart_concept.ipynb)
291+
292+
---
231293

232294
See more examples in the documentation:
233295

@@ -305,6 +367,20 @@ This project is automatically scanned for security vulnerabilities using [CodeQL
305367
See [SECURITY](https://github.com/shcherbak-ai/contextgem/blob/main/SECURITY.md) file for details.
306368

307369

370+
## 🙏 Acknowledgements
371+
372+
ContextGem relies on these excellent open-source packages:
373+
374+
- [pydantic](https://github.com/pydantic/pydantic): The gold standard for data validation
375+
- [Jinja2](https://github.com/pallets/jinja): Fast, expressive template engine that powers our dynamic prompt rendering
376+
- [litellm](https://github.com/BerriAI/litellm): Unified interface to multiple LLM providers with seamless provider switching
377+
- [wtpsplit](https://github.com/segment-any-text/wtpsplit): State-of-the-art text segmentation tool
378+
- [loguru](https://github.com/Delgan/loguru): Simple yet powerful logging that enhances debugging and observability
379+
- [python-ulid](https://github.com/mdomke/python-ulid): Efficient ULID generation
380+
- [PyTorch](https://github.com/pytorch/pytorch): Industry-standard machine learning framework
381+
- [aiolimiter](https://github.com/mjpieters/aiolimiter): Powerful rate limiting for async operations
382+
383+
308384
## 📄 License & Contact
309385

310386
This project is licensed under the Apache 2.0 License - see the [LICENSE](https://github.com/shcherbak-ai/contextgem/blob/main/LICENSE) and [NOTICE](https://github.com/shcherbak-ai/contextgem/blob/main/NOTICE) files for details.

contextgem/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
ContextGem - Easier and faster way to build LLM extraction workflows through powerful abstractions
2121
"""
2222

23-
__version__ = "0.1.1"
23+
__version__ = "0.1.1.post1"
2424
__author__ = "Shcherbak AI AS"
2525

2626
from contextgem.public import (

0 commit comments

Comments
 (0)