You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CONTRIBUTING.md
+18-3Lines changed: 18 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -53,7 +53,11 @@ To sign the agreement:
53
53
54
54
3. **Install pre-commit hooks**:
55
55
```bash
56
+
# Install pre-commit hooks
56
57
pre-commit install
58
+
59
+
# Install commit-msg hooks (for commitizen)
60
+
pre-commit install --hook-type commit-msg
57
61
```
58
62
59
63
@@ -102,12 +106,23 @@ To sign the agreement:
102
106
103
107
Please note that we use pytest-vcr to record and replay LLM API interactions. Your changes may require re-recording VCR cassettes for the tests. See [VCR Cassette Management](#vcr-cassette-management) section below for details.
104
108
105
-
4. **Commit your changes**with a descriptive commit message:
109
+
4. **Commit your changes**using Conventional Commits format:
106
110
107
-
For example:
111
+
We use [Conventional Commits](https://www.conventionalcommits.org/) format for our commit messages, which enables automatic changelog generation and semantic versioning. Instead of using regular git commit, please use commitizen:
108
112
109
113
```bash
110
-
git commit -m "Add feature: description of your changes"
114
+
poetry run cz commit
115
+
```
116
+
117
+
This will guide you through an interactive prompt to create a properly formatted commit message with:
118
+
- Type of change (feat, fix, docs, style, refactor, etc.)
119
+
- Optional scope (e.g., api, cli, docs)
120
+
- Short description
121
+
- Optional longer description and breaking change notes
Copy file name to clipboardExpand all lines: README.md
+84-8Lines changed: 84 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -28,9 +28,8 @@ ContextGem addresses this challenge by providing a flexible, intuitive framework
28
28
Read more on the project [motivation](https://contextgem.dev/motivation.html) in the documentation.
29
29
30
30
31
-
## 💡 What can you do with ContextGem?
31
+
## 💡 With ContextGem, you can:
32
32
33
-
With ContextGem, you can:
34
33
-**Extract structured data** from documents (text, images) with minimal code
35
34
-**Identify and analyze key aspects** (topics, themes, categories) within documents
36
35
-**Extract specific concepts** (entities, facts, conclusions, assessments) from documents
@@ -173,15 +172,74 @@ pip install -U contextgem
173
172
174
173
## 🚀 Quick start
175
174
175
+
### Aspect extraction
176
+
177
+
Aspect is a defined area or topic within a document (or another aspect). Each aspect reflects a specific subject or theme.
178
+
179
+
```python
180
+
# Quick Start Example - Extracting payment terms from a document
181
+
182
+
import os
183
+
184
+
from contextgem import Aspect, Document, DocumentLLM
185
+
186
+
# Sample document text (shortened for brevity)
187
+
doc = Document(
188
+
raw_text=(
189
+
"SERVICE AGREEMENT\n"
190
+
"SERVICES. Provider agrees to provide the following services to Client: "
191
+
"Cloud-based data analytics platform access and maintenance...\n"
192
+
"PAYMENT. Client agrees to pay $5,000 per month for the services. "
193
+
"Payment is due on the 1st of each month. Late payments will incur a 2% fee per month...\n"
194
+
"CONFIDENTIALITY. Both parties agree to keep all proprietary information confidential "
195
+
"for a period of 5 years following termination of this Agreement..."
196
+
),
197
+
)
198
+
199
+
# Define the aspects to extract
200
+
doc.aspects = [
201
+
Aspect(
202
+
name="Payment Terms",
203
+
description="Payment terms and conditions in the contract",
204
+
# see the docs for more configuration options, e.g. sub-aspects, concepts, etc.
205
+
),
206
+
# Add more aspects as needed
207
+
]
208
+
# Or use `doc.add_aspects([...])`
209
+
210
+
# Define an LLM for extracting information from the document
211
+
llm = DocumentLLM(
212
+
model="openai/gpt-4o-mini", # or any other LLM from e.g. Anthropic, etc.
213
+
api_key=os.environ.get(
214
+
"CONTEXTGEM_OPENAI_API_KEY"
215
+
), # your API key for the LLM provider, e.g. OpenAI, Anthropic, etc.
216
+
# see the docs for more configuration options
217
+
)
218
+
219
+
# Extract information from the document
220
+
doc = llm.extract_all(doc) # or use async version `await llm.extract_all_async(doc)`
221
+
222
+
# Access extracted information in the document object
223
+
for item in doc.aspects[0].extracted_items:
224
+
print(f"• {item.value}")
225
+
# or `doc.get_aspect_by_name("Payment Terms").extracted_items`
226
+
227
+
```
228
+
[](https://colab.research.google.com/github/shcherbak-ai/contextgem/blob/main/dev/notebooks/readme/quickstart_aspect.ipynb)
229
+
230
+
231
+
### Concept extraction
232
+
233
+
Concept is a unit of information or an entity, derived from an aspect or the broader document context.
234
+
176
235
```python
177
236
# Quick Start Example - Extracting anomalies from a document, with source references and justifications
178
237
179
238
import os
180
239
181
240
from contextgem import Document, DocumentLLM, StringConcept
182
241
183
-
# Example document instance
184
-
# Document content is shortened for brevity
242
+
# Sample document text (shortened for brevity)
185
243
doc = Document(
186
244
raw_text=(
187
245
"Consultancy Agreement\n"
@@ -203,13 +261,14 @@ doc.concepts = [
203
261
reference_depth="sentences",
204
262
add_justifications=True,
205
263
justification_depth="brief",
264
+
# see the docs for more configuration options
206
265
)
207
266
# add more concepts to the document, if needed
208
267
# see the docs for available concepts: StringConcept, JsonObjectConcept, etc.
209
268
]
210
-
# Or use doc.add_concepts([...])
269
+
# Or use `doc.add_concepts([...])`
211
270
212
-
#Create an LLM for extracting data and insights from the document
271
+
#Define an LLM for extracting information from the document
213
272
llm = DocumentLLM(
214
273
model="openai/gpt-4o-mini", # or any other LLM from e.g. Anthropic, etc.
215
274
api_key=os.environ.get(
@@ -219,15 +278,18 @@ llm = DocumentLLM(
219
278
)
220
279
221
280
# Extract information from the document
222
-
doc = llm.extract_all(doc) # or use async version llm.extract_all_async(doc)
281
+
doc = llm.extract_all(doc) # or use async version `await llm.extract_all_async(doc)`
223
282
224
283
# Access extracted information in the document object
225
284
print(
226
285
doc.concepts[0].extracted_items
227
286
) # extracted items with references & justifications
228
-
# or doc.get_concept_by_name("Anomalies").extracted_items
287
+
# or `doc.get_concept_by_name("Anomalies").extracted_items`
229
288
230
289
```
290
+
[](https://colab.research.google.com/github/shcherbak-ai/contextgem/blob/main/dev/notebooks/readme/quickstart_concept.ipynb)
291
+
292
+
---
231
293
232
294
See more examples in the documentation:
233
295
@@ -305,6 +367,20 @@ This project is automatically scanned for security vulnerabilities using [CodeQL
305
367
See [SECURITY](https://github.com/shcherbak-ai/contextgem/blob/main/SECURITY.md) file for details.
306
368
307
369
370
+
## 🙏 Acknowledgements
371
+
372
+
ContextGem relies on these excellent open-source packages:
373
+
374
+
-[pydantic](https://github.com/pydantic/pydantic): The gold standard for data validation
-[aiolimiter](https://github.com/mjpieters/aiolimiter): Powerful rate limiting for async operations
382
+
383
+
308
384
## 📄 License & Contact
309
385
310
386
This project is licensed under the Apache 2.0 License - see the [LICENSE](https://github.com/shcherbak-ai/contextgem/blob/main/LICENSE) and [NOTICE](https://github.com/shcherbak-ai/contextgem/blob/main/NOTICE) files for details.
0 commit comments