Implement schema constraints for OpenAI #61

GirinMan · 2025-08-05T11:08:39Z

Description

Implements schema constraints for OpenAI models, enabling structured outputs with JSON format without requiring output fencing.

Fixes #59

Feature

How Has This Been Tested?

Comprehensive testing has been performed through automated tests and code review:

Tests Added

1. Unit Tests (`tests/inference_test.py`)

test_openai_schema_constraints_json: Validates that schema constraints work correctly with JSON format
test_openai_schema_constraints_yaml_raises_error: Ensures YAML format with schema constraints raises appropriate error
test_openai_with_schema_constraints: Verifies correct API parameters are sent to OpenAI when using structured outputs

2. Schema Tests (`tests/schema_test.py`)

OpenAISchemaTest class with multiple test cases:
- Tests schema generation from empty extractions
- Tests schema generation with and without attributes
- Tests handling of list-type attributes
- Tests custom attribute suffix functionality

3. Integration Tests (`tests/openai_extract_test.py`)

test_extract_with_openai_schema_constraints: End-to-end test of extract() function with OpenAI schema constraints
test_extract_openai_yaml_with_schema_raises_error: Validates error handling for unsupported YAML format
test_extract_openai_fence_output_with_schema_raises_error: Validates error handling for fence_output=True
test_extract_openai_without_schema_constraints: Ensures backward compatibility

Test Coverage

Schema generation and validation
API parameter construction with response_format
Error handling for unsupported configurations
Backward compatibility for existing functionality

Running Tests

Command:

$ pytest tests/inference_test.py
$ pytest tests/schema_test.py
$ pytest tests/openai_extract_test.py

Note: The tests use mocked OpenAI API responses to avoid requiring actual API keys during testing.

Checklist:

I have read and acknowledged Google's Open Source Code of conduct.
I have read the Contributing page, and I either signed the Google Individual CLA or am covered by my company's Corporate CLA.
I have discussed my proposed solution with code owners in the linked issue(s) and we have agreed upon the general approach.
I have made any needed documentation changes, or noted in the linked issue(s) that documentation elsewhere needs updating.
I have added tests, or I have ensured existing tests cover the changes
I have followed Google's Python Style Guide and ran pylint over the affected code.

- Switch from badge.fury.io to shields.io for working PyPI badge - Convert relative paths to absolute GitHub URLs for PyPI compatibility - Bump version to 0.1.3

- Add GitHub Actions workflow for automated PyPI publishing via OIDC - Configure trusted publishing environment for verified releases - Update project metadata with proper URLs and license format - Prepare for v1.0.0 stable release with production-ready automation

- Add pylibmagic>=0.5.0 dependency for bundled libraries - Add [full] install option and pre-import handling - Update README with troubleshooting and Docker sections - Bump version to 1.0.1 Fixes google#6

Deleted an inline comment referencing the output directory in the save_annotated_documents.

…ples.md docs: clarify output_dir behavior in medication_examples.md

Prevents confusion from default `test_output/...` by explicitly saving to current directory.

docs: add output_dir="." to all save_annotated_documents examples

feat: add code formatting and linting pipeline

Introduces a common base exception class that all library-specific exceptions inherit from, enabling users to catch all LangExtract errors with a single except clause.

Add LangExtractError base exception for centralized error handling

Fixes google#25 - Windows installation failure due to pylibmagic build requirements Breaking change: LangFunLanguageModel removed. Use GeminiLanguageModel or OllamaLanguageModel instead.

fix: Remove LangFun and pylibmagic dependencies to fix Windows installation and OpenAI SDK v1.x compatibility

- Modified save_annotated_documents to accept both pathlib.Path and string paths - Convert string paths to Path objects before calling mkdir() - This fixes the error when using output_dir='.' as shown in the README example

…-mkdir Fix save_annotated_documents to handle string paths

feat: Add OpenAI language model support

…s: (google#10) * docs: clarify output_dir behavior in medication_examples.md * Removed inline comment in medication example Deleted an inline comment referencing the output directory in the save_annotated_documents. * docs: add output_dir="." to all save_annotated_documents examples Prevents confusion from default `test_output/...` by explicitly saving to current directory. * build: add formatting & linting pipeline with pre-commit integration * style: apply pyink, isort, and pre-commit formatting * ci: enable format and lint checks in tox * Add LangExtractError base exception for centralized error handling Introduces a common base exception class that all library-specific exceptions inherit from, enabling users to catch all LangExtract errors with a single except clause. * fix(ui): prevent current highlight border from being obscured --------- Co-authored-by: Leena Kamran <62442533+kleeena@users.noreply.github.com> Co-authored-by: Akshay Goel <akshay.k.goel@gmail.com>

- Gemini & OpenAI test suites with retry on transient errors - CI: Separate job, Python 3.11 only, skips for forks - Validates char_interval for all extractions - Multilingual test xfail (issue google#13) TODO: Remove xfail from multilingual test after tokenizer fix

…oogle#57) Fixes google#27

- Add OpenAISchema class to generate JSON Schema compatible with OpenAI's structured outputs API - Update OpenAILanguageModel to accept and use openai_schema parameter - Configure response_format with json_schema when schema is provided - Add validation to ensure schema constraints are only used with JSON format - Update extract() function to generate OpenAI schemas when appropriate - Support LANGEXTRACT_OPENAI_API_KEY environment variable This enables use_schema_constraints=True with fence_output=False for OpenAI models when using FormatType.JSON. YAML format with schema constraints will raise a clear error.

- Add tests for OpenAILanguageModel with schema constraints - Add tests for OpenAISchema generation from examples - Add integration tests for extract() function with OpenAI - Test validation errors for YAML format and fence_output=True - Verify correct API parameters when using structured outputs

- Update OpenAI example from README - Document that schema constraints now work with JSON format - Add note about FormatType and fence_output requirements - Clarify supported models and limitations

* Add workflow_dispatch trigger to validation workflows - Enable manual triggering for check-linked-issue, check-pr-size, and validate_pr_template - Add conditional logic to ensure PR-specific steps only run on PR events - Allows maintainers to manually trigger workflows when needed * Add manual trigger to infrastructure protection workflow - Add workflow_dispatch trigger - Add conditional logic for PR-specific checks - Ensures consistency across all validation workflows

- Change from pull_request to pull_request_target in all validation workflows - This gives workflows proper permissions to add labels and comments on PRs from forks - Fixes 'Resource not accessible by integration' error (HTTP 403) - Safe because workflows only read PR metadata and don't execute PR code

Enables manual triggering of CI workflow including live API tests. This allows maintainers to run live API tests for PRs from forks where the tests would normally be skipped for security reasons.

Enables two ways to run live API tests: 1. workflow_dispatch: Manual trigger via Actions tab 2. Label trigger: Add 'ready-to-merge' label to any PR The label-based approach uses pull_request_target for security: - Runs in base repository context with access to secrets - Safely merges PR into main branch before testing - Only maintainers can trigger - Comments test results back to PR This provides a production-ready solution for testing PRs from forks while maintaining security, following patterns used by major projects.

* Add base_url to OpenAILanguageModel * Github action lint is outdated, so adapting * Adding base_url to parameterized test * Lint fixes to inference_test.py

Bug: Workflows triggered on pull_request_target but checked for pull_request, causing all validations to be skipped. Fixed: - Event condition checks now match trigger type - Add manual revalidation workflow - Enable workflow_dispatch with PR number input

github-actions · 2025-08-06T23:16:01Z

Manual validation results:

Size: 798 lines
Template: ✓
Linked issue: ✓

Run ID: 16790875474

- Creates visible PR checks (pass/fail status) - Shows validation errors in status description (up to 140 chars) - Links to workflow run for full details - Maintains backward compatibility with comment reporting

github-actions · 2025-08-06T23:36:58Z

Manual validation results:

Size: 798 lines
Template: ✓
Linked issue: ✓

Run ID: 16791196611

github-actions · 2025-08-06T23:40:33Z

Manual Validation Results

Status: ❌ Failed

Check	Status	Details
PR Size	✅	798 lines
Template	✅	Complete
Linked Issue	✅	Found

View workflow run

The workflow was comparing boolean true to string 'true', causing all validations to incorrectly show as failed even when all checks passed.

github-actions · 2025-08-07T00:54:05Z

Manual Validation Results

Status: ✅ Passed

Check	Status	Details
PR Size	✅	798 lines
Template	✅	Complete
Linked Issue	✅	Found

View workflow run

- revalidate-all-prs.sh: Triggers manual validation for all open PRs - add-size-labels.sh: Adds size labels (XS/S/M/L/XL) based on change count - add-new-checks.sh: Adds required status checks to branch protection These scripts require maintainer permissions and help manage PR workflows.

- Add type ignore comments for IPython imports - Fix return type annotation (remove unnecessary quotes) - Add _is_jupyter() to properly detect notebook environments - Replace lambda with def function for pylint compliance Fixes google#65

- Add format-check job that checks actual PR code, not merge commit - Validate formatting before expensive fork PR tests - Provide clear error messages when formatting fails Fixes false positives where incorrectly formatted PRs passed CI

Auto-updates PRs behind main, handles forks/conflicts gracefully, skips bot/draft PRs, monitors API limits

github-actions · 2025-08-07T05:29:39Z

⚠️ Branch Update Required

Your branch is 20 commits behind main. Please update your branch to ensure CI checks run with the latest code:

git fetch origin main
git merge origin/main
git push

Note: Enable "Allow edits by maintainers" to allow automatic updates.

- Apply end-of-file and whitespace fixes to workflows

- Fix empty interval bug when newline falls at chunk boundary (issue google#71) - Add concise comment explaining the fix logic - Remove excessive/obvious comments from chunking tests - Improve test docstring to be more descriptive and professional

The exceptions.py file existed in both the root directory and langextract/ directory with identical content. This removes the duplicate from the root to avoid confusion and maintain proper package structure.

github-actions · 2025-08-08T08:02:58Z

❌ Infrastructure File Protection

This PR modifies protected infrastructure files:

.github/workflows/validate_pr_template.yaml (14 changes)

Only repository maintainers are allowed to modify infrastructure files (including .github/, build configuration, and repository documentation).

Note: If these are only formatting changes, please:

Revert changes to .github/ files
Use ./autoformat.sh to format only source code directories
Avoid running formatters on infrastructure files

If structural changes are necessary:

Open an issue describing the needed infrastructure changes
A maintainer will review and implement the changes if approved

For more information, see our Contributing Guidelines.

merge main

github-actions · 2025-08-14T09:24:35Z

⚠️ Branch Update Required

Your branch is 23 commits behind main. Please update your branch to ensure CI checks run with the latest code:

git fetch origin main
git merge origin/main
git push

Note: Enable "Allow edits by maintainers" to allow automatic updates.

github-actions · 2025-08-22T02:32:08Z

⚠️ Branch Update Required

Your branch is 86 commits behind main. Please update your branch to ensure CI checks run with the latest code:

git fetch origin main
git merge origin/main
git push

Note: Enable "Allow edits by maintainers" to allow automatic updates.

aksg87 and others added 29 commits July 22, 2025 01:39

docs(pypi): Improve README display and badge reliability

2ce2399

- Switch from badge.fury.io to shields.io for working PyPI badge - Convert relative paths to absolute GitHub URLs for PyPI compatibility - Bump version to 0.1.3

Fix: Resolve libmagic ImportError (google#6)

e696a48

- Add pylibmagic>=0.5.0 dependency for bundled libraries - Add [full] install option and pre-import handling - Update README with troubleshooting and Docker sections - Bump version to 1.0.1 Fixes google#6

docs: clarify output_dir behavior in medication_examples.md

5447637

Merge pull request google#11 from google/fix/libmagic-dependency-issue

9c47b34

Removed inline comment in medication example

175e075

Deleted an inline comment referencing the output directory in the save_annotated_documents.

Merge pull request google#15 from kleeena/docs/update-medication_exam…

9472099

…ples.md docs: clarify output_dir behavior in medication_examples.md

docs: add output_dir="." to all save_annotated_documents examples

e6c3dcd

Prevents confusion from default `test_output/...` by explicitly saving to current directory.

Merge pull request google#17 from google/fix/output-dir-consistency

1fb1f1d

docs: add output_dir="." to all save_annotated_documents examples

build: add formatting & linting pipeline with pre-commit integration

13fbd2c

style: apply pyink, isort, and pre-commit formatting

c8d2027

ci: enable format and lint checks in tox

146a095

Merge pull request google#24 from google/feat/code-formatting-pipeline

aa6da18

feat: add code formatting and linting pipeline

Add LangExtractError base exception for centralized error handling

ed65bca

Introduces a common base exception class that all library-specific exceptions inherit from, enabling users to catch all LangExtract errors with a single except clause.

Merge pull request google#26 from google/feat/exception-hierarchy

6c4508b

Add LangExtractError base exception for centralized error handling

fix: Remove LangFun and pylibmagic dependencies (v1.0.2)

8b85225

Fixes google#25 - Windows installation failure due to pylibmagic build requirements Breaking change: LangFunLanguageModel removed. Use GeminiLanguageModel or OllamaLanguageModel instead.

Merge pull request google#28 from google/fix/remove-breaking-dep-langfun

88520cc

fix: Remove LangFun and pylibmagic dependencies to fix Windows installation and OpenAI SDK v1.x compatibility

Fix save_annotated_documents to handle string paths

75a6f12

- Modified save_annotated_documents to accept both pathlib.Path and string paths - Convert string paths to Path objects before calling mkdir() - This fixes the error when using output_dir='.' as shown in the README example

Merge pull request google#29 from google/fix-save-annotated-documents…

a415b94

…-mkdir Fix save_annotated_documents to handle string paths

feat: Add OpenAI language model support

8289b3a

Merge pull request google#31 from google/feature/add-oai-inference

c8ef723

feat: Add OpenAI language model support

Add PR template validation workflow (google#45)

dc61372

fix: Change OllamaLanguageModel parameter from 'model' to 'model_id' (g…

da771e6

…oogle#57) Fixes google#27

feat: Add CITATION.cff file for proper software citation

e83d5cf

docs: Update README to document OpenAI schema constraints support

c8ecbbf

- Update OpenAI example from README - Document that schema constraints now work with JSON format - Add note about FormatType and fence_output requirements - Clarify supported models and limitations

GirinMan changed the title ~~Feature/schema constraints for OpenAI~~ Implement schema constraints for OpenAI Aug 5, 2025

aksg87 and others added 2 commits August 5, 2025 18:47

aksg87 changed the title ~~Implement schema constraints for OpenAI~~ Implement schema constraints for OpenAI Aug 6, 2025

aksg87 and others added 4 commits August 6, 2025 09:13

Add workflow_dispatch trigger to CI workflow

1290d63

Enables manual triggering of CI workflow including live API tests. This allows maintainers to run live API tests for PRs from forks where the tests would normally be skipped for security reasons.

Add base_url to OpenAILanguageModel (google#51)

234081e

* Add base_url to OpenAILanguageModel * Github action lint is outdated, so adapting * Adding base_url to parameterized test * Lint fixes to inference_test.py

Add commit status to revalidation workflow

6fb66cf

- Creates visible PR checks (pass/fail status) - Shows validation errors in status description (up to 140 chars) - Links to workflow run for full details - Maintains backward compatibility with comment reporting

Fix boolean comparison in revalidation workflow

47a251e

The workflow was comparing boolean true to string 'true', causing all validations to incorrectly show as failed even when all checks passed.

aksg87 added the size/L Pull request with 600-1000 lines changed label Aug 7, 2025

aksg87 and others added 4 commits August 6, 2025 21:13

Fix CI to validate PR branch formatting directly

e6dcc8e

- Add format-check job that checks actual PR code, not merge commit - Validate formatting before expensive fork PR tests - Provide clear error messages when formatting fails Fixes false positives where incorrectly formatted PRs passed CI

Add PR update automation workflows

1c3c1a2

Auto-updates PRs behind main, handles forks/conflicts gracefully, skips bot/draft PRs, monitors API limits

aksg87 and others added 4 commits August 7, 2025 01:34

Fix workflow formatting

b60f0b2

- Apply end-of-file and whitespace fixes to workflows

Bump version to 1.0.5

b3bff86

Remove duplicate exceptions.py from root directory (google#94)

f3c1553

The exceptions.py file existed in both the root directory and langextract/ directory with identical content. This removes the duplicate from the root to avoid confusion and maintain proper package structure.

GirinMan added 2 commits August 8, 2025 17:06

Merge branch 'feature/schema-constraints-for-openai' into main

3d7e934

Merge pull request #1 from GirinMan/main

21c674b

merge main

aksg87 force-pushed the main branch from e36e455 to 3dff0d3 Compare August 21, 2025 01:43

GirinMan closed this Aug 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement schema constraints for OpenAI #61

Implement schema constraints for OpenAI #61

Uh oh!

GirinMan commented Aug 5, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 6, 2025

Uh oh!

github-actions bot commented Aug 6, 2025

Uh oh!

github-actions bot commented Aug 6, 2025

Uh oh!

github-actions bot commented Aug 7, 2025

Uh oh!

github-actions bot commented Aug 7, 2025

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Implement schema constraints for OpenAI #61

Implement schema constraints for OpenAI #61

Uh oh!

Conversation

GirinMan commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Tests Added

1. Unit Tests (tests/inference_test.py)

2. Schema Tests (tests/schema_test.py)

3. Integration Tests (tests/openai_extract_test.py)

Test Coverage

Running Tests

Checklist:

Uh oh!

github-actions bot commented Aug 6, 2025

Uh oh!

github-actions bot commented Aug 6, 2025

Uh oh!

github-actions bot commented Aug 6, 2025

Manual Validation Results

Uh oh!

github-actions bot commented Aug 7, 2025

Manual Validation Results

Uh oh!

github-actions bot commented Aug 7, 2025

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

GirinMan commented Aug 5, 2025 •

edited

Loading

1. Unit Tests (`tests/inference_test.py`)

2. Schema Tests (`tests/schema_test.py`)

3. Integration Tests (`tests/openai_extract_test.py`)