Skip to content

Conversation

GirinMan
Copy link

@GirinMan GirinMan commented Aug 5, 2025

Description

Implements schema constraints for OpenAI models, enabling structured outputs with JSON format without requiring output fencing.

Fixes #59

Feature

How Has This Been Tested?

Comprehensive testing has been performed through automated tests and code review:

Tests Added

1. Unit Tests (tests/inference_test.py)

  • test_openai_schema_constraints_json: Validates that schema constraints work correctly with JSON format
  • test_openai_schema_constraints_yaml_raises_error: Ensures YAML format with schema constraints raises appropriate error
  • test_openai_with_schema_constraints: Verifies correct API parameters are sent to OpenAI when using structured outputs

2. Schema Tests (tests/schema_test.py)

  • OpenAISchemaTest class with multiple test cases:
    • Tests schema generation from empty extractions
    • Tests schema generation with and without attributes
    • Tests handling of list-type attributes
    • Tests custom attribute suffix functionality

3. Integration Tests (tests/openai_extract_test.py)

  • test_extract_with_openai_schema_constraints: End-to-end test of extract() function with OpenAI schema constraints
  • test_extract_openai_yaml_with_schema_raises_error: Validates error handling for unsupported YAML format
  • test_extract_openai_fence_output_with_schema_raises_error: Validates error handling for fence_output=True
  • test_extract_openai_without_schema_constraints: Ensures backward compatibility

Test Coverage

  • Schema generation and validation
  • API parameter construction with response_format
  • Error handling for unsupported configurations
  • Backward compatibility for existing functionality

Running Tests

Command:

$ pytest tests/inference_test.py
$ pytest tests/schema_test.py
$ pytest tests/openai_extract_test.py

Note: The tests use mocked OpenAI API responses to avoid requiring actual API keys during testing.

Checklist:

  • I have read and acknowledged Google's Open Source Code of conduct.
  • I have read the Contributing page, and I either signed the Google Individual CLA or am covered by my company's Corporate CLA.
  • I have discussed my proposed solution with code owners in the linked issue(s) and we have agreed upon the general approach.
  • I have made any needed documentation changes, or noted in the linked issue(s) that documentation elsewhere needs updating.
  • I have added tests, or I have ensured existing tests cover the changes
  • I have followed Google's Python Style Guide and ran pylint over the affected code.

aksg87 and others added 29 commits July 22, 2025 01:39
- Switch from badge.fury.io to shields.io for working PyPI badge
- Convert relative paths to absolute GitHub URLs for PyPI compatibility
- Bump version to 0.1.3
- Add GitHub Actions workflow for automated PyPI publishing via OIDC
- Configure trusted publishing environment for verified releases
- Update project metadata with proper URLs and license format
- Prepare for v1.0.0 stable release with production-ready automation
- Add pylibmagic>=0.5.0 dependency for bundled libraries
- Add [full] install option and pre-import handling
- Update README with troubleshooting and Docker sections
- Bump version to 1.0.1

Fixes google#6
Deleted an inline comment referencing the  output directory in the save_annotated_documents.
…ples.md

docs: clarify output_dir behavior in medication_examples.md
Prevents confusion from default `test_output/...` by explicitly saving to current directory.
docs: add output_dir="." to all save_annotated_documents examples
feat: add code formatting and linting pipeline
Introduces a common base exception class that all library-specific exceptions inherit from, enabling users to catch all LangExtract errors with a single except clause.
Add LangExtractError base exception for centralized error handling
Fixes google#25 - Windows installation failure due to pylibmagic build requirements

Breaking change: LangFunLanguageModel removed. Use GeminiLanguageModel or OllamaLanguageModel instead.
fix: Remove LangFun and pylibmagic dependencies to fix Windows installation and OpenAI SDK v1.x compatibility
- Modified save_annotated_documents to accept both pathlib.Path and string paths
- Convert string paths to Path objects before calling mkdir()
- This fixes the error when using output_dir='.' as shown in the README example
…-mkdir

Fix save_annotated_documents to handle string paths
feat: Add OpenAI language model support
…s: (google#10)

* docs: clarify output_dir behavior in medication_examples.md

* Removed inline comment in medication example

Deleted an inline comment referencing the  output directory in the save_annotated_documents.

* docs: add output_dir="." to all save_annotated_documents examples

Prevents confusion from default `test_output/...` by explicitly saving to current directory.

* build: add formatting & linting pipeline with pre-commit integration

* style: apply pyink, isort, and pre-commit formatting

* ci: enable format and lint checks in tox

* Add LangExtractError base exception for centralized error handling

Introduces a common base exception class that all library-specific exceptions inherit from, enabling users to catch all LangExtract errors with a single except clause.

* fix(ui): prevent current highlight border from being obscured

---------

Co-authored-by: Leena Kamran <62442533+kleeena@users.noreply.github.com>
Co-authored-by: Akshay Goel <akshay.k.goel@gmail.com>
- Gemini & OpenAI test suites with retry on transient errors
- CI: Separate job, Python 3.11 only, skips for forks
- Validates char_interval for all extractions
- Multilingual test xfail (issue google#13)

TODO: Remove xfail from multilingual test after tokenizer fix
- Add OpenAISchema class to generate JSON Schema compatible with OpenAI's structured outputs API
- Update OpenAILanguageModel to accept and use openai_schema parameter
- Configure response_format with json_schema when schema is provided
- Add validation to ensure schema constraints are only used with JSON format
- Update extract() function to generate OpenAI schemas when appropriate
- Support LANGEXTRACT_OPENAI_API_KEY environment variable

This enables use_schema_constraints=True with fence_output=False for OpenAI models when using FormatType.JSON. YAML format with schema constraints will raise a clear error.
- Add tests for OpenAILanguageModel with schema constraints
- Add tests for OpenAISchema generation from examples
- Add integration tests for extract() function with OpenAI
- Test validation errors for YAML format and fence_output=True
- Verify correct API parameters when using structured outputs
- Update OpenAI example from README
- Document that schema constraints now work with JSON format
- Add note about FormatType and fence_output requirements
- Clarify supported models and limitations
@GirinMan GirinMan changed the title Feature/schema constraints for OpenAI Implement schema constraints for OpenAI Aug 5, 2025
aksg87 and others added 2 commits August 5, 2025 18:47
* Add workflow_dispatch trigger to validation workflows

- Enable manual triggering for check-linked-issue, check-pr-size, and validate_pr_template
- Add conditional logic to ensure PR-specific steps only run on PR events
- Allows maintainers to manually trigger workflows when needed

* Add manual trigger to infrastructure protection workflow

- Add workflow_dispatch trigger
- Add conditional logic for PR-specific checks
- Ensures consistency across all validation workflows
- Change from pull_request to pull_request_target in all validation workflows
- This gives workflows proper permissions to add labels and comments on PRs from forks
- Fixes 'Resource not accessible by integration' error (HTTP 403)
- Safe because workflows only read PR metadata and don't execute PR code
@aksg87 aksg87 changed the title Implement schema constraints for OpenAI Implement schema constraints for OpenAI Aug 6, 2025
aksg87 and others added 4 commits August 6, 2025 09:13
Enables manual triggering of CI workflow including live API tests.
This allows maintainers to run live API tests for PRs from forks
where the tests would normally be skipped for security reasons.
Enables two ways to run live API tests:
1. workflow_dispatch: Manual trigger via Actions tab
2. Label trigger: Add 'ready-to-merge' label to any PR

The label-based approach uses pull_request_target for security:
- Runs in base repository context with access to secrets
- Safely merges PR into main branch before testing
- Only maintainers can trigger
- Comments test results back to PR

This provides a production-ready solution for testing PRs from forks
while maintaining security, following patterns used by major projects.
* Add base_url to OpenAILanguageModel

* Github action lint is outdated, so adapting

* Adding base_url to parameterized test

* Lint fixes to inference_test.py
Bug: Workflows triggered on pull_request_target but checked for pull_request,
causing all validations to be skipped.

Fixed:
- Event condition checks now match trigger type
- Add manual revalidation workflow
- Enable workflow_dispatch with PR number input
Copy link

github-actions bot commented Aug 6, 2025

Manual validation results:

Size: 798 lines
Template: ✓
Linked issue: ✓

Run ID: 16790875474

- Creates visible PR checks (pass/fail status)
- Shows validation errors in status description (up to 140 chars)
- Links to workflow run for full details
- Maintains backward compatibility with comment reporting
Copy link

github-actions bot commented Aug 6, 2025

Manual validation results:

Size: 798 lines
Template: ✓
Linked issue: ✓

Run ID: 16791196611

Copy link

github-actions bot commented Aug 6, 2025

Manual Validation Results

Status: ❌ Failed

Check Status Details
PR Size 798 lines
Template Complete
Linked Issue Found

View workflow run

The workflow was comparing boolean true to string 'true', causing all validations to incorrectly show as failed even when all checks passed.
Copy link

github-actions bot commented Aug 7, 2025

Manual Validation Results

Status: ✅ Passed

Check Status Details
PR Size 798 lines
Template Complete
Linked Issue Found

View workflow run

@aksg87 aksg87 added the size/L Pull request with 600-1000 lines changed label Aug 7, 2025
aksg87 and others added 4 commits August 6, 2025 21:13
- revalidate-all-prs.sh: Triggers manual validation for all open PRs
- add-size-labels.sh: Adds size labels (XS/S/M/L/XL) based on change count
- add-new-checks.sh: Adds required status checks to branch protection

These scripts require maintainer permissions and help manage PR workflows.
- Add type ignore comments for IPython imports
- Fix return type annotation (remove unnecessary quotes)
- Add _is_jupyter() to properly detect notebook environments
- Replace lambda with def function for pylint compliance

Fixes google#65
- Add format-check job that checks actual PR code, not merge commit
- Validate formatting before expensive fork PR tests
- Provide clear error messages when formatting fails

Fixes false positives where incorrectly formatted PRs passed CI
Auto-updates PRs behind main, handles forks/conflicts gracefully,
skips bot/draft PRs, monitors API limits
Copy link

github-actions bot commented Aug 7, 2025

⚠️ Branch Update Required

Your branch is 20 commits behind main. Please update your branch to ensure CI checks run with the latest code:

git fetch origin main
git merge origin/main
git push

Note: Enable "Allow edits by maintainers" to allow automatic updates.

aksg87 and others added 4 commits August 7, 2025 01:34
- Apply end-of-file and whitespace fixes to workflows
- Fix empty interval bug when newline falls at chunk boundary (issue google#71)
- Add concise comment explaining the fix logic
- Remove excessive/obvious comments from chunking tests
- Improve test docstring to be more descriptive and professional
The exceptions.py file existed in both the root directory and langextract/ directory with identical content. This removes the duplicate from the root to avoid confusion and maintain proper package structure.
Copy link

github-actions bot commented Aug 8, 2025

Infrastructure File Protection

This PR modifies protected infrastructure files:

  • .github/workflows/validate_pr_template.yaml (14 changes)

Only repository maintainers are allowed to modify infrastructure files (including .github/, build configuration, and repository documentation).

Note: If these are only formatting changes, please:

  1. Revert changes to .github/ files
  2. Use ./autoformat.sh to format only source code directories
  3. Avoid running formatters on infrastructure files

If structural changes are necessary:

  1. Open an issue describing the needed infrastructure changes
  2. A maintainer will review and implement the changes if approved

For more information, see our Contributing Guidelines.

Copy link

⚠️ Branch Update Required

Your branch is 23 commits behind main. Please update your branch to ensure CI checks run with the latest code:

git fetch origin main
git merge origin/main
git push

Note: Enable "Allow edits by maintainers" to allow automatic updates.

Copy link

⚠️ Branch Update Required

Your branch is 86 commits behind main. Please update your branch to ensure CI checks run with the latest code:

git fetch origin main
git merge origin/main
git push

Note: Enable "Allow edits by maintainers" to allow automatic updates.

@GirinMan GirinMan closed this Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Pull request with 600-1000 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Schema constraints for OpenAI not supported

5 participants