PDFToTextConverter OCR failed schema validation #5468

Mjmorell · 2023-07-28T18:12:49Z

Mjmorell
Jul 28, 2023

I'm attempting to deploy in a container environment, I've attempted with 1.19.0, latest, and 1.18.1. When attempting to enable ocr in a pipeline I seem to be getting a validation error.

The chunk of the pipeline.yaml pertaining to the converter:

      - name: PDFFileConverter
        type: PDFToTextConverter
        params:
          remove_numeric_tables: false
          ocr: full
          ocr_language: eng

The error, with some bloat replaced with ... for readability.


Missing definition for node of type PDFToTextConverter. Looking into local classes...
Added definition for PDFToTextConverter
Traceback (most recent call last):
File "/opt/venv/lib/python3.10/site-packages/haystack/pipelines/config.py", line 282, in validate_schema
Draft7Validator(schema).validate(instance=pipeline_config)
File "/opt/venv/lib/python3.10/site-packages/jsonschema/validators.py", line 430, in validate
raise error
jsonschema.exceptions.ValidationError: {'name': 'PDFFileConverter', 'type': 'PDFToTextConverter', 'params': {'remove_numeric_tables': False, 'ocr': 'full', 'ocr_language': 'eng'}} is not valid under any of the given schemas

Failed validating 'anyOf' in schema['properties']['components']['items']:
{'anyOf': [{'$ref': '#/definitions/DeepsetCloudDocumentStoreComponent'},
{'$ref': '#/definitions/ElasticsearchDocumentStoreComponent'},
...
{'$ref': '#/definitions/PDFToTextConverterComponent'}]}

On instance['components'][6]:
{'name': 'PDFFileConverter',
'params': {'ocr': 'full',
'ocr_language': 'eng',
'remove_numeric_tables': False},
'type': 'PDFToTextConverter'}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
...
File "/opt/venv/lib/python3.10/site-packages/haystack/pipelines/config.py", line 208, in validate_config 
validate_schema(pipeline_config=pipeline_config, strict_version_check=strict_version_check, extras=extras)
File "/opt/venv/lib/python3.10/site-packages/haystack/pipelines/config.py", line 300, in validate_schema
raise PipelineSchemaError(
haystack.errors.PipelineSchemaError: Node of type PDFToTextConverter found, but it failed validation. Possible causes:
- The node is missing some mandatory parameter
- Wrong indentation of some parameter in YAML
See the stacktrace for more information.

I've searched around, and I'm guessing the fail specifically pertains to the haystack-pipeline-1.19.0.schema.json referenced in the code, which does not have ocr as a valid parameter.

Here's the chunk from the schema:

"PDFToTextConverterComponent": {
      "type": "object",
      "properties": {
        "name": {
          "title": "Name",
          "description": "Custom name for the component. Helpful for visualization and debugging.",
          "type": "string"
        },
        "type": {
          "title": "Type",
          "description": "Haystack Class name for the component.",
          "type": "string",
          "const": "PDFToTextConverter"
        },
        "params": {
          "title": "Parameters",
          "type": "object",
          "properties": {
            "remove_numeric_tables": {
              "title": "Remove Numeric Tables",
              "default": false,
              "type": "boolean"
            },
            "valid_languages": {
              "title": "Valid Languages",
              "anyOf": [
                {
                  "type": "array",
                  "items": {
                    "type": "string"
                  }
                },
                {
                  "type": "null"
                }
              ]
            },
            "id_hash_keys": {
              "title": "Id Hash Keys",
              "anyOf": [
                {
                  "type": "array",
                  "items": {
                    "type": "string"
                  }
                },
                {
                  "type": "null"
                }
              ]
            },
            "encoding": {
              "title": "Encoding",
              "default": "UTF-8",
              "anyOf": [
                {
                  "type": "string"
                },
                {
                  "type": "null"
                }
              ]
            },
            "keep_physical_layout": {
              "title": "Keep Physical Layout",
              "default": false,
              "type": "boolean"
            }
          },
          "additionalProperties": false,
          "description": "Each parameter can reference other components defined in the same YAML file."
        }
      },
      "required": [
        "type",
        "name"
      ],
      "additionalProperties": false
    },

Some documentation from deepset.ai:

I can verify farm-haystack[ocr] is installed on the containers, as when attempting to pip install, they are up-to-date.

I don't know if I'm misunderstanding something, as I'm attempting to pull text from scanned pdf documents. The pipeline I've been testing with works fine with parsing other files (when the ocr option is disabled). I've only recently attempted adding the ocr functionality. I'm also working off of a container, using the pipeline.yaml file as the main configuration.

Thanks for any help.

Answered by vblagoje

Jul 31, 2023

Hey @Mjmorell would you please try the following? Install farm-haystack[pdf] as well and then try to see if your pipeline loads correctly. Let us know

View full answer

vblagoje · 2023-07-31T10:43:48Z

vblagoje
Jul 31, 2023
Maintainer

Hey @Mjmorell would you please try the following? Install farm-haystack[pdf] as well and then try to see if your pipeline loads correctly. Let us know

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDFToTextConverter OCR failed schema validation #5468

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

PDFToTextConverter OCR failed schema validation #5468

Uh oh!

Mjmorell Jul 28, 2023

Replies: 1 comment

Uh oh!

vblagoje Jul 31, 2023 Maintainer

Mjmorell
Jul 28, 2023

vblagoje
Jul 31, 2023
Maintainer