-
I'm attempting to deploy in a container environment, I've attempted with 1.19.0, latest, and 1.18.1. When attempting to enable ocr in a pipeline I seem to be getting a validation error. The chunk of the pipeline.yaml pertaining to the converter: - name: PDFFileConverter
type: PDFToTextConverter
params:
remove_numeric_tables: false
ocr: full
ocr_language: eng The error, with some bloat replaced with
I've searched around, and I'm guessing the fail specifically pertains to the haystack-pipeline-1.19.0.schema.json referenced in the code, which does not have Here's the chunk from the schema: "PDFToTextConverterComponent": {
"type": "object",
"properties": {
"name": {
"title": "Name",
"description": "Custom name for the component. Helpful for visualization and debugging.",
"type": "string"
},
"type": {
"title": "Type",
"description": "Haystack Class name for the component.",
"type": "string",
"const": "PDFToTextConverter"
},
"params": {
"title": "Parameters",
"type": "object",
"properties": {
"remove_numeric_tables": {
"title": "Remove Numeric Tables",
"default": false,
"type": "boolean"
},
"valid_languages": {
"title": "Valid Languages",
"anyOf": [
{
"type": "array",
"items": {
"type": "string"
}
},
{
"type": "null"
}
]
},
"id_hash_keys": {
"title": "Id Hash Keys",
"anyOf": [
{
"type": "array",
"items": {
"type": "string"
}
},
{
"type": "null"
}
]
},
"encoding": {
"title": "Encoding",
"default": "UTF-8",
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
]
},
"keep_physical_layout": {
"title": "Keep Physical Layout",
"default": false,
"type": "boolean"
}
},
"additionalProperties": false,
"description": "Each parameter can reference other components defined in the same YAML file."
}
},
"required": [
"type",
"name"
],
"additionalProperties": false
}, Some documentation from deepset.ai: I can verify farm-haystack[ocr] is installed on the containers, as when attempting to pip install, they are up-to-date. I don't know if I'm misunderstanding something, as I'm attempting to pull text from scanned pdf documents. The pipeline I've been testing with works fine with parsing other files (when the ocr option is disabled). I've only recently attempted adding the ocr functionality. I'm also working off of a container, using the pipeline.yaml file as the main configuration. Thanks for any help. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hey @Mjmorell would you please try the following? Install farm-haystack[pdf] as well and then try to see if your pipeline loads correctly. Let us know |
Beta Was this translation helpful? Give feedback.
Hey @Mjmorell would you please try the following? Install farm-haystack[pdf] as well and then try to see if your pipeline loads correctly. Let us know