Skip to content

♻️ update display & structure for invoice splitter v1 #302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 11, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions docs/extras/code_samples/invoice_splitter_v1_async.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
from mindee import Client, product
from time import sleep
from mindee.parsing.common import AsyncPredictResponse
from mindee import Client, product, AsyncPredictResponse

# Init a new client
mindee_client = Client(api_key="my-api-key")
Expand Down
78 changes: 44 additions & 34 deletions docs/extras/guide/invoice_splitter_v1.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,17 @@
---
title: Invoice Splitter API Python
title: Invoice Splitter OCR Python
category: 622b805aaec68102ea7fcbc2
slug: python-invoice-splitter-api
slug: python-invoice-splitter-ocr
parentDoc: 609808f773b0b90051d839de
---
The Python OCR SDK supports the [Invoice Splitter API](https://platform.mindee.com/mindee/invoice_splitter).

Using [this sample](https://github.com/mindee/client-lib-test-data/blob/main/products/invoice_splitter/default_sample.pdf), we are going to illustrate how to detect the pages of multiple invoices within the same document.
Using the [sample below](https://github.com/mindee/client-lib-test-data/blob/main/products/invoice_splitter/default_sample.pdf), we are going to illustrate how to extract the data that we want using the OCR SDK.
![Invoice Splitter sample](https://github.com/mindee/client-lib-test-data/blob/main/products/invoice_splitter/default_sample.pdf?raw=true)

# Quick-Start

> **⚠️ Important:** This API only works **asynchronously**, which means that documents have to be sent and retrieved in a specific way:

```py
from mindee import Client, product
from time import sleep
from mindee.parsing.common import AsyncPredictResponse
from mindee import Client, product, AsyncPredictResponse

# Init a new client
mindee_client = Client(api_key="my-api-key")
Expand All @@ -31,66 +27,80 @@ result: AsyncPredictResponse = mindee_client.enqueue_and_parse(

# Print a brief summary of the parsed data
print(result.document)

```

**Output (RST):**

```rst
########
Document
########
:Mindee ID: 8c25cc63-212b-4537-9c9b-3fbd3bd0ee20
:Filename: default_sample.jpg
:Mindee ID: 15ad7a19-7b75-43d0-b0c6-9a641a12b49b
:Filename: default_sample.pdf

Inference
#########
:Product: mindee/carte_vitale v1.0
:Rotation applied: Yes
:Product: mindee/invoice_splitter v1.1
:Rotation applied: No

Prediction
==========
:Given Name(s): NATHALIE
:Surname: DURAND
:Social Security Number: 269054958815780
:Issuance Date: 2007-01-01
:Invoice Page Groups:
:Page indexes: 0
:Page indexes: 1

Page Predictions
================

Page 0
------
:Given Name(s): NATHALIE
:Surname: DURAND
:Social Security Number: 269054958815780
:Issuance Date: 2007-01-01
:Invoice Page Groups:

Page 1
------
:Invoice Page Groups:
```

# Field Types
## Standard Fields
These fields are generic and used in several products.

## Specific Fields
### BaseField
Each prediction object contains a set of fields that inherit from the generic `BaseField` class.
A typical `BaseField` object will have the following attributes:

### Page Group
* **value** (`Union[float, str]`): corresponds to the field value. Can be `None` if no value was extracted.
* **confidence** (`float`): the confidence score of the field prediction.
* **bounding_box** (`[Point, Point, Point, Point]`): contains exactly 4 relative vertices (points) coordinates of a right rectangle containing the field in the document.
* **polygon** (`List[Point]`): contains the relative vertices coordinates (`Point`) of a polygon containing the field in the image.
* **page_id** (`int`): the ID of the page, always `None` when at document-level.
* **reconstructed** (`bool`): indicates whether an object was reconstructed (not extracted as the API gave it).

List of page group indexes.
> **Note:** A `Point` simply refers to a List of two numbers (`[float, float]`).

An `InvoiceSplitterV1PageGroup` implements the following attributes:

- **page_indexes** (`float`\[]): List of indexes of the pages of a single invoice.
- **confidence** (`float`): The confidence of the prediction.
Aside from the previous attributes, all basic fields have access to a custom `__str__` method that can be used to print their value as a string.

# Attributes
## Specific Fields
Fields which are specific to this product; they are not used in any other product.

### Invoice Page Groups Field
List of page groups. Each group represents a single invoice within a multi-invoice document.

A `InvoiceSplitterV1InvoicePageGroup` implements the following attributes:

* **page_indexes** (`List[int]`): List of page indexes that belong to the same invoice (group).

# Attributes
The following fields are extracted for Invoice Splitter V1:

## Invoice Page Groups

**invoice_page_groups** ([InvoiceSplitterV1PageGroup](#invoice-splitter-v1-page-group)\[]): List of page indexes that belong to the same invoice in the PDF.
**invoice_page_groups** (List[[InvoiceSplitterV1InvoicePageGroup](#invoice-page-groups-field)]): List of page groups. Each group represents a single invoice within a multi-invoice document.

```py
for invoice_page_groups_elem in page.prediction.invoice_page_groups):
print(invoice_page_groups_elem)
for invoice_page_groups_elem in result.document.inference.prediction.invoice_page_groups:
print(invoice_page_groups_elem.value)
```

# Questions?

[Join our Slack](https://join.slack.com/t/mindee-community/shared_invite/zt-2d0ds7dtz-DPAF81ZqTy20chsYpQBW5g)
4 changes: 4 additions & 0 deletions docs/product/invoice_splitter_v1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,7 @@ Invoice Splitter V1
.. autoclass:: mindee.product.invoice_splitter.invoice_splitter_v1_document.InvoiceSplitterV1Document
:members:
:inherited-members:

.. autoclass:: mindee.product.invoice_splitter.invoice_splitter_v1_invoice_page_group.InvoiceSplitterV1InvoicePageGroup
:members:
:inherited-members:
10 changes: 4 additions & 6 deletions mindee/extraction/pdf_extractor/pdf_extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,7 @@
from mindee.error.mindee_error import MindeeError
from mindee.extraction.pdf_extractor.extracted_pdf import ExtractedPdf
from mindee.input.sources.local_input_source import LocalInputSource
from mindee.product.invoice_splitter.invoice_splitter_v1_page_group import (
InvoiceSplitterV1PageGroup,
)
from mindee.product.invoice_splitter import InvoiceSplitterV1InvoicePageGroup


class PdfExtractor:
Expand Down Expand Up @@ -76,7 +74,7 @@ def extract_sub_documents(

def extract_invoices(
self,
page_indexes: List[Union[InvoiceSplitterV1PageGroup, List[int]]],
page_indexes: List[Union[InvoiceSplitterV1InvoicePageGroup, List[int]]],
strict: bool = False,
) -> List[ExtractedPdf]:
"""
Expand All @@ -88,7 +86,7 @@ def extract_invoices(
"""
if len(page_indexes) < 1:
raise MindeeError("No indexes provided.")
if not isinstance(page_indexes[0], InvoiceSplitterV1PageGroup):
if not isinstance(page_indexes[0], InvoiceSplitterV1InvoicePageGroup):
return self.extract_sub_documents(page_indexes) # type: ignore
if not strict:
indexes_as_list = [page_index.page_indexes for page_index in page_indexes] # type: ignore
Expand All @@ -97,7 +95,7 @@ def extract_invoices(
current_list: List[int] = []
previous_confidence: Optional[float] = None
for i, page_index in enumerate(page_indexes):
assert isinstance(page_index, InvoiceSplitterV1PageGroup)
assert isinstance(page_index, InvoiceSplitterV1InvoicePageGroup)
confidence = page_index.confidence
page_list = page_index.page_indexes

Expand Down
4 changes: 2 additions & 2 deletions mindee/product/invoice_splitter/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@
from mindee.product.invoice_splitter.invoice_splitter_v1_document import (
InvoiceSplitterV1Document,
)
from mindee.product.invoice_splitter.invoice_splitter_v1_page_group import (
InvoiceSplitterV1PageGroup,
from mindee.product.invoice_splitter.invoice_splitter_v1_invoice_page_group import (
InvoiceSplitterV1InvoicePageGroup,
)
12 changes: 9 additions & 3 deletions mindee/product/invoice_splitter/invoice_splitter_v1.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@


class InvoiceSplitterV1(Inference):
"""Inference prediction for Invoice Splitter, API version 1."""
"""Invoice Splitter API version 1 inference prediction."""

prediction: InvoiceSplitterV1Document
"""Document-level prediction."""
Expand All @@ -20,14 +20,20 @@ class InvoiceSplitterV1(Inference):
endpoint_version = "1"
"""Version of the endpoint."""

def __init__(self, raw_prediction: StringDict) -> None:
def __init__(self, raw_prediction: StringDict):
"""
Invoice Splitter v1 inference.

:param raw_prediction: Raw prediction from the HTTP response.
"""
super().__init__(raw_prediction)

self.prediction = InvoiceSplitterV1Document(raw_prediction["prediction"])
self.pages = []
for page in raw_prediction["pages"]:
self.pages.append(Page(InvoiceSplitterV1Document, page))
try:
page_prediction = page["prediction"]
except KeyError:
continue
if page_prediction:
self.pages.append(Page(InvoiceSplitterV1Document, page))
61 changes: 40 additions & 21 deletions mindee/product/invoice_splitter/invoice_splitter_v1_document.py
Original file line number Diff line number Diff line change
@@ -1,38 +1,57 @@
from typing import List
from typing import List, Optional

from mindee.parsing.common.prediction import Prediction
from mindee.parsing.common.string_dict import StringDict
from mindee.parsing.common.summary_helper import clean_out_string
from mindee.product.invoice_splitter.invoice_splitter_v1_page_group import (
InvoiceSplitterV1PageGroup,
from mindee.product.invoice_splitter.invoice_splitter_v1_invoice_page_group import (
InvoiceSplitterV1InvoicePageGroup,
)


class InvoiceSplitterV1Document(Prediction):
"""Document data for Invoice Splitter, API version 1."""
"""Invoice Splitter API version 1.2 document data."""

invoice_page_groups: List[InvoiceSplitterV1PageGroup]
"""Page groups linked to an invoice."""
invoice_page_groups: List[InvoiceSplitterV1InvoicePageGroup]
"""List of page groups. Each group represents a single invoice within a multi-invoice document."""

def __init__(self, raw_prediction: StringDict) -> None:
def __init__(
self,
raw_prediction: StringDict,
page_id: Optional[int] = None,
):
"""
Invoice Splitter document.

:param raw_prediction: Raw prediction from HTTP response
:param page_id: Page number for multi pages pdf input
"""
super().__init__(raw_prediction)

invoice_page_groups = []
if (
"invoice_page_groups" in raw_prediction
and len(raw_prediction["invoice_page_groups"]) > 0
):
for prediction in raw_prediction["invoice_page_groups"]:
invoice_page_groups.append(InvoiceSplitterV1PageGroup(prediction))
self.invoice_page_groups = invoice_page_groups
super().__init__(raw_prediction, page_id)
self.invoice_page_groups = [
InvoiceSplitterV1InvoicePageGroup(prediction, page_id=page_id)
for prediction in raw_prediction["invoice_page_groups"]
]

@staticmethod
def _invoice_page_groups_separator(char: str) -> str:
out_str = " "
out_str += f"+{char * 74}"
return out_str + "+"

def _invoice_page_groups_to_str(self) -> str:
if not self.invoice_page_groups:
return ""

lines = f"\n{self._invoice_page_groups_separator('-')}\n ".join(
[item.to_table_line() for item in self.invoice_page_groups]
)
out_str = ""
out_str += f"\n{self._invoice_page_groups_separator('-')}\n "
out_str += " | Page Indexes "
out_str += f" |\n{self._invoice_page_groups_separator('=')}"
out_str += f"\n {lines}"
out_str += f"\n{self._invoice_page_groups_separator('-')}"
return out_str

def __str__(self) -> str:
page_group_str = ":Invoice Page Groups:"
for page_group in self.invoice_page_groups:
page_group_str += f"\n {str(page_group)}"
return clean_out_string(page_group_str)
out_str: str = f":Invoice Page Groups: {self._invoice_page_groups_to_str()}\n"
return clean_out_string(out_str)
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
from typing import Dict, List, Optional

from mindee.parsing.common.string_dict import StringDict
from mindee.parsing.common.summary_helper import clean_out_string
from mindee.parsing.standard.base import FieldConfidenceMixin, FieldPositionMixin


class InvoiceSplitterV1InvoicePageGroup(FieldPositionMixin, FieldConfidenceMixin):
"""List of page groups. Each group represents a single invoice within a multi-invoice document."""

page_indexes: List[int]
"""List of page indexes that belong to the same invoice (group)."""
page_n: int
"""The document page on which the information was found."""

def __init__(
self,
raw_prediction: StringDict,
page_id: Optional[int] = None,
):
self._set_confidence(raw_prediction)
self._set_position(raw_prediction)

if page_id is None:
try:
self.page_n = raw_prediction["page_id"]
except KeyError:
pass
else:
self.page_n = page_id

self.page_indexes = raw_prediction["page_indexes"]

def _printable_values(self) -> Dict[str, str]:
"""Return values for printing."""
out_dict: Dict[str, str] = {}
out_dict["page_indexes"] = ", ".join([str(elem) for elem in self.page_indexes])
return out_dict

def _table_printable_values(self) -> Dict[str, str]:
"""Return values for printing inside an RST table."""
out_dict: Dict[str, str] = {}
out_dict["page_indexes"] = ", ".join([str(elem) for elem in self.page_indexes])
return out_dict

def to_table_line(self) -> str:
"""Output in a format suitable for inclusion in an rST table."""
printable = self._table_printable_values()
out_str: str = f"| {printable['page_indexes']:<72} | "
return clean_out_string(out_str)

def __str__(self) -> str:
"""Default string representation."""
printable = self._printable_values()
out_str: str = f"Page Indexes: {printable['page_indexes']}, \n"
return clean_out_string(out_str)
22 changes: 0 additions & 22 deletions mindee/product/invoice_splitter/invoice_splitter_v1_page_group.py

This file was deleted.

Loading