Skip to content

Commit e834152

Browse files
♻️ update display & structure for invoice splitter v1 (#302)
1 parent 931f496 commit e834152

File tree

11 files changed

+183
-117
lines changed

11 files changed

+183
-117
lines changed

docs/extras/code_samples/invoice_splitter_v1_async.txt

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,4 @@
1-
from mindee import Client, product
2-
from time import sleep
3-
from mindee.parsing.common import AsyncPredictResponse
1+
from mindee import Client, product, AsyncPredictResponse
42

53
# Init a new client
64
mindee_client = Client(api_key="my-api-key")
Lines changed: 44 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,17 @@
11
---
2-
title: Invoice Splitter API Python
2+
title: Invoice Splitter OCR Python
33
category: 622b805aaec68102ea7fcbc2
4-
slug: python-invoice-splitter-api
4+
slug: python-invoice-splitter-ocr
55
parentDoc: 609808f773b0b90051d839de
66
---
77
The Python OCR SDK supports the [Invoice Splitter API](https://platform.mindee.com/mindee/invoice_splitter).
88

9-
Using [this sample](https://github.com/mindee/client-lib-test-data/blob/main/products/invoice_splitter/default_sample.pdf), we are going to illustrate how to detect the pages of multiple invoices within the same document.
9+
Using the [sample below](https://github.com/mindee/client-lib-test-data/blob/main/products/invoice_splitter/default_sample.pdf), we are going to illustrate how to extract the data that we want using the OCR SDK.
10+
![Invoice Splitter sample](https://github.com/mindee/client-lib-test-data/blob/main/products/invoice_splitter/default_sample.pdf?raw=true)
1011

1112
# Quick-Start
12-
13-
> **⚠️ Important:** This API only works **asynchronously**, which means that documents have to be sent and retrieved in a specific way:
14-
1513
```py
16-
from mindee import Client, product
17-
from time import sleep
18-
from mindee.parsing.common import AsyncPredictResponse
14+
from mindee import Client, product, AsyncPredictResponse
1915

2016
# Init a new client
2117
mindee_client = Client(api_key="my-api-key")
@@ -31,66 +27,80 @@ result: AsyncPredictResponse = mindee_client.enqueue_and_parse(
3127

3228
# Print a brief summary of the parsed data
3329
print(result.document)
30+
3431
```
3532

3633
**Output (RST):**
37-
3834
```rst
3935
########
4036
Document
4137
########
42-
:Mindee ID: 8c25cc63-212b-4537-9c9b-3fbd3bd0ee20
43-
:Filename: default_sample.jpg
38+
:Mindee ID: 15ad7a19-7b75-43d0-b0c6-9a641a12b49b
39+
:Filename: default_sample.pdf
4440
4541
Inference
4642
#########
47-
:Product: mindee/carte_vitale v1.0
48-
:Rotation applied: Yes
43+
:Product: mindee/invoice_splitter v1.1
44+
:Rotation applied: No
4945
5046
Prediction
5147
==========
52-
:Given Name(s): NATHALIE
53-
:Surname: DURAND
54-
:Social Security Number: 269054958815780
55-
:Issuance Date: 2007-01-01
48+
:Invoice Page Groups:
49+
:Page indexes: 0
50+
:Page indexes: 1
5651
5752
Page Predictions
5853
================
5954
6055
Page 0
6156
------
62-
:Given Name(s): NATHALIE
63-
:Surname: DURAND
64-
:Social Security Number: 269054958815780
65-
:Issuance Date: 2007-01-01
57+
:Invoice Page Groups:
58+
59+
Page 1
60+
------
61+
:Invoice Page Groups:
6662
```
6763

6864
# Field Types
65+
## Standard Fields
66+
These fields are generic and used in several products.
6967

70-
## Specific Fields
68+
### BaseField
69+
Each prediction object contains a set of fields that inherit from the generic `BaseField` class.
70+
A typical `BaseField` object will have the following attributes:
7171

72-
### Page Group
72+
* **value** (`Union[float, str]`): corresponds to the field value. Can be `None` if no value was extracted.
73+
* **confidence** (`float`): the confidence score of the field prediction.
74+
* **bounding_box** (`[Point, Point, Point, Point]`): contains exactly 4 relative vertices (points) coordinates of a right rectangle containing the field in the document.
75+
* **polygon** (`List[Point]`): contains the relative vertices coordinates (`Point`) of a polygon containing the field in the image.
76+
* **page_id** (`int`): the ID of the page, always `None` when at document-level.
77+
* **reconstructed** (`bool`): indicates whether an object was reconstructed (not extracted as the API gave it).
7378

74-
List of page group indexes.
79+
> **Note:** A `Point` simply refers to a List of two numbers (`[float, float]`).
7580
76-
An `InvoiceSplitterV1PageGroup` implements the following attributes:
7781

78-
- **page_indexes** (`float`\[]): List of indexes of the pages of a single invoice.
79-
- **confidence** (`float`): The confidence of the prediction.
82+
Aside from the previous attributes, all basic fields have access to a custom `__str__` method that can be used to print their value as a string.
8083

81-
# Attributes
84+
## Specific Fields
85+
Fields which are specific to this product; they are not used in any other product.
86+
87+
### Invoice Page Groups Field
88+
List of page groups. Each group represents a single invoice within a multi-invoice document.
89+
90+
A `InvoiceSplitterV1InvoicePageGroup` implements the following attributes:
91+
92+
* **page_indexes** (`List[int]`): List of page indexes that belong to the same invoice (group).
8293

94+
# Attributes
8395
The following fields are extracted for Invoice Splitter V1:
8496

8597
## Invoice Page Groups
86-
87-
**invoice_page_groups** ([InvoiceSplitterV1PageGroup](#invoice-splitter-v1-page-group)\[]): List of page indexes that belong to the same invoice in the PDF.
98+
**invoice_page_groups** (List[[InvoiceSplitterV1InvoicePageGroup](#invoice-page-groups-field)]): List of page groups. Each group represents a single invoice within a multi-invoice document.
8899

89100
```py
90-
for invoice_page_groups_elem in page.prediction.invoice_page_groups):
91-
print(invoice_page_groups_elem)
101+
for invoice_page_groups_elem in result.document.inference.prediction.invoice_page_groups:
102+
print(invoice_page_groups_elem.value)
92103
```
93104

94105
# Questions?
95-
96106
[Join our Slack](https://join.slack.com/t/mindee-community/shared_invite/zt-2d0ds7dtz-DPAF81ZqTy20chsYpQBW5g)

docs/product/invoice_splitter_v1.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,7 @@ Invoice Splitter V1
1313
.. autoclass:: mindee.product.invoice_splitter.invoice_splitter_v1_document.InvoiceSplitterV1Document
1414
:members:
1515
:inherited-members:
16+
17+
.. autoclass:: mindee.product.invoice_splitter.invoice_splitter_v1_invoice_page_group.InvoiceSplitterV1InvoicePageGroup
18+
:members:
19+
:inherited-members:

mindee/extraction/pdf_extractor/pdf_extractor.py

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,7 @@
88
from mindee.error.mindee_error import MindeeError
99
from mindee.extraction.pdf_extractor.extracted_pdf import ExtractedPdf
1010
from mindee.input.sources.local_input_source import LocalInputSource
11-
from mindee.product.invoice_splitter.invoice_splitter_v1_page_group import (
12-
InvoiceSplitterV1PageGroup,
13-
)
11+
from mindee.product.invoice_splitter import InvoiceSplitterV1InvoicePageGroup
1412

1513

1614
class PdfExtractor:
@@ -76,7 +74,7 @@ def extract_sub_documents(
7674

7775
def extract_invoices(
7876
self,
79-
page_indexes: List[Union[InvoiceSplitterV1PageGroup, List[int]]],
77+
page_indexes: List[Union[InvoiceSplitterV1InvoicePageGroup, List[int]]],
8078
strict: bool = False,
8179
) -> List[ExtractedPdf]:
8280
"""
@@ -88,7 +86,7 @@ def extract_invoices(
8886
"""
8987
if len(page_indexes) < 1:
9088
raise MindeeError("No indexes provided.")
91-
if not isinstance(page_indexes[0], InvoiceSplitterV1PageGroup):
89+
if not isinstance(page_indexes[0], InvoiceSplitterV1InvoicePageGroup):
9290
return self.extract_sub_documents(page_indexes) # type: ignore
9391
if not strict:
9492
indexes_as_list = [page_index.page_indexes for page_index in page_indexes] # type: ignore
@@ -97,7 +95,7 @@ def extract_invoices(
9795
current_list: List[int] = []
9896
previous_confidence: Optional[float] = None
9997
for i, page_index in enumerate(page_indexes):
100-
assert isinstance(page_index, InvoiceSplitterV1PageGroup)
98+
assert isinstance(page_index, InvoiceSplitterV1InvoicePageGroup)
10199
confidence = page_index.confidence
102100
page_list = page_index.page_indexes
103101

mindee/product/invoice_splitter/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,6 @@
22
from mindee.product.invoice_splitter.invoice_splitter_v1_document import (
33
InvoiceSplitterV1Document,
44
)
5-
from mindee.product.invoice_splitter.invoice_splitter_v1_page_group import (
6-
InvoiceSplitterV1PageGroup,
5+
from mindee.product.invoice_splitter.invoice_splitter_v1_invoice_page_group import (
6+
InvoiceSplitterV1InvoicePageGroup,
77
)

mindee/product/invoice_splitter/invoice_splitter_v1.py

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99

1010

1111
class InvoiceSplitterV1(Inference):
12-
"""Inference prediction for Invoice Splitter, API version 1."""
12+
"""Invoice Splitter API version 1 inference prediction."""
1313

1414
prediction: InvoiceSplitterV1Document
1515
"""Document-level prediction."""
@@ -20,14 +20,20 @@ class InvoiceSplitterV1(Inference):
2020
endpoint_version = "1"
2121
"""Version of the endpoint."""
2222

23-
def __init__(self, raw_prediction: StringDict) -> None:
23+
def __init__(self, raw_prediction: StringDict):
2424
"""
2525
Invoice Splitter v1 inference.
2626
2727
:param raw_prediction: Raw prediction from the HTTP response.
2828
"""
2929
super().__init__(raw_prediction)
30+
3031
self.prediction = InvoiceSplitterV1Document(raw_prediction["prediction"])
3132
self.pages = []
3233
for page in raw_prediction["pages"]:
33-
self.pages.append(Page(InvoiceSplitterV1Document, page))
34+
try:
35+
page_prediction = page["prediction"]
36+
except KeyError:
37+
continue
38+
if page_prediction:
39+
self.pages.append(Page(InvoiceSplitterV1Document, page))
Lines changed: 40 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,57 @@
1-
from typing import List
1+
from typing import List, Optional
22

33
from mindee.parsing.common.prediction import Prediction
44
from mindee.parsing.common.string_dict import StringDict
55
from mindee.parsing.common.summary_helper import clean_out_string
6-
from mindee.product.invoice_splitter.invoice_splitter_v1_page_group import (
7-
InvoiceSplitterV1PageGroup,
6+
from mindee.product.invoice_splitter.invoice_splitter_v1_invoice_page_group import (
7+
InvoiceSplitterV1InvoicePageGroup,
88
)
99

1010

1111
class InvoiceSplitterV1Document(Prediction):
12-
"""Document data for Invoice Splitter, API version 1."""
12+
"""Invoice Splitter API version 1.2 document data."""
1313

14-
invoice_page_groups: List[InvoiceSplitterV1PageGroup]
15-
"""Page groups linked to an invoice."""
14+
invoice_page_groups: List[InvoiceSplitterV1InvoicePageGroup]
15+
"""List of page groups. Each group represents a single invoice within a multi-invoice document."""
1616

17-
def __init__(self, raw_prediction: StringDict) -> None:
17+
def __init__(
18+
self,
19+
raw_prediction: StringDict,
20+
page_id: Optional[int] = None,
21+
):
1822
"""
1923
Invoice Splitter document.
2024
2125
:param raw_prediction: Raw prediction from HTTP response
26+
:param page_id: Page number for multi pages pdf input
2227
"""
23-
super().__init__(raw_prediction)
24-
25-
invoice_page_groups = []
26-
if (
27-
"invoice_page_groups" in raw_prediction
28-
and len(raw_prediction["invoice_page_groups"]) > 0
29-
):
30-
for prediction in raw_prediction["invoice_page_groups"]:
31-
invoice_page_groups.append(InvoiceSplitterV1PageGroup(prediction))
32-
self.invoice_page_groups = invoice_page_groups
28+
super().__init__(raw_prediction, page_id)
29+
self.invoice_page_groups = [
30+
InvoiceSplitterV1InvoicePageGroup(prediction, page_id=page_id)
31+
for prediction in raw_prediction["invoice_page_groups"]
32+
]
33+
34+
@staticmethod
35+
def _invoice_page_groups_separator(char: str) -> str:
36+
out_str = " "
37+
out_str += f"+{char * 74}"
38+
return out_str + "+"
39+
40+
def _invoice_page_groups_to_str(self) -> str:
41+
if not self.invoice_page_groups:
42+
return ""
43+
44+
lines = f"\n{self._invoice_page_groups_separator('-')}\n ".join(
45+
[item.to_table_line() for item in self.invoice_page_groups]
46+
)
47+
out_str = ""
48+
out_str += f"\n{self._invoice_page_groups_separator('-')}\n "
49+
out_str += " | Page Indexes "
50+
out_str += f" |\n{self._invoice_page_groups_separator('=')}"
51+
out_str += f"\n {lines}"
52+
out_str += f"\n{self._invoice_page_groups_separator('-')}"
53+
return out_str
3354

3455
def __str__(self) -> str:
35-
page_group_str = ":Invoice Page Groups:"
36-
for page_group in self.invoice_page_groups:
37-
page_group_str += f"\n {str(page_group)}"
38-
return clean_out_string(page_group_str)
56+
out_str: str = f":Invoice Page Groups: {self._invoice_page_groups_to_str()}\n"
57+
return clean_out_string(out_str)
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
from typing import Dict, List, Optional
2+
3+
from mindee.parsing.common.string_dict import StringDict
4+
from mindee.parsing.common.summary_helper import clean_out_string
5+
from mindee.parsing.standard.base import FieldConfidenceMixin, FieldPositionMixin
6+
7+
8+
class InvoiceSplitterV1InvoicePageGroup(FieldPositionMixin, FieldConfidenceMixin):
9+
"""List of page groups. Each group represents a single invoice within a multi-invoice document."""
10+
11+
page_indexes: List[int]
12+
"""List of page indexes that belong to the same invoice (group)."""
13+
page_n: int
14+
"""The document page on which the information was found."""
15+
16+
def __init__(
17+
self,
18+
raw_prediction: StringDict,
19+
page_id: Optional[int] = None,
20+
):
21+
self._set_confidence(raw_prediction)
22+
self._set_position(raw_prediction)
23+
24+
if page_id is None:
25+
try:
26+
self.page_n = raw_prediction["page_id"]
27+
except KeyError:
28+
pass
29+
else:
30+
self.page_n = page_id
31+
32+
self.page_indexes = raw_prediction["page_indexes"]
33+
34+
def _printable_values(self) -> Dict[str, str]:
35+
"""Return values for printing."""
36+
out_dict: Dict[str, str] = {}
37+
out_dict["page_indexes"] = ", ".join([str(elem) for elem in self.page_indexes])
38+
return out_dict
39+
40+
def _table_printable_values(self) -> Dict[str, str]:
41+
"""Return values for printing inside an RST table."""
42+
out_dict: Dict[str, str] = {}
43+
out_dict["page_indexes"] = ", ".join([str(elem) for elem in self.page_indexes])
44+
return out_dict
45+
46+
def to_table_line(self) -> str:
47+
"""Output in a format suitable for inclusion in an rST table."""
48+
printable = self._table_printable_values()
49+
out_str: str = f"| {printable['page_indexes']:<72} | "
50+
return clean_out_string(out_str)
51+
52+
def __str__(self) -> str:
53+
"""Default string representation."""
54+
printable = self._printable_values()
55+
out_str: str = f"Page Indexes: {printable['page_indexes']}, \n"
56+
return clean_out_string(out_str)

mindee/product/invoice_splitter/invoice_splitter_v1_page_group.py

Lines changed: 0 additions & 22 deletions
This file was deleted.

0 commit comments

Comments
 (0)