Skip to content

Commit b814ece

Browse files
authored
fix: properly handle the case when an element's text is None (#3995)
Some elements, like `Image`, can have `None` as its `text` attribute's value. In that case current chunking logic fails because it expects the field to always have a length or can be split. The fix is to update the logic as `element.text or ""` for checking length and add flow control to early exit to avoid calling split on `None`.
1 parent 604c4a7 commit b814ece

File tree

4 files changed

+18
-4
lines changed

4 files changed

+18
-4
lines changed

CHANGELOG.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.17.6-dev2
1+
## 0.17.6
22

33
### Enhancements
44

@@ -10,6 +10,7 @@ Two executions of the same code, on the same file, produce different results. Th
1010
This makes it impossible to write stable unit tests, for example, or to obtain reproducible results.
1111
- **Do not use NLP to determine element types for extracted elements with hi_res.** This avoids extraneous Title elements in hi_res outputs. This only applies to *extracted* elements, meaning text objects that are found outside of Object Detection objects which get mapped to *inferred* elements. (*extracted* and *inferred* elements get merged together to form the list of `Element`s returned by `pdf_partition()`)
1212
- Resolve open CVEs
13+
- Properly handle the case when an element's `text` attribute is None
1314

1415

1516
## 0.17.5
@@ -48,7 +49,7 @@ This makes it impossible to write stable unit tests, for example, or to obtain r
4849
### Features
4950

5051
### Fixes
51-
- **Fixes wrong detection of office files** certain office files wrongly identified as .ZIP when office(.docx,.xlsx and .pptx) files containing files other than word/document.xml, xl/workbook.xml and ppt/presentation.xml respectively will now be identified correctly by looking for word/document\*.xml, xl/workbook\*.xml and ppt/presentation\*.xml
52+
- **Fixes wrong detection of office files** certain office files wrongly identified as .ZIP when office(.docx,.xlsx and .pptx) files containing files other than word/document.xml, xl/workbook.xml and ppt/presentation.xml respectively will now be identified correctly by looking for word/document\*.xml, xl/workbook\*.xml and ppt/presentation\*.xml
5253

5354
## 0.17.2
5455

test_unstructured/chunking/test_base.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@
3131
CompositeElement,
3232
Element,
3333
ElementMetadata,
34+
Image,
3435
PageBreak,
3536
Table,
3637
TableChunk,
@@ -234,6 +235,10 @@ def it_accumulates_elements_added_to_it(self):
234235
assert builder._text_length == 112
235236
assert builder._remaining_space == 36
236237

238+
def it_will_fit_when_element_has_none_as_text(self):
239+
builder = PreChunkBuilder(opts=ChunkingOptions())
240+
assert builder.will_fit(Image(None))
241+
237242
def it_will_fit_an_oversized_element_when_empty(self):
238243
builder = PreChunkBuilder(opts=ChunkingOptions())
239244
assert builder.will_fit(Text("abcd " * 200))
@@ -405,6 +410,12 @@ def and_it_knows_it_is_NOT_equal_to_an_object_that_is_not_a_PreChunk(self):
405410
pre_chunk = PreChunk([], overlap_prefix="", opts=ChunkingOptions())
406411
assert pre_chunk != 42
407412

413+
def it_can_handle_element_with_none_as_text(self):
414+
pre_chunk = PreChunk(
415+
[Image(None), Text("hello")], overlap_prefix="", opts=ChunkingOptions()
416+
)
417+
assert pre_chunk._text == "hello"
418+
408419
@pytest.mark.parametrize(
409420
("max_characters", "combine_text_under_n_chars", "expected_value"),
410421
[

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.17.6-dev2" # pragma: no cover
1+
__version__ = "0.17.6" # pragma: no cover

unstructured/chunking/base.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -387,7 +387,7 @@ def will_fit(self, element: Element) -> bool:
387387
if self._text_length > self._opts.soft_max:
388388
return False
389389
# -- don't add an element if it would increase total size beyond the hard-max --
390-
return not self._remaining_space < len(element.text)
390+
return not self._remaining_space < len(element.text or "")
391391

392392
@property
393393
def _remaining_space(self) -> int:
@@ -503,6 +503,8 @@ def _iter_text_segments(self) -> Iterator[str]:
503503
if self._overlap_prefix:
504504
yield self._overlap_prefix
505505
for e in self._elements:
506+
if e.text is None:
507+
continue
506508
text = " ".join(e.text.strip().split())
507509
if not text:
508510
continue

0 commit comments

Comments
 (0)