A group aware serialization for embeddings #1720

Fogapod · 2025-06-05T18:21:43Z

Fogapod
Jun 5, 2025

Hello, I'm using HierarchicalChunker + contextualize() to split document into parts to use for embeddings. The problem is some chunks are too granular. Sometimes it's a single number or word.

I've noticed these granular items are often inside groups so I want to try using whole group instead. Document's body.children contains top level objects without duplicates so I've tried iterating it and serializing these objects without traversing entire tree.
Serializer doesn't have an overridable method like serialize_group so I assume that is not the way to go.

This is the code i've got:

from collections.abc import Iterator

from docling_core.transforms.chunker.hierarchical_chunker import (
    ChunkingSerializerProvider,
)
from docling_core.transforms.serializer.base import SerializationResult
from docling_core.transforms.serializer.common import create_ser_result
from docling_core.types.doc.document import (
    DoclingDocument,
    InlineGroup,
    NodeItem,
)

with open("d.json") as f:
    doc = DoclingDocument.model_validate_json(f.read())

ser = ChunkingSerializerProvider().get_serializer(doc)

items: list[SerializationResult] = []

for ref in doc.body.children:
    item: DocItem = ref.resolve(doc)
    # get_parts returns serialization results
    parts = ser.get_parts(item)

    if isinstance(item, InlineGroup):
        text = " ".join(part.text for part in parts)
    else:
        text = "\n".join(part.text for part in parts)

    res = create_ser_result(
        text=text,
        span_source=parts,
    )

    # some results are empty for some reason
    if not res.text:
        continue

    items.append(res)


for item in items:
    print(item.text)
    print("-----")

# print(doc.export_to_markdown())
# doc.print_element_tree()

This snippet seems to work correctly, joining all group items into a single SerializationResult but now section headers are separate. I think contextualize method on chunker adds them.

My questions are:

Can groups be nested?
How do I capture headings, captions etc similar to contextualize? Do I need to manually count groups/text nodes and try to match them or does docling have utilities for this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A group aware serialization for embeddings #1720

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

A group aware serialization for embeddings #1720

Uh oh!

Fogapod Jun 5, 2025

Replies: 0 comments

Fogapod
Jun 5, 2025