Open
Description
Question
Is there a way we can improve inference latency of Docling on a GPU by creating a batch of page images as an input to the different models - EasyOCR, Layout Detection and TableFormer?
I am using a single A10 GPU for inference, and it is significantly underutilized (~15%). It would be ideal if we can batch
Looking into the Docling documentation, I have tried increasing num_threads
, but that seems to only work for CPU and not GPUs.
When I did a little digging into the code I saw that docling iterates over the pages in a page_batch only passes a single page as an input to these models like so:
def __call__(
self, conv_res: ConversionResult, page_batch: Iterable[Page]
) -> Iterable[Page]:
for page in page_batch:
assert page._backend is not None
if not page._backend.is_valid():
yield page
else:
with TimeRecorder(conv_res, "layout"):
assert page.size is not None
page_image = page.get_image(scale=1.0)
assert page_image is not None
clusters = []
for ix, pred_item in enumerate(
self.layout_predictor.predict(page_image)
):
label = DocItemLabel(
pred_item["label"]
.lower()
.replace(" ", "_")
.replace("-", "_")
)
........
........
It would be great if we can do batching of the page images and maximize the GPU capabilities.
Looking forward to hearing back!
Thank you!