PDF to Markdown - Images Not Extracted, Showing <--image--> Placeholder #1146

harskuma · 2025-03-12T07:42:22Z

harskuma
Mar 12, 2025

I am trying to convert a PDF to Markdown using Docling, but instead of extracting the images, I am getting a <--image--> placeholder in the output.

Used pipeline_options.do_ocr = True to enable OCR.

jaddison · 2025-03-12T23:54:35Z

jaddison
Mar 12, 2025

See the image_mode parameter to export_to_markdown. There are 3 options, one of which is REFERENCED.

class ImageRefMode(str, Enum):
    """ImageRefMode."""

    PLACEHOLDER = "placeholder"  # just a place-holder
    EMBEDDED = "embedded"  # embed the image as a base64
    REFERENCED = "referenced"  # reference the image via uri

0 replies

KartikB4B · 2025-04-01T06:49:15Z

KartikB4B
Apr 1, 2025

when you are exporting the markdown using the doc.export_to_markdown(),you can specify the image_mode.
Example ->

full_markdown = doc.export_to_markdown(
page_break_placeholder="",
image_mode="embedded"
)

4 replies

harskuma Apr 1, 2025
Author

Hey! Thanks for the reference , i used this but i'm getting image as base64, i want the text from images,am i missing something?

KartikB4B Apr 2, 2025

No . I am also looking for that.

sreena-certaintiai Apr 7, 2025

Same issue goes here.

itsyaboyksi Apr 26, 2025

Any luck yet?

harskuma · 2025-04-29T16:01:39Z

harskuma
Apr 29, 2025
Author

@itsyaboyksi @sreena-certaintiai @itsyaboyksi
Hey! I was just looking around and found a way to access the image content
Here's what worked for me (I used the sample PDF amt_handbook_sample.pdf from the Docling repo):

source = "amt_handbook_sample.pdf"

pipeline_options = PdfPipelineOptions()
pipeline_options.images_scale = 2
pipeline_options.generate_page_images = True

doc_converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)

result = doc_converter.convert(source)
doc = result.document

for picture in doc.pictures:
    print(picture.caption_text(doc), " contains these elements:")
    for item, level in doc.iterate_items(root=picture, traverse_pictures=True):
        if isinstance(item, TextItem):
            print(item.text)
    print("\n")

This gives you access to the image blocks and any OCR-extracted text inside them.
For now, I'm replacing the placeholder in the Markdown with the output of this code using some custom logic.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDF to Markdown - Images Not Extracted, Showing <--image--> Placeholder #1146

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

PDF to Markdown - Images Not Extracted, Showing <--image--> Placeholder #1146

Uh oh!

harskuma Mar 12, 2025

Replies: 3 comments · 4 replies

Uh oh!

jaddison Mar 12, 2025

Uh oh!

KartikB4B Apr 1, 2025

Uh oh!

harskuma Apr 1, 2025 Author

Uh oh!

KartikB4B Apr 2, 2025

Uh oh!

sreena-certaintiai Apr 7, 2025

Uh oh!

itsyaboyksi Apr 26, 2025

Uh oh!

Uh oh!

harskuma Apr 29, 2025 Author

harskuma
Mar 12, 2025

Replies: 3 comments 4 replies

jaddison
Mar 12, 2025

KartikB4B
Apr 1, 2025

harskuma Apr 1, 2025
Author

harskuma
Apr 29, 2025
Author