output base64 encoded images urls (strings) #183

Human-Hassan · 2024-11-11T03:00:03Z

Human-Hassan
Nov 11, 2024

Hello,
First, thanks a lot for the amazing work! One suggestion (I'm kinda a beginner here so hopefully I didn't miss something, which then this will be a question of how?):

I'm using pymupdf4llm to extract text with images from pdf files. I would like to avoid saving the images and then process them and encode them to be added to my DB or sent to the LMM (I'm using OpenAI GPT4 via AzureOpenAI API).
I'm aware that there is 'embed_images' boolean which can be used to embed images as base64 encoded strings (urls) to the extracted text. I see that when 'save_image' is called, data, containing the urls, is returned (data = f"data:image/{IMG_EXTENSION};base64," + data) and then added to the md text.
However, this is usually problematic and expensive, because the urls are usually very long and in my case at least, even for a small image it causes the max number of tokens to be hit.
Generally, I think according to MS AzureOpenAI service, it is better and much less expensive to send the encoded images urls using the type "image_url" along with the md text using the type "text".

Therefore, my suggestion is to give the option to return the (encoded) images urls along with the extracted text as separate outputs (e.g. the return of pymupdf4llm.to_markdown is a list of the md text and the images urls).

Thanks!

JorjMcKie · 2024-11-11T14:23:32Z

JorjMcKie
Nov 11, 2024
Maintainer

I really don't understand that:

You can either embed images as base64 strings in the MD text, or store them separately as image files in a folder of your choice.
In the latter case, a markdown-compatible URL is included in the text, pointing to this file.

All that could be offered is a callback function (to be provided by the programmer = you) to which the image (and associated metadata) is handed over.
We ourselves will certainly not store images to internet resources ourselves . All the entailing problems, error checking and what not else, are clearly out of scope.

In order to streamline your process, I would never recommend to access the internet every time when that hypothetic callback function is invoked.
For a smooth and efficient processing, storing images somewhere and later bulk-upload them from that storage seems to be advisable.
Based on this argument, you already have all what you need today:

specify write_images=True, page_chunks=True, image_path="path")
at the end of the markdown run, upload the images contained in "path", enriched with information from the corresponding page chunk.

1 reply

Human-Hassan Nov 11, 2024
Author

Sorry I think I confused you by saying "image urls", which AzureOpenAI vision refers to as "data_url" (see example: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/gpt-with-vision?tabs=rest). I don't want to access or store images in a server or something in the internet.

As you mentioned using 'embed_images=True', we embed images as base64 strings in the MD text, which I'm saying is problematic. In short, instead of embedding them to the MD text, I want them in a separate list so I can send them along with the MD text but not as part of it.

Example:

# Get MD text and images in the document
md_text, encoded_images= pumypdf4llm.to_markdown(document_path, encoded_image_list = True)

summarization_message = [{"role": "system", "content": "summarize this document and tell me what you see in the images in this document."}]
content = [{ "type": "text", "text":  md_text}]
for encoded_image in encoded_images:
  content.append({"type": "image_url", "image_url": {"url": encoded_image}})

message= summarization_message + content

# Send to LMM
response = client.chat.completions.create(messages=message, model=model, temperature=0.001)

With how things currently work, to avoid sending the images as part of the md_text (embedded) to avoid hitting the max number of tokens allowed, what I had to do is save the images and then encode them again to be sent to the LMM as the example I showed above. But, I want to avoid saving the images (and also avoid the extra work of encoding them again since pymupdf4llm can already do that.
An example of what I'm currently doing and want to avoid:

image_path= "./temp/save_here"
# Get MD text and images in the document
md_text = pumypdf4llm.to_markdown(document_path, write_images=True, image_path=image_path)

# Encode images
encoded_images = encode_image(image_path=image_path):

summarization_message = [{"role": "system", "content": "summarize this document and tell me what you see in the images in this document."}]
content = [{ "type": "text", "text":  md_text}]
for encoded_image in encoded_images:
  content.append({"type": "image_url", "image_url": {"url": encoded_image}})

message= summarization_message + content

# Send to LMM
response = client.chat.completions.create(messages=message, model=model, temperature=0.001)

where the utility function encode_image is:

def encode_image(images_paths):
encoded_images = []
for image_path in images_paths:
    # Guess the MIME type of the image based on the file extension
    mime_type, _ = guess_type(image_path)
    if mime_type is None:
        mime_type = 'application/octet-stream'  # Default MIME type if none is found
    # Read and encode the image file
    with open(image_path, "rb") as image_file:
        base64_encoded_image = base64.b64encode(image_file.read()).decode("utf-8")
    # Construct the data URL
    image_url = f"data:{mime_type};base64,{base64_encoded_image}"
    encoded_images.append(image_url)

However, I'm working with a large number of documents and would really want to avoid making directories for each one and saving images and then deleting them, etc, especially that the work is already done inside pymupdf4llm.to_markdown.

JorjMcKie · 2024-11-11T16:28:07Z

JorjMcKie
Nov 11, 2024
Maintainer

But you can use write_images=True instead and have the images stored in a separate folder.
Then upload from there.
What's the problem in this case?

16 replies

Human-Hassan Nov 11, 2024
Author

So you're saying when I hand in the encoded_image to the AzureOpenAI client, they are processed and uploaded anyway, right?
That is okay since I have not saved anything.
So for me now, the solution to get a list of encoded images (i.e. "data"), is to get md_text without image and then invoke a callback to save_image in pymupdf_rag.py?

Could you elaborate on this and help me get there?

Thanks a lot for your help and support.

JorjMcKie Nov 11, 2024
Maintainer

We haven't considered adding a callback function for image output in any detail yet. It would require diligent thought and be prioritized with loads of other work we have on stock. It would certainly not receive a first-in-queue position.
Any potential solution would also not be influenced by requirements of any individual receiver backend (as MS Azure, Google, Amazon or whatever).

So you might want to think about your own solution in the meantime. I still think that a local folder for images is a pragmatic idea. Image filenames are unique across documents, so images from multiple documents may be stored in the same folder.
Using suitable mechanisms like the subprocess module, images can be uploaded asynchronously to your online backend and deleted from local storage.
This would keep local storage requirements low, while at the same time keeping the MD text production fast and not impeded by the notoriously slow processing whenever internet access comes into play.

Human-Hassan Nov 11, 2024
Author

I see. Thank you.
Yes, for now it seems that I will save the images with write_images=True in a temp directory and then encode them outside and then send the encoded images to the LMM.

Another option (feel free to let me know what you think of this) I have been thinking about:
is to embed the images with embed_images=True and then scan the md_text for the encoded images (beginning with data:image.... ) and extracting them and removing them from the md_text, but this seems a lot of work too.

If I may ask, what is the logic of embedding the encoded images as part of the md_text, especially that the string of an encoded image can be very very long? At the beginning I thought of the encoded images being in a separate list instead of embedding them because the encoded image can be sent to the LMM as a separate type and does not affect the max number of tokens allowed.

JorjMcKie Nov 11, 2024
Maintainer

Encoding an image is a real no-brainer, see the code. Just make little script that walks through that folder, reads each image, encodes it and prepares a dictionary as required by your internet backend.
Send the dictionary to the backend, then delete the file from the image path.
Repeat until no more images in the folder.

Human-Hassan Nov 11, 2024
Author

Yes, I already did that (I made a copy of my code example doing that in my earlier response #183 (reply in thread)).

output base64 encoded images urls (strings) #183

Uh oh!

Uh oh!

Human-Hassan Nov 11, 2024

Replies: 2 comments · 17 replies

Uh oh!

JorjMcKie Nov 11, 2024 Maintainer

Uh oh!

Uh oh!

Human-Hassan Nov 11, 2024 Author

Uh oh!

JorjMcKie Nov 11, 2024 Maintainer

Uh oh!

Uh oh!

Human-Hassan Nov 11, 2024 Author

Uh oh!

JorjMcKie Nov 11, 2024 Maintainer

Uh oh!

Uh oh!

Human-Hassan Nov 11, 2024 Author

Uh oh!

JorjMcKie Nov 11, 2024 Maintainer

Uh oh!

Human-Hassan Nov 11, 2024 Author

Human-Hassan
Nov 11, 2024

Replies: 2 comments 17 replies

JorjMcKie
Nov 11, 2024
Maintainer

Human-Hassan Nov 11, 2024
Author

JorjMcKie
Nov 11, 2024
Maintainer

Human-Hassan Nov 11, 2024
Author

JorjMcKie Nov 11, 2024
Maintainer

Human-Hassan Nov 11, 2024
Author

JorjMcKie Nov 11, 2024
Maintainer

Human-Hassan Nov 11, 2024
Author