Move `to_{html,xhtml,xml,text,json}` methods from `Page` to `TextPage` #143

ginnyTheCat · 2025-05-26T03:58:07Z

This moves to_{html,xhtml,xml,text,json} from Page to TextPage which is more inline with how the C API itself works. This allows specifying the TextPageFlags like PyMuPDF does depending on the format requested when no explicit flags are passed. Ontop this has the added benefit of not having to construct a fz_stext_page twice when calling to_text and to_json for example (very visible in the tests/test_issues.rs file and the examples/).

To fix #69 the code in the issue can now be adapted to:

use std::fs;

use mupdf::{Document, TextPageFlags};

fn main() {
    let doc: Document = Document::open("input.epub").unwrap();

    let page = doc.load_page(7).unwrap();
    let text_page = page.to_text_page(TextPageFlags::PRESERVE_IMAGES).unwrap();
    let html = text_page.to_html(7).unwrap();
    fs::write("out.html", html).unwrap();
}

To fully match PyMuPDF's behaviour here TextPageFlags::PRESERVE_LIGATURES | TextPageFlags::PRESERVE_WHITESPACE | TextPageFlags::CLIP | TextPageFlags::PRESERVE_IMAGES | TextPageFlags::USE_CID_FOR_UNKNOWN_UNICODE would need to be passed. Therefore it might be worth adding either

a) A shortcut function like (this would have the benefit of keeping the Page::to_html function in the same place, even if with slightly changed behaviour). This would fix #69 without a code change for example.

impl Page {
    fn to_html(&self) -> Result<String, Error> {
        self.to_text_page(flags_from_above)?.to_html(self.inner.number)
    }
}

or

b) An alias like TextPageFlags::DISPLAY = flags_from_above (which would prevent people from doing the fz_page -> fz_stext_page conversion more often than they would need to, just because they don't see it hidden inside the Page::to_html function).

I'm unsure myself which one of these would be better, but that's an addition that could come in a future PR anyway.

mupdf-sys/wrapper.c

messense

LGTM, just one question.

ginnyTheCat added 2 commits May 26, 2025 05:28

Move to_ methods from Page to TextPage

40e55a4

Fix test failures

02ee9b2

messense reviewed May 26, 2025

View reviewed changes

mupdf-sys/wrapper.c Show resolved Hide resolved

messense approved these changes May 26, 2025

View reviewed changes

Allow specifying whether to add header and trailer

b9a3eef

messense merged commit 2e76d2f into messense:main May 27, 2025
14 checks passed

ginnyTheCat deleted the text_page_to branch May 27, 2025 11:04

ginnyTheCat mentioned this pull request Aug 21, 2025

Release version 0.6.0 #165

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move `to_{html,xhtml,xml,text,json}` methods from `Page` to `TextPage` #143

Move `to_{html,xhtml,xml,text,json}` methods from `Page` to `TextPage` #143

Uh oh!

ginnyTheCat commented May 26, 2025

Uh oh!

Uh oh!

messense left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Move to_{html,xhtml,xml,text,json} methods from Page to TextPage #143

Move to_{html,xhtml,xml,text,json} methods from Page to TextPage #143

Uh oh!

Conversation

ginnyTheCat commented May 26, 2025

Uh oh!

Uh oh!

messense left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Move `to_{html,xhtml,xml,text,json}` methods from `Page` to `TextPage` #143

Move `to_{html,xhtml,xml,text,json}` methods from `Page` to `TextPage` #143