Skip to content

Move to_{html,xhtml,xml,text,json} methods from Page to TextPage #143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 27, 2025

Conversation

ginnyTheCat
Copy link
Collaborator

This moves to_{html,xhtml,xml,text,json} from Page to TextPage which is more inline with how the C API itself works. This allows specifying the TextPageFlags like PyMuPDF does depending on the format requested when no explicit flags are passed. Ontop this has the added benefit of not having to construct a fz_stext_page twice when calling to_text and to_json for example (very visible in the tests/test_issues.rs file and the examples/).

To fix #69 the code in the issue can now be adapted to:

use std::fs;

use mupdf::{Document, TextPageFlags};

fn main() {
    let doc: Document = Document::open("input.epub").unwrap();

    let page = doc.load_page(7).unwrap();
    let text_page = page.to_text_page(TextPageFlags::PRESERVE_IMAGES).unwrap();
    let html = text_page.to_html(7).unwrap();
    fs::write("out.html", html).unwrap();
}

To fully match PyMuPDF's behaviour here TextPageFlags::PRESERVE_LIGATURES | TextPageFlags::PRESERVE_WHITESPACE | TextPageFlags::CLIP | TextPageFlags::PRESERVE_IMAGES | TextPageFlags::USE_CID_FOR_UNKNOWN_UNICODE would need to be passed. Therefore it might be worth adding either

a) A shortcut function like (this would have the benefit of keeping the Page::to_html function in the same place, even if with slightly changed behaviour). This would fix #69 without a code change for example.

impl Page {
    fn to_html(&self) -> Result<String, Error> {
        self.to_text_page(flags_from_above)?.to_html(self.inner.number)
    }
}

or

b) An alias like TextPageFlags::DISPLAY = flags_from_above (which would prevent people from doing the fz_page -> fz_stext_page conversion more often than they would need to, just because they don't see it hidden inside the Page::to_html function).

I'm unsure myself which one of these would be better, but that's an addition that could come in a future PR anyway.

Copy link
Owner

@messense messense left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one question.

@messense messense merged commit 2e76d2f into messense:main May 27, 2025
14 checks passed
@ginnyTheCat ginnyTheCat deleted the text_page_to branch May 27, 2025 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Images missing in Page's to_html or to_xhtml output
2 participants