Skip to content

make link copying more tolerant when adding page #1103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 25, 2025
Merged

Conversation

EliotJones
Copy link
Member

in #1082 and other issues relating to annotations we're running into constraints of the current model of building a pdf document. currently we skip all link type annotations, i think we can support copying of links where the link destination is outside the current document. however the more i look at this code the more i think we need a radical redesign of how document building is done because it has been pushed far beyond its current capabilities, i'll detail my thinking in the related pr in more detail

in #1082 and other issues relating to annotations we're running into
constraints of the current model of building a pdf document. currently
we skip all link type annotations, i think we can support copying of links
where the link destination is outside the current document. however the
more i look at this code the more i think we need a radical redesign of
how document building is done because it has been pushed far beyond
its current capabilities, i'll detail my thinking in the related pr in more
detail
@EliotJones
Copy link
Member Author

EliotJones commented Jul 24, 2025

Proposed redesign notes

PdfDocumentBuilder was never intended to support editing of existing documents but as the library has evolved we have 'hacked' on 3 different primary ways of editing existing documents:

  • PdfDocumentBuilder.AddPage(PdfDocument document, int pageNumber), takes the parsed document as input and copies the specified page to the builder as a new PdfPageBuilder.
  • PdfPageBuilder.CopyFrom(Page srcPage), copy the content from the page in the parsed document and write it to the current page builder in the parent builder.
  • PdfMerger and PdfTextRemover pass through to PdfDocumentBuilder.AddPage

This results in a large number of issues: #635 #1082 #1000 #704 #720 #1035 #947 #936 #912 #878 #1083 etc. This is obviously unsustainable and prevents the library evolving to support other requested features like editing acro forms etc in future.

I believe we should change how we think about adding existing documents to a builder as well as try to remove, as far as possible, the copy from method.

Track added documents

Each PdfDocument instance should have an internally tracked Guid, or maybe we just track by reference, each time a new document is added to the builder we find every indirect reference in that document.

  1. Reserve all free indirect reference numbers from the currently added document at the same numbers. Rather than re-assigning every indirect reference for a document we should avoid changing them if at-all possible.
  2. Log all updated indirect references. If the number in the source document is no-longer free in the builder the next free number should be assigned, or potentially we apply a shift, e.g. num + (1000 * offsetCount) to all references to keep the source-number mapping at some offset.
  3. Keep track of a 'set' state for all things like metadata, pages/catalog/trailer/info/etc. dictionary. The first document wins unless an 'override' flag is passed in for a subsequent document. If the state of these non-page members in the builder is empty/untouched then we copy every value from the first setting document. This is basically every non /Type /Page dictionary, since we, as far as possible, want to copy as much as possible of the source document to avoid regressions.
  4. We need to think intelligently about how we track resources like fonts, I'm wondering if we write these eagerly and only go back and prune unused members when we build the document? This would avoid having to do too much content parsing at add-time, so if subsequent pages from the same document are added that use these fonts they're already in place. Not sure in the current design how much pruning is possible...
  5. When copying a page to a page builder the entire existing content stream should be wrapped in a push/pop pair so that further edits can occur. This avoids issues like CopyFrom and AddPage both cause AddText to be written upside down #614 where global transforms cause misaligned additions.
  6. If we can parse the added fonts (Standard14, TrueType) they should be available as builder fonts.

Challenges

The main challenges here are the rewriting of indirect references on addition and final save. This needs to be approached carefully since every reference needs to be updated recursively.

We also need to somehow do this at the last possible moment since if you have the following code where pages are added with interleaving:

using var inDoc1 = PdfDocument.Open(@"C:\temp\input1.pdf");
using var inDoc2 = PdfDocument.Open(@"C:\temp\input2.pdf");

var builder = new PdfDocumentBuilder(1.9);
builder.DocumentInformation.Author = "Fred";

builder.AddPage(inDoc1, 1);
builder.AddPage(inDoc2, 1);
builder.AddPage(inDoc1, 2);
builder.AddPage(inDoc2, 2);

You don't want to apply remapping of indirect refs in inDoc2 when you add page 1 since page 2 may contain references to page 1 content via forms, annotations, etc and you want to catch any remapping.

How do we handle resource inheritance in the page tree? Ideally we defer this to the last possible moment, i.e. the written document does not support resource inheritance, each leaf page defines its own resource dictionary.

Also consider the name dictionary which can be used to point to pages. If the pointer is to a page that never gets added that will cause further problems. We need some way to track which pages are added then go back and prune them if they are missing. But this is also the case for e.g. annotations and forms.

AI notes

LLMs are very useful when it comes to spec stuff due to my inability to keep all 1,310 pages of the spec in my head. In terms of dictionary merge resolution Ched GPT highlighted the following areas to consider:

Conflict Resolution Strategies by Dictionary

Document Information Dictionary (trailer /Info)

  • First write wins is fine here.
  • Consider merging only selectively (e.g., copy /Producer but let users override /Author, /Title, etc.).
  • Allow overwrite by the existing API

Catalog Dictionary (/Catalog)

  • First write wins is fine for most fields.
  • Special attention needed for: /Names, /Outlines, /AcroForm, /Threads
  • The above can contain references to pages/objects not added.
  • Suggest pruning unused references at finalization.

Name Dictionary (/Names)

  • Needs conflict resolution per sub-tree (Dests, JavaScript, EmbeddedFiles, etc.)
  • Options:
  • First write wins per name tree type, i.e. only one /EmbeddedFiles tree is retained.
  • Allow a merge with name deduplication strategy (/MyDoc.File1, /MyDoc.File2, etc.).
  • Warn on collision or optionally throw with a flag.

Resource Dictionary (/Resources)

  • Each page should have its own, so conflicts are mostly localized.
  • If fonts/images from the same doc are reused across pages, avoid duplication by:
  • Assigning consistent logical names (F1, I2, etc.).
  • Tracking global font/image/XObject registry internally in the builder.
  • Consider de-duplication of identical resources:
  • Hash font dictionaries or content streams (byte-wise or object-wise).
  • Reuse same object reference if identical, saving file size.

Outlines (/Outlines)

  • Fragile if entries reference missing pages.
  • Option 1: Don’t include outlines unless all pages they point to are added.
  • Option 2: Include them but prune any invalid destinations (risky without deep parsing).

AcroForm (/AcroForm)

  • Very tricky: requires merging /Fields, /DR, /NeedAppearances, etc.

Threads, Metadata, MarkInfo, ViewerPreferences

  • First-write-wins should be fine.
  • Could allow override APIs if needed.

@EliotJones
Copy link
Member Author

@BobLd interested on your thoughts on the proposed redesign please since you've had a couple of years dealing with these issues too. Any thoughts on things to look out for?

@EliotJones EliotJones marked this pull request as ready for review July 25, 2025 00:02
@EliotJones EliotJones requested a review from BobLd July 25, 2025 02:49
@BobLd
Copy link
Collaborator

BobLd commented Jul 25, 2025

@EliotJones I agree with your proposed plan, and make a lot of sense to aim for 0.1.12 for the full refactoring. I've had limited exposure on the Creating/Editing part of the PdfPig API so far, so unfortunately I have limited opinion here

@BobLd BobLd merged commit 50f878b into master Jul 25, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants