make link copying more tolerant when adding page #1103

EliotJones · 2025-07-24T01:56:50Z

in #1082 and other issues relating to annotations we're running into constraints of the current model of building a pdf document. currently we skip all link type annotations, i think we can support copying of links where the link destination is outside the current document. however the more i look at this code the more i think we need a radical redesign of how document building is done because it has been pushed far beyond its current capabilities, i'll detail my thinking in the related pr in more detail

EliotJones · 2025-07-24T02:26:15Z

Proposed redesign notes

PdfDocumentBuilder was never intended to support editing of existing documents but as the library has evolved we have 'hacked' on 3 different primary ways of editing existing documents:

PdfDocumentBuilder.AddPage(PdfDocument document, int pageNumber), takes the parsed document as input and copies the specified page to the builder as a new PdfPageBuilder.
PdfPageBuilder.CopyFrom(Page srcPage), copy the content from the page in the parsed document and write it to the current page builder in the parent builder.
PdfMerger and PdfTextRemover pass through to PdfDocumentBuilder.AddPage

This results in a large number of issues: #635 #1082 #1000 #704 #720 #1035 #947 #936 #912 #878 #1083 etc. This is obviously unsustainable and prevents the library evolving to support other requested features like editing acro forms etc in future.

I believe we should change how we think about adding existing documents to a builder as well as try to remove, as far as possible, the copy from method.

Track added documents

Each PdfDocument instance should have an internally tracked Guid, or maybe we just track by reference, each time a new document is added to the builder we find every indirect reference in that document.

Reserve all free indirect reference numbers from the currently added document at the same numbers. Rather than re-assigning every indirect reference for a document we should avoid changing them if at-all possible.
Log all updated indirect references. If the number in the source document is no-longer free in the builder the next free number should be assigned, or potentially we apply a shift, e.g. num + (1000 * offsetCount) to all references to keep the source-number mapping at some offset.
Keep track of a 'set' state for all things like metadata, pages/catalog/trailer/info/etc. dictionary. The first document wins unless an 'override' flag is passed in for a subsequent document. If the state of these non-page members in the builder is empty/untouched then we copy every value from the first setting document. This is basically every non /Type /Page dictionary, since we, as far as possible, want to copy as much as possible of the source document to avoid regressions.
We need to think intelligently about how we track resources like fonts, I'm wondering if we write these eagerly and only go back and prune unused members when we build the document? This would avoid having to do too much content parsing at add-time, so if subsequent pages from the same document are added that use these fonts they're already in place. Not sure in the current design how much pruning is possible...
When copying a page to a page builder the entire existing content stream should be wrapped in a push/pop pair so that further edits can occur. This avoids issues like CopyFrom and AddPage both cause AddText to be written upside down #614 where global transforms cause misaligned additions.
If we can parse the added fonts (Standard14, TrueType) they should be available as builder fonts.

Challenges

The main challenges here are the rewriting of indirect references on addition and final save. This needs to be approached carefully since every reference needs to be updated recursively.

We also need to somehow do this at the last possible moment since if you have the following code where pages are added with interleaving:

using var inDoc1 = PdfDocument.Open(@"C:\temp\input1.pdf");
using var inDoc2 = PdfDocument.Open(@"C:\temp\input2.pdf");

var builder = new PdfDocumentBuilder(1.9);
builder.DocumentInformation.Author = "Fred";

builder.AddPage(inDoc1, 1);
builder.AddPage(inDoc2, 1);
builder.AddPage(inDoc1, 2);
builder.AddPage(inDoc2, 2);

You don't want to apply remapping of indirect refs in inDoc2 when you add page 1 since page 2 may contain references to page 1 content via forms, annotations, etc and you want to catch any remapping.

How do we handle resource inheritance in the page tree? Ideally we defer this to the last possible moment, i.e. the written document does not support resource inheritance, each leaf page defines its own resource dictionary.

Also consider the name dictionary which can be used to point to pages. If the pointer is to a page that never gets added that will cause further problems. We need some way to track which pages are added then go back and prune them if they are missing. But this is also the case for e.g. annotations and forms.

AI notes

LLMs are very useful when it comes to spec stuff due to my inability to keep all 1,310 pages of the spec in my head. In terms of dictionary merge resolution Ched GPT highlighted the following areas to consider:

Conflict Resolution Strategies by Dictionary

Document Information Dictionary (trailer /Info)

First write wins is fine here.
Consider merging only selectively (e.g., copy /Producer but let users override /Author, /Title, etc.).
Allow overwrite by the existing API

Catalog Dictionary (/Catalog)

First write wins is fine for most fields.
Special attention needed for: /Names, /Outlines, /AcroForm, /Threads
The above can contain references to pages/objects not added.
Suggest pruning unused references at finalization.

Name Dictionary (/Names)

Needs conflict resolution per sub-tree (Dests, JavaScript, EmbeddedFiles, etc.)
Options:
First write wins per name tree type, i.e. only one /EmbeddedFiles tree is retained.
Allow a merge with name deduplication strategy (/MyDoc.File1, /MyDoc.File2, etc.).
Warn on collision or optionally throw with a flag.

Resource Dictionary (/Resources)

Each page should have its own, so conflicts are mostly localized.
If fonts/images from the same doc are reused across pages, avoid duplication by:
Assigning consistent logical names (F1, I2, etc.).
Tracking global font/image/XObject registry internally in the builder.
Consider de-duplication of identical resources:
Hash font dictionaries or content streams (byte-wise or object-wise).
Reuse same object reference if identical, saving file size.

Outlines (/Outlines)

Fragile if entries reference missing pages.
Option 1: Don’t include outlines unless all pages they point to are added.
Option 2: Include them but prune any invalid destinations (risky without deep parsing).

AcroForm (/AcroForm)

Very tricky: requires merging /Fields, /DR, /NeedAppearances, etc.

Threads, Metadata, MarkInfo, ViewerPreferences

First-write-wins should be fine.
Could allow override APIs if needed.

EliotJones · 2025-07-24T02:39:45Z

@BobLd interested on your thoughts on the proposed redesign please since you've had a couple of years dealing with these issues too. Any thoughts on things to look out for?

BobLd · 2025-07-25T17:16:22Z

@EliotJones I agree with your proposed plan, and make a lot of sense to aim for 0.1.12 for the full refactoring. I've had limited exposure on the Creating/Editing part of the PdfPig API so far, so unfortunately I have limited opinion here

restore copy link func logic

f8af7da

EliotJones marked this pull request as ready for review July 25, 2025 00:02

EliotJones assigned BobLd Jul 25, 2025

EliotJones mentioned this pull request Jul 25, 2025

rework numeric tokenizer hot path #1104

Merged

EliotJones unassigned BobLd Jul 25, 2025

EliotJones requested a review from BobLd July 25, 2025 02:49

BobLd approved these changes Jul 25, 2025

View reviewed changes

BobLd merged commit 50f878b into master Jul 25, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

make link copying more tolerant when adding page #1103

make link copying more tolerant when adding page #1103

Uh oh!

EliotJones commented Jul 24, 2025

Uh oh!

EliotJones commented Jul 24, 2025 •

edited

Loading

Uh oh!

EliotJones commented Jul 24, 2025

Uh oh!

BobLd commented Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

make link copying more tolerant when adding page #1103

make link copying more tolerant when adding page #1103

Uh oh!

Conversation

EliotJones commented Jul 24, 2025

Uh oh!

EliotJones commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed redesign notes

Track added documents

Challenges

AI notes

Conflict Resolution Strategies by Dictionary

Document Information Dictionary (trailer /Info)

Catalog Dictionary (/Catalog)

Name Dictionary (/Names)

Resource Dictionary (/Resources)

Outlines (/Outlines)

AcroForm (/AcroForm)

Threads, Metadata, MarkInfo, ViewerPreferences

Uh oh!

EliotJones commented Jul 24, 2025

Uh oh!

BobLd commented Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

EliotJones commented Jul 24, 2025 •

edited

Loading