-
Couldn't load subscription status.
- Fork 288
FAQ
This is most certainly due to fonts being available on a given OS, and not on another.
On Linux, installing additional font should help:
sudo apt install ttf*See issue
PdfPig does not support all images filters out of the box.
If you already have applied the below solution and images are still missing, it might be because they are contained into a Pattern color, see this issue for a possible solution.
Filters requiring external implementation are: DCT, JPX and JBIG2. You can either implement your own, or use the following NuGet packages:
- PdfPig.Filters.Dct.JpegLibrary
- PdfPig.Filters.Jbig2.PdfboxJbig2
- PdfPig.Filters.Jpx.OpenJpeg
Once the Nuget packages are added, use the following:
// Create your filter provider
public sealed class MyFilterProvider : BaseFilterProvider
{
/// <summary>
/// The single instance of this provider.
/// </summary>
public static readonly IFilterProvider Instance = new MyFilterProvider();
/// <inheritdoc/>
private MyFilterProvider() : base(GetDictionary())
{
}
private static Dictionary<string, IFilter> GetDictionary()
{
// new filters
var jbig2 = new PdfboxJbig2DecodeFilter();
var jpx = new OpenJpegJpxDecodeFilter();
var dct = new JpegLibraryDctDecodeFilter();
// Default filters
var ascii85 = new Ascii85Filter();
var asciiHex = new AsciiHexDecodeFilter();
var ccitt = new CcittFaxDecodeFilter();
var flate = new FlateFilter();
var runLength = new RunLengthFilter();
var lzw = new LzwFilter();
return new Dictionary<string, IFilter>
{
{ NameToken.Ascii85Decode.Data, ascii85 },
{ NameToken.Ascii85DecodeAbbreviation.Data, ascii85 },
{ NameToken.AsciiHexDecode.Data, asciiHex },
{ NameToken.AsciiHexDecodeAbbreviation.Data, asciiHex },
{ NameToken.CcittfaxDecode.Data, ccitt },
{ NameToken.CcittfaxDecodeAbbreviation.Data, ccitt },
{ NameToken.DctDecode.Data, dct },
{ NameToken.DctDecodeAbbreviation.Data, dct },
{ NameToken.FlateDecode.Data, flate },
{ NameToken.FlateDecodeAbbreviation.Data, flate },
{ NameToken.Jbig2Decode.Data, jbig2 },
{ NameToken.JpxDecode.Data, jpx },
{ NameToken.RunLengthDecode.Data, runLength },
{ NameToken.RunLengthDecodeAbbreviation.Data, runLength },
{ NameToken.LzwDecode.Data, lzw },
{ NameToken.LzwDecodeAbbreviation.Data, lzw }
};
}
}
var parsingOption = new ParsingOptions()
{
UseLenientParsing = true, // Optinal
SkipMissingFonts = true, // Optinal
FilterProvider = MyFilterProvider.Instance
};
using (var doc = PdfDocument.Open("my_document.pdf", parsingOption))
{
int i = 0;
foreach (var page in doc.GetPages())
{
foreach (var pdfImage in page.GetImages())
{
// Process your images, e.g.:
if (pdfImage.TryGetPng(out byte[] bytes))
{
File.WriteAllBytes($"image_{i++}.png", bytes);
}
}
}
}PdfPig not being an image library, it has limitation when handling images (e.g. image resizing). For a more robust image extraction, you can also use https://github.com/BobLd/PdfPig.Rendering.Skia, which relies on SkiaSharp.
First, install the PdfPig.Rendering.Skia NuGet package. Once done, it can be used as follow:
using UglyToad.PdfPig.Rendering.Skia.Helpers;
[...]
using (var document = PdfDocument.Open(_path, SkiaRenderingParsingOptions.Instance))
{
for (int p = 1; p <= document.NumberOfPages; p++)
{
var page = document.GetPage(p);
foreach (var pdfImage in page.GetImages())
{
var skImage = pdfImage.GetSKImage();
// Use SKImage
}
}
}This is often the case when the document was created using "fake bold" to bold letters. When this is the case, the document creator will duplicate each letter that are supposed to be bold with a slight offset, creating a thicker appearance. When this method is used, the PdfPig letter object will not be flagged as bold.
In order to handle duplicate letters, post-process the page's letter collection with:
letters = DuplicateOverlappingTextProcessor.Get(letters);