Skip to content

Commit 01575c9

Browse files
authored
Parse titles correctly from PDFs when importing (#622)
We used to try and extract the title of any and every document as if it were HTML. However, we import lots of things that aren't HTML (e.g. PDF, CSV, XLS) and trying to parse those as HTML can slow, pointless, or memory intensive (the importer recently crashed by running out of memory trying to parse a PDF this way). This change takes more care to only extract the titles from HTML and from PDFs, and adds proper PDF parsing so we actually can extract the titles instead of failing and wasting huge amounts of memory every time.
1 parent 534e114 commit 01575c9

File tree

2 files changed

+8
-0
lines changed

2 files changed

+8
-0
lines changed

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ git+https://github.com/anastasia/htmldiffer@develop
77
git+https://github.com/danielballan/htmltreediff@customize
88
html5-parser ~=0.4.9 --no-binary lxml
99
lxml ~=4.5.2
10+
PyPDF2 ~=1.26.0
1011
sentry-sdk ~=0.16.3
1112
requests ~=2.24.0
1213
toolz ~=0.10.0

web_monitoring/utils.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
import logging
55
import lxml.html
66
import os
7+
from PyPDF2 import PdfFileReader
78
import queue
89
import re
910
import requests
@@ -35,6 +36,12 @@ def extract_title(content_bytes, encoding='utf-8'):
3536
return WHITESPACE_PATTERN.sub(' ', title.text.strip())
3637

3738

39+
def extract_pdf_title(content_bytes):
40+
pdf = PdfFileReader(io.BytesIO(content_bytes))
41+
info = pdf.getDocumentInfo()
42+
return info.title
43+
44+
3845
def hash_content(content_bytes):
3946
"Create a version_hash for the content of a snapshot."
4047
return hashlib.sha256(content_bytes).hexdigest()

0 commit comments

Comments
 (0)