Skip to content

Commit bc1df2f

Browse files
authored
HTML diff: Maintain spacing in text/tags (#40)
It turns out that some of the display breakage we've had with `html_render_diff` is happening because we are nicely reformatting the source on pages that use `white-space: pre;` or similar CSS styling to treat the spacing in the source code as significant. For example, the following will display poorly, but it's what we were outputting: <!-- Before Version --> <div style="white-space: pre;">Some text</div> <!-- Before Version --> <div style="white-space: pre;"></div> <!-- Diff Output --> <div style="white-space: pre;"> <del> Some text </del> </div> Solve this by working harder to maintain the original spacing and not unintentionally reformatting the source, so we now output: <!-- Diff Output --> <div style="white-space: pre;"><del>Some text</del></div> Fixes #36.
1 parent 20cb995 commit bc1df2f

File tree

3 files changed

+60
-8
lines changed

3 files changed

+60
-8
lines changed

docs/source/release-history.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ In Development
77

88
- Fixes an issue where the diffing server could reset the process pool that manages the actual diffs multiple times unnecessarily, leading to wasted memory and CPU. If you are tracking logs and errors, this will also make error messages about the diffing server clearer — you’ll see “BrokenProcessPool” instead of “'NoneType' object does not support item assignment.” (`#38 <https://github.com/edgi-govdata-archiving/web-monitoring-diff/issues/38>`_)
99

10+
- Fixes :func:`web_monitoring_diff.html_diff_render` to make sure the spacing of text and tags in the HTML source code of the diff matches the original. This resolves display issues on pages where CSS is used to treat spacing as significant. (`#36 <https://github.com/edgi-govdata-archiving/web-monitoring-diff/issues/36>`_)
11+
1012

1113
Version 0.1.0
1214
-------------

web_monitoring_diff/html_render_diff.py

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -435,18 +435,18 @@ def html_diff_render(a_text, b_text, a_headers=None, b_headers=None,
435435
soup.head.append(change_styles)
436436

437437
soup.body.replace_with(diff_body)
438-
# The method we use above to append HTML strings (the diffs) to the soup
439-
# results in a non-navigable soup. So we serialize and re-parse :(
440-
# (Note we use no formatter for this because proper encoding escapes
441-
# the tags our differ generated.)
442-
soup = html5_parser.parse(soup.prettify(formatter=None),
443-
treebuilder='soup', return_root=False)
444438
runtime_scripts = soup.new_tag('script', id='wm-diff-script')
445439
runtime_scripts.string = UPDATE_CONTRAST_SCRIPT
446440
soup.body.append(runtime_scripts)
447441
if diff_type == 'combined':
448442
_deactivate_deleted_active_elements(soup)
449-
results[diff_type] = soup.prettify(formatter='minimal')
443+
# Convert to a string instead of prettifying. `prettify()` will always
444+
# add extra space around non-inline elements, even if `formatter` is
445+
# "minimal" or None. This is a problem because the page may use pre-
446+
# formatted text or use CSS to make block elements display as inline,
447+
# etc. There are lots of situations where the spacing really matters,
448+
# so we want to make sure not to alter it.
449+
results[diff_type] = str(soup)
450450

451451
return results
452452

@@ -506,7 +506,18 @@ def diff_elements(old, new, comparator, include='all'):
506506
def fill_element(element, diff):
507507
result_element = copy.copy(element)
508508
result_element.clear()
509-
result_element.append(diff)
509+
# At this point, `diff` is an HTML *string*, so we need to parse it
510+
# before we can safely insert it into a soup. (We used to insert it as
511+
# a string and do some funny tricks, but this led to other issues.)
512+
# TODO: _htmldiff() should return a tree of soup tags rather than a
513+
# list of strings, so we don't need to re-parse here.
514+
parsed_diff = html5_parser.parse(
515+
f'<!doctype html>\n<html><body>{diff}</body></html>',
516+
treebuilder='soup', return_root=False)
517+
# `.contents/.children` are *live*, so cache their output into a new
518+
# list before moving them to the new container -- otherwise we'll miss
519+
# some because the contents change in the middle of moving.
520+
result_element.extend(list(parsed_diff.body.children))
510521
return result_element
511522

512523
results = {}

web_monitoring_diff/tests/test_html_diff_validity.py

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,21 @@
1515
from web_monitoring_diff.html_render_diff import html_diff_render
1616

1717

18+
# Helpers
19+
HEAD_PATTERN = re.compile(r'<head>(.*?)</head>', re.DOTALL)
20+
BODY_PATTERN = re.compile(r'<body.*?>(.*)</body>', re.DOTALL)
21+
EXTRAS_PATTERN = re.compile(r'<(style|script) id="wm-diff-.*?</\1>', re.DOTALL)
22+
23+
def get_head(html):
24+
return HEAD_PATTERN.search(html).group(1)
25+
26+
def get_body(html):
27+
return BODY_PATTERN.search(html).group(1)
28+
29+
def remove_extras(html):
30+
return EXTRAS_PATTERN.sub('', html)
31+
32+
1833
# TODO: extend these to other html differs via parameterization, a la
1934
# `test_html_diff.py`. Most of these are written generically enough they could
2035
# feasibly work with any visual HTML diff routine.
@@ -26,6 +41,30 @@ def test_html_diff_render_works_on_pages_with_no_head():
2641
assert result
2742

2843

44+
def test_html_diff_render_maintains_existing_content_spacing():
45+
"""
46+
The diff should not add spaces to make the HTML source look nicely
47+
formatted, since this can change how the page is laid out and styled -- any
48+
space in the doc may be significant if an element has CSS like:
49+
white-space: pre;
50+
"""
51+
result = html_diff_render(
52+
'<html><body>Hello <div>Alice</div></body></html>',
53+
'<html><body>Hello <div>Sally</div></body></html>',
54+
include='all')
55+
56+
# Just look at the body (the head can have nice space) and remove extra
57+
# styling/scripting, which can have space inside it because it doesn't
58+
# contribute to the content.
59+
deletions = remove_extras(get_body(result['deletions']))
60+
insertions = remove_extras(get_body(result['insertions']))
61+
combined = remove_extras(get_body(result['combined']))
62+
63+
assert deletions == 'Hello <div><del class="wm-diff">Alice</del></div>'
64+
assert insertions == 'Hello <div><ins class="wm-diff">Sally</ins></div>'
65+
assert combined == 'Hello <div><del class="wm-diff">Alice</del><ins class="wm-diff">Sally</ins></div>'
66+
67+
2968
def test_html_diff_render_does_not_encode_embedded_content():
3069
html = '<script>console.log("uhoh");</script> ok ' \
3170
'<style>body {font-family: "arial";}</style>'

0 commit comments

Comments
 (0)