use list-based processing (inspired by AlextheYounga) #186

chrispy-snps · 2025-02-05T14:18:06Z

This pull request does the following:

Uses list-based processing for faster speed with parent elements that contain many content children elements
Partially fixes Newlines inside <pre> blocks are collapsed instead of preserved #185 by not collapsing some newlines inside <pre> elements

Consider the following children elements of a <div>:

<div>
  <p>line 1</p>
  <p>line 2</p>
</div>

As we process each child, we use a regex to split its conversion function result into (1) leading newlines, (2) content, and (3) trailing newlines:

["\n\n", "line 1", "\n\n"]

Before we accumulate the current child onto our running list, we collapse the trailing newlines of the last child with the leading newlines of the next child (similar to how CSS margins work):

["\n\n", "line 1", "\n\n"] + ["\n\n", "line 2", "\n\n"]
#               max(^^^^   ,   ^^^^)

The child is then accumulated onto the running list with its collapsed newlines:

["\n\n", "line 1", "\n\n", "line 2", "\n\n"]
#                   ^^^^

Currently, inline whitespace collapsing is implemented ad-hoc across multiple element-specific conversion functions. In the future, perhaps this inline whitespace collapsing could be deferred to a similar code path in process_element(). This would solve a lot of corner cases with nested inline elements that accumulate uncollapsed spaces at their boundaries.

chrispy-snps · 2025-02-15T13:04:02Z

@AlextheYounga - what are your thoughts on this approach?

Signed-off-by: Chris Papademetrious <chrispy@synopsys.com> Signed-off-by: chrispy <chrispy@synopsys.com>

Signed-off-by: chrispy <chrispy@synopsys.com>

AlextheYounga · 2025-02-17T02:31:35Z

Looking at this now. Once again, sorry for the delay

AlextheYounga · 2025-02-17T02:58:49Z

I am running a test right now just to say I did due diligence but this code looks excellent. There is now far more control over newlines, like you were saying in our previous conversation.

AlextheYounga · 2025-02-17T03:34:14Z

Just going to expand on some of the comments I made in #162

17 seconds for a 35MB file is crazy fast. I personally love it and would prefer prioritizing the speed over newlines discrepancies, but that is my opinion. I also think these changes make the code far more readable, as well as allows us far more control on how we approach newlines.

But I should note that this seems to have failed our previous tests. Maybe these tests are outdated, but I just want to note them here in case there is something important here I don't fully understand. You know more than me here.

Here are the errors I get on the tests I was running. The test_newlines test was a test I wrote just to test the data you had provided me previously.

And this change does add substantially more newlines. On my 35MB test, I saw 25k more newlines than our previous test.

But it's also 35x faster than my change, and 80x faster than the original!

I say send it!

AlextheYounga · 2025-02-17T03:36:52Z

I can also see what you mean by the difference here on how nested certain items are. This particular document we are testing is kind of odd because the entire document is flat. I will see if I can find a very large nested document to test this with.

chrispy-snps · 2025-02-17T13:40:57Z

@AlextheYounga - the additional newlines are an unanticipated side effect to previous pull request #184 ("make conversion non-destructive..."). The IRS document contains HTML comments intermixed with block elements:

<p>content</p>
<p>content</p>
<!--COMMENT-->
<p>content</p>
<!--COMMENT-->
<!--COMMENT-->

Because the code no longer deletes elements, the whitespace-rejection code in _can_ignore() sees comments instead of adjacent block elements and thus the whitespace is not ignored.

I will try to fix this in a follow-up pull request, but let's get some of the runtime improvements merged in first so we can properly assess any runtime change for this comment/whitespace fix.

…#186)

chrispy-snps mentioned this pull request Feb 5, 2025

perf: improve parsing performance from o(n^2) to o(n) #162

Closed

use list-based processing (inspired by AlextheYounga)

5b79b92

Signed-off-by: Chris Papademetrious <chrispy@synopsys.com> Signed-off-by: chrispy <chrispy@synopsys.com>

chrispy-snps force-pushed the chrispy/use-list-processing branch from 1b4750c to 5b79b92 Compare February 15, 2025 14:08

use greedy regex instead of reluctant regex for speed

16f2d8e

Signed-off-by: chrispy <chrispy@synopsys.com>

AlexVonB approved these changes Feb 17, 2025

View reviewed changes

chrispy-snps merged commit a83577d into matthewwithanm:develop Feb 17, 2025
1 check passed

chrispy-snps deleted the chrispy/use-list-processing branch February 17, 2025 13:44

chrispy-snps added a commit that referenced this pull request Feb 17, 2025

use list-based processing (inspired by AlextheYounga) (#186)

c52ba47

chrispy-snps mentioned this pull request Feb 17, 2025

propagate parent tag context downward to improve runtime #191

Merged

Wuhall pushed a commit to Wuhall/python-markdownify that referenced this pull request May 21, 2025

use list-based processing (inspired by AlextheYounga) (matthewwithanm…

9b49570

…#186)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use list-based processing (inspired by AlextheYounga) #186

use list-based processing (inspired by AlextheYounga) #186

chrispy-snps commented Feb 5, 2025 •

edited

Loading

Uh oh!

chrispy-snps commented Feb 15, 2025

Uh oh!

AlextheYounga commented Feb 17, 2025

Uh oh!

AlextheYounga commented Feb 17, 2025

Uh oh!

AlextheYounga commented Feb 17, 2025

Uh oh!

AlextheYounga commented Feb 17, 2025

Uh oh!

chrispy-snps commented Feb 17, 2025

Uh oh!

Uh oh!

Uh oh!

use list-based processing (inspired by AlextheYounga) #186

use list-based processing (inspired by AlextheYounga) #186

Conversation

chrispy-snps commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrispy-snps commented Feb 15, 2025

Uh oh!

AlextheYounga commented Feb 17, 2025

Uh oh!

AlextheYounga commented Feb 17, 2025

Uh oh!

AlextheYounga commented Feb 17, 2025

Uh oh!

AlextheYounga commented Feb 17, 2025

Uh oh!

chrispy-snps commented Feb 17, 2025

Uh oh!

Uh oh!

Uh oh!

chrispy-snps commented Feb 5, 2025 •

edited

Loading