perf: improve parsing performance from o(n^2) to o(n) #162

AlextheYounga · 2024-12-17T04:21:04Z

The function process_tag was previously concatenating strings inside of a loop. Each + operation creates a new string, resulting in repeated copying of already accumulated data. This logic is of complexity O(n²).

We can take this from an exponential function to a linear function by using an array, appending to it, and then returning a ''.join() at the end. This logic is of complexity O(n).

After this change, the time it took to convert a 20.8MB HTML file (US social security tax law document) went from 18.94 minutes to 4.11 minutes, a speed increase of 4.6x.

# Before change
Converting title_42—the_public_health_and_welfare::chapter_7—social_security.html to markdown...
Writing title_42—the_public_health_and_welfare::chapter_7—social_security.md...
Time taken: 1136.8312180042267 seconds

# After change
Converting title_42—the_public_health_and_welfare::chapter_7—social_security.html to markdown...
Writing title_42—the_public_health_and_welfare::chapter_7—social_security.md...
Time taken: 246.62110209465027 seconds

All tox tests are also passing.

I also added .python-version to the .gitignore file, (I was using pyenv)

AlextheYounga · 2024-12-18T16:51:38Z

I also added a few early checks when iterating over node children to prevent unnecessary function calls. I won't make any more changes to this branch in the interest of limiting complexity.

      for i, el in enumerate(children):
          # Quick type check first to avoid unnecessary function calls
          if not isinstance(el, NavigableString):
              continue

          # Check if the text is entirely whitespace first
          text = six.text_type(el)
          if text.strip():
              continue

          # Determine if we can extract based on position and adjacency
          can_extract = (
              (should_remove_inside and (i == 0 or i == len(children) - 1)) or (i > 0 and should_remove_whitespace_outside(children[i - 1])) or (i < len(children) - 1 and should_remove_whitespace_outside(children[i + 1]))
          )

          # Extract if conditions are met
          if can_extract:
              el.extract()

AlextheYounga · 2024-12-20T22:13:53Z

@matthewwithanm I just want to make sure you saw this, I think you might like this. Let me know if you'd like me to make any changes.

Thanks for creating this library 🫡

matthewwithanm · 2024-12-21T02:35:17Z

Yes! I haven't looked at the code but it sounds great! Thank you!

@AlexVonB has been managing the repo for a while though. He'll get around to it eventually but it's a busy time of year 🙂

AlexVonB · 2024-12-23T11:58:58Z

Hey all, I might have time during the holidays to check in on the issue. From the first glance it looks good, but I would like to take the time to completely understand every change. I wish you quiet holidays :)

chrispy-snps · 2025-01-01T15:54:03Z

@AlextheYounga - thanks for this pull request! I also get an impressive runtime reduction (379 seconds to 86 seconds on one of our larger HTML documents).

When I diff the new output against the old output, I see extra newlines. It would be good for the output to diff identically. You can use the following HTML to reproduce the issue:

<article>
    <h1>Heading 1</h1>
    <div>
        <p>article body</p>
    </div>
    <article>
        <h2>Heading 2</h2>
        <div>
            <p>article body</p>
        </div>
    </article>
    <p>footnote</p>
</article>

AlextheYounga · 2025-01-01T18:15:59Z

@chrispy-snps Thanks for catching this. If it's adding newlines, that is no bueno. I will see if I can correct this and perhaps add an extra test to catch this.

AlextheYounga · 2025-01-08T23:37:11Z

markdownify/__init__.py

        for el in node.children:
            if isinstance(el, Comment) or isinstance(el, Doctype):
                continue
            elif isinstance(el, NavigableString):
-                text += self.process_text(el)
+                text_parts.append(self.process_text(el))
            else:


Just an update here that I am still working on this. I have added a test and improved the readability of the code so as to better identify the problem.

I tried to copy over as much as I could from the original code, but I think some of that code doesn't translate well to this new loop paradigm. I am positive that the variable in newlines_left has changed.

In the original function we are grabbing a number of line breaks from the full text and using this to create newlines_left:
https://github.com/matthewwithanm/python-markdownify/blob/6258f5c38b97ab443b4ddf03e6676ce29b392d06/markdownify/__init__.py#L164C1-L166C60

Original:

else: text_strip = text.rstrip('\n') newlines_left = len(text) - len(text_strip)

However, in this new structure, we trying to break away from concatenating a string in each loop, the source of the exponential logic, but that comes with its own tradeoffs, namely it is no longer straightforward to access the full string text. So now we are only counting the number of newlines in each "chunk", in the array, which may not be the same number as before. I feel like there is a solution here, I am still working the problem.

New:

else: text_strip = '' newlines_left = 0 if text_parts: last_chunk = text_parts.pop() text_strip = last_chunk.rstrip('\n') newlines_left = len(last_chunk) - len(text_strip)

@AlextheYounga - the list-based approach actually brings a lot of power with it. We know that the internal contents of each list item is fixed and completed, but there are opportunities to normalize/collapse whitespace (for inline content) and newlines (for block content) at the leading/trailing ends of the list items. (Does convert_children_as_newline determine this?)

For example, we could post-process the list before joining by looking at the end of the previous item and the beginning of the next item. For example, trailing+leading sequences of more than 2 newlines could be limited to 2. Maybe the list items could be expanded into a (leading-ws, content, trailing-ws) tuples to make this easier. I'm not sure, but you are right that this is a different paradigm where the logic of the old code might not apply, and that could very well be a good thing.

@AlextheYounga - if having another pair of eyes on the code would help, push your latest code to your branch, and I'll clone it and have a look. It would be great to get this enhancement into the next release of Markdownify.

Edit: I merged in the upstream develop branch into your pull request branch and resolved the conflicts.

Apologies, I have been very busy recently. Honestly another pair of eyes here would help, but I am looking into this again now.

The function process_tag was previously concatenating strings inside of a loop. Each + operation creates a new string, resulting in repeated copying of already accumulated data. By replacing this with an array we append to, and then joining this array at the end, we can take this from an exponential function to a linear function.

AlextheYounga · 2025-01-29T08:06:56Z

I think I'm closer, but I don't think the behavior is identical yet.

Using the list-based approach, there may be times when the last chunk is composed of entirely newline characters, and this throws off the newlines calculation completely. I gave up trying to think of a clever way to use the code we already had and instead began writing a new function to deal with these newline chunks (and ChatGPT helped here).

This new function pop_trailing_newlines() resolves the issues we were seeing in the example html we were using and pass the newline test I created, but I ran both current develop and my branch on the US Internal Revenue Code (34MB) and there is a discrepancy of about 100 lines (although that's not even a drop in the bucket for the IRC). We're also sacrificing a little bit of our performance here: down from a 4.6x boost to 2.2x boost after this change. :(

There also may be ways to refactor this function. Not a trivial problem, for me at least.

    def pop_trailing_newlines(self, text_parts):
        newlines_total = 0
        # 1) First pop off any chunks that are newlines
        newline_chunks = []
        while text_parts and text_parts[-1].rstrip('\n') == '':
            newline_chunks.append(text_parts.pop())
        if newline_chunks:
            # Example: if we had ["\n", "\n", "\n"] at the very end
            all_newlines = ''.join(reversed(newline_chunks))
            newlines_total += len(all_newlines)

        # 2) Now look at one more chunk which might have some real text + trailing newlines
        text_without_newline = ''
        if text_parts:
            last_chunk = text_parts.pop()
            stripped = last_chunk.rstrip('\n')
            newlines_total += (len(last_chunk) - len(stripped))
            text_without_newline = stripped

        return (text_without_newline, newlines_total)

chrispy-snps · 2025-02-01T16:50:41Z

@AlextheYounga - where can I find the test HTML document you are using?

chrispy-snps · 2025-02-02T15:32:43Z

@AlextheYounga - I have some code that retains the 4x performance benefit and also partially fixes #185. However, it's based on some of my other recent pull requests so I need to wait for those to land.

I don't want to open a separate pull request for my code (you deserve the credit for this approach!), so I suggest that (1) we wait for the pending open pull requests to be merged, (2) I open a pull request on your branch so we can review and sync up, (3) we we merge it into your branch, (4) your branch is reviewed and merged into the project.

What do you think?

AlextheYounga · 2025-02-03T18:42:19Z

I would love to see that code, and I greatly appreciate your help and understanding here. Yes, let's sync up after the pending pull requests are merged, and I will keep thinking of other approaches.

The big document I was testing can be downloaded here:
https://uscode.house.gov/download/releasepoints/us/pl/118/250not159/htm_usc26@118-250not159.zip

(I was originally using this library to convert the US Federal Code into markdown).

chrispy-snps · 2025-02-05T14:24:22Z

@AlextheYounga - for now, I posted my implementation in pull request #186.

I'm not sure I understand these results, but here is what I get with the 35MB test document you linked:

develop - 4953 seconds
#162 (AlexatheYounga/develop) - 1058 seconds
#186 (chrispy/use-list-processing) - 24 seconds

Note that my variant is not always faster. For HTML files that are hierarchically deep instead of broad, my <pre> fix using find_parent() in the main processing loop introduces some slowdown, because it is called for every tag in the document, and find_parent() searches the ancestry up to the root every time. But, I have an idea to improve this via context propagation in a future pull request.

Also, I still see extra newlines in my branch's output, where I was expecting that they would all be collapsed to no more than two newlines in a row (due to the min() I perform when collapsing newlines). I need to look into this.

AlextheYounga · 2025-02-17T03:21:50Z

Holy SMOKES @chrispy-snps . I was just able to replicate your finding here, but 17 seconds for me! You've gone to plaid.

Now, I noticed that the test constraints changed here; I ran your updates on the tests I was using and it failed test_basic.py, as well as the newlines test I made based on the test data you provided. I noticed a descrepency of about 25k newlines on our big 35MB file.

But honestly, who cares. This speed boost is a massive game changer. It still looks like Markdown to me 😄

I say send it if you ask me. I think everyone would appreciate the massive speed boost rather than nitpick over newlines. I should say I am being slightly selfish here because there are other projects I'm working on where a massive speed boost like this on this package would be VERY helpful.

chrispy-snps · 2025-02-17T14:24:49Z

@AlextheYounga - I merged #186, so this issue can be closed. THANK YOU for noticing this problem and proposing this approach!

In #186, I don't get any test failures. If you can share your local newlines tests, I'd like to have a look at that too.

The extra newlines are a consequence of #184; see more details here.

Also, if you still want to add .python-version to the .gitignore, feel free to open a pull request!

chrispy-snps · 2025-02-19T01:12:58Z

Implemented by #186.

AlextheYounga changed the title ~~feat: improve parsing performance from o(n^2) to o(n)~~ perf: improve parsing performance from o(n^2) to o(n) Dec 17, 2024

AlextheYounga force-pushed the develop branch from 7e5a5d0 to 13d7ba4 Compare December 17, 2024 05:09

AlexVonB self-assigned this Dec 23, 2024

chrispy-snps mentioned this pull request Jan 3, 2025

Inconsistent handling of No-Break Space and Space #175

Open

AlextheYounga commented Jan 8, 2025

View reviewed changes

chrispy-snps mentioned this pull request Jan 27, 2025

remove superfluous leading/trailing whitespace #181

Merged

chore: add python version file to gitignore

9c299ed

AlextheYounga force-pushed the develop branch 2 times, most recently from 5dd8301 to be96173 Compare January 29, 2025 07:20

AlextheYounga force-pushed the develop branch from be96173 to 5bae8a0 Compare January 29, 2025 07:23

test: add test for newline count

8d973b3

AlextheYounga mentioned this pull request Feb 17, 2025

use list-based processing (inspired by AlextheYounga) #186

Merged

AlexVonB mentioned this pull request Feb 17, 2025

propagate parent tag context downward to improve runtime #191

Merged

chrispy-snps closed this Feb 19, 2025

perf: improve parsing performance from o(n^2) to o(n) #162

perf: improve parsing performance from o(n^2) to o(n) #162

Uh oh!

Conversation

AlextheYounga commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlextheYounga commented Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlextheYounga commented Dec 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthewwithanm commented Dec 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexVonB commented Dec 23, 2024

Uh oh!

chrispy-snps commented Jan 1, 2025

Uh oh!

AlextheYounga commented Jan 1, 2025

Uh oh!

AlextheYounga Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chrispy-snps Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chrispy-snps Jan 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlextheYounga Jan 27, 2025

Choose a reason for hiding this comment

Uh oh!

AlextheYounga commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrispy-snps commented Feb 1, 2025

Uh oh!

chrispy-snps commented Feb 2, 2025

Uh oh!

AlextheYounga commented Feb 3, 2025

Uh oh!

chrispy-snps commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlextheYounga commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrispy-snps commented Feb 17, 2025

Uh oh!

chrispy-snps commented Feb 19, 2025

Uh oh!

Uh oh!

AlextheYounga commented Dec 17, 2024 •

edited

Loading

AlextheYounga commented Dec 18, 2024 •

edited

Loading

AlextheYounga commented Dec 20, 2024 •

edited

Loading

matthewwithanm commented Dec 21, 2024 •

edited

Loading

AlextheYounga Jan 8, 2025 •

edited

Loading

chrispy-snps Jan 8, 2025 •

edited

Loading

chrispy-snps Jan 26, 2025 •

edited

Loading

AlextheYounga commented Jan 29, 2025 •

edited

Loading

chrispy-snps commented Feb 5, 2025 •

edited

Loading

AlextheYounga commented Feb 17, 2025 •

edited

Loading