Skip to content

perf: improve parsing performance from o(n^2) to o(n) #162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

AlextheYounga
Copy link

@AlextheYounga AlextheYounga commented Dec 17, 2024

The function process_tag was previously concatenating strings inside of a loop. Each + operation creates a new string, resulting in repeated copying of already accumulated data. This logic is of complexity O(n2).

We can take this from an exponential function to a linear function by using an array, appending to it, and then returning a ''.join() at the end. This logic is of complexity O(n).

After this change, the time it took to convert a 20.8MB HTML file (US social security tax law document) went from 18.94 minutes to 4.11 minutes, a speed increase of 4.6x.

# Before change
Converting title_42—the_public_health_and_welfare::chapter_7—social_security.html to markdown...
Writing title_42—the_public_health_and_welfare::chapter_7—social_security.md...
Time taken: 1136.8312180042267 seconds

# After change
Converting title_42—the_public_health_and_welfare::chapter_7—social_security.html to markdown...
Writing title_42—the_public_health_and_welfare::chapter_7—social_security.md...
Time taken: 246.62110209465027 seconds

All tox tests are also passing.

I also added .python-version to the .gitignore file, (I was using pyenv)

@AlextheYounga AlextheYounga changed the title feat: improve parsing performance from o(n^2) to o(n) perf: improve parsing performance from o(n^2) to o(n) Dec 17, 2024
@AlextheYounga
Copy link
Author

AlextheYounga commented Dec 18, 2024

I also added a few early checks when iterating over node children to prevent unnecessary function calls. I won't make any more changes to this branch in the interest of limiting complexity.

      for i, el in enumerate(children):
          # Quick type check first to avoid unnecessary function calls
          if not isinstance(el, NavigableString):
              continue

          # Check if the text is entirely whitespace first
          text = six.text_type(el)
          if text.strip():
              continue

          # Determine if we can extract based on position and adjacency
          can_extract = (
              (should_remove_inside and (i == 0 or i == len(children) - 1)) or (i > 0 and should_remove_whitespace_outside(children[i - 1])) or (i < len(children) - 1 and should_remove_whitespace_outside(children[i + 1]))
          )

          # Extract if conditions are met
          if can_extract:
              el.extract()

@AlextheYounga
Copy link
Author

AlextheYounga commented Dec 20, 2024

@matthewwithanm I just want to make sure you saw this, I think you might like this. Let me know if you'd like me to make any changes.

Thanks for creating this library 🫡

@matthewwithanm
Copy link
Owner

matthewwithanm commented Dec 21, 2024

Yes! I haven't looked at the code but it sounds great! Thank you!

@AlexVonB has been managing the repo for a while though. He'll get around to it eventually but it's a busy time of year 🙂

@AlexVonB
Copy link
Collaborator

Hey all, I might have time during the holidays to check in on the issue. From the first glance it looks good, but I would like to take the time to completely understand every change. I wish you quiet holidays :)

@AlexVonB AlexVonB self-assigned this Dec 23, 2024
@chrispy-snps
Copy link
Collaborator

@AlextheYounga - thanks for this pull request! I also get an impressive runtime reduction (379 seconds to 86 seconds on one of our larger HTML documents).

When I diff the new output against the old output, I see extra newlines. It would be good for the output to diff identically. You can use the following HTML to reproduce the issue:

<article>
    <h1>Heading 1</h1>
    <div>
        <p>article body</p>
    </div>
    <article>
        <h2>Heading 2</h2>
        <div>
            <p>article body</p>
        </div>
    </article>
    <p>footnote</p>
</article>

@AlextheYounga
Copy link
Author

@chrispy-snps Thanks for catching this. If it's adding newlines, that is no bueno. I will see if I can correct this and perhaps add an extra test to catch this.

for el in node.children:
if isinstance(el, Comment) or isinstance(el, Doctype):
continue
elif isinstance(el, NavigableString):
text += self.process_text(el)
text_parts.append(self.process_text(el))
else:
Copy link
Author

@AlextheYounga AlextheYounga Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an update here that I am still working on this. I have added a test and improved the readability of the code so as to better identify the problem.

I tried to copy over as much as I could from the original code, but I think some of that code doesn't translate well to this new loop paradigm. I am positive that the variable in newlines_left has changed.

In the original function we are grabbing a number of line breaks from the full text and using this to create newlines_left:
https://github.com/matthewwithanm/python-markdownify/blob/6258f5c38b97ab443b4ddf03e6676ce29b392d06/markdownify/__init__.py#L164C1-L166C60

Original:

else:
    text_strip = text.rstrip('\n')
    newlines_left = len(text) - len(text_strip)

However, in this new structure, we trying to break away from concatenating a string in each loop, the source of the exponential logic, but that comes with its own tradeoffs, namely it is no longer straightforward to access the full string text. So now we are only counting the number of newlines in each "chunk", in the array, which may not be the same number as before. I feel like there is a solution here, I am still working the problem.

New:

else:
    text_strip = ''
    newlines_left = 0
    if text_parts:
        last_chunk = text_parts.pop()
        text_strip = last_chunk.rstrip('\n')
        newlines_left = len(last_chunk) - len(text_strip)

Copy link
Collaborator

@chrispy-snps chrispy-snps Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlextheYounga - the list-based approach actually brings a lot of power with it. We know that the internal contents of each list item is fixed and completed, but there are opportunities to normalize/collapse whitespace (for inline content) and newlines (for block content) at the leading/trailing ends of the list items. (Does convert_children_as_newline determine this?)

For example, we could post-process the list before joining by looking at the end of the previous item and the beginning of the next item. For example, trailing+leading sequences of more than 2 newlines could be limited to 2. Maybe the list items could be expanded into a (leading-ws, content, trailing-ws) tuples to make this easier. I'm not sure, but you are right that this is a different paradigm where the logic of the old code might not apply, and that could very well be a good thing.

Copy link
Collaborator

@chrispy-snps chrispy-snps Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlextheYounga - if having another pair of eyes on the code would help, push your latest code to your branch, and I'll clone it and have a look. It would be great to get this enhancement into the next release of Markdownify.

Edit: I merged in the upstream develop branch into your pull request branch and resolved the conflicts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I have been very busy recently. Honestly another pair of eyes here would help, but I am looking into this again now.

@AlextheYounga AlextheYounga force-pushed the develop branch 2 times, most recently from 5dd8301 to be96173 Compare January 29, 2025 07:20
The function process_tag was previously concatenating strings
inside of a loop. Each + operation creates a new string,
resulting in repeated copying of already accumulated data. By
replacing this with an array we append to, and then joining this
array at the end, we can take this from an exponential function
to a linear function.
@AlextheYounga
Copy link
Author

AlextheYounga commented Jan 29, 2025

I think I'm closer, but I don't think the behavior is identical yet.

Using the list-based approach, there may be times when the last chunk is composed of entirely newline characters, and this throws off the newlines calculation completely. I gave up trying to think of a clever way to use the code we already had and instead began writing a new function to deal with these newline chunks (and ChatGPT helped here).

This new function pop_trailing_newlines() resolves the issues we were seeing in the example html we were using and pass the newline test I created, but I ran both current develop and my branch on the US Internal Revenue Code (34MB) and there is a discrepancy of about 100 lines (although that's not even a drop in the bucket for the IRC). We're also sacrificing a little bit of our performance here: down from a 4.6x boost to 2.2x boost after this change. :(

There also may be ways to refactor this function. Not a trivial problem, for me at least.

    def pop_trailing_newlines(self, text_parts):
        newlines_total = 0
        # 1) First pop off any chunks that are newlines
        newline_chunks = []
        while text_parts and text_parts[-1].rstrip('\n') == '':
            newline_chunks.append(text_parts.pop())
        if newline_chunks:
            # Example: if we had ["\n", "\n", "\n"] at the very end
            all_newlines = ''.join(reversed(newline_chunks))
            newlines_total += len(all_newlines)

        # 2) Now look at one more chunk which might have some real text + trailing newlines
        text_without_newline = ''
        if text_parts:
            last_chunk = text_parts.pop()
            stripped = last_chunk.rstrip('\n')
            newlines_total += (len(last_chunk) - len(stripped))
            text_without_newline = stripped

        return (text_without_newline, newlines_total)

@chrispy-snps
Copy link
Collaborator

@AlextheYounga - where can I find the test HTML document you are using?

@chrispy-snps
Copy link
Collaborator

@AlextheYounga - I have some code that retains the 4x performance benefit and also partially fixes #185. However, it's based on some of my other recent pull requests so I need to wait for those to land.

I don't want to open a separate pull request for my code (you deserve the credit for this approach!), so I suggest that (1) we wait for the pending open pull requests to be merged, (2) I open a pull request on your branch so we can review and sync up, (3) we we merge it into your branch, (4) your branch is reviewed and merged into the project.

What do you think?

@AlextheYounga
Copy link
Author

I would love to see that code, and I greatly appreciate your help and understanding here. Yes, let's sync up after the pending pull requests are merged, and I will keep thinking of other approaches.

The big document I was testing can be downloaded here:
https://uscode.house.gov/download/releasepoints/us/pl/118/250not159/htm_usc26@118-250not159.zip

(I was originally using this library to convert the US Federal Code into markdown).

@chrispy-snps
Copy link
Collaborator

chrispy-snps commented Feb 5, 2025

@AlextheYounga - for now, I posted my implementation in pull request #186.

I'm not sure I understand these results, but here is what I get with the 35MB test document you linked:

develop - 4953 seconds
#162 (AlexatheYounga/develop) - 1058 seconds
#186 (chrispy/use-list-processing) - 24 seconds

Note that my variant is not always faster. For HTML files that are hierarchically deep instead of broad, my <pre> fix using find_parent() in the main processing loop introduces some slowdown, because it is called for every tag in the document, and find_parent() searches the ancestry up to the root every time. But, I have an idea to improve this via context propagation in a future pull request.

Also, I still see extra newlines in my branch's output, where I was expecting that they would all be collapsed to no more than two newlines in a row (due to the min() I perform when collapsing newlines). I need to look into this.

@AlextheYounga
Copy link
Author

AlextheYounga commented Feb 17, 2025

Holy SMOKES @chrispy-snps . I was just able to replicate your finding here, but 17 seconds for me! You've gone to plaid.

Now, I noticed that the test constraints changed here; I ran your updates on the tests I was using and it failed test_basic.py, as well as the newlines test I made based on the test data you provided. I noticed a descrepency of about 25k newlines on our big 35MB file.

But honestly, who cares. This speed boost is a massive game changer. It still looks like Markdown to me 😄

I say send it if you ask me. I think everyone would appreciate the massive speed boost rather than nitpick over newlines. I should say I am being slightly selfish here because there are other projects I'm working on where a massive speed boost like this on this package would be VERY helpful.

@chrispy-snps
Copy link
Collaborator

@AlextheYounga - I merged #186, so this issue can be closed. THANK YOU for noticing this problem and proposing this approach!

In #186, I don't get any test failures. If you can share your local newlines tests, I'd like to have a look at that too.

The extra newlines are a consequence of #184; see more details here.

Also, if you still want to add .python-version to the .gitignore, feel free to open a pull request!

@chrispy-snps
Copy link
Collaborator

Implemented by #186.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants