-
Notifications
You must be signed in to change notification settings - Fork 167
perf: improve parsing performance from o(n^2) to o(n) #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,3 +10,4 @@ | |
build/ | ||
.vscode/settings.json | ||
.tox/ | ||
.python-version |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
from markdownify import markdownify as md | ||
|
||
|
||
html = """ | ||
<article> | ||
<h1>Heading 1</h1> | ||
<div> | ||
<p>article body</p> | ||
</div> | ||
<article> | ||
<h2>Heading 2</h2> | ||
<div> | ||
<p>article body</p> | ||
</div> | ||
</article> | ||
<p>footnote</p> | ||
</article> | ||
""" | ||
|
||
# Leaving this here just for reference. | ||
correct_conversion = """Heading 1 | ||
========= | ||
|
||
article body | ||
|
||
|
||
Heading 2 | ||
--------- | ||
|
||
article body | ||
|
||
|
||
footnote""" | ||
|
||
|
||
def test_newlines(): | ||
converted = md(html) | ||
newlines = converted.count('\n') | ||
assert newlines == 12 |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an update here that I am still working on this. I have added a test and improved the readability of the code so as to better identify the problem.
I tried to copy over as much as I could from the original code, but I think some of that code doesn't translate well to this new loop paradigm. I am positive that the variable in
newlines_left
has changed.In the original function we are grabbing a number of line breaks from the full text and using this to create
newlines_left
:https://github.com/matthewwithanm/python-markdownify/blob/6258f5c38b97ab443b4ddf03e6676ce29b392d06/markdownify/__init__.py#L164C1-L166C60
Original:
However, in this new structure, we trying to break away from concatenating a string in each loop, the source of the exponential logic, but that comes with its own tradeoffs, namely it is no longer straightforward to access the full string text. So now we are only counting the number of newlines in each "chunk", in the array, which may not be the same number as before. I feel like there is a solution here, I am still working the problem.
New:
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlextheYounga - the list-based approach actually brings a lot of power with it. We know that the internal contents of each list item is fixed and completed, but there are opportunities to normalize/collapse whitespace (for inline content) and newlines (for block content) at the leading/trailing ends of the list items. (Does
convert_children_as_newline
determine this?)For example, we could post-process the list before joining by looking at the end of the previous item and the beginning of the next item. For example, trailing+leading sequences of more than 2 newlines could be limited to 2. Maybe the list items could be expanded into a (leading-ws, content, trailing-ws) tuples to make this easier. I'm not sure, but you are right that this is a different paradigm where the logic of the old code might not apply, and that could very well be a good thing.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlextheYounga - if having another pair of eyes on the code would help, push your latest code to your branch, and I'll clone it and have a look. It would be great to get this enhancement into the next release of Markdownify.
Edit: I merged in the upstream
develop
branch into your pull request branch and resolved the conflicts.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I have been very busy recently. Honestly another pair of eyes here would help, but I am looking into this again now.