Skip to content

perf: improve parsing performance from o(n^2) to o(n) #162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@
build/
.vscode/settings.json
.tox/
.python-version
94 changes: 73 additions & 21 deletions markdownify/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ def convert_soup(self, soup):
return self.process_tag(soup, convert_as_inline=False)

def process_tag(self, node, convert_as_inline):
text = ''
text_parts = []

# markdown headings or cells can't include
# block elements (elements w/newlines)
Expand All @@ -147,43 +147,95 @@ def process_tag(self, node, convert_as_inline):
# Remove whitespace-only textnodes just before, after or
# inside block-level elements.
should_remove_inside = should_remove_whitespace_inside(node)
for el in node.children:
# Only extract (remove) whitespace-only text node if any of the
# conditions is true:
# - el is the first element in its parent (block-level)
# - el is the last element in its parent (block-level)
# - el is adjacent to a block-level node
can_extract = (should_remove_inside and (not el.previous_sibling
or not el.next_sibling)
or should_remove_whitespace_outside(el.previous_sibling)
or should_remove_whitespace_outside(el.next_sibling))
if (isinstance(el, NavigableString)
and six.text_type(el).strip() == ''
and can_extract):
children = list(node.children)
for i, el in enumerate(children):
# Quick type check first to avoid unnecessary function calls
if not isinstance(el, NavigableString):
continue

# Check if the text is entirely whitespace first
text = six.text_type(el)
if text.strip():
continue

# If first or last element
is_at_extreme_position = (
should_remove_inside and (i == 0 or i == len(children) - 1)
)
# True if there is a preceding sibling, and it should have whitespace removed.
has_removal_candidate_to_left = (
i > 0 and should_remove_whitespace_outside(children[i - 1])
)
# True if there is a following sibling, and it should have whitespace removed.
has_removal_candidate_to_right = (
i < len(children) - 1 and should_remove_whitespace_outside(children[i + 1])
)
# Determine if we can extract based on position and adjacency
can_extract = (
is_at_extreme_position
or has_removal_candidate_to_left
or has_removal_candidate_to_right
)

# Extract if conditions are met
if can_extract:
el.extract()

# Convert the children first
for el in node.children:
if isinstance(el, Comment) or isinstance(el, Doctype):
continue
elif isinstance(el, NavigableString):
text += self.process_text(el)
text_parts.append(self.process_text(el))
else:
Copy link
Author

@AlextheYounga AlextheYounga Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an update here that I am still working on this. I have added a test and improved the readability of the code so as to better identify the problem.

I tried to copy over as much as I could from the original code, but I think some of that code doesn't translate well to this new loop paradigm. I am positive that the variable in newlines_left has changed.

In the original function we are grabbing a number of line breaks from the full text and using this to create newlines_left:
https://github.com/matthewwithanm/python-markdownify/blob/6258f5c38b97ab443b4ddf03e6676ce29b392d06/markdownify/__init__.py#L164C1-L166C60

Original:

else:
    text_strip = text.rstrip('\n')
    newlines_left = len(text) - len(text_strip)

However, in this new structure, we trying to break away from concatenating a string in each loop, the source of the exponential logic, but that comes with its own tradeoffs, namely it is no longer straightforward to access the full string text. So now we are only counting the number of newlines in each "chunk", in the array, which may not be the same number as before. I feel like there is a solution here, I am still working the problem.

New:

else:
    text_strip = ''
    newlines_left = 0
    if text_parts:
        last_chunk = text_parts.pop()
        text_strip = last_chunk.rstrip('\n')
        newlines_left = len(last_chunk) - len(text_strip)

Copy link
Collaborator

@chrispy-snps chrispy-snps Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlextheYounga - the list-based approach actually brings a lot of power with it. We know that the internal contents of each list item is fixed and completed, but there are opportunities to normalize/collapse whitespace (for inline content) and newlines (for block content) at the leading/trailing ends of the list items. (Does convert_children_as_newline determine this?)

For example, we could post-process the list before joining by looking at the end of the previous item and the beginning of the next item. For example, trailing+leading sequences of more than 2 newlines could be limited to 2. Maybe the list items could be expanded into a (leading-ws, content, trailing-ws) tuples to make this easier. I'm not sure, but you are right that this is a different paradigm where the logic of the old code might not apply, and that could very well be a good thing.

Copy link
Collaborator

@chrispy-snps chrispy-snps Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlextheYounga - if having another pair of eyes on the code would help, push your latest code to your branch, and I'll clone it and have a look. It would be great to get this enhancement into the next release of Markdownify.

Edit: I merged in the upstream develop branch into your pull request branch and resolved the conflicts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I have been very busy recently. Honestly another pair of eyes here would help, but I am looking into this again now.

text_strip = text.rstrip('\n')
newlines_left = len(text) - len(text_strip)
# 1) Pop trailing newlines from whatever's in text_parts so far
text_strip, newlines_left = self.pop_trailing_newlines(text_parts)
# 2) Convert the next tag
next_text = self.process_tag(el, convert_children_as_inline)
# 3) Figure out how many leading newlines in next_text
next_text_strip = next_text.lstrip('\n')
newlines_right = len(next_text) - len(next_text_strip)
# 4) Calculate how many newlines to insert between text_strip and next_text_strip
newlines = '\n' * max(newlines_left, newlines_right)
text = text_strip + newlines + next_text_strip
text_parts.extend([text_strip, newlines, next_text_strip])

# apply this tag's final conversion function
convert_fn_name = "convert_%s" % re.sub(r"[\[\]:-]", "_", node.name)
convert_fn = getattr(self, convert_fn_name, None)
if convert_fn and self.should_convert_tag(node.name):
text = convert_fn(node, text, convert_as_inline)

return text
# Join the text parts before passing to convert_fn
text_parts_str = ''.join(text_parts)
text_parts = [convert_fn(node, text_parts_str, convert_as_inline)]

return ''.join(text_parts)

def pop_trailing_newlines(self, text_parts):
"""
Pops trailing newline text from the end of `parts`. I suspect this function
will get refactored with time.
Returns: (combined_text_without_trailing_newlines, count_of_newlines_popped)
If there is no trailing text in parts, returns ("", 0).
"""

newlines_total = 0
# 1) First pop off any chunks that are newlines
newline_chunks = []
while text_parts and text_parts[-1].rstrip('\n') == '':
newline_chunks.append(text_parts.pop())
if newline_chunks:
# Example: if we had ["\n", "\n", "\n"] at the very end
all_newlines = ''.join(reversed(newline_chunks))
newlines_total += len(all_newlines)

# 2) Now look at one more chunk which might have some real text + trailing newlines
text_without_newline = ''
if text_parts:
last_chunk = text_parts.pop()
stripped = last_chunk.rstrip('\n')
newlines_total += (len(last_chunk) - len(stripped))
text_without_newline = stripped

return (text_without_newline, newlines_total)

def convert__document_(self, el, text, convert_as_inline):
"""Final document-level formatting for BeautifulSoup object (node.name == "[document]")"""
Expand Down
39 changes: 39 additions & 0 deletions tests/test_newlines.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
from markdownify import markdownify as md


html = """
<article>
<h1>Heading 1</h1>
<div>
<p>article body</p>
</div>
<article>
<h2>Heading 2</h2>
<div>
<p>article body</p>
</div>
</article>
<p>footnote</p>
</article>
"""

# Leaving this here just for reference.
correct_conversion = """Heading 1
=========

article body


Heading 2
---------

article body


footnote"""


def test_newlines():
converted = md(html)
newlines = converted.count('\n')
assert newlines == 12