-
Notifications
You must be signed in to change notification settings - Fork 167
use list-based processing (inspired by AlextheYounga) #186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use list-based processing (inspired by AlextheYounga) #186
Conversation
@AlextheYounga - what are your thoughts on this approach? |
Signed-off-by: Chris Papademetrious <chrispy@synopsys.com> Signed-off-by: chrispy <chrispy@synopsys.com>
1b4750c
to
5b79b92
Compare
Signed-off-by: chrispy <chrispy@synopsys.com>
Looking at this now. Once again, sorry for the delay |
I am running a test right now just to say I did due diligence but this code looks excellent. There is now far more control over newlines, like you were saying in our previous conversation. |
Just going to expand on some of the comments I made in #162 17 seconds for a 35MB file is crazy fast. I personally love it and would prefer prioritizing the speed over newlines discrepancies, but that is my opinion. I also think these changes make the code far more readable, as well as allows us far more control on how we approach newlines. But I should note that this seems to have failed our previous tests. Maybe these tests are outdated, but I just want to note them here in case there is something important here I don't fully understand. You know more than me here. Here are the errors I get on the tests I was running. The test_newlines test was a test I wrote just to test the data you had provided me previously. And this change does add substantially more newlines. On my 35MB test, I saw 25k more newlines than our previous test. But it's also 35x faster than my change, and 80x faster than the original! I say send it! |
I can also see what you mean by the difference here on how nested certain items are. This particular document we are testing is kind of odd because the entire document is flat. I will see if I can find a very large nested document to test this with. |
@AlextheYounga - the additional newlines are an unanticipated side effect to previous pull request #184 ("make conversion non-destructive..."). The IRS document contains HTML comments intermixed with block elements: <p>content</p>
<p>content</p>
<!--COMMENT-->
<p>content</p>
<!--COMMENT-->
<!--COMMENT--> Because the code no longer deletes elements, the whitespace-rejection code in I will try to fix this in a follow-up pull request, but let's get some of the runtime improvements merged in first so we can properly assess any runtime change for this comment/whitespace fix. |
This pull request does the following:
<pre>
blocks are collapsed instead of preserved #185 by not collapsing some newlines inside<pre>
elementsConsider the following children elements of a
<div>
:As we process each child, we use a regex to split its conversion function result into (1) leading newlines, (2) content, and (3) trailing newlines:
Before we accumulate the current child onto our running list, we collapse the trailing newlines of the last child with the leading newlines of the next child (similar to how CSS margins work):
The child is then accumulated onto the running list with its collapsed newlines:
Currently, inline whitespace collapsing is implemented ad-hoc across multiple element-specific conversion functions. In the future, perhaps this inline whitespace collapsing could be deferred to a similar code path in
process_element()
. This would solve a lot of corner cases with nested inline elements that accumulate uncollapsed spaces at their boundaries.