-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Align maximum HTML depth handling with browsers #2421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4b277c2 to
09c206f
Compare
|
About the CI failure: should I keep |
|
Thanks for this. First pass responses:
Yes for now we should keep it in and mark it
Could you clarify that -- you do have it marked as "yes" ? Which do you think is best, and is there an impact if that is different from Chrome as predominant browser? We are trying to align to that in lieu of a followed spec, right? And I don't follow the point on requiring an unlimited stack to implement that -- aren't we just walking up the limited set? |
Chrome/Firefox do not actually have a stack size limit, and that's how they are able to match opening/closing element at any depth. The have a limit on the dom tree depth. The can to see 10'000 alternating open I personally think that the Chrome/Firefox behavior is better than Safari, but Safari has 15%-20% market share so it's probably reasonable to expect that pages work there, and thus their HTML is something that behaves the same across Chrome and Safari. Having an unlimited stack size in jsoup was not desirable for perfomance reasons, because it quadratically iterates through that stack (#955). If you want, I can try to have an unlimited stack size while still avoiding the quadratic behavior, however it comes at an extra memory cost. We'd need to have, next to the stack, a map of (element name)->(stack of indexes that that element has in the stack), so that we could query "where does this element appear in the stack" in O(1) time. |
EDIT: This is much more complex than just having that map, because of all the searches we need to do in the stack "check if there is any element of type x/y/z but stop when you find a/b/c". |
|
You can see for example Firefox's implementation, which is very similar to Chrome's but it's written in Java so it's easy to compare:
|
|
I'm confused as to those HTML Validator source links. That's not the Firefox parser and I don't see that we can use it as a reference for anything. |
|
That C++ is generated by transpiling the Java code: https://searchfox.org/firefox-main/source/parser/html/java/README.txt |
Also, made it work for the XML parser
|
But the TreeBuilder hasn't changed in seven years? How can that be? |
So that various xml builder constructors get the unlimited setting.
|
I have added some changes so that
|
For some reason I linked to the commit where the changes to the depth handling were first introduced, but the last commit is from two months ago. |
|
Ah sorry, I completely missed that I wasn't viewing on head there. Thanks again for the report detail and the PR. |
This patch aligns the way jsoup handles too deep DOM trees with what browsers do. Note that all browsers have different behaviors, so I picked what seemed to be most reasonable/implementable.
whatwg/html#3732 (comment)
<table>allow going past the max depth
whatwg/html#3732 (comment)
that is auto-closed due to the max
depth will close the previous matching
element on the stack
whatwg/html#3732 (comment)
Ref #2416 (let's keep that issue open to track HTML spec changes related to this?)
I actually think that the last line would be more intuitive as "Yes", but that would cause the same perf issues that were fixed by MaxScopeSearchDepth (since we'd need an unlimited stack size).