Skip to content

Conversation

@nicolo-ribaudo
Copy link
Contributor

@nicolo-ribaudo nicolo-ribaudo commented Oct 21, 2025

This patch aligns the way jsoup handles too deep DOM trees with what browsers do. Note that all browsers have different behaviors, so I picked what seemed to be most reasonable/implementable.

jsoup Chrome Firefox Safari
Maximum depth
whatwg/html#3732 (comment)
512 513 513 512
Implicit elements inside <table>
allow going past the max depth
whatwg/html#3732 (comment)
No Yes No Yes
A closing tag after an opening tag
that is auto-closed due to the max
depth will close the previous matching
element on the stack
whatwg/html#3732 (comment)
Yes No No Yes

Ref #2416 (let's keep that issue open to track HTML spec changes related to this?)

I actually think that the last line would be more intuitive as "Yes", but that would cause the same perf issues that were fixed by MaxScopeSearchDepth (since we'd need an unlimited stack size).

@nicolo-ribaudo
Copy link
Contributor Author

About the CI failure: should I keep MaxScopeSearchDepth, even if it's now unused?

@jhy jhy linked an issue Oct 22, 2025 that may be closed by this pull request
@jhy
Copy link
Owner

jhy commented Oct 22, 2025

Thanks for this. First pass responses:

About the CI failure: should I keep MaxScopeSearchDepth, even if it's now unused?

Yes for now we should keep it in and mark it @Deprecated so that we can remove. I doubt it's used by anyone but that gives us a path.

I actually think that the last line would be more intuitive as "Yes", but that would cause the same perf issues that were fixed by MaxScopeSearchDepth (since we'd need an unlimited stack size).

Could you clarify that -- you do have it marked as "yes" ? Which do you think is best, and is there an impact if that is different from Chrome as predominant browser? We are trying to align to that in lieu of a followed spec, right? And I don't follow the point on requiring an unlimited stack to implement that -- aren't we just walking up the limited set?

@nicolo-ribaudo
Copy link
Contributor Author

Could you clarify that -- you do have it marked as "yes" ? Which do you think is best, and is there an impact if that is different from Chrome as predominant browser? We are trying to align to that in lieu of a followed spec, right? And I don't follow the point on requiring an unlimited stack to implement that -- aren't we just walking up the limited set?

Chrome/Firefox do not actually have a stack size limit, and that's how they are able to match opening/closing element at any depth. The have a limit on the dom tree depth. The can to see 10'000 alternating open <div>s and <span>s, 9'999 closing ones, and be able to tell that only the outermost one was left open.

I personally think that the Chrome/Firefox behavior is better than Safari, but Safari has 15%-20% market share so it's probably reasonable to expect that pages work there, and thus their HTML is something that behaves the same across Chrome and Safari.

Having an unlimited stack size in jsoup was not desirable for perfomance reasons, because it quadratically iterates through that stack (#955).

If you want, I can try to have an unlimited stack size while still avoiding the quadratic behavior, however it comes at an extra memory cost. We'd need to have, next to the stack, a map of (element name)->(stack of indexes that that element has in the stack), so that we could query "where does this element appear in the stack" in O(1) time.

int d = 0;
while ((el = el.parent()) != null) {
d++;
} while (el != null);

Check warning

Code scanning / CodeQL

Constant loop condition Warning test

Loop
might not terminate, as this loop condition is constant within the loop.
@nicolo-ribaudo
Copy link
Contributor Author

If you want, I can try to have an unlimited stack size while still avoiding the quadratic behavior, however it comes at an extra memory cost. We'd need to have, next to the stack, a map of (element name)->(stack of indexes that that element has in the stack), so that we could query "where does this element appear in the stack" in O(1) time.

EDIT: This is much more complex than just having that map, because of all the searches we need to do in the stack "check if there is any element of type x/y/z but stop when you find a/b/c".

@nicolo-ribaudo
Copy link
Contributor Author

You can see for example Firefox's implementation, which is very similar to Chrome's but it's written in Java so it's easy to compare:

@jhy
Copy link
Owner

jhy commented Nov 13, 2025

I'm confused as to those HTML Validator source links. That's not the Firefox parser and I don't see that we can use it as a reference for anything.

Firefox code is in this repo, per this doc.

@nicolo-ribaudo
Copy link
Contributor Author

That C++ is generated by transpiling the Java code: https://searchfox.org/firefox-main/source/parser/html/java/README.txt

Also, made it work for the XML parser
@jhy
Copy link
Owner

jhy commented Nov 13, 2025

But the TreeBuilder hasn't changed in seven years? How can that be?

jhy added 2 commits November 13, 2025 16:18
So that various xml builder constructors get the unlimited setting.
@jhy jhy added this to the 1.22.1 milestone Nov 13, 2025
@jhy jhy added improvement An improvement / new feature idea fixed An {bug|improvement} that has been {fixed|implemented} labels Nov 13, 2025
@jhy
Copy link
Owner

jhy commented Nov 13, 2025

I have added some changes so that

  • the XML parser also can apply a limit. By default it's unlimited / max integer
  • when we pop to limit, we also clean up the other tree builder state (form, head elements, template, format els)

@jhy jhy merged commit 2cc74b6 into jhy:master Nov 13, 2025
12 checks passed
@nicolo-ribaudo
Copy link
Contributor Author

But the TreeBuilder hasn't changed in seven years? How can that be?

For some reason I linked to the commit where the changes to the depth handling were first introduced, but the last commit is from two months ago.

@jhy
Copy link
Owner

jhy commented Nov 13, 2025

Ah sorry, I completely missed that I wasn't viewing on head there. Thanks again for the report detail and the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fixed An {bug|improvement} that has been {fixed|implemented} improvement An improvement / new feature idea

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Align handling of too deep documents with browsers

2 participants