Align maximum HTML depth handling with browsers #2421

nicolo-ribaudo · 2025-10-21T17:14:08Z

This patch aligns the way jsoup handles too deep DOM trees with what browsers do. Note that all browsers have different behaviors, so I picked what seemed to be most reasonable/implementable.

	jsoup	Chrome	Firefox	Safari
Maximum depth whatwg/html#3732 (comment)	512	513	513	512
Implicit elements inside `<table>` allow going past the max depth whatwg/html#3732 (comment)	No	Yes	No	Yes
A closing tag after an opening tag that is auto-closed due to the max depth will close the previous matching element on the stack whatwg/html#3732 (comment)	Yes	No	No	Yes

Ref #2416 (let's keep that issue open to track HTML spec changes related to this?)

I actually think that the last line would be more intuitive as "Yes", but that would cause the same perf issues that were fixed by MaxScopeSearchDepth (since we'd need an unlimited stack size).

nicolo-ribaudo · 2025-10-21T17:20:55Z

About the CI failure: should I keep MaxScopeSearchDepth, even if it's now unused?

jhy · 2025-10-22T06:31:09Z

Thanks for this. First pass responses:

About the CI failure: should I keep MaxScopeSearchDepth, even if it's now unused?

Yes for now we should keep it in and mark it @Deprecated so that we can remove. I doubt it's used by anyone but that gives us a path.

I actually think that the last line would be more intuitive as "Yes", but that would cause the same perf issues that were fixed by MaxScopeSearchDepth (since we'd need an unlimited stack size).

Could you clarify that -- you do have it marked as "yes" ? Which do you think is best, and is there an impact if that is different from Chrome as predominant browser? We are trying to align to that in lieu of a followed spec, right? And I don't follow the point on requiring an unlimited stack to implement that -- aren't we just walking up the limited set?

nicolo-ribaudo · 2025-10-27T11:22:32Z

Could you clarify that -- you do have it marked as "yes" ? Which do you think is best, and is there an impact if that is different from Chrome as predominant browser? We are trying to align to that in lieu of a followed spec, right? And I don't follow the point on requiring an unlimited stack to implement that -- aren't we just walking up the limited set?

Chrome/Firefox do not actually have a stack size limit, and that's how they are able to match opening/closing element at any depth. The have a limit on the dom tree depth. The can to see 10'000 alternating open <div>s and <span>s, 9'999 closing ones, and be able to tell that only the outermost one was left open.

I personally think that the Chrome/Firefox behavior is better than Safari, but Safari has 15%-20% market share so it's probably reasonable to expect that pages work there, and thus their HTML is something that behaves the same across Chrome and Safari.

Having an unlimited stack size in jsoup was not desirable for perfomance reasons, because it quadratically iterates through that stack (#955).

If you want, I can try to have an unlimited stack size while still avoiding the quadratic behavior, however it comes at an extra memory cost. We'd need to have, next to the stack, a map of (element name)->(stack of indexes that that element has in the stack), so that we could query "where does this element appear in the stack" in O(1) time.

src/test/java/org/jsoup/parser/HtmlParserTest.java

+            int d = 0;
+            while ((el = el.parent()) != null) {
+                d++;
+            } while (el != null);


nicolo-ribaudo · 2025-10-27T14:11:34Z

If you want, I can try to have an unlimited stack size while still avoiding the quadratic behavior, however it comes at an extra memory cost. We'd need to have, next to the stack, a map of (element name)->(stack of indexes that that element has in the stack), so that we could query "where does this element appear in the stack" in O(1) time.

EDIT: This is much more complex than just having that map, because of all the searches we need to do in the stack "check if there is any element of type x/y/z but stop when you find a/b/c".

src/main/java/org/jsoup/parser/HtmlTreeBuilder.java

nicolo-ribaudo · 2025-10-27T15:10:58Z

You can see for example Firefox's implementation, which is very similar to Chrome's but it's written in Java so it's easy to compare:

they have an upper limit when getting the "last" element of the stack to know to which parent insert the new element (https://github.com/validator/htmlparser/blob/b19d4088f1d138715a981e960f0720678e7946f7/src/nu/validator/htmlparser/impl/TreeBuilder.java#L6349)
they do not have any limit when iterating the stack (see https://github.com/validator/htmlparser/blob/b19d4088f1d138715a981e960f0720678e7946f7/src/nu/validator/htmlparser/impl/TreeBuilder.java#L4168 and similar methods)
they do not have any limit when pushing elements to the stack (https://github.com/validator/htmlparser/blob/b19d4088f1d138715a981e960f0720678e7946f7/src/nu/validator/htmlparser/impl/TreeBuilder.java#L4556)

jhy · 2025-11-13T04:17:10Z

I'm confused as to those HTML Validator source links. That's not the Firefox parser and I don't see that we can use it as a reference for anything.

Firefox code is in this repo, per this doc.

nicolo-ribaudo · 2025-11-13T04:27:24Z

That C++ is generated by transpiling the Java code: https://searchfox.org/firefox-main/source/parser/html/java/README.txt

Also, made it work for the XML parser

jhy · 2025-11-13T05:11:51Z

But the TreeBuilder hasn't changed in seven years? How can that be?

So that various xml builder constructors get the unlimited setting.

jhy · 2025-11-13T05:21:50Z

I have added some changes so that

the XML parser also can apply a limit. By default it's unlimited / max integer
when we pop to limit, we also clean up the other tree builder state (form, head elements, template, format els)

nicolo-ribaudo · 2025-11-13T06:51:39Z

But the TreeBuilder hasn't changed in seven years? How can that be?

For some reason I linked to the commit where the changes to the depth handling were first introduced, but the last commit is from two months ago.

jhy · 2025-11-13T09:18:38Z

Ah sorry, I completely missed that I wasn't viewing on head there. Thanks again for the report detail and the PR.

nicolo-ribaudo added 2 commits October 21, 2025 13:48

Remove MaxScopeSearchDepth

cfd381d

Implement HTML depth limit (512) similar to browsers

09c206f

nicolo-ribaudo force-pushed the fix-deep-tree branch from 4b277c2 to 09c206f Compare October 21, 2025 17:16

jhy linked an issue Oct 22, 2025 that may be closed by this pull request

Align handling of too deep documents with browsers #2416

Closed

Add MaxScopeSearchDepth back

5ce7335

github-advanced-security bot found potential problems Oct 27, 2025

View reviewed changes

src/test/java/org/jsoup/parser/HtmlParserTest.java

int d = 0;

while ((el = el.parent()) != null) {

d++;

} while (el != null);

Check warning

Code scanning / CodeQL

Constant loop condition Warning test

Loop

Loading
might not terminate, as this loop condition is constant within the loop.

Update expectation

b209d4b

nicolo-ribaudo commented Oct 27, 2025

View reviewed changes

src/main/java/org/jsoup/parser/HtmlTreeBuilder.java Show resolved Hide resolved

Apply suggestion from @nicolo-ribaudo

576fe38

Perform required cleanup when popping in HTML

8a5d557

Also, made it work for the XML parser

jhy added 2 commits November 13, 2025 16:18

Set the default max depth via overridable

4bb06bf

So that various xml builder constructors get the unlimited setting.

Merge branch 'master' into fix-deep-tree

8d753d3

jhy added this to the 1.22.1 milestone Nov 13, 2025

jhy added improvement An improvement / new feature idea fixed An {bug|improvement} that has been {fixed|implemented} labels Nov 13, 2025

jhy merged commit 2cc74b6 into jhy:master Nov 13, 2025
12 checks passed

Align maximum HTML depth handling with browsers #2421

Align maximum HTML depth handling with browsers #2421

Conversation

nicolo-ribaudo commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicolo-ribaudo commented Oct 21, 2025

Uh oh!

jhy commented Oct 22, 2025

Uh oh!

nicolo-ribaudo commented Oct 27, 2025

Uh oh!

Check warning

Uh oh!

nicolo-ribaudo commented Oct 27, 2025

Uh oh!

Uh oh!

nicolo-ribaudo commented Oct 27, 2025

Uh oh!

jhy commented Nov 13, 2025

Uh oh!

nicolo-ribaudo commented Nov 13, 2025

Uh oh!

jhy commented Nov 13, 2025

Uh oh!

jhy commented Nov 13, 2025

Uh oh!

Uh oh!

nicolo-ribaudo commented Nov 13, 2025

Uh oh!

jhy commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nicolo-ribaudo commented Oct 21, 2025 •

edited

Loading