Address the remaining issues in #45 and fix #46 #47

TimG1964 · 2025-08-15T14:49:51Z

This PR is motivated by two observations on PR #45:

It seems to get exponentially slower for very large XML documents (noted in Performance Regression in v0.3.6 with LazyNode Usage #46)
It doesn't work for all cases of nesting

I believe I've resolved both of these issues and I've added a slew of extra tests to cover the second of these.

Some points to note:

Handling xml:space attributes correctly is slower than not handling them at all (ie v0.3.5). The difference isn't huge, and certainly isn't the cause of the regression reported in Performance Regression in v0.3.6 with LazyNode Usage #46. However, to mitigate this, and because use of xml:space isn't particularly common, I've split next and prev into two separate pathways. When a Raw entity is first created, I test for the presence of "xml:space" anywhere in the data and create a flag for this (raw.has_xml_space). When next or prev are invoked, they check this flag and only take the path that handles the attribute correctly if this flag is true. If it is false, the path taken is identical to the function from v0.3.5.
Handling xml:space in prev is challenging because it is necessary to know the status of the attribute which may be inherited from anywhere "above" the current text node in the xml structure but, because we are moving backwards, the xml tree hasn't yet been processed. With a little help from ChatGPT v5, I was able to find a way to use next to determine the correct attribute inheritance reliably. This approach has the advantage of keeping next and prev reliably consistent, too.
I've added a number of tests, some of which were suggested by ChatGPT v5. I may have overdone it, but I'd rather try to be safe (this time!).
I have not made any changes to XML.write and, as a result, it does not properly respect xml:space="preserve". Instead, it continues to add indentation and line feeds for pretty printing and this means a node containing xml:space="preserve" cannot do a roundtrip through write -> parse.

I've also made a one other decision in this implementation which is (perhaps) arbitrary but is essentially trivial to reverse:

Where xml:space = "preserve" is specified, RawText nodes are created between sibling nodes. I've chosen to keep these (three commented lines would readily suppress these).

To illustrate this last point, consider this example:

<root xml:space="preserve">
     <child>  normalized despite parent  </child>
     <child2>  normalized despite parent  </child2>
</root>

How many children does the <root> node have? Because space is preserved, the first child is a Text node of <root> itself - LazyNode (depth=2) Text "\n ". This is the line feed and indentation which the xml:space attribute requires to be preserved before <child> is reached. There are also similar text nodes between </child> and <child2> and between </child2> and </root>. Thus <root> has 5 children and not 2:

julia> XML.children(doc[1]) # doc defined by parsing the above xml
5-element Vector{Node}:
 Node Text "\n     "
 Node Element <child> (1 child)
 Node Text "\n     "
 Node Element <child2> (1 child)
 Node Text "\n"

I've compared the above behaviour with EzXML.jl, which does the following:

julia> doc="""<root xml:space="preserve">
            <child>  normalized despite parent  </child>
            <child2>  normalized despite parent  </child2>
       </root>"""
"<root xml:space=\"preserve\">\n     <child>  normalized despite parent  </child>\n     <child2>  normalized despite parent  </child2>\n</root>"

julia> d=EzXML.parsexml(doc)
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00000220c5ace530>))

julia> r=d.root
EzXML.Node(<ELEMENT_NODE[root]@0x00000220c65b4570>)

julia> c=EzXML.elements(r)
2-element Vector{EzXML.Node}:
 EzXML.Node(<ELEMENT_NODE[child]@0x00000220c65b56f0>)
 EzXML.Node(<ELEMENT_NODE[child2]@0x00000220c65b46f0>)

julia> findall("text()", r)
3-element Vector{EzXML.Node}:
 EzXML.Node(<TEXT_NODE@0x00000220c65b4170>)
 EzXML.Node(<TEXT_NODE@0x00000220c65b5b70>)
 EzXML.Node(<TEXT_NODE@0x00000220c65b54f0>)

julia> findall("text()", r)[1].content
"\n     "

julia> findall("text()", r)[2].content
"\n     "

julia> findall("text()", r)[3].content
"\n"

julia>

So EzXML.jl retains the orphaned Text nodes and finds the same 5 elements as in this PR.

I would expect this particular combination of features to be rare and my approach seems OK to me in the context of XML.jl.

issues in JuliaComputing#45.

TimG1964 · 2025-08-25T14:35:09Z

Hi Josh,

Here is a revised version of XML.write() which I think respects xml:space.

function write(io::IO, x, ctx::Vector{Bool}=[false]; indentsize::Int=2, depth::Int=1)
    indent = ' '^indentsize
    nodetype = XML.nodetype(x)
    tag = XML.tag(x)
    value = XML.value(x)
    children = XML.children(x)

    padding = indent^max(0, depth - 1)
    !ctx[end] && print(io, padding)
    if nodetype === Text
        print(io, value)

    elseif nodetype === Element
        push!(ctx, ctx[end])
        update_ctx!(ctx, x)
        print(io, '<', tag)
        _print_attrs(io, x)
        print(io, isempty(children) ? '/' : "", '>')
        if !isempty(children)
            if length(children) == 1 && XML.nodetype(only(children)) === Text
                write(io, only(children), ctx; indentsize=0)
                print(io, "</", tag, '>')
            else
                !ctx[end] && println(io)
                foreach(children) do child
                    write(io, child, ctx; indentsize, depth=depth + 1)
                    !ctx[end] && println(io)
                end
                print(io, !ctx[end] ? padding : "", "</", tag, '>')
            end
        end
        pop!(ctx)

    elseif nodetype === DTD
        print(io, "<!DOCTYPE ", value, '>')

    elseif nodetype === Declaration
        print(io, "<?xml")
        _print_attrs(io, x)
        print(io, "?>")

    elseif nodetype === ProcessingInstruction
        print(io, "<?", tag)
        _print_attrs(io, x)
        print(io, "?>")

    elseif nodetype === Comment
        print(io, "<!--", value, "-->")

    elseif nodetype === CData
        print(io, "<![CData[", value, "]]>")

    elseif nodetype === Document
        foreach(children) do child
            write(io, child, ctx; indentsize)
            !ctx[end] && println(io)
        end

    else
        error("Unreachable case reached during XML.write")
    end

end

It relies on a function (update_ctx!()) from PR #47 to maintain the status of xml:space during a traverse of the xml tree. In addition, several of the tests in #47 need updating to reflect this more correct functionality.

These two examples illustrate:

julia> lzxml = """<root>\n   <text>    </text>\n   <text2>  hello  </text2><text3 xml:space="preserve">  hello  <text3b>  preserve  </text3b></text3>\n   <text4 xml:space="preserve"></text4><text5/></root>"""

julia> lz = XML.parse(XML.LazyNode, lzxml)
LazyNode (depth=0) Document

julia> println(XML.write(lz)) # respecting xml:space="preserve"
<root>
  <text/>
  <text2>hello</text2>
  <text3 xml:space="preserve">  hello  <text3b>  preserve  </text3b></text3>
  <text4 xml:space="preserve"/>
  <text5/>
</root>

julia>  n2xml = """<root>\n   <text>    </text>\n   <text2>  hello  </text2><text3 xml:space="default">  hello  <text3b>  preserve  </text3b></text3>\n   <text4 xml:space="default"></text4><text5/></root>"""

julia>  n2 = XML.parse(XML.LazyNode, n2xml)
LazyNode (depth=0) Document

julia> println(XML.write(n2)) # as v0.3.5 would have printed
<root>
  <text/>
  <text2>hello</text2>
  <text3 xml:space="default">
    hello
    <text3b>preserve</text3b>
  </text3>
  <text4 xml:space="default"/>
  <text5/>
</root>

Not sure how to add this as a PR. I can wait until you decide whether to merge #47 and then make a new PR with this if you do merge. Alternatively I could just update my github fork now and it will become part of #47. Reluctant to do the latter since I've been advised correct etiquette is not to bundle up multiple changes in a single PR. I'll wait to hear from you...

Thanks,

Tim

mkitti · 2025-08-29T07:37:46Z

oof, this performance regression is quite significant. Thank you for catching this.

joshday · 2025-09-02T09:50:44Z

Sorry for the delay!

My gut instinct is that this and the previous PR add(ed) too much complexity for handling preserved spaces, but I also recognize that preserved spaces are a really annoying thing to get right. I'll merge this and create a new release, but I'd really like a simpler implementation that's easier to navigate and contribute to. I do have a draft of a redesign going, but its nowhere near ready.

TimG1964 · 2025-09-02T13:37:03Z

Thanks, Josh. I really appreciate it - especially after my previous attempt!

Am I OK now to make a PR for the changes to XML.write that I described above (and to include the suggested fix for #48)?

joshday · 2025-09-02T14:49:00Z

Yes, please do make a PR for the write method!

TimG1964 added 3 commits August 14, 2025 21:38

Address speed regression in JuliaComputing#46 and fix reamining

55c9ea5

issues in JuliaComputing#45.

Use @views more often for slices

c4a24bb

undo skip orphan text nodes (cf. EzXML)

237e5e3

joshday merged commit 5466022 into JuliaComputing:main Sep 2, 2025
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Address the remaining issues in #45 and fix #46 #47

Address the remaining issues in #45 and fix #46 #47

Uh oh!

TimG1964 commented Aug 15, 2025 •

edited

Loading

Uh oh!

TimG1964 commented Aug 25, 2025

Uh oh!

mkitti commented Aug 29, 2025

Uh oh!

joshday commented Sep 2, 2025

Uh oh!

Uh oh!

TimG1964 commented Sep 2, 2025

Uh oh!

joshday commented Sep 2, 2025

Uh oh!

Uh oh!

Address the remaining issues in #45 and fix #46 #47

Address the remaining issues in #45 and fix #46 #47

Uh oh!

Conversation

TimG1964 commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TimG1964 commented Aug 25, 2025

Uh oh!

mkitti commented Aug 29, 2025

Uh oh!

joshday commented Sep 2, 2025

Uh oh!

Uh oh!

TimG1964 commented Sep 2, 2025

Uh oh!

joshday commented Sep 2, 2025

Uh oh!

Uh oh!

TimG1964 commented Aug 15, 2025 •

edited

Loading