Skip to content

Conversation

TimG1964
Copy link
Contributor

@TimG1964 TimG1964 commented Aug 15, 2025

This PR is motivated by two observations on PR #45:

I believe I've resolved both of these issues and I've added a slew of extra tests to cover the second of these.

Some points to note:

  • Handling xml:space attributes correctly is slower than not handling them at all (ie v0.3.5). The difference isn't huge, and certainly isn't the cause of the regression reported in Performance Regression in v0.3.6 with LazyNode Usage #46. However, to mitigate this, and because use of xml:space isn't particularly common, I've split next and prev into two separate pathways. When a Raw entity is first created, I test for the presence of "xml:space" anywhere in the data and create a flag for this (raw.has_xml_space). When next or prev are invoked, they check this flag and only take the path that handles the attribute correctly if this flag is true. If it is false, the path taken is identical to the function from v0.3.5.
  • Handling xml:space in prev is challenging because it is necessary to know the status of the attribute which may be inherited from anywhere "above" the current text node in the xml structure but, because we are moving backwards, the xml tree hasn't yet been processed. With a little help from ChatGPT v5, I was able to find a way to use next to determine the correct attribute inheritance reliably. This approach has the advantage of keeping next and prev reliably consistent, too.
  • I've added a number of tests, some of which were suggested by ChatGPT v5. I may have overdone it, but I'd rather try to be safe (this time!).
  • I have not made any changes to XML.write and, as a result, it does not properly respect xml:space="preserve". Instead, it continues to add indentation and line feeds for pretty printing and this means a node containing xml:space="preserve" cannot do a roundtrip through write -> parse.

I've also made a one other decision in this implementation which is (perhaps) arbitrary but is essentially trivial to reverse:

  • Where xml:space = "preserve" is specified, RawText nodes are created between sibling nodes. I've chosen to keep these (three commented lines would readily suppress these).

To illustrate this last point, consider this example:

<root xml:space="preserve">
     <child>  normalized despite parent  </child>
     <child2>  normalized despite parent  </child2>
</root>

How many children does the <root> node have? Because space is preserved, the first child is a Text node of <root> itself - LazyNode (depth=2) Text "\n ". This is the line feed and indentation which the xml:space attribute requires to be preserved before <child> is reached. There are also similar text nodes between </child> and <child2> and between </child2> and </root>. Thus <root> has 5 children and not 2:

julia> XML.children(doc[1]) # doc defined by parsing the above xml
5-element Vector{Node}:
 Node Text "\n     "
 Node Element <child> (1 child)
 Node Text "\n     "
 Node Element <child2> (1 child)
 Node Text "\n"

I've compared the above behaviour with EzXML.jl, which does the following:

julia> doc="""<root xml:space="preserve">
            <child>  normalized despite parent  </child>
            <child2>  normalized despite parent  </child2>
       </root>"""
"<root xml:space=\"preserve\">\n     <child>  normalized despite parent  </child>\n     <child2>  normalized despite parent  </child2>\n</root>"

julia> d=EzXML.parsexml(doc)
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00000220c5ace530>))

julia> r=d.root
EzXML.Node(<ELEMENT_NODE[root]@0x00000220c65b4570>)

julia> c=EzXML.elements(r)
2-element Vector{EzXML.Node}:
 EzXML.Node(<ELEMENT_NODE[child]@0x00000220c65b56f0>)
 EzXML.Node(<ELEMENT_NODE[child2]@0x00000220c65b46f0>)

julia> findall("text()", r)
3-element Vector{EzXML.Node}:
 EzXML.Node(<TEXT_NODE@0x00000220c65b4170>)
 EzXML.Node(<TEXT_NODE@0x00000220c65b5b70>)
 EzXML.Node(<TEXT_NODE@0x00000220c65b54f0>)

julia> findall("text()", r)[1].content
"\n     "

julia> findall("text()", r)[2].content
"\n     "

julia> findall("text()", r)[3].content
"\n"

julia>

So EzXML.jl retains the orphaned Text nodes and finds the same 5 elements as in this PR.

I would expect this particular combination of features to be rare and my approach seems OK to me in the context of XML.jl.

@TimG1964
Copy link
Contributor Author

Hi Josh,

Here is a revised version of XML.write() which I think respects xml:space.

function write(io::IO, x, ctx::Vector{Bool}=[false]; indentsize::Int=2, depth::Int=1)
    indent = ' '^indentsize
    nodetype = XML.nodetype(x)
    tag = XML.tag(x)
    value = XML.value(x)
    children = XML.children(x)

    padding = indent^max(0, depth - 1)
    !ctx[end] && print(io, padding)
    if nodetype === Text
        print(io, value)

    elseif nodetype === Element
        push!(ctx, ctx[end])
        update_ctx!(ctx, x)
        print(io, '<', tag)
        _print_attrs(io, x)
        print(io, isempty(children) ? '/' : "", '>')
        if !isempty(children)
            if length(children) == 1 && XML.nodetype(only(children)) === Text
                write(io, only(children), ctx; indentsize=0)
                print(io, "</", tag, '>')
            else
                !ctx[end] && println(io)
                foreach(children) do child
                    write(io, child, ctx; indentsize, depth=depth + 1)
                    !ctx[end] && println(io)
                end
                print(io, !ctx[end] ? padding : "", "</", tag, '>')
            end
        end
        pop!(ctx)

    elseif nodetype === DTD
        print(io, "<!DOCTYPE ", value, '>')

    elseif nodetype === Declaration
        print(io, "<?xml")
        _print_attrs(io, x)
        print(io, "?>")

    elseif nodetype === ProcessingInstruction
        print(io, "<?", tag)
        _print_attrs(io, x)
        print(io, "?>")

    elseif nodetype === Comment
        print(io, "<!--", value, "-->")

    elseif nodetype === CData
        print(io, "<![CData[", value, "]]>")

    elseif nodetype === Document
        foreach(children) do child
            write(io, child, ctx; indentsize)
            !ctx[end] && println(io)
        end

    else
        error("Unreachable case reached during XML.write")
    end

end

It relies on a function (update_ctx!()) from PR #47 to maintain the status of xml:space during a traverse of the xml tree. In addition, several of the tests in #47 need updating to reflect this more correct functionality.

These two examples illustrate:

julia> lzxml = """<root>\n   <text>    </text>\n   <text2>  hello  </text2><text3 xml:space="preserve">  hello  <text3b>  preserve  </text3b></text3>\n   <text4 xml:space="preserve"></text4><text5/></root>"""

julia> lz = XML.parse(XML.LazyNode, lzxml)
LazyNode (depth=0) Document

julia> println(XML.write(lz)) # respecting xml:space="preserve"
<root>
  <text/>
  <text2>hello</text2>
  <text3 xml:space="preserve">  hello  <text3b>  preserve  </text3b></text3>
  <text4 xml:space="preserve"/>
  <text5/>
</root>

julia>  n2xml = """<root>\n   <text>    </text>\n   <text2>  hello  </text2><text3 xml:space="default">  hello  <text3b>  preserve  </text3b></text3>\n   <text4 xml:space="default"></text4><text5/></root>"""

julia>  n2 = XML.parse(XML.LazyNode, n2xml)
LazyNode (depth=0) Document

julia> println(XML.write(n2)) # as v0.3.5 would have printed
<root>
  <text/>
  <text2>hello</text2>
  <text3 xml:space="default">
    hello
    <text3b>preserve</text3b>
  </text3>
  <text4 xml:space="default"/>
  <text5/>
</root>

Not sure how to add this as a PR. I can wait until you decide whether to merge #47 and then make a new PR with this if you do merge. Alternatively I could just update my github fork now and it will become part of #47. Reluctant to do the latter since I've been advised correct etiquette is not to bundle up multiple changes in a single PR. I'll wait to hear from you...

Thanks,

Tim

@mkitti
Copy link

mkitti commented Aug 29, 2025

oof, this performance regression is quite significant. Thank you for catching this.

@joshday
Copy link
Member

joshday commented Sep 2, 2025

Sorry for the delay!

My gut instinct is that this and the previous PR add(ed) too much complexity for handling preserved spaces, but I also recognize that preserved spaces are a really annoying thing to get right. I'll merge this and create a new release, but I'd really like a simpler implementation that's easier to navigate and contribute to. I do have a draft of a redesign going, but its nowhere near ready.

@joshday joshday merged commit 5466022 into JuliaComputing:main Sep 2, 2025
16 checks passed
@TimG1964
Copy link
Contributor Author

TimG1964 commented Sep 2, 2025

Thanks, Josh. I really appreciate it - especially after my previous attempt!

Am I OK now to make a PR for the changes to XML.write that I described above (and to include the suggested fix for #48)?

@joshday
Copy link
Member

joshday commented Sep 2, 2025

Yes, please do make a PR for the write method!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants