Skip to content

Conversation

paulbijnens
Copy link

Up to now it was documented that internal DTDs inside the doctype declaration confuse HTML::Parser.
Depending on the content of that internal dtd, the parser would return a text token instead, but sometimes also a declaration token that contained a lot of elements and text appearing after the syntactically correct declaration as well.
The old implementation did allow for the empty internal DTD like:
<!DOCTYPE abc SYSTEM "abc.dtd" [] >

This patch allows non-empty internal DTDs inside those square brackets in the doctype declaration, and returns the whole internal DTD as one single token in the list, similar to the token just containing "[]" in the old implementation. E.g. now it correctly parses:

<!DOCTYPE abc SYSTEM abc.dtd"[
<!-- even a simple comment here would confuse it -->
<!-- or quoted strings with special chars like ]> -->
<!ENTITY confuse "]>">
] >
<abc>Hello world</abc>

Paul
(Ten years after my previous small patch, but still using this very nice perl module, one of the only ones that allows for sane parsing of sgml-like files with errors in it.)

@paulbijnens
Copy link
Author

Wait a moment -- still some bug.

@paulbijnens
Copy link
Author

Ok. Now it correctly parses all the possible ways comments inside the internal DTD.
Can you have a look now? Feedback welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant