-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Hi and thank you for HTML5::DOM which had served me superbly quite a few times.
Alas, it failed me when I tried to parse the contents of a webpage which it states it is encoded with "charset=windows-1253" (via this: <meta http-equiv="Content-Type" content="text/html; charset=windows-1253">). The result is that parse() returns nodes whose text, when printed on a linux console, appears gibberish (the typical horror of Perl's screen-of-unicode-death §Ξ΅Ξ—ΣΤΛΩΛΩ).
My eventual solution was to zap the evil windows-1253 from the html content and replace it with UTF-8.
How to solve this properly (thiugh I don't mind the zapping)?
Secondly, I tried to tell HTML5::DOM not to be concerned at all with unicode and return me back un-encoded text so that I would encode it myself using parse(..., {utf8=>0}). Either I made a mistake or this is not possible because I ended up with even more gibberish. On second though why use utf8=>0 when encoding is not utf8?
Below is a self-contained example demonstrating the problem.
Many thanks,
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Request;
use HTML5::DOM;
use Encode;
my $ua = LWP::UserAgent->new();
my $response = $ua->request(
HTTP::Request->new(
'GET' => 'https://www.areiospagos.gr/proedros.htm',
[
'Connection' => 'keep-alive',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Encoding' => 'gzip, deflate',
'Accept-Language' => 'en-GB,en;q=0.5',
'Referer' => 'http://www.polignosi.com/cgibin/hweb',
'User-Agent' => 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/110.0',
'Upgrade-Insecure-Requests' => '1'
],
)
);
die unless $response && $response->is_success;
my $html = $response->decoded_content;
print "encoding using detect(): ".HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding::detect($html))."\n";
print "encoding using detectUnicode(): ".HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding::detectUnicode($html))."\n";
print "encoding using detectByPrescanStream(): ".HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding::detectByPrescanStream($html))."\n";
# The above html contains
# <meta http-equiv="Content-Type" content="text/html; charset=windows-1253">
# replacing the crappy windows-1253 with UTF-8 solves my problem
#$html =~ s/charset=windows-1253/charset=UTF-8/g;
my $parser = HTML5::DOM->new();
my $tree = $parser->parse($html, {scripts => 0});
my $is_utf8_enabled = $tree->utf8;
# it prints 'true'
print "is_utf8_enabled=".($tree ? "true" : "false")."\n"; # false
my $text = $tree->find('body table#table1 tbody tr td table#table2 tbody tr td p span')->[0]->text();
# it prints gibberish (doubly-encoded)
print $text;
# it is solved by replacing the windows-1235 charset from $html, see above