Skip to content

Problems when html's charset is windows-1253 #10

@hadjiprocopis

Description

@hadjiprocopis

Hi and thank you for HTML5::DOM which had served me superbly quite a few times.

Alas, it failed me when I tried to parse the contents of a webpage which it states it is encoded with "charset=windows-1253" (via this: <meta http-equiv="Content-Type" content="text/html; charset=windows-1253">). The result is that parse() returns nodes whose text, when printed on a linux console, appears gibberish (the typical horror of Perl's screen-of-unicode-death §Ξ΅Ξ—ΣΤΛΩΛΩ).

My eventual solution was to zap the evil windows-1253 from the html content and replace it with UTF-8.

How to solve this properly (thiugh I don't mind the zapping)?

Secondly, I tried to tell HTML5::DOM not to be concerned at all with unicode and return me back un-encoded text so that I would encode it myself using parse(..., {utf8=>0}). Either I made a mistake or this is not possible because I ended up with even more gibberish. On second though why use utf8=>0 when encoding is not utf8?

Below is a self-contained example demonstrating the problem.

Many thanks,

use strict;
use warnings;

use LWP::UserAgent;
use HTTP::Request;
use HTML5::DOM;
use Encode;

my $ua = LWP::UserAgent->new();
my $response = $ua->request(
  HTTP::Request->new(
	'GET' => 'https://www.areiospagos.gr/proedros.htm',
	[
		'Connection' => 'keep-alive',
		'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
		'Accept-Encoding' => 'gzip, deflate',
		'Accept-Language' => 'en-GB,en;q=0.5',
		'Referer' => 'http://www.polignosi.com/cgibin/hweb',
		'User-Agent' => 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/110.0',
		'Upgrade-Insecure-Requests' => '1'
	],
  )
);
die unless $response && $response->is_success;

my $html = $response->decoded_content;

print "encoding using detect(): ".HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding::detect($html))."\n";
print "encoding using detectUnicode(): ".HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding::detectUnicode($html))."\n";
print "encoding using detectByPrescanStream(): ".HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding::detectByPrescanStream($html))."\n";

# The above html contains
#  <meta http-equiv="Content-Type" content="text/html; charset=windows-1253">
# replacing the crappy windows-1253 with UTF-8 solves my problem
#$html =~ s/charset=windows-1253/charset=UTF-8/g;

my $parser = HTML5::DOM->new();

my $tree = $parser->parse($html, {scripts => 0});

my $is_utf8_enabled = $tree->utf8;
# it prints 'true'
print "is_utf8_enabled=".($tree ? "true" : "false")."\n"; # false
my $text = $tree->find('body table#table1 tbody tr td table#table2 tbody tr td p span')->[0]->text();
# it prints gibberish (doubly-encoded)
print $text;
# it is solved by replacing the windows-1235 charset from $html, see above

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions