ISO-8859-1 is a subset of UTF-8, which supports many languages.
windows-1252 is like ISO-8859-1, except it also uses positions 80 through 9F for "special" windows characters -- see http://en.wikipedia.org/wiki/Windows_1252
These special characters (the most common being "smart" or "curly" quotes and the "emdash" (double dash) most commonly find their way into html by way of Microsoft Word content being pasted into email, web content, CMS, feeds, and blogs.
Almost all modern browsers use a "loose" interpretation of 8859-1 and the windows characters appear as intended (one that does not always appear or print right is the euro € sign).
The problem:
Pages interpreted as strict ISO-8859-1 or UTF-8, including all properly-served xml content (such as RSS feeds) will not reproduce the windows-1252 characters, nor will they be automatically replaced, nor will the feeds validate properly.
To make matters worse, xml <title> and most channel fields disallow the use of character entities, so substitution with their decimal, hex, unicode, or html equivalents (like € or €) will invalidate the feed. Furthermore, direct replacements like this only work if the server encoding is UTF-8 (irrespective of your feed declaration line). So the only place direct replacements with their UTF-8 equivalents works is in the <description> fields.
Towards a solution:
I have written two subs which will be incorporated into my RSS Feed Style after a little more testing.
Sub replace1252 is for ISO-8859-1 content, and also where html substitutions are not allowed, such as xml <title> fields. It makes reasonable replacements of windows-1252 characters with generic equivalents, based on their most common usage. It is also the safest route for compatibility and printability.
- Code: Select all
# Added to fix undeclared characters (windows-1252)
sub replace1252 {
my $string = shift;
$string =~ tr/\x82\x83\x84\x86\x88\x8A\x8B\x8E/'f"\*^S'Z/;
$string =~ tr/\x91\x92\x93\x94\x95\x96\x98\x9A\x9B\x9E\x9F/''""\*\-~s'zY/;
$string =~ s/\x80/EUR/sg;
$string =~ s/\x85/.../sg;
$string =~ s/\x87/\*\*/sg;
$string =~ s/\x89/0\/00/sg;
$string =~ s/\x8C/OE/sg;
$string =~ s/\x97/--/sg;
$string =~ s/\x99/\(TM\)/sg;
$string =~ s/\x9C/oe/sg;
$string =~ tr/\x80-\x9F//d;
return($string);
}
# Example usage
# my $Subject = replace1252($Subject);
Sub preserve1252 is a little riskier, because it must be used only with server-encoded UTF-8 content (irrespective of your xml encoding declaration), and only where html content is allowed, such as xml <description> fields, and on web pages, etc.
- Code: Select all
# Use in <description> fields only
# Only works when feed is served UTF-8
sub preserve1252 {
my $string = shift;
$string =~ s/\x80/€/sg;
$string =~ s/\x82/‚/sg;
$string =~ s/\x83/ƒ/sg;
$string =~ s/\x84/„/sg;
$string =~ s/\x85/…/sg;
$string =~ s/\x86/†/sg;
$string =~ s/\x87/‡/sg;
$string =~ s/\x88/ˆ/sg;
$string =~ s/\x89/‰/sg;
$string =~ s/\x8A/Š/sg;
$string =~ s/\x8B/‹/sg;
$string =~ s/\x8C/Œ/sg;
$string =~ s/\x8E/Ž/sg;
$string =~ s/\x91/‘/sg;
$string =~ s/\x92/’/sg;
$string =~ s/\x93/“/sg;
$string =~ s/\x94/”/sg;
$string =~ s/\x95/•/sg;
$string =~ s/\x96/–/sg;
$string =~ s/\x97/—/sg;
$string =~ s/\x98/˜/sg;
$string =~ s/\x99/™/sg;
$string =~ s/\x9A/š/sg;
$string =~ s/\x9B/›/sg;
$string =~ s/\x9C/œ/sg;
$string =~ s/\x9E/ž/sg;
$string =~ s/\x9F/Ÿ/sg;
$string =~ tr/\x80-\x9F//d;
return($string);
}
# Example usage
# my $Text = preserve1252($Text);