windows-1252 ISO-8859-1 UTF-8 Encoding Issues & Solution

All notes and input on 2.0 Development are herein. <ul><li>Feature requests</li>
<li>Bug Reports</li>
<li>Beta Testing Feedback</li>
<li>Open Dev Discussion</li></ul>

Moderator: Coranto Moderator Team

windows-1252 ISO-8859-1 UTF-8 Encoding Issues & Solution

Postby Musicvid » Thu Aug 16, 2007 1:20 am

Background:
ISO-8859-1 is a subset of UTF-8, which supports many languages.
windows-1252 is like ISO-8859-1, except it also uses positions 80 through 9F for "special" windows characters -- see http://en.wikipedia.org/wiki/Windows_1252

These special characters (the most common being "smart" or "curly" quotes and the "emdash" (double dash) most commonly find their way into html by way of Microsoft Word content being pasted into email, web content, CMS, feeds, and blogs.

Almost all modern browsers use a "loose" interpretation of 8859-1 and the windows characters appear as intended (one that does not always appear or print right is the euro € sign).

The problem:
Pages interpreted as strict ISO-8859-1 or UTF-8, including all properly-served xml content (such as RSS feeds) will not reproduce the windows-1252 characters, nor will they be automatically replaced, nor will the feeds validate properly.

To make matters worse, xml <title> and most channel fields disallow the use of character entities, so substitution with their decimal, hex, unicode, or html equivalents (like € or &euro;) will invalidate the feed. Furthermore, direct replacements like this only work if the server encoding is UTF-8 (irrespective of your feed declaration line). So the only place direct replacements with their UTF-8 equivalents works is in the <description> fields.

Towards a solution:
I have written two subs which will be incorporated into my RSS Feed Style after a little more testing.

Sub replace1252 is for ISO-8859-1 content, and also where html substitutions are not allowed, such as xml <title> fields. It makes reasonable replacements of windows-1252 characters with generic equivalents, based on their most common usage. It is also the safest route for compatibility and printability.
Code: Select all
# Added to fix undeclared characters (windows-1252)
  sub replace1252  {
  my $string = shift;
  $string =~ tr/\x82\x83\x84\x86\x88\x8A\x8B\x8E/'f"\*^S'Z/;
  $string =~ tr/\x91\x92\x93\x94\x95\x96\x98\x9A\x9B\x9E\x9F/''""\*\-~s'zY/;
  $string =~ s/\x80/EUR/sg;
  $string =~ s/\x85/.../sg;
  $string =~ s/\x87/\*\*/sg;
  $string =~ s/\x89/0\/00/sg;
  $string =~ s/\x8C/OE/sg;
  $string =~ s/\x97/--/sg;
  $string =~ s/\x99/\(TM\)/sg;
  $string =~ s/\x9C/oe/sg;
  $string =~ tr/\x80-\x9F//d;
  return($string);
  }
# Example usage
# my $Subject = replace1252($Subject);

Sub preserve1252 is a little riskier, because it must be used only with server-encoded UTF-8 content (irrespective of your xml encoding declaration), and only where html content is allowed, such as xml <description> fields, and on web pages, etc.
Code: Select all
# Use in <description> fields only
# Only works when feed is served UTF-8
sub preserve1252 {
    my $string = shift;
    $string =~ s/\x80/&#8364;/sg;
    $string =~ s/\x82/&#8218;/sg;
    $string =~ s/\x83/&#402;/sg;
    $string =~ s/\x84/&#8222;/sg;
    $string =~ s/\x85/&#8230;/sg;
    $string =~ s/\x86/&#8224;/sg;
    $string =~ s/\x87/&#8225;/sg;
    $string =~ s/\x88/&#710;/sg;
    $string =~ s/\x89/&#8240;/sg;
    $string =~ s/\x8A/&#352;/sg;
    $string =~ s/\x8B/&#8249;/sg;
    $string =~ s/\x8C/&#338;/sg;
    $string =~ s/\x8E/&#381;/sg;
    $string =~ s/\x91/&#8216;/sg;
    $string =~ s/\x92/&#8217;/sg;
    $string =~ s/\x93/&#8220;/sg;
    $string =~ s/\x94/&#8221;/sg;
    $string =~ s/\x95/&#8226;/sg;
    $string =~ s/\x96/&#8211;/sg;
    $string =~ s/\x97/&#8212;/sg;
    $string =~ s/\x98/&#732;/sg;
    $string =~ s/\x99/&#8482;/sg;
    $string =~ s/\x9A/&#353;/sg;
    $string =~ s/\x9B/&#8250;/sg;
    $string =~ s/\x9C/&#339;/sg;
    $string =~ s/\x9E/&#382;/sg;
    $string =~ s/\x9F/&#376;/sg;
    $string =~ tr/\x80-\x9F//d;
    return($string);
  }
# Example usage
# my $Text = preserve1252($Text);
Since those of you who have encountered these issues for x80 through x9F characters in your own feeds will want to test the replacements, here is a complete set of those "problem" characters to paste into your test content (don't worry about control characters, they get ignored anyway):[quote][size=150]€ ‚ Æ’ „ … † ‡ ˆ ‰ Å  ‹ Å’ Ž ‘ ’ “ â€
Last edited by Musicvid on Fri Jun 06, 2008 12:11 am, edited 2 times in total.
Musicvid
 
Posts: 138
Joined: Wed Jan 17, 2007 1:05 am
Location: Western America

Postby YushDon » Wed Jan 23, 2008 9:45 am

Hello Musicvid. Thanks for your lucid explanation. I am most affected by use of the apostrophe (') and the ampersand (&) sign in titles, description and any part of a document in Coranto.

I am not sure how to implement your solution though. Any kind of additional info for the novice user would be super cool!
YushDon
 
Posts: 88
Joined: Mon Jan 22, 2007 11:33 pm
Location: London

Postby Musicvid » Sat Jan 26, 2008 2:17 am

Yushdon,
Thanks for the compliment.
The issue you are facing is a little different than the windows-1252 character issue.

In XML, the characters
' " & < > \
are strictly forbidden because they are used for other purposes in the schema. The easiest workaround for <title> and <description> fields is to enclose them in <![CDATA[]]> blocks as such:
Code: Select all
<description><![CDATA[M'y "SILLY" <code> & \stuff]]></description>
Other methods are to escape them with their HTML entities in RSS <description> fields, and replace or eliminate them in <title> fields, where HTML is not allowed. You can use Perl regular expressions to accomplish any of those tasks.

See my RSS Feed Style for practical applications of both solutions.

Hope this helps.
Musicvid
 
Posts: 138
Joined: Wed Jan 17, 2007 1:05 am
Location: Western America


Return to Coranto 2.0 Development

Who is online

Users browsing this forum: No registered users and 2 guests

cron