Wednesday, February 01, 2012

Fun in HTML5 and UTF-8 land

Spent the last couple of days looking at converting http://nrich.maths.org to HTML5 and UTF-8.
For a bit of background the site stores content internally as XML in a MySQL DB which then gets put thru PHP's XML parser and is converted to XHTML, editing is done via CKeditor ->PHP Tidy -> XML parser -> DB


So far the process looks to be fairly straight forward:

  1. Change the XHTML header to a HTML5 header - there seem to be no major compatibility issues between XHTML and HTML5 (see http://coding.smashingmagazine.com/2009/07/29/misunderstanding-markup-xhtml-2-comic-strip/ !)
  2. Make sure all the XML parsing done internally is using the right character set.
  3. Remove all the bit's of code designed to stop UTF-8 characters getting into the DB!
  4. Change php::tidy config to output UTF-8.
  5. Dump the DB out of mysql and change all the DEFAULT CHARSET=latin1 -> DEFAULT CHARSET=utf8
  6. do a "ALTER DATABASE nrichdb charset = 'utf8';" on the main database
  7. Reimport the DB
  8. Pull all the content out of the DB put thru utf8_encode() add the UTF-8 encoding to the XML processing instructions and then resave.
  9. ...
  10. $PROFIT$ well OK maybe not...
So far so good...

UTF-8
We're doing this because pretty much everything is UTF-8 based now and using latin1/ISO-8859 is just a hangover from the website having been in production since the late 1990s

HTML5
Well all the cool kids are doing it right? ;) certainly all the major dev work is going that way.