Wednesday, February 01, 2012

Fun in HTML5 and UTF-8 land

Spent the last couple of days looking at converting http://nrich.maths.org to HTML5 and UTF-8.
For a bit of background the site stores content internally as XML in a MySQL DB which then gets put thru PHP's XML parser and is converted to XHTML, editing is done via CKeditor ->PHP Tidy -> XML parser -> DB


So far the process looks to be fairly straight forward:

  1. Change the XHTML header to a HTML5 header - there seem to be no major compatibility issues between XHTML and HTML5 (see http://coding.smashingmagazine.com/2009/07/29/misunderstanding-markup-xhtml-2-comic-strip/ !)
  2. Make sure all the XML parsing done internally is using the right character set.
  3. Remove all the bit's of code designed to stop UTF-8 characters getting into the DB!
  4. Change php::tidy config to output UTF-8.
  5. Dump the DB out of mysql and change all the DEFAULT CHARSET=latin1 -> DEFAULT CHARSET=utf8
  6. do a "ALTER DATABASE nrichdb charset = 'utf8';" on the main database
  7. Reimport the DB
  8. Pull all the content out of the DB put thru utf8_encode() add the UTF-8 encoding to the XML processing instructions and then resave.
  9. ...
  10. $PROFIT$ well OK maybe not...
So far so good...

UTF-8
We're doing this because pretty much everything is UTF-8 based now and using latin1/ISO-8859 is just a hangover from the website having been in production since the late 1990s

HTML5
Well all the cool kids are doing it right? ;) certainly all the major dev work is going that way.






Friday, August 20, 2010

Importing old html content into Drupal 6

Importing old content into Drupal is a fairly painless process once you know:

  • How to create Nodes
  • How to pull the old pages into php

After spending far too much time of mine own trying to get to grips with the node creation process I found http://acquia.com/blog/migrating-drupal-way-part-i-creating-node which pretty much answers most of the basic node creation questions.

So I created my function to create the nodes from my old pages

It's probably worth me describing the format of the old pages, I had 200 html 'index' pages in a database which linked to somewhere between 0 and 20 html,image or other documents per 'index' grouped as Talk,Project or Additional Material files, so a total of just under 1000 html pages.

I wanted to create the index files with links to the files and backlinks from the files to the index

<?php
require 'includes/bootstrap.inc';

require
'modules/node/node.pages.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

function
mcreate($title,$body,$path,$tagmap=false,$tlink=false,$plink=false,$alink=false,$type="resources") {

$node = new stdClass();
$node->title = $title;
$node->body = $body;
$node->type = $type;
$node->path = $path;
$node->changed = time();
$node->status = 1;
$node->promote = 1;
$node->sticky = 0;
$node->format = 3; // unFiltered HTML
$node->uid = 1; // UID of content owner
node_object_prepare($node);
//topic link
if ($tlink) {
$i=0;
foreach (
$tlink as $title=>$url) {
$node->field_tlink[$i]['title']=$title;
$node->field_tlink[$i]['url']=$url;
$node->field_tlink[$i]['attributes']=array();
$i++;
}
}
//project link
if ($plink) {
$i=0;
foreach (
$plink as $title=>$url) {
$node->field_plink[$i]['title']=$title;
$node->field_plink[$i]['url']=$url;
$node->field_plink[$i]['attributes']=array();
$i++;
}
}
//addition link
if ($alink) {
$i=0;
foreach (
$alink as $title=>$url) {
$node->field_alink[$i]['title']=$title;
$node->field_alink[$i]['url']=$url;
$node->field_alink[$i]['attributes']=array();
$i++;
}
}
if (
$tagmap) {
$node->taxonomy = $tagmap;
}
$node->created = $unixdate;
node_save($node);
return
$node;
}

The field_tlink field_plink field_alink are http://drupal.org/project/link CCK fields which link to the talk/project/addition files

Taxonomy Gotcha - Unless you add the the tid and a value to the term_hierarchy database table the taxonomy terms will not show up

mysql> insert into term_hierarchy select tid,0 from term_data

did the job for me

the Main body of the script to process the files is fairly normal php stuff I'll publish here for reference and the include with the functions here

Here is how the script basically works:

  • get the index files links and taxonomy from DB
    • get the html using curl
    • clean up and change the img links

  • Create new nodes for each link (if it's html)
  • create the index node with links to the files
  • edit the link files 'linked_from' with the UID of the index file. and repeat....

The functions explained:

function setlinkfrom($subnode,$linkfrom)
send an array of nodes (the linked to pages) with the nid of the index node, update the nodes with the backlink and save

function fetchpluspage($domain,$url)
Uses PHP Curl to fetch http:// pages

function cleanuppage($page)
Uses PHP tidy to clean up dodgy html

function tagmap($tags)
send this function an array of taxonomy terms and it will insert then into taxonomy vocab 1 if they aren;t there already and return their IDs for insertion into the node

function getguts($data)
Horrible function to pull relevant bits out of html pages

Thursday, April 29, 2010

Last Post

I've now moved to blogger (not that I really wanted to)

Thursday, March 11, 2010

finding additional line returns at the end of php files

Extra line returns at the end of php files can cause issues - breaking http headers,xml files and compression of output problems.
The Unix command

grep -rP "\?>\n\n+$" *

run in the same directory as the php should locate the offending files

Wednesday, November 25, 2009

Slow X11 startup on new Snow leopard install

After installing xemacs form macosx ports I was finding it was taking several minutes to startup - looking closer I realised that X11 was taking the time not Xemacs itself.
In /var/log/system.log there were entries like:

Nov 25 14:01:29 lapc-br1-2 org.x.startx[366]: xauth: timeout in locking authority file /Users/ogs22/.serverauth.366
Nov 25 14:01:49 lapc-br1-2 org.x.startx[366]: xauth: timeout in locking authority file /Users/ogs22/.Xauthority

Checking the file permission on my home directory showed:

drwxr-xr-x 210 temp 502 7140 25 Nov 14:01 ogs22

temp being the user I created to transfer my files/settings using migration assistant, so a quick

sudo chown ogs22:ogs22 ogs22

and X11 starts in about 2 seconds now.

Thursday, November 12, 2009

Shakespeare Letter Frequency

17.8851% : space
9.0881% : e
6.7498% : t
6.3808% : o
5.9213% : a
5.1092% : i
5.0632% : s
4.9656% : n
4.8777% : h
4.7921% : r
3.4463% : l
3.0274% : d
2.6422% : u
2.2643% : m
1.9180% : y
1.8387% : w
1.7362% : c
1.7181% : ,
1.6475% : f
1.6063% : .
1.3916% : g
1.2293% : b
1.1645% : p
0.7681% : v
0.7281% : k
0.6476% : '
0.3584% : ;
0.2184% : ?
0.1840% : !
0.1635% : -
0.1042% : x
0.0945% : j
0.0746% : q
0.0432% : [
0.0430% : ]
0.0377% : :
0.0339% : z
0.0094% : "
0.0050% : 1
0.0034% : )
0.0034% : (
0.0026% : 2
0.0021% : 3
0.0018% : 4
0.0016% : 5
0.0014% : _
0.0012% : 6
0.0011% : 9
0.0009% : 0
0.0008% : 7
0.0007% : |
0.0007% : 8
0.0006% : <
0.0004% : &
0.0000% : }
0.0000% : `

Friday, June 26, 2009

Entities, ext/xml and libxml 2.7 - CDATA Zone

Entities, ext/xml and libxml 2.7 - CDATA Zone

This just cost me an afternoon - portupgrade libxml2 ; portupgrade -f php5 saved me though :)

So if you find ampersand and other entities disappearing in your php scripts it's one to look out for