Friday, August 20, 2010

Importing old html content into Drupal 6

Importing old content into Drupal is a fairly painless process once you know:

  • How to create Nodes
  • How to pull the old pages into php

After spending far too much time of mine own trying to get to grips with the node creation process I found http://acquia.com/blog/migrating-drupal-way-part-i-creating-node which pretty much answers most of the basic node creation questions.

So I created my function to create the nodes from my old pages

It's probably worth me describing the format of the old pages, I had 200 html 'index' pages in a database which linked to somewhere between 0 and 20 html,image or other documents per 'index' grouped as Talk,Project or Additional Material files, so a total of just under 1000 html pages.

I wanted to create the index files with links to the files and backlinks from the files to the index

<?php
require 'includes/bootstrap.inc';

require
'modules/node/node.pages.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

function
mcreate($title,$body,$path,$tagmap=false,$tlink=false,$plink=false,$alink=false,$type="resources") {

$node = new stdClass();
$node->title = $title;
$node->body = $body;
$node->type = $type;
$node->path = $path;
$node->changed = time();
$node->status = 1;
$node->promote = 1;
$node->sticky = 0;
$node->format = 3; // unFiltered HTML
$node->uid = 1; // UID of content owner
node_object_prepare($node);
//topic link
if ($tlink) {
$i=0;
foreach (
$tlink as $title=>$url) {
$node->field_tlink[$i]['title']=$title;
$node->field_tlink[$i]['url']=$url;
$node->field_tlink[$i]['attributes']=array();
$i++;
}
}
//project link
if ($plink) {
$i=0;
foreach (
$plink as $title=>$url) {
$node->field_plink[$i]['title']=$title;
$node->field_plink[$i]['url']=$url;
$node->field_plink[$i]['attributes']=array();
$i++;
}
}
//addition link
if ($alink) {
$i=0;
foreach (
$alink as $title=>$url) {
$node->field_alink[$i]['title']=$title;
$node->field_alink[$i]['url']=$url;
$node->field_alink[$i]['attributes']=array();
$i++;
}
}
if (
$tagmap) {
$node->taxonomy = $tagmap;
}
$node->created = $unixdate;
node_save($node);
return
$node;
}

The field_tlink field_plink field_alink are http://drupal.org/project/link CCK fields which link to the talk/project/addition files

Taxonomy Gotcha - Unless you add the the tid and a value to the term_hierarchy database table the taxonomy terms will not show up

mysql> insert into term_hierarchy select tid,0 from term_data

did the job for me

the Main body of the script to process the files is fairly normal php stuff I'll publish here for reference and the include with the functions here

Here is how the script basically works:

  • get the index files links and taxonomy from DB
    • get the html using curl
    • clean up and change the img links

  • Create new nodes for each link (if it's html)
  • create the index node with links to the files
  • edit the link files 'linked_from' with the UID of the index file. and repeat....

The functions explained:

function setlinkfrom($subnode,$linkfrom)
send an array of nodes (the linked to pages) with the nid of the index node, update the nodes with the backlink and save

function fetchpluspage($domain,$url)
Uses PHP Curl to fetch http:// pages

function cleanuppage($page)
Uses PHP tidy to clean up dodgy html

function tagmap($tags)
send this function an array of taxonomy terms and it will insert then into taxonomy vocab 1 if they aren;t there already and return their IDs for insertion into the node

function getguts($data)
Horrible function to pull relevant bits out of html pages