Use HTML Tidy on a string

2004-06-02 00:00

This function can be used to make HTML contained in a string standards compliant. You will need a copy of HTML Tidy running on your server. Your PHP configuration must also allow you to execute code and you will need a directory that PHP can create temporary files into.

I wrote this function for one of the projects that involved importing HTML content into one monstrous CMS.

An example HTML Tidy configuration file that outputs indented XHTML is below the source code.

<?php

// Global variables. Feel free to rename them both here and in the function
// to better suit your framework.
global $TMP_DIR, $TIDY_CMD;


/**
 * Directory with write access
 * @var string $TEMP_DIR
 */
$TEMP_DIR = ‘c:/temp/’; 


/**
 * Path to HTML Tidy executable.
 * Note that “default.tidy” config file should be in the same directory
 * as the executable or should be written with a full path.
 * You can also manually write configuration options.
 *
 * @var string $TIDY_CMD
 */
$TIDY_CMD = ‘c:/tools/tidy/tidy.exe -config default.tidy’;

/**
 * Uses HTML Tidy on input from string
 *
 * @function tidyString
 * @return string Tidied-up HTML
 * @param string $string Original HTML
 * @param optional boolean $strip_body Should the function output
 *  just the body part (TRUE, default), or full HTML (FALSE)  
 * @use $TEMP_DIR  
 * @use $TIDY_CMD
 */
function tidyString($string, $strip_body = TRUE)
{
    global $TEMP_DIR, $TIDY_CMD;
    
    $tmp_file_name = $TEMP_DIR.md5(uniqid(rand(),1));
    
    // Save string to file
    $fp = fopen($tmp_file_name.‘.in’, ‘wb’);
    fwrite($fp,$string,strlen($string));
    fclose($fp);
    
    // Execute Tidy on file
    passthru($TIDY_CMD.‘ -o ‘.$tmp_file_name.‘.out ‘.$tmp_file_name.‘.in’);
    
    // Read output from XML file
    $fp = fopen($tmp_file_name.‘.out’, ‘rb’);
    $string = fread($fp, filesize($tmp_file_name.‘.out’));
    fclose($fp);
    
    // Delete temp files
    unlink($tmp_file_name.‘.in’);
    unlink($tmp_file_name.‘.out’);
    
    // If needed strip body tags:
    if ($strip_body)
    {
        preg_match(‘/<body>(.*)<\/body>/s’, $string, $temp);
        $string = $temp[1];
    }

    // Return clean output:
    return $string;
}

?>

Example default.tidy config file I used to force indented XHTML output (you can find more configuration options in the HTML Tidy Quick Reference):

indent-spaces: 1
wrap: 700
doctype: transitional
indent: yes
output-xml: yes
output-xhtml: yes
numeric-entities: yes
wrap-sections: no
wrap-asp: no
wrap-jste: no
wrap-php: no
assume-xml-procins: yes
tidy-mark: no
force-output: yes