Releasing A HTML5 Parser and Writer For PHP

Parsing and writing html5 with PHP doesn't just work. The tools built into PHP are designed for HTML 4 and XML. HTML5 is different. For example, take the markup fragment:

<audio src="foo.ogg">
    <track kind="captions" src="foo.en.vtt" srclang="en" label="English">
    <track kind="captions" src="" srclang="sv" label="Svenska">

PHP will see this as:

<audio src="foo.ogg">
    <track kind="captions" src="foo.en.vtt" srclang="en" label="English">
        <track kind="captions" src="" srclang="sv" label="Svenska">

In addition to you'll get php warnings telling you the audio and track tags are invalid entities. They are new to html5 and the html4 parser doesn't know about them and doesn't handle them properly.

What's a developer to do? How about write a parser and serializer (writer) that works for html5. Matt Butcher and I spent half the past year creating html5-php and today we are ready with the first stable release.

Why Write Not Fix PHP?

The html 4 and xml parser and serializer in PHP are supplied by libxml. While libxml has many strengths and features it has not added html5 support. Adding that support to libxml wasn't ideal because we'd have to write it in C (which didn't feel desirable) and working on the code wouldn't be accessible to the mass majority of PHP developers. So, we opted to create it as a PHP library.

Installing html5-php

The easiest manner to include the library is via composer. Add a dependency to your projects composer.json file like:

    "require" : {
        "masterminds/html5": "1.*"

Then use composer install or composer update to install and update the codebase.

Being a PSR-0 compliant library, and PSR-0 autoloader will work. You can download the library and include it in your codebase.

Since we aim to be semantically versioned compatibility with future releases should be handled cleanly.

Using html5-php

Let's start with a simple example.

// Assuming you installed from Composer:
require "vendor/autoload.php";

// An example HTML document:
$html = <<< 'HERE'
    <body id='foo'>
        <h1>Hello World</h1>
        <p>This is a test of the HTML5 parser.</p>

// Parse the document. $dom is a DOMDocument.
$dom = HTML5::loadHTML($html);

// Render it as HTML5:
print HTML5::saveHTML($dom);

// Or save it to a file:
HTML5::save($dom, 'out.html');

There are many features such as fragment parsing and rendering, creating your own parser from our parts if you want to do something different, and so on. You can learn how this works from the API documentation and general user documentation.

Compatibility With Existing Tools

There are a lot of existing tools to work with html in the PHP community. The parser returns DOMDocument and DOMDocumentFragment objects which are the same as those returned by the native PHP parser. The serializer accepts both of these objects as well.

If existing tools already work with these native PHP objects they should continue to work with those returned from this library.