DOMDocument and UTF-8 Problem

By  on  

A few weeks back I shared how I used PHP DOMDocument to reliably update all image URLs from standard HTTP to HTTPS.  DOMDocument made a difficult problem seem incredibly easy ... but with one side-effect that it took me a while to spot:  UTF-8 characters were being mutated into another set of characters.  I was seeing a bunch of odd characters like "ãç³" and"»ã®é" all over each blog post.

I knew the problem was happening during the DOMDocument parsing and that I need to find a fix quickly.  The solution was just a tiny bit of code:

// Create a DOMDocument instance 
$doc = new DOMDocument();

// The fix: mb_convert_encoding conversion
$doc->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));

After setting the character set with mb_convert_encoding, the odd characters vanished and the desired characters were back in place.  Phew!

Recent Features

  • By
    Introducing MooTools Templated

    One major problem with creating UI components with the MooTools JavaScript framework is that there isn't a great way of allowing customization of template and ease of node creation. As of today, there are two ways of creating: new Element Madness The first way to create UI-driven...

  • By
    5 More HTML5 APIs You Didn’t Know Existed

    The HTML5 revolution has provided us some awesome JavaScript and HTML APIs.  Some are APIs we knew we've needed for years, others are cutting edge mobile and desktop helpers.  Regardless of API strength or purpose, anything to help us better do our job is a...

Incredible Demos

  • By
    Google Extension Effect with CSS or jQuery or MooTools JavaScript

    Both of the two great browser vendors, Google and Mozilla, have Extensions pages that utilize simple but classy animation effects to enhance the page. One of the extensions used by Google is a basic margin-top animation to switch between two panes: a graphic pane...

  • By
    Page Visibility API

    One event that's always been lacking within the document is a signal for when the user is looking at a given tab, or another tab. When does the user switch off our site to look at something else? When do they come back?

Discussion

  1. Markus

    instead of converting the input string from your encoding into UTF8 you can also tell the DOMDocument with the 2nd arg in which encoding your string is.

    http://php.net/manual/en/domdocument.construct.php

    This should save you some cpu-cycles and reduce memory consumption.

  2. Whelping

    Actually I was already doing

    $html = new DOMDocument( null, 'UTF-8' );

    and still getting weird characters like  – adding your fix did the trick for me, thanks David!

    • Whelping

      It turns out that specifying the encoding argument when you instantiate a DOMDocument doesn’t encode the contents, just sets the document’s header – see this comment http://php.net/manual/en/domdocument.construct.php#78027. It probably wasn’t necessary to specify UTF-8 anyway, as that’s the default. So something like David’s fix is needed to change any unwanted encodings in the content.

  3. Raja Amer Khan

    Thanks for the tutorial. It really saved time.

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!