DOMDocument and UTF-8 Problem

By  on  

A few weeks back I shared how I used PHP DOMDocument to reliably update all image URLs from standard HTTP to HTTPS.  DOMDocument made a difficult problem seem incredibly easy ... but with one side-effect that it took me a while to spot:  UTF-8 characters were being mutated into another set of characters.  I was seeing a bunch of odd characters like "ãç³" and"»ã®é" all over each blog post.

I knew the problem was happening during the DOMDocument parsing and that I need to find a fix quickly.  The solution was just a tiny bit of code:

// Create a DOMDocument instance 
$doc = new DOMDocument();

// The fix: mb_convert_encoding conversion
$doc->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));

After setting the character set with mb_convert_encoding, the odd characters vanished and the desired characters were back in place.  Phew!

Recent Features

  • By
    LightFace:  Facebook Lightbox for MooTools

    One of the web components I've always loved has been Facebook's modal dialog.  This "lightbox" isn't like others:  no dark overlay, no obnoxious animating to size, and it doesn't try to do "too much."  With Facebook's dialog in mind, I've created LightFace:  a Facebook lightbox...

  • By
    Responsive and Infinitely Scalable JS Animations

    Back in late 2012 it was not easy to find open source projects using requestAnimationFrame() - this is the hook that allows Javascript code to synchronize with a web browser's native paint loop. Animations using this method can run at 60 fps and deliver fantastic...

Incredible Demos

  • By
    Dynamically Load Stylesheets Using MooTools 1.2

    Theming has become a big part of the Web 2.0 revolution. Luckily, so too has a higher regard for semantics and CSS standards. If you build your pages using good XHTML code, changing a CSS file can make your website look completely different.

  • By
    Fancy FAQs with jQuery Sliders

    Frequently asked questions can be super boring, right? They don't have to be! I've already shown you how to create fancy FAQs with MooTools -- here's how to create the same effect using jQuery. The HTML Simply a series of H3s and DIVs wrapper...

Discussion

  1. Markus

    instead of converting the input string from your encoding into UTF8 you can also tell the DOMDocument with the 2nd arg in which encoding your string is.

    http://php.net/manual/en/domdocument.construct.php

    This should save you some cpu-cycles and reduce memory consumption.

  2. Whelping

    Actually I was already doing

    $html = new DOMDocument( null, 'UTF-8' );

    and still getting weird characters like  – adding your fix did the trick for me, thanks David!

    • Whelping

      It turns out that specifying the encoding argument when you instantiate a DOMDocument doesn’t encode the contents, just sets the document’s header – see this comment http://php.net/manual/en/domdocument.construct.php#78027. It probably wasn’t necessary to specify UTF-8 anyway, as that’s the default. So something like David’s fix is needed to change any unwanted encodings in the content.

  3. Raja Amer Khan

    Thanks for the tutorial. It really saved time.

  4. Dario

    Thank you very much you life saver

  5. Sam

    Thank you very for this :D
    Caught this bug too, even after instantiating it properly

  6. Rafael

    Awesome, character encoding has always been a pain the a.

  7. Thank you very much David.
    This helped me a lot. because I had bad characters in wordpress the_content.
    This method removed special characters pasted from word document.
    :D

  8. Thanks a lot, David.
    I was about to pull my hair out :D

  9. Robert Andrews

    Whilst this sorts out the worst of the character nasties, I am still seeing instances where an apostrophe appears as a question mark.

    The character on the input end is ’ … it appears as ?

    Somtimes I think what I may be experiencing is the crime of taking articles which originated in Microsoft Word, or maybe even Google Docs – ie. curly apostrophe. But I’m not sure – there are pieces of text that I’ve written out directly, which I’m not sure could explain this problem.

    It’s frustrating that I can’t iron this out. They appear just fine if I disable my code which uses DOMDocument.

    No code that I can find online (not MS Word character-conversion functions https://stackoverflow.com/questions/1262038/how-to-replace-microsoft-encoded-quotes-in-php and not DOMDocument modifiers) seems to work.

  10. In PHP8.2, mb_convert_encoding is deprecated. Instead of

    $document->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
    

    I am now doing this:

    $document->loadHTML(htmlspecialchars_decode(htmlentities($html)));
    

    Has been working fine in my tests

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!