Building Resilient Systems on AWS: Learn how to design and implement a resilient, highly available, fault-tolerant infrastructure on AWS.

DOMDocument and UTF-8 Problem

By David Walsh on November 17, 2015

A few weeks back I shared how I used PHP DOMDocument to reliably update all image URLs from standard HTTP to HTTPS. DOMDocument made a difficult problem seem incredibly easy ... but with one side-effect that it took me a while to spot: UTF-8 characters were being mutated into another set of characters. I was seeing a bunch of odd characters like "ãç³" and"»ã®é" all over each blog post.

I knew the problem was happening during the DOMDocument parsing and that I need to find a fix quickly. The solution was just a tiny bit of code:

// Create a DOMDocument instance 
$doc = new DOMDocument();

// The fix: mb_convert_encoding conversion
$doc->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));

After setting the character set with mb_convert_encoding, the odd characters vanished and the desired characters were back in place. Phew!

Recent Features

By David WalshFebruary 19, 2019
Welcome to My New Office
My first professional web development was at a small print shop where I sat in a windowless cubical all day. I suffered that boxed in environment for almost five years before I was able to find a remote job where I worked from home. The first...
By Landon SchroppMarch 16, 2015
Regular Expressions for the Rest of Us
Sooner or later you'll run across a regular expression. With their cryptic syntax, confusing documentation and massive learning curve, most developers settle for copying and pasting them from StackOverflow and hoping they work. But what if you could decode regular expressions and harness their power? In...

Incredible Demos

By David WalshOctober 20, 2011
Face Detection with jQuery
I've always been intrigued by recognition software because I cannot imagine the logic that goes into all of the algorithms. Whether it's voice, face, or other types of detection, people look and sound so different, pictures are shot differently, and from different angles, I...
By David WalshDecember 27, 2010
HTML5 Placeholder Styling with CSS
Last week I showed you how you could style selected text with CSS. I've searched for more interesting CSS style properties and found another: INPUT placeholder styling. Let me show you how to style placeholder text within INPUTelements with some unique CSS code. The CSS Firefox...

Discussion

Markus
instead of converting the input string from your encoding into UTF8 you can also tell the DOMDocument with the 2nd arg in which encoding your string is.

http://php.net/manual/en/domdocument.construct.php

This should save you some cpu-cycles and reduce memory consumption.

David Walsh
Awesome, thank you for pointing this out!

Whelping
Actually I was already doing
```
$html = new DOMDocument( null, 'UTF-8' );
```
and still getting weird characters like Â – adding your fix did the trick for me, thanks David!

Whelping
It turns out that specifying the encoding argument when you instantiate a DOMDocument doesn’t encode the contents, just sets the document’s header – see this comment http://php.net/manual/en/domdocument.construct.php#78027. It probably wasn’t necessary to specify UTF-8 anyway, as that’s the default. So something like David’s fix is needed to change any unwanted encodings in the content.

Raja Amer Khan
Thanks for the tutorial. It really saved time.
Dario
Thank you very much you life saver
Sam
Thank you very for this :D
Caught this bug too, even after instantiating it properly
Rafael
Awesome, character encoding has always been a pain the a.
Jignesh
Thank you very much David.
This helped me a lot. because I had bad characters in wordpress the_content.
This method removed special characters pasted from word document.
:D
Shaban
Thanks a lot, David.
I was about to pull my hair out :D
Robert Andrews
Whilst this sorts out the worst of the character nasties, I am still seeing instances where an apostrophe appears as a question mark.

The character on the input end is ’ … it appears as ?

Somtimes I think what I may be experiencing is the crime of taking articles which originated in Microsoft Word, or maybe even Google Docs – ie. curly apostrophe. But I’m not sure – there are pieces of text that I’ve written out directly, which I’m not sure could explain this problem.

It’s frustrating that I can’t iron this out. They appear just fine if I disable my code which uses DOMDocument.

No code that I can find online (not MS Word character-conversion functions https://stackoverflow.com/questions/1262038/how-to-replace-microsoft-encoded-quotes-in-php and not DOMDocument modifiers) seems to work.

Rasso

In PHP8.2, mb_convert_encoding is deprecated. Instead of

$document->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

I am now doing this:

$document->loadHTML(htmlspecialchars_decode(htmlentities($html)));

Has been working fine in my tests

DOMDocument and UTF-8 Problem

Recent Features

Welcome to My New Office

Regular Expressions for the Rest of Us

Incredible Demos

Face Detection with jQuery

HTML5 Placeholder Styling with CSS

Discussion