DOMDocument and UTF-8 Problem
A few weeks back I shared how I used PHP DOMDocument to reliably update all image URLs from standard HTTP to HTTPS. DOMDocument made a difficult problem seem incredibly easy ... but with one side-effect that it took me a while to spot: UTF-8 characters were being mutated into another set of characters. I was seeing a bunch of odd characters like "ãç³" and"»ã®é" all over each blog post.
I knew the problem was happening during the DOMDocument parsing and that I need to find a fix quickly. The solution was just a tiny bit of code:
// Create a DOMDocument instance
$doc = new DOMDocument();
// The fix: mb_convert_encoding conversion
$doc->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));
After setting the character set with mb_convert_encoding, the odd characters vanished and the desired characters were back in place. Phew!
![Write Better JavaScript with Promises]()
You've probably heard the talk around the water cooler about how promises are the future. All of the cool kids are using them, but you don't see what makes them so special. Can't you just use a callback? What's the big deal? In this article, we'll...
![5 HTML5 APIs You Didn’t Know Existed]()
When you say or read "HTML5", you half expect exotic dancers and unicorns to walk into the room to the tune of "I'm Sexy and I Know It." Can you blame us though? We watched the fundamental APIs stagnate for so long that a basic feature...
![jQuery Countdown Plugin]()
You've probably been to sites like RapidShare and MegaUpload that allow you to download files but make you wait a specified number of seconds before giving you the download link. I've created a similar script but my script allows you to animate the CSS font-size...
![Scroll IFRAMEs on iOS]()
For the longest time, developers were frustrated by elements with overflow not being scrollable within the page of iOS Safari. For my blog it was particularly frustrating because I display my demos in sandboxed IFRAMEs on top of the article itself, so as to not affect my site's...
instead of converting the input string from your encoding into UTF8 you can also tell the DOMDocument with the 2nd arg in which encoding your string is.
http://php.net/manual/en/domdocument.construct.php
This should save you some cpu-cycles and reduce memory consumption.
Awesome, thank you for pointing this out!
Actually I was already doing
and still getting weird characters like  – adding your fix did the trick for me, thanks David!
It turns out that specifying the encoding argument when you instantiate a
DOMDocumentdoesn’t encode the contents, just sets the document’s header – see this comment http://php.net/manual/en/domdocument.construct.php#78027. It probably wasn’t necessary to specifyUTF-8anyway, as that’s the default. So something like David’s fix is needed to change any unwanted encodings in the content.Thanks for the tutorial. It really saved time.
Thank you very much you life saver
Thank you very for this :D
Caught this bug too, even after instantiating it properly
Awesome, character encoding has always been a pain the a.
Thank you very much David.
This helped me a lot. because I had bad characters in wordpress the_content.
This method removed special characters pasted from word document.
:D
Thanks a lot, David.
I was about to pull my hair out :D
Whilst this sorts out the worst of the character nasties, I am still seeing instances where an apostrophe appears as a question mark.
The character on the input end is ’ … it appears as ?
Somtimes I think what I may be experiencing is the crime of taking articles which originated in Microsoft Word, or maybe even Google Docs – ie. curly apostrophe. But I’m not sure – there are pieces of text that I’ve written out directly, which I’m not sure could explain this problem.
It’s frustrating that I can’t iron this out. They appear just fine if I disable my code which uses DOMDocument.
No code that I can find online (not MS Word character-conversion functions https://stackoverflow.com/questions/1262038/how-to-replace-microsoft-encoded-quotes-in-php and not DOMDocument modifiers) seems to work.
In PHP8.2, mb_convert_encoding is deprecated. Instead of
I am now doing this:
Has been working fine in my tests