DOMDocument and UTF-8 Problem
A few weeks back I shared how I used PHP DOMDocument to reliably update all image URLs from standard HTTP to HTTPS. DOMDocument made a difficult problem seem incredibly easy ... but with one side-effect that it took me a while to spot: UTF-8 characters were being mutated into another set of characters. I was seeing a bunch of odd characters like "ãç³" and"»ã®é" all over each blog post.
I knew the problem was happening during the DOMDocument parsing and that I need to find a fix quickly. The solution was just a tiny bit of code:
// Create a DOMDocument instance $doc = new DOMDocument(); // The fix: mb_convert_encoding conversion $doc->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));
After setting the character set with
mb_convert_encoding, the odd characters vanished and the desired characters were back in place. Phew!