Building Resilient Systems on AWS: Learn how to design and implement a resilient, highly available, fault-tolerant infrastructure on AWS.

Using DOMDocument to Modify HTML with PHP

By David Walsh on November 2, 2015

One of the first things you learn when wanting to implement a service worker on a website is that the site requires SSL (an https address). Ever since I saw the blinding speed service workers can provide a website, I've been obsessed with readying my site for SSL. Enforcing SSL with .htaccess was easy -- the hard part is updating asset links in blog content. You start out by feeling as though regular expressions will be the quick cure but anyone that has experience with regular expression knows that working with URLs is a nightmare and regex is probably the wrong decision. The right decision is DOMDocument, a native PHP object which allows you to work with HTML in a logical, pleasant fashion. You start by loading the HTML into a DOMDocument instance and then using its predictable functions to make things happen.

// Formats post content for SSL
function format_post_content($content = '') {
  $document = new DOMDocument();
  // Ensure UTF-8 is respected by using 'mb_convert_encoding'
  $document->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));
  
  $tags = $document->getElementsByTagName('img');
  foreach ($tags as $tag) {
    $tag->setAttribute('src', 
      str_replace('http://davidwalsh.name', 
                  'https://davidwalsh.name', 
                  $tag->getAttribute('src')
      )
    );
  }
  return $document->saveHTML();
}

In my example above, I find all img elements and replace their protocol with https://. I will end up doing the same with iframe src, a href, and a few other rarely used tags. When my modifications are done, I call saveHTML to get the new string. Don't fall into the trap of trying to use regular expressions with HTML -- you're in for a future of failure. DOMDocument is lightweight and will make your code infinitely more maintainable.

Recent Features

By David WalshJuly 1, 2013
9 Mind-Blowing Canvas Demos
The <canvas> element has been a revelation for the visual experts among our ranks. Canvas provides the means for incredible and efficient animations with the added bonus of no Flash; these developers can flash their awesome JavaScript skills instead. Here are nine unbelievable canvas demos that...
By David WalshNovember 7, 2012
Camera and Video Control with HTML5
Client-side APIs on mobile and desktop devices are quickly providing the same APIs. Of course our mobile devices got access to some of these APIs first, but those APIs are slowly making their way to the desktop. One of those APIs is the getUserMedia API...

Incredible Demos

By David WalshJune 29, 2009
Highlighter: A MooTools Search & Highlight Plugin
Searching within the page is a major browser functionality, but what if we could code a search box in JavaScript that would do the same thing? I set out to do that using MooTools and ended up with a pretty decent solution. The MooTools JavaScript Class The...
By David WalshAugust 14, 2007
Advanced CSS Tables II – Using Mootools JavaScript For Alternate Row Colors
As I discussed in Advanced CSS Tables - Using CSS3 For Alternate Row Colors, we will eventually be able to use the ":nth-child(argument)" pseudo-class in CSS3 to provide alternate row background colors. What do we use in the mean time? You can explicitly...

Discussion

Manny Fleurmond
So do you know if there is a performance hit with creating an element using this vs creating a string of html?
zakius
The right decision is skipping domain entirely if it isn’t hosted on some subdomain (/path/to/asset), and skipping protocol if it is ((//example.com/path/to/asset)
Jonathan Hollin
David, rather than str_replace all your (internal) http:// strings with https:// you should replace them with // – that way your links become protocol-agnostic — a more future-proof solution.
abogomolov
Why don’t you use the search-replace function in WP-CLI?
Silvestre
Why not remove the protocol completely?

//davidwalsh.name/ would default to whatever protocol is used in the address bar.
David Walsh
I agree that // would be better but some RSS feed readers use http, others https. I’m asserting complete control.

Using DOMDocument to Modify HTML with PHP

Recent Features

9 Mind-Blowing Canvas Demos

Camera and Video Control with HTML5

Incredible Demos

Highlighter: A MooTools Search & Highlight Plugin

Advanced CSS Tables II – Using Mootools JavaScript For Alternate Row Colors

Discussion