Using DOMDocument to Modify HTML with PHP

By  on  

One of the first things you learn when wanting to implement a service worker on a website is that the site requires SSL (an https address).  Ever since I saw the blinding speed service workers can provide a website, I've been obsessed with readying my site for SSL.  Enforcing SSL with .htaccess was easy -- the hard part is updating asset links in blog content.  You start out by feeling as though regular expressions will be the quick cure but anyone that has experience with regular expression knows that working with URLs is a nightmare and regex is probably the wrong decision.

The right decision is DOMDocument, a native PHP object which allows you to work with HTML in a logical, pleasant fashion.  You start by loading the HTML into a DOMDocument instance and then using its predictable functions to make things happen.

// Formats post content for SSL
function format_post_content($content = '') {
  $document = new DOMDocument();
  // Ensure UTF-8 is respected by using 'mb_convert_encoding'
  $doc->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));
  
  $tags = $document->getElementsByTagName('img');
  foreach ($tags as $tag) {
    $tag->setAttribute('src', 
      str_replace('http://davidwalsh.name', 
                  'https://davidwalsh.name', 
                  $tag->getAttribute('src')
      )
    );
  }
  return $document->saveHTML();
}

In my example above, I find all img elements and replace their protocol with https://.  I will end up doing the same with iframe src, a href, and a few other rarely used tags.  When my modifications are done, I call saveHTML to get the new string.

Don't fall into the trap of trying to use regular expressions with HTML -- you're in for a future of failure.  DOMDocument is lightweight and will make your code infinitely more maintainable.

Recent Features

  • By
    Write Better JavaScript with Promises

    You've probably heard the talk around the water cooler about how promises are the future. All of the cool kids are using them, but you don't see what makes them so special. Can't you just use a callback? What's the big deal? In this article, we'll...

  • By
    Responsive and Infinitely Scalable JS Animations

    Back in late 2012 it was not easy to find open source projects using requestAnimationFrame() - this is the hook that allows Javascript code to synchronize with a web browser's native paint loop. Animations using this method can run at 60 fps and deliver fantastic...

Incredible Demos

  • By
    Generate Dojo GFX Drawings from SVG Files

    One of the most awesome parts of the Dojo / Dijit / DojoX family is the amazing GFX library.  GFX lives within the dojox.gfx namespace and provides the foundation of Dojo's charting, drawing, and sketch libraries.  GFX allows you to create vector graphics (SVG, VML...

  • By
    dwProgressBar v2:  Stepping and Events

    dwProgressBar was a huge hit when it debuted. For those of you who didn't catch my first post, dwProgressBar is a MooTools 1.2-based progress bar which allows for as much flexibility as possible. Every piece of dwProgressBar can be controlled by CSS...

Discussion

  1. Manny Fleurmond

    So do you know if there is a performance hit with creating an element using this vs creating a string of html?

  2. zakius

    The right decision is skipping domain entirely if it isn’t hosted on some subdomain (/path/to/asset), and skipping protocol if it is ((//example.com/path/to/asset)

  3. David, rather than str_replace all your (internal) http:// strings with https:// you should replace them with // – that way your links become protocol-agnostic — a more future-proof solution.

  4. Why don’t you use the search-replace function in WP-CLI?

  5. Silvestre

    Why not remove the protocol completely?

    //davidwalsh.name/ would default to whatever protocol is used in the address bar.

  6. I agree that // would be better but some RSS feed readers use http, others https. I’m asserting complete control.

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!