Pygments on PHP & WordPress

By  on  

I've been in a long journey trying to find a great code highlighter, I've been using a lot of them that I can't even remember. These are the ones I can remember right now:

  • SyntaxHighlighter
  • Google Prettifier
  • highlighter.js
  • Geshi

Right now I'm using highlighter.js but it wasn't exactly what I want, what I want is to be able to highlight most "words" or reserved words, such as built in function, objects, etc. that this highlighter and most of them are missing. I know is not an important thing, unfortunately this was stuck in my head, until now.

Finally, I've found Pygments the perfect one that match with what I've been looking for and it's the same used by GitHub. The only obstacle right now is that it's a python based syntax highlighter and I'm using WordPress, and Wordpress is built on PHP.

Installation

But hey, we can get over it, there is a solution, first, we need to get python installed on our server so we can use Pygments.

We aren't going to go too deep on installation due to the fact that there are so many OS Flavors out there and it could be slightly different on each one of them.

Python

First of all you have to check if you already have python installed by typing python on your command line.

If not is installed, you should take a look at Python Downloads page and download your OS installer.

PIP Installer

To install pip installer according to its site, there are two ways to install it:

First and recommended way is downloading get-pip.py and run it on your command line:

python get-pip.py

Second way is using package managers, by running one of these possible two commands, like it have been mentioned before, this would depends on your server OS.

sudo apt-get install python-pip

Or:

sudo yum install python-pip

NOTE: you can use any package manager you prefer, such as easy_install, for the sake of example and because is the one used on Pygments site I used pip.

Pygments

To install pygments you need to run this command:

pip install Pygments

If you are on server where the user don't have root access, you would be unable to install it with the previous command, if that is the case you have to run it with --user flag to install the module on the user directory.

pip install --user Pygments

Everything is installed now, so what we got to do is work with PHP and some Python code

PHP + Python

The way it's going to work, it's by executing a python script via php using exec() sending the language name and a filename of the file containing the code to be highlighted.

Python

The first thing we are going to do is create the python script that is going to convert plain code into highlighted code using Pygments.

So let's go step by step on how to create the python script.

First we import all the required modules:

import sys
from pygments import highlight
from pygments.formatters import HtmlFormatter

sys module provide the argv list which contains all the arguments passed to the python script.

highlight from pygments is in fact the main function along with a lexer would generate the highlighted code. You would read a bit more about lexer below.

HtmlFormatter is how we want the code generated be formatted, and we are going to use HTML format. Here is a list of available formatters in case of wondering.

# Get the code
language = (sys.argv[1]).lower()
filename = sys.argv[2] 
f = open(filename, 'rb')
code = f.read()
f.close()

This block of code what it does is that it takes the second argument (sys.argv[1]) and transform it to lowercase text just to make sure it always be lowercase. Because "php" !== "PHP". The third argument sys.argv[2] is the filename path of the code, so we open, read its contents and close it. The first argument is the python's script name.

# Importing Lexers
# PHP
if language == 'php':
  from pygments.lexers import PhpLexer
  lexer = PhpLexer(startinline=True)

# GUESS
elif language == 'guess':
  from pygments.lexers import guess_lexer
  lexer = guess_lexer( code )

# GET BY NAME
else:
  from pygments.lexers import get_lexer_by_name
  lexer = get_lexer_by_name( language )

So it's time to import the lexer, this block of code what it does is create a lexer depending on the language we need to analyze. A lexer what it does it analyze our code and grab each reserved words, symbols, built-in functions, and so forth.

In this case after the lexer analyze all the code would formatted into HTML wrapping all the "words" into an HTML element with a class. By the way the classes name are not descriptive at all, so a function is not class "function", but anyways this is not something to be worried about right now.

The variable language contains the string of the language name we want to convert the code, we use lexer = get_lexer_by_name( language ) to get any lexer by their names, well the function it self explanatory. But why we check for php and guess first you may ask, well, we check for php because if we use get_lexer_by_name('php') and the php code does not have the required opening php tag <?php is not going to highlight the code well or as we expected and we need to create a the specific php lexer like this lexer = PhpLexer(startinline=True) passing startinline=True as parameter, so this opening php tag is not required anymore. guess is a string we pass from php letting it know to pygments we don't know which language is it, or the language is not provided and we need it to be guessed.

There is a list of available lexers on their site.

The final step on python is creating the HTML formatter, performing the highlighting and outputing the HTML code containing the highlighted code.

formatter = HtmlFormatter(linenos=False, encoding='utf-8', nowrap=True)
highlighted = highlight(code, lexer, formatter)
print highlighted

For the formatter it's passed linenos=False to not generate lines numbers and nowrap=True to not allow div wrapping the generate code. This is a personal decision, the code would be wrapped using PHP.

Next it's passed code containing the actual code, lexer containing the language lexer and the formatter we just create in the line above which tell the highlight how we want our code formatted.

Finally it's output the code.

That's about it for python, that the script that is going to build the highlight.

Here is the complete file: build.py

import sys
from pygments import highlight
from pygments.formatters import HtmlFormatter


# If there isn't only 2 args something weird is going on
expecting = 2;
if ( len(sys.argv) != expecting + 1 ):
  exit(128)

# Get the code
language = (sys.argv[1]).lower()
filename = sys.argv[2] 
f = open(filename, 'rb')
code = f.read()
f.close()


# PHP
if language == 'php':
  from pygments.lexers import PhpLexer
  lexer = PhpLexer(startinline=True)

# GUESS
elif language == 'guess':
  from pygments.lexers import guess_lexer
  lexer = guess_lexer( code )

# GET BY NAME
else:
  from pygments.lexers import get_lexer_by_name
  lexer = get_lexer_by_name( language )
  

# OUTPUT
formatter = HtmlFormatter(linenos=False, encoding='utf-8', nowrap=True)
highlighted = highlight(code, lexer, formatter)
print highlighted

PHP - WordPress

Let's jump to WordPress and create a basic plugin to handle the code that needs to be highlighted.

It's does not matter if you have never create a plugin for WordPress in your entire life, this plugin is just a file with php functions in it, so you would be just fine without the WordPress plugin development knowledge, but you need knowledge on WordPress development though.

Create a folder inside wp-content/plugins named wp-pygments (can be whatever you want) and inside it copy build.py the python script we just created and create a new php file named wp-pygments.php (maybe the same name as the directory).

The code below just let WordPress know what's the plugin's name and other informations, this code is going to be at the top of wp-pygments.php.

<?php
/*
 * Plugin Name: WP Pygments
 * Plugin URI: http://wellingguzman.com/wp-pygments
 * Description: A brief description of the Plugin.
 * Version: 0.1
 * Author: Welling Guzman
 * Author URI: http://wellingguzman.com
 * License: MEH
*/
?>

Add a filter on the_content to look for <pre> tags. the code expected is:

<pre class="php">
<code>
$name = "World";
echo "Hello, " . $name;
</code>
</pre>

NOTE: html tags needs to be encoded; for example < needs to be &lt; so the parse don't get confused and do it all wrong.

Where class is the language of the code inside pre tags, if there is not class or is empty would pass guess to build.py.

add_filter( 'the_content', 'mb_pygments_content_filter' );
function mb_pygments_content_filter( $content )
{
  $content = preg_replace_callback('/]?.*?>.*?<code>(.*?)<\/code>.*?<\/pre>/sim', 'mb_pygments_convert_code', $content);
  
  return $content;
}

preg_replace_callback function would execute mb_pygments_convert_code callback function every time there's a match on the content using the regex pattern provided: /<pre(\s?class\="(.*?)")?[^>]?.*?>.*?<code>(.*?)<\/code>.*?<\/pre>/sim, it should match on any <pre><code> on a post/page content.

What about sim?, these are three pattern modifiers flags. From php.net:

  • s: If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines.
  • i: If this modifier is set, letters in the pattern match both upper and lower case letters.
  • m: By default, PCRE treats the subject string as consisting of a single "line" of characters (even if it actually contains several newlines).

This can be done with DOMDocument(); as well. replace /<pre(\s?class\="(.*?)")?[^>]?.*?>.*?<code>(.*?).*?/sim with this:

// This prevent throwing error
libxml_use_internal_errors(true);

// Get all pre from post content
$dom = new DOMDocument();
$dom->loadHTML($content);
$pres = $dom->getElementsByTagName('pre');

foreach ($pres as $pre) {
  $class = $pre->attributes->getNamedItem('class')->nodeValue;
  $code = $pre->nodeValue;
  
  $args = array(
    2 => $class, // Element at position [2] is the class
    3 => $code // And element at position [2] is the code
  );
  
  // convert the code
  $new_code = mb_pygments_convert_code($args);
  
  // Replace the actual pre with the new one.
  $new_pre = $dom->createDocumentFragment();
  $new_pre->appendXML($new_code);
  $pre->parentNode->replaceChild($new_pre, $pre);
}
// Save the HTML of the new code.
$content = $dom->saveHTML();

The code below is from mb_pygments_convert_code function.

define( 'MB_WPP_BASE', dirname(__FILE__) );
function mb_pygments_convert_code( $matches )
{
  $pygments_build = MB_WPP_BASE . '/build.py';
  $source_code    = isset($matches[3])?$matches[3]:'';
  $class_name     = isset($matches[2])?$matches[2]:'';
  
  // Creates a temporary filename
  $temp_file      = tempnam(sys_get_temp_dir(), 'MB_Pygments_');
  
  // Populate temporary file
  $filehandle = fopen($temp_file, "w");
  fwrite($filehandle, html_entity_decode($source_code, ENT_COMPAT, 'UTF-8') );
  fclose($filehandle);
  
  // Creates pygments command
  $language   = $class_name?$class_name:'guess';
  $command    = sprintf('python %s %s %s', $pygments_build, $language, $temp_file);

  // Executes the command
  $retVal = -1;
  exec( $command, $output, $retVal );
  unlink($temp_file);
  
  // Returns Source Code
  $format = '<div class="highlight highlight-%s"><pre><code>%s</code></pre></div>';
  
  if ( $retVal == 0 )
    $source_code = implode("\n", $output);
    
  $highlighted_code = sprintf($format, $language, $source_code);
  
  return $highlighted_code;
}

Reviewing the code above:

define( 'MB_WPP_BASE', dirname(__FILE__) );

define a absolute plugin's directory path constant.

$pygments_build = MB_WPP_BASE . '/build.py';
$source_code    = isset($matches[3])?$matches[3]:'';
$class_name     = isset($matches[2])?$matches[2]:'';

$pygments_build is the full path where the python script is located. Every time there is a match an array called $matches is passed containing 4 element. Take this as an example of a matched code from post/page content:

<pre class="php">
<code>
$name = "World";
echo "Hello, " . $name;
</code>
</pre>
  • The element at position [0] is the whole <pre> match, and its value is:

    <pre class="php">
    <code>
    $name = "World";
    echo "Hello, " . $name;
    </code>
    </pre>
    
  • The element at position [1] is the class attribute name with its value, and its value is:

    class="php"
    
  • The element at position [2] is the class attribute value without its name, and its value is:

    php
    
  • The element at position [3] is the code itself without its pre tags, and its value is:

    $name = "World";
    echo "Hello, " . $name;
    
// Creates a temporary filename
$temp_file = tempnam(sys_get_temp_dir(), 'MB_Pygments_');

it creates a temporary file containing the code that would be passed to the python script. it's a better way to handle the code would be passed. instead of passing this whole thing as a parameters it would be a mess.

// Populate temporary file
$filehandle = fopen($temp_file, "wb");
fwrite($filehandle, html_entity_decode($source_code, ENT_COMPAT, 'UTF-8') );
fclose($filehandle);

It creates the file of the code, but we decode all the HTML entities, so pygments can convert them properly.

// Creates pygments command
$language = $class_name?$class_name:'guess';
$command  = sprintf('python %s %s %s', $pygments_build, $language, $temp_file);

It creates the python command to be used, it outputs:

python /path/to/build.py php /path/to/temp.file
// Executes the command
$retVal = -1;
exec( $command, $output, $retVal );
unlink($temp_file);
  
// Returns Source Code
$format = '<div class="highlight highlight-%s"><pre><code>%s</code></pre></div>';
  
if ( $retVal == 0 )
  $source_code = implode("\n", $output);
    
$highlighted_code = sprintf($format, $language, $source_code);

Executes the command just created and if returns 0 everything worked fine on the python script. exec(); return an array of the lines outputs from python script. so we join the array outputs into one string to be the source code. If not, we are going to stick with the code without highlight.

Improving it by Caching

So by now with work fine, but we have to save time and processing, imagine 100 <pre> tags on a content it would creates 100 files and call 100 times the python script, so let's cache this baby.

Transient API

WordPress provide the ability of storing data on the database temporarily with the Transient API.

First, let's add a action to save_post hook, so every time the post is saved we convert the code and cache it.

add_action( 'save_post', 'mb_pygments_save_post' );
function mb_pygments_save_post( $post_id )
{
  if ( wp_is_post_revision( $post_id ) )
    return;
    
  $content = get_post_field( 'post_content', $post_id );
  
  mb_pygments_content_filter( $content );
}

if is a revision we don't do anything, otherwise we get the post content and call the pygments content filter function.

Let's create some functions to handle the cache.

// Cache Functions
// Expiration time (1 month), let's clear cache every month.
define('MB_WPP_EXPIRATION', 60 * 60 * 24 * 30);

// This function it returns the name of a post cache.
function get_post_cache_transient()
{
  global $post;
  
  $post_id = $post->ID;
  $transient = 'post_' . $post_id . '_content';
  
  return $transient;
}

// This creates a post cache for a month,
// containing the new content with pygments
// and last time the post was updated.
function save_post_cache($content)
{
  global $post;
    
  $expiration = MB_WPP_EXPIRATION;
  $value = array( 'content'=>$content, 'updated'=>$post->post_modified );
  set_transient( get_post_cache_transient(), $value, $expiration );
}

// This returns a post cache
function get_post_cache()
{
  $cached_post = get_transient( get_post_cache_transient() );
  
  return $cached_post;
}

// Check if a post needs to be updated.
function post_cache_needs_update()
{
  global $post;
  
  $cached_post = get_post_cache();
  if ( strtotime($post->post_modified) > strtotime($cached_post['updated']) )
    return TRUE;
      
  return FALSE;
}

// Delete a post cache.
function clear_post_cache()
{ 
  delete_transient( get_post_cache_transient() );
}

At the beginning of mb_pygments_content_filter() add some lines to check if there is a cached for the post.

function mb_pygments_content_filter( $content )
{
  if ( FALSE !== ( $cached_post = get_post_cache() ) && !post_cache_needs_update() )
    return $cached_post['content'];

  clear_post_cache();

And at the end of mb_pygments_content_filter() add a line to save the post cache.

save_post_cache( $content );

Finally, when the plugin is uninstall we need to remove all the cache we created, this is a bit tricky, so we use $wpdb object to delete all using this a query.

register_uninstall_hook(__FILE__, 'mb_wp_pygments_uninstall');
function mb_wp_pygments_uninstall() {
  global $wpdb;
  
  $wpdb->query( "DELETE FROM `wp_options` WHERE option_name LIKE '_transient_post_%_content' " );
}
Welling Guzman

About Welling Guzman

Welling Guzman is a freelance web developer and consultant from Dominican Republic, who loves to code, mainly for the web.

Recent Features

  • By
    Send Text Messages with PHP

    Kids these days, I tell ya.  All they care about is the technology.  The video games.  The bottled water.  Oh, and the texting, always the texting.  Back in my day, all we had was...OK, I had all of these things too.  But I still don't get...

  • By
    39 Shirts &#8211; Leaving Mozilla

    In 2001 I had just graduated from a small town high school and headed off to a small town college. I found myself in the quaint computer lab where the substandard computers featured two browsers: Internet Explorer and Mozilla. It was this lab where I fell...

Incredible Demos

  • By
    Translate Content with the Google Translate API and JavaScript

    Note:  For this tutorial, I'm using version1 of the Google Translate API.  A newer REST-based version is available. In an ideal world, all websites would have a feature that allowed the user to translate a website into their native language (or even more ideally, translation would be...

  • By
    Vibration API

    Many of the new APIs provided to us by browser vendors are more targeted toward the mobile user than the desktop user.  One of those simple APIs the Vibration API.  The Vibration API allows developers to direct the device, using JavaScript, to vibrate in...

Discussion

  1. There is a plugin in wordpress WPygments (http://wordpress.org/plugins/wpygments) powered by PHPygments (https://github.com/capy/PHPygments) ;)

  2. thanks welling, it’s well written article that helps understand powerful syntax highlighting feature from Python + PHP. Enjoyed the article a lot.

  3. What happened with my comment? I just wanted point another solution, and was not approved?

  4. Wow. That is a lot of work for something as simple as a syntax highlighter. I think you managed to solve Chess in there somewhere. ;)

    Perhaps the greater takeaway from the article is: Always do your homework to see if someone has done your job for you before reinventing the wheel.

  5. Filippo Fadda

    You can also use Pygmentize, a simple wrapper to Pygments, that at least is able to handle errors (not like the previous ones), and it doesn’t use any external components. You can simply add it using Composer. You can generate the documentation using Doxygen, the doxy file is included. It’s just a class with a static method and it uses proc_open. See https://github.com/dedalozzo/Pygmentize.

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!