Pygments on PHP & WordPress
I've been in a long journey trying to find a great code highlighter, I've been using a lot of them that I can't even remember. These are the ones I can remember right now:
- SyntaxHighlighter
- Google Prettifier
- highlighter.js
- Geshi
Right now I'm using highlighter.js
but it wasn't exactly what I want, what I want is to be able to highlight most "words" or reserved words, such as built in function, objects, etc. that this highlighter and most of them are missing. I know is not an important thing, unfortunately this was stuck in my head, until now.
Finally, I've found Pygments the perfect one that match with what I've been looking for and it's the same used by GitHub. The only obstacle right now is that it's a python based syntax highlighter and I'm using WordPress, and Wordpress is built on PHP.
Installation
But hey, we can get over it, there is a solution, first, we need to get python installed on our server so we can use Pygments.
We aren't going to go too deep on installation due to the fact that there are so many OS Flavors out there and it could be slightly different on each one of them.
Python
First of all you have to check if you already have python installed by typing python
on your command line.
If not is installed, you should take a look at Python Downloads page and download your OS installer.
PIP Installer
To install pip installer according to its site, there are two ways to install it:
First and recommended way is downloading get-pip.py and run it on your command line:
python get-pip.py
Second way is using package managers, by running one of these possible two commands, like it have been mentioned before, this would depends on your server OS.
sudo apt-get install python-pip
Or:
sudo yum install python-pip
NOTE: you can use any package manager you prefer, such as easy_install, for the sake of example and because is the one used on Pygments site I used pip.
Pygments
To install pygments you need to run this command:
pip install Pygments
If you are on server where the user don't have root access, you would be unable to install it with the previous command, if that is the case you have to run it with --user
flag to install the module on the user directory.
pip install --user Pygments
Everything is installed now, so what we got to do is work with PHP and some Python code
PHP + Python
The way it's going to work, it's by executing a python script via php using exec()
sending the language name and a filename of the file containing the code to be highlighted.
Python
The first thing we are going to do is create the python script that is going to convert plain code into highlighted code using Pygments.
So let's go step by step on how to create the python script.
First we import all the required modules:
import sys from pygments import highlight from pygments.formatters import HtmlFormatter
sys
module provide the argv
list which contains all the arguments passed to the python script.
highlight
from pygments is in fact the main function along with a lexer would generate the highlighted code. You would read a bit more about lexer below.
HtmlFormatter
is how we want the code generated be formatted, and we are going to use HTML format. Here is a list of available formatters in case of wondering.
# Get the code language = (sys.argv[1]).lower() filename = sys.argv[2] f = open(filename, 'rb') code = f.read() f.close()
This block of code what it does is that it takes the second argument (sys.argv[1]
) and transform it to lowercase text just to make sure it always be lowercase. Because "php" !== "PHP"
. The third argument sys.argv[2]
is the filename path of the code, so we open, read its contents and close it. The first argument is the python's script name.
# Importing Lexers # PHP if language == 'php': from pygments.lexers import PhpLexer lexer = PhpLexer(startinline=True) # GUESS elif language == 'guess': from pygments.lexers import guess_lexer lexer = guess_lexer( code ) # GET BY NAME else: from pygments.lexers import get_lexer_by_name lexer = get_lexer_by_name( language )
So it's time to import the lexer, this block of code what it does is create a lexer depending on the language we need to analyze. A lexer what it does it analyze our code and grab each reserved words, symbols, built-in functions, and so forth.
In this case after the lexer analyze all the code would formatted into HTML wrapping all the "words" into an HTML element with a class. By the way the classes name are not descriptive at all, so a function is not class "function", but anyways this is not something to be worried about right now.
The variable language
contains the string of the language name we want to convert the code, we use lexer = get_lexer_by_name( language )
to get any lexer by their names, well the function it self explanatory. But why we check for php and guess first you may ask, well, we check for php because if we use get_lexer_by_name('php')
and the php code does not have the required opening php tag <?php is not going to highlight the code well or as we expected and we need to create a the specific php lexer like this lexer = PhpLexer(startinline=True)
passing startinline=True
as parameter, so this opening php tag is not required anymore. guess
is a string we pass from php letting it know to pygments we don't know which language is it, or the language is not provided and we need it to be guessed.
There is a list of available lexers on their site.
The final step on python is creating the HTML formatter, performing the highlighting and outputing the HTML code containing the highlighted code.
formatter = HtmlFormatter(linenos=False, encoding='utf-8', nowrap=True) highlighted = highlight(code, lexer, formatter) print highlighted
For the formatter it's passed linenos=False
to not generate lines numbers and nowrap=True
to not allow div wrapping the generate code. This is a personal decision, the code would be wrapped using PHP.
Next it's passed code
containing the actual code, lexer
containing the language lexer and the formatter
we just create in the line above which tell the highlight how we want our code formatted.
Finally it's output the code.
That's about it for python, that the script that is going to build the highlight.
Here is the complete file: build.py
import sys from pygments import highlight from pygments.formatters import HtmlFormatter # If there isn't only 2 args something weird is going on expecting = 2; if ( len(sys.argv) != expecting + 1 ): exit(128) # Get the code language = (sys.argv[1]).lower() filename = sys.argv[2] f = open(filename, 'rb') code = f.read() f.close() # PHP if language == 'php': from pygments.lexers import PhpLexer lexer = PhpLexer(startinline=True) # GUESS elif language == 'guess': from pygments.lexers import guess_lexer lexer = guess_lexer( code ) # GET BY NAME else: from pygments.lexers import get_lexer_by_name lexer = get_lexer_by_name( language ) # OUTPUT formatter = HtmlFormatter(linenos=False, encoding='utf-8', nowrap=True) highlighted = highlight(code, lexer, formatter) print highlighted
PHP - WordPress
Let's jump to WordPress and create a basic plugin to handle the code that needs to be highlighted.
It's does not matter if you have never create a plugin for WordPress in your entire life, this plugin is just a file with php functions in it, so you would be just fine without the WordPress plugin development knowledge, but you need knowledge on WordPress development though.
Create a folder inside wp-content/plugins
named wp-pygments (can be whatever you want) and inside it copy build.py the python script we just created and create a new php file named wp-pygments.php (maybe the same name as the directory).
The code below just let WordPress know what's the plugin's name and other informations, this code is going to be at the top of wp-pygments.php.
<?php /* * Plugin Name: WP Pygments * Plugin URI: http://wellingguzman.com/wp-pygments * Description: A brief description of the Plugin. * Version: 0.1 * Author: Welling Guzman * Author URI: http://wellingguzman.com * License: MEH */ ?>
Add a filter on the_content
to look for <pre>
tags. the code expected is:
<pre class="php"> <code> $name = "World"; echo "Hello, " . $name; </code> </pre>
NOTE: html tags needs to be encoded; for example <
needs to be <
so the parse don't get confused and do it all wrong.
Where class
is the language of the code inside pre tags, if there is not class or is empty would pass guess
to build.py
.
add_filter( 'the_content', 'mb_pygments_content_filter' ); function mb_pygments_content_filter( $content ) { $content = preg_replace_callback('/]?.*?>.*?<code>(.*?)<\/code>.*?<\/pre>/sim', 'mb_pygments_convert_code', $content); return $content; }
preg_replace_callback
function would executemb_pygments_convert_code
callback function every time there's a match on the content using the regex pattern provided:/<pre(\s?class\="(.*?)")?[^>]?.*?>.*?<code>(.*?)<\/code>.*?<\/pre>/sim
, it should match on any <pre><code> on a post/page content.What about sim?, these are three pattern modifiers flags. From php.net:
- s: If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines.
- i: If this modifier is set, letters in the pattern match both upper and lower case letters.
- m: By default, PCRE treats the subject string as consisting of a single "line" of characters (even if it actually contains several newlines).
This can be done with DOMDocument();
as well. replace /<pre(\s?class\="(.*?)")?[^>]?.*?>.*?<code>(.*?).*?/sim
with this:
// This prevent throwing error libxml_use_internal_errors(true); // Get all pre from post content $dom = new DOMDocument(); $dom->loadHTML($content); $pres = $dom->getElementsByTagName('pre'); foreach ($pres as $pre) { $class = $pre->attributes->getNamedItem('class')->nodeValue; $code = $pre->nodeValue; $args = array( 2 => $class, // Element at position [2] is the class 3 => $code // And element at position [2] is the code ); // convert the code $new_code = mb_pygments_convert_code($args); // Replace the actual pre with the new one. $new_pre = $dom->createDocumentFragment(); $new_pre->appendXML($new_code); $pre->parentNode->replaceChild($new_pre, $pre); } // Save the HTML of the new code. $content = $dom->saveHTML();
The code below is from mb_pygments_convert_code
function.
define( 'MB_WPP_BASE', dirname(__FILE__) ); function mb_pygments_convert_code( $matches ) { $pygments_build = MB_WPP_BASE . '/build.py'; $source_code = isset($matches[3])?$matches[3]:''; $class_name = isset($matches[2])?$matches[2]:''; // Creates a temporary filename $temp_file = tempnam(sys_get_temp_dir(), 'MB_Pygments_'); // Populate temporary file $filehandle = fopen($temp_file, "w"); fwrite($filehandle, html_entity_decode($source_code, ENT_COMPAT, 'UTF-8') ); fclose($filehandle); // Creates pygments command $language = $class_name?$class_name:'guess'; $command = sprintf('python %s %s %s', $pygments_build, $language, $temp_file); // Executes the command $retVal = -1; exec( $command, $output, $retVal ); unlink($temp_file); // Returns Source Code $format = '<div class="highlight highlight-%s"><pre><code>%s</code></pre></div>'; if ( $retVal == 0 ) $source_code = implode("\n", $output); $highlighted_code = sprintf($format, $language, $source_code); return $highlighted_code; }
Reviewing the code above:
define( 'MB_WPP_BASE', dirname(__FILE__) );
define a absolute plugin's directory path constant.
$pygments_build = MB_WPP_BASE . '/build.py'; $source_code = isset($matches[3])?$matches[3]:''; $class_name = isset($matches[2])?$matches[2]:'';
$pygments_build
is the full path where the python script is located. Every time there is a match an array called $matches
is passed containing 4 element. Take this as an example of a matched code from post/page content:
<pre class="php"> <code> $name = "World"; echo "Hello, " . $name; </code> </pre>
-
The element at position [0] is the whole <pre> match, and its value is:
<pre class="php"> <code> $name = "World"; echo "Hello, " . $name; </code> </pre>
-
The element at position [1] is the class attribute name with its value, and its value is:
class="php"
-
The element at position [2] is the class attribute value without its name, and its value is:
php
-
The element at position [3] is the code itself without its
pre
tags, and its value is:$name = "World"; echo "Hello, " . $name;
// Creates a temporary filename $temp_file = tempnam(sys_get_temp_dir(), 'MB_Pygments_');
it creates a temporary file containing the code that would be passed to the python script. it's a better way to handle the code would be passed. instead of passing this whole thing as a parameters it would be a mess.
// Populate temporary file $filehandle = fopen($temp_file, "wb"); fwrite($filehandle, html_entity_decode($source_code, ENT_COMPAT, 'UTF-8') ); fclose($filehandle);
It creates the file of the code, but we decode all the HTML entities, so pygments can convert them properly.
// Creates pygments command $language = $class_name?$class_name:'guess'; $command = sprintf('python %s %s %s', $pygments_build, $language, $temp_file);
It creates the python command to be used, it outputs:
python /path/to/build.py php /path/to/temp.file
// Executes the command $retVal = -1; exec( $command, $output, $retVal ); unlink($temp_file); // Returns Source Code $format = '<div class="highlight highlight-%s"><pre><code>%s</code></pre></div>'; if ( $retVal == 0 ) $source_code = implode("\n", $output); $highlighted_code = sprintf($format, $language, $source_code);
Executes the command just created and if returns 0 everything worked fine on the python script. exec();
return an array of the lines outputs from python script. so we join the array outputs into one string to be the source code. If not, we are going to stick with the code without highlight.
Improving it by Caching
So by now with work fine, but we have to save time and processing, imagine 100 <pre>
tags on a content it would creates 100 files and call 100 times the python script, so let's cache this baby.
Transient API
WordPress provide the ability of storing data on the database temporarily with the Transient API.
First, let's add a action to save_post
hook, so every time the post is saved we convert the code and cache it.
add_action( 'save_post', 'mb_pygments_save_post' ); function mb_pygments_save_post( $post_id ) { if ( wp_is_post_revision( $post_id ) ) return; $content = get_post_field( 'post_content', $post_id ); mb_pygments_content_filter( $content ); }
if is a revision we don't do anything, otherwise we get the post content and call the pygments content filter function.
Let's create some functions to handle the cache.
// Cache Functions // Expiration time (1 month), let's clear cache every month. define('MB_WPP_EXPIRATION', 60 * 60 * 24 * 30); // This function it returns the name of a post cache. function get_post_cache_transient() { global $post; $post_id = $post->ID; $transient = 'post_' . $post_id . '_content'; return $transient; } // This creates a post cache for a month, // containing the new content with pygments // and last time the post was updated. function save_post_cache($content) { global $post; $expiration = MB_WPP_EXPIRATION; $value = array( 'content'=>$content, 'updated'=>$post->post_modified ); set_transient( get_post_cache_transient(), $value, $expiration ); } // This returns a post cache function get_post_cache() { $cached_post = get_transient( get_post_cache_transient() ); return $cached_post; } // Check if a post needs to be updated. function post_cache_needs_update() { global $post; $cached_post = get_post_cache(); if ( strtotime($post->post_modified) > strtotime($cached_post['updated']) ) return TRUE; return FALSE; } // Delete a post cache. function clear_post_cache() { delete_transient( get_post_cache_transient() ); }
At the beginning of mb_pygments_content_filter()
add some lines to check if there is a cached for the post.
function mb_pygments_content_filter( $content ) { if ( FALSE !== ( $cached_post = get_post_cache() ) && !post_cache_needs_update() ) return $cached_post['content']; clear_post_cache();
And at the end of mb_pygments_content_filter()
add a line to save the post cache.
save_post_cache( $content );
Finally, when the plugin is uninstall we need to remove all the cache we created, this is a bit tricky, so we use $wpdb
object to delete all using this a query.
register_uninstall_hook(__FILE__, 'mb_wp_pygments_uninstall'); function mb_wp_pygments_uninstall() { global $wpdb; $wpdb->query( "DELETE FROM `wp_options` WHERE option_name LIKE '_transient_post_%_content' " ); }
About Welling Guzman
Welling Guzman is a freelance web developer and consultant from Dominican Republic, who loves to code, mainly for the web.
There is a plugin in wordpress WPygments (http://wordpress.org/plugins/wpygments) powered by PHPygments (https://github.com/capy/PHPygments) ;)
thanks welling, it’s well written article that helps understand powerful syntax highlighting feature from Python + PHP. Enjoyed the article a lot.
What happened with my comment? I just wanted point another solution, and was not approved?
Wow. That is a lot of work for something as simple as a syntax highlighter. I think you managed to solve Chess in there somewhere. ;)
Perhaps the greater takeaway from the article is: Always do your homework to see if someone has done your job for you before reinventing the wheel.
You can also use Pygmentize, a simple wrapper to Pygments, that at least is able to handle errors (not like the previous ones), and it doesn’t use any external components. You can simply add it using Composer. You can generate the documentation using Doxygen, the doxy file is included. It’s just a class with a static method and it uses proc_open. See https://github.com/dedalozzo/Pygmentize.