Match Special Letters with PHP Regular Expressions

By  on  

Regular expressions come with all sorts of peculiarities, one of which I recently ran into when creating a regex within PHP and preg_match.  I was trying to parse strings with the format "Real Name (:username)" when I ran into a problem I would see a lot at Mozilla:  my regular expression wasn't properly catching "special" or "international" letters, like à, é, ü, and the dozens of others.

My regular expression was using A-z in the real name matching piece of the regex, which I assumed would match special letters, but it did not:

preg_match(
  "/([A-Za-z -]+)?\s?\[?\(?:([A-Za-z0-9\-\_]+)\)?\]?/", 
  "Yep Nopé [:ynope]", $matches);

// 0 => '[:ynope]', 1 => 'Yep Nopé', 2 => 'ynope'

To match international letters, I needed to update my regular expression in two ways:

  • Change A-z to \pL within the matching piece
  • Add the u modifier makes the string treated as UTF-8

The updated regex would be:

preg_match(
  "/([\pL -]+)?\s?\[?\(?:([\pL0-9\-\_]+)\)?\]?/u", 
  "Yep Nopé [:ynope]", $matches);

// 0 => 'Yep Nopé [:ynope]', 1 => 'Yep Nopé', 2 => 'ynope'

You can see my simple test bed here. If you're afraid that other characters might seep in, or don't trust \pL, you could list every special letter manually (i.e. [A-zàáâä....])

One of the nice parts of working at a truly global organization like Mozilla is that I'm exposed to many edge cases; in this case, a few special letters!

Recent Features

  • By
    Camera and Video Control with HTML5

    Client-side APIs on mobile and desktop devices are quickly providing the same APIs.  Of course our mobile devices got access to some of these APIs first, but those APIs are slowly making their way to the desktop.  One of those APIs is the getUserMedia API...

  • By
    5 Awesome New Mozilla Technologies You’ve Never Heard Of

    My trip to Mozilla Summit 2013 was incredible.  I've spent so much time focusing on my project that I had lost sight of all of the great work Mozillians were putting out.  MozSummit provided the perfect reminder of how brilliant my colleagues are and how much...

Incredible Demos

  • By
    TextboxList for MooTools and jQuery by Guillermo Rauch

    I'll be honest with you: I still haven't figured out if I like my MooTools teammate Guillermo Rauch. He's got a lot stacked up against him. He's from Argentina so I get IM'ed about 10 times a day about how great Lionel...

  • By
    Modal-Style Text Selection with Fokus

    Every once in a while I find a tiny JavaScript library that does something very specific, very well.  My latest find, Fokus, is a utility that listens for text selection within the page, and when such an event occurs, shows a beautiful modal dialog in...

Discussion

  1. [A-z] doesn’t do what you seem to quite what you think it does. That character range includes the characters in the ASCII table between Z and a: [\]^_. It looks like you should be using [A-Za-z].

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!