Match Special Letters with PHP Regular Expressions

By  on  

Regular expressions come with all sorts of peculiarities, one of which I recently ran into when creating a regex within PHP and preg_match.  I was trying to parse strings with the format "Real Name (:username)" when I ran into a problem I would see a lot at Mozilla:  my regular expression wasn't properly catching "special" or "international" letters, like à, é, ü, and the dozens of others.

My regular expression was using A-z in the real name matching piece of the regex, which I assumed would match special letters, but it did not:

preg_match(
  "/([A-Za-z -]+)?\s?\[?\(?:([A-Za-z0-9\-\_]+)\)?\]?/", 
  "Yep Nopé [:ynope]", $matches);

// 0 => '[:ynope]', 1 => 'Yep Nopé', 2 => 'ynope'

To match international letters, I needed to update my regular expression in two ways:

  • Change A-z to \pL within the matching piece
  • Add the u modifier makes the string treated as UTF-8

The updated regex would be:

preg_match(
  "/([\pL -]+)?\s?\[?\(?:([\pL0-9\-\_]+)\)?\]?/u", 
  "Yep Nopé [:ynope]", $matches);

// 0 => 'Yep Nopé [:ynope]', 1 => 'Yep Nopé', 2 => 'ynope'

You can see my simple test bed here. If you're afraid that other characters might seep in, or don't trust \pL, you could list every special letter manually (i.e. [A-zàáâä....])

One of the nice parts of working at a truly global organization like Mozilla is that I'm exposed to many edge cases; in this case, a few special letters!

Recent Features

  • By
    Write Better JavaScript with Promises

    You've probably heard the talk around the water cooler about how promises are the future. All of the cool kids are using them, but you don't see what makes them so special. Can't you just use a callback? What's the big deal? In this article, we'll...

  • By
    CSS @supports

    Feature detection via JavaScript is a client side best practice and for all the right reasons, but unfortunately that same functionality hasn't been available within CSS.  What we end up doing is repeating the same properties multiple times with each browser prefix.  Yuck.  Another thing we...

Incredible Demos

  • By
    MooTools HTML Police: dwMarkupMarine

    We've all inherited rubbish websites from webmasters that couldn't master valid HTML. You know the horrid markup: paragraph tags with align attributes and body tags with background attributes. It's almost a sin what they do. That's where dwMarkupMarine comes in.

  • By
    jQuery Comment Preview

    I released a MooTools comment preview script yesterday and got numerous requests for a jQuery version. Ask and you shall receive! I'll use the exact same CSS and HTML as yesterday. The XHTML The CSS The jQuery JavaScript On the keypress and blur events, we validate and...

Discussion

  1. [A-z] doesn’t do what you seem to quite what you think it does. That character range includes the characters in the ASCII table between Z and a: [\]^_. It looks like you should be using [A-Za-z].

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!