Match Special Letters with PHP Regular Expressions
Regular expressions come with all sorts of peculiarities, one of which I recently ran into when creating a regex within PHP and preg_match
. I was trying to parse strings with the format "Real Name (:username)" when I ran into a problem I would see a lot at Mozilla: my regular expression wasn't properly catching "special" or "international" letters, like à, é, ü, and the dozens of others.
My regular expression was using A-z
in the real name matching piece of the regex, which I assumed would match special letters, but it did not:
preg_match( "/([A-Za-z -]+)?\s?\[?\(?:([A-Za-z0-9\-\_]+)\)?\]?/", "Yep Nopé [:ynope]", $matches); // 0 => '[:ynope]', 1 => 'Yep Nopé', 2 => 'ynope'
To match international letters, I needed to update my regular expression in two ways:
- Change
A-z
to\pL
within the matching piece - Add the
u
modifier makes the string treated as UTF-8
The updated regex would be:
preg_match( "/([\pL -]+)?\s?\[?\(?:([\pL0-9\-\_]+)\)?\]?/u", "Yep Nopé [:ynope]", $matches); // 0 => 'Yep Nopé [:ynope]', 1 => 'Yep Nopé', 2 => 'ynope'
You can see my simple test bed here. If you're afraid that other characters might seep in, or don't trust \pL
, you could list every special letter manually (i.e. [A-zàáâä....]
)
One of the nice parts of working at a truly global organization like Mozilla is that I'm exposed to many edge cases; in this case, a few special letters!
[A-z]
doesn’t do what you seem to quite what you think it does. That character range includes the characters in the ASCII table betweenZ
anda
:[\]^_
. It looks like you should be using[A-Za-z]
.Good point! Updated!