Regular Expressions for the Rest of Us
Sooner or later you'll run across a regular expression. With their cryptic syntax, confusing documentation and massive learning curve, most developers settle for copying and pasting them from StackOverflow and hoping they work. But what if you could decode regular expressions and harness their power? In this article, I'll show you why you should take a second look at regular expressions, and how you can use them in the real world.
Why Regular Expressions?
Why bother with regular expressions at all? Why should you care?
- Matching: Regular expressions are great at determining if a string matches some format, such as a phone number, email or credit card number.
- Replacement: Regular expressions make it easy to find and replace patterns in a string. For example,
text.replace(/\s+/g, " ")
replaces all chunks of whitespace intext
, such as" \n\t "
, with a single space. - Extraction: It's easy to extract pieces of information from a pattern with regular expressions. For example,
name.matches(/^(Mr|Ms|Mrs|Dr)\.?\s/i)[1]
extracts a person's title from a string, such as"Mr"
from"Mr. Schropp"
. - Portability: Almost every major language has a regular expression library. The syntax is mostly standardized, so you don't have to worry about relearning regexes when you switch languages.
- Coding: When writing code, you can use regular expressions to search through files with tools such as find and replace in Atom or ack in the command line.
- Clear and Concise: If you're comfortable with regular expressions, you can perform some pretty tricky operations with a very small amount of code.
- Fame and Glory: Regular expressions will give you superpowers.
How to Write Regular Expressions
The best way to learn regular expressions is by using an example. Let's say you're building a web page with a phone number input. Because you're a rockstar developer, you decide to display a checkmark when the phone number is valid and an X when it's invalid.
See the Pen Regular Expression Demo by Landon Schropp (@LandonSchropp) on CodePen.
<input id="phone-number" type="text"> <label class="valid" for="phone-number"><img src="check.svg"></label> <label class="invalid" for="phone-number"><img src="x.svg"></label>
input:not([data-validation="valid"]) ~ label.valid, input:not([data-validation="invalid"]) ~ label.invalid { display: none; }
$("input").on("input blur", function(event) { if (isPhoneNumber($(this).val())) { $(this).attr({ "data-validation": "valid" }); return; } if (event.type == "blur") { $(this).attr({ "data-validation": "invalid" }); } else { $(this).removeAttr("data-validation"); } });
With the above code, whenever a person types or pastes a valid number into the input, the check image is displayed. When the user blurs the input and the value is invalid, the error X is displayed.
Since you know that phone numbers are made up of ten digits, your first pass at isPhoneNumber
looks like this:
function isPhoneNumber(string) { return /\d\d\d\d\d\d\d\d\d\d/.test(string); }
This function contains a regular expression between the /
characters with ten \d
's, or digit characters. The test
method returns true if the regex matches the string and false if it doesn't. If you run isPhoneNumber("5558675309")
, it returns true
! Woohoo!
However, writing ten \d
's is little redundant. Luckily, you can use the curly braces to accomplish the same thing.
function isPhoneNumber(string) { return /\d{10}/.test(string); }
Sometimes, when people type in phone numbers, they start with a leading 1
. Wouldn't it be nice if your regex could handle those cases? You can with the ?
character!
function isPhoneNumber(string) { return /1?\d{10}/.test(string); }
The ?
symbol means zero or one, so now isPhoneNumber
returns true
for both "5558675309"
and "15558675309"
!
So far, isPhoneNumber
is pretty good, but you're missing one key thing: regexes are more than happy to match parts of a string. As it stands, isPhoneNumber("555555555555555555")
returns true because that string contains ten numbers. You can fix this problem by using the ^
and $
anchors.
function isPhoneNumber(string) { return /^1?\d{10}$/.test(string); }
Roughly, ^
matches the beginning of the string and $
matches the end, so now your regex will match the whole phone number.
Getting Serious
You released your page, and it's a smashing success, but there's one major problem. In the U.S., there are many common ways to write a phone number:
(234) 567-8901
234-567-8901
234.567.8901
234/567-8901
234 567 8901
+1 (234) 567-8901
1-234-567-8901
While your users could leave out the punctuation, it's much easier for them to type out a formatted number.
While you could write a regular expression to handle all of those formats, it's probably a bad idea. Even if you nail every format in this list, it's very easy to miss one. Besides, you really only care about the data, not how it's formatted. So, instead of worrying about punctuation, why not strip it out?
function isPhoneNumber(string) { return /^1?\d{10}$/.test(string.replace(/\D/g, "")); }
The replace
function is replacing the \D
character, which matches any non-digit characters, with an empty string. The g
, or global flag, tells the function to replace all matches to the regular expression instead of just the first.
Getting Even More Serious
Everybody loves your phone number page, and you're the king of the water cooler at work. However, being the pro that you are, you want to take things one step further.
The North American Numbering Plan is the phone number standard used in the U.S., Canada, and twenty-three other countries. This system has a few simple rules:
- A phone number (
(234) 567-8901
) is broken up into three pieces: The area code (234
), the exchange code (567
) and the subscriber number (8901
). - For the area code and exchange code, the first digit can be
2
through9
and the second and third digits can be0
through9
. - The exchange code cannot have
1
as the third digit if1
is also the second digit.
Your regex already works for the first rule, but it breaks the second and third. For now, let's only worry about the second rule. The new regular expression needs to look something like the following:
/^1?<AREA CODE><EXCHANGE CODE><SUBSCRIBER NUMBER>$/
The subscriber number is easy; it's four digits.
/^1?<AREA CODE><EXCHANGE CODE>\d{4}$/
The area code is a little tricker. You need a number between 2
and 9
, followed by two digits. To accomplish that, you can use a character set! A character set lets you specify a group of characters to choose from.
/^1?[23456789]\d\d<EXCHANGE CODE>\d{4}$/
That's great, but it's annoying to type out all the characters between 2
and 9
. Clean it up with a character range.
/^1?[2-9]\d\d<EXCHANGE CODE>\d{4}$/
That's better! Since the exchange code is the same as the area code, you could duplicate your regex to finish off the number.
/^1?[2-9]\d\d[2-9]\d\d\d{4}$/
But, wouldn't it be nice if you didn't have to copy and paste the area code section of your regex? You can simplify it up by using a group! Groups are formed by wrapping characters in parentheses.
/^1?([2-9]\d\d){2}\d{4}$/
Now, [2-9]\d\d
is contained in a group and {2}
specifies that that group should occur twice.
That's it! Here's what the final isPhoneNumber
function looks like:
function isPhoneNumber(string) { return /^1?([2-9]\d\d){2}\d{4}$/.test(string.replace(/\D/g, "")); }
When to Avoid Regular Expressions
Regular expressions are great, but there's some problems you just shouldn't tackle with them.
- Don't be too strict. There's little value in being too strict with regular expressions. For phone numbers, even if we did match all of the rules in NANP, there's still no way to know if a phone number is real. If I rattled off the number
(555) 555-5555
, it matches the pattern but it's not a real phone number. - Don't write an HTML parser. While it's fine to use regexes to parse simple things, they're not useful for parsing entire languages. Without getting too technical, you're not going to have a good time parsing non-regular languages with regular expressions.
- Don't use them for really complicated strings. The full regex for emails is 6,318 characters long. A simple, imperfect one looks like this:
/^[^@]+@[^@]+\.[^@\.]+$/
. As a general rule of thumb, if you regular expression is longer than a line of code, it might be time to look for another solution.
Wrapping Up
In this article, you've learned when to use regular expressions and when to avoid them, and you've experienced the process of writing one. Hopefully regular expressions seem a bit less ominous, and maybe even intriguing. If you use a regex to solve a tricky problem, let me know in the comments!
If you'd like to read more about regular expressions, check out the excellent MDN Regular Expressions Guide.
About Landon Schropp
Landon is a developer, designer and entrepreneur based in Kansas City. He's the author of the Unraveling Flexbox. He's passionate about building simple apps people love to use.
And then try to use the same regular expression on international phone numbers, postal codes etc. ;-)
Ha ha, that’s where it starts to get a little bit more complicated.
A (x) symbol is not a good choice as an indicator for invalid input, as that exact symbol is already widely used for clearing input fields. You even put it at the same spot (at the right edge inside the field).
Yep, I agree. I was just trying to keep the example light.
I remember, as a programmer, I wrote long pieces of code to get my search scripts right, more were the parameters, more was the complexity and lower was the speed. Learning regular expressions was the best decision in my life, it made the search scripts simpler and faster. You gave a very nice teaser of sorts for the regular expressions.
Thanks!
If you want a nice regexp debugger: http://debuggex.com
That’s always helpful to get a graphical feedback from a regexp
Are Regular Expressions the same in all programming languages? For example, is Regular Expression syntax in PHP the same with the JAVASCRIPT or any other programming language?
No regular expression are not the same across language. The basic syntax is mostly the same but feature support and advanced syntax varies widely.
Michael’s right. Different languages support different features, and sometimes the syntax is different. However, the core concepts are usually the same and should translate.
@Landon you really made regx very easy. I bet, after going through this article programmers are gonna use it no matter how slow or inefficient(some cases) is this. I believe it’s a very good piece of programming art for particularly beginner or intermediate level. Because the way you squeezed repetition in a concise code with a perfect presentation is just a fine example of code re-usability, need for a new function(in case of code anomaly) and simplicity.
Thanks Hussain, I appreciate that. :)
Took me a few tries (and a few more when I discovered that the JS regex engine doesn’t support
(?ifthen|else)
), but I think I have an expression that covers all three of the NANP rules. It’s not too terribly lengthy, character-count-wise, but it’s a *fantastic* example of how quickly regex can go from clear and concise to hieroglyphics that get copy/pasted with fingers crossed. Here’s the expression:Ohh my God! I have just found the thing that i was looking for a long time.. I can teach myself anything except this RegEx..
Thanks a ton :) I will practice thousands time
I realise the above is just a quick & imperfect example for emails, but it would exclude domains like “.co.uk” which are very common in some countries, so maybe worth amending to be “too lax” rather than “too strict” to make it safe to use? The below should allow the two dot domains, though perhaps it’s simpler just to leave out “not .” in the group after the first dot match.
Good call!
At least I could understand the basics, thanks:)
Awesome tutorial but… That email regex… To quote Princess Bride: MY GOD WHAT IS THAT THING. Could someone please explain it?