David Walsh Blog

Regular Expressions for the Rest of Us

Sooner or later you’ll run across a regular expression. With their cryptic syntax, confusing documentation and massive learning curve, most developers settle for copying and pasting them from StackOverflow and hoping they work. But what if you could decode regular expressions and harness their power? In this article, I’ll show you why you should take a second look at regular expressions, and how you can use them in the real world.

Why Regular Expressions?

Why bother with regular expressions at all? Why should you care?

How to Write Regular Expressions

The best way to learn regular expressions is by using an example. Let’s say you’re building a web page with a phone number input. Because you’re a rockstar developer, you decide to display a checkmark when the phone number is valid and an X when it’s invalid.

See the Pen Regular Expression Demo by Landon Schropp (@LandonSchropp) on CodePen.

<input id="phone-number" type="text">
<label class="valid" for="phone-number"><img src="check.svg"></label>
<label class="invalid" for="phone-number"><img src="x.svg"></label>
input:not([data-validation="valid"]) ~ label.valid,
input:not([data-validation="invalid"]) ~ label.invalid {
  display: none;
}
$("input").on("input blur", function(event) {
  if (isPhoneNumber($(this).val())) {
    $(this).attr({ "data-validation": "valid" });
    return;
  }

  if (event.type == "blur") {
    $(this).attr({ "data-validation": "invalid" });
  }
  else {
    $(this).removeAttr("data-validation");
  }
});

With the above code, whenever a person types or pastes a valid number into the input, the check image is displayed. When the user blurs the input and the value is invalid, the error X is displayed.

Since you know that phone numbers are made up of ten digits, your first pass at isPhoneNumber looks like this:

function isPhoneNumber(string) {
  return /\d\d\d\d\d\d\d\d\d\d/.test(string);
}

This function contains a regular expression between the / characters with ten \d‘s, or digit characters. The test method returns true if the regex matches the string and false if it doesn’t. If you run isPhoneNumber("5558675309"), it returns true! Woohoo!

However, writing ten \d‘s is little redundant. Luckily, you can use the curly braces to accomplish the same thing.

function isPhoneNumber(string) {
  return /\d{10}/.test(string);
}

Sometimes, when people type in phone numbers, they start with a leading 1. Wouldn’t it be nice if your regex could handle those cases? You can with the ? character!

function isPhoneNumber(string) {
  return /1?\d{10}/.test(string);
}

The ? symbol means zero or one, so now isPhoneNumber returns true for both "5558675309" and "15558675309"!

So far, isPhoneNumber is pretty good, but you’re missing one key thing: regexes are more than happy to match parts of a string. As it stands, isPhoneNumber("555555555555555555") returns true because that string contains ten numbers. You can fix this problem by using the ^ and $ anchors.

function isPhoneNumber(string) {
  return /^1?\d{10}$/.test(string);
}

Roughly, ^ matches the beginning of the string and $ matches the end, so now your regex will match the whole phone number.

Getting Serious

You released your page, and it’s a smashing success, but there’s one major problem. In the U.S., there are many common ways to write a phone number:

While your users could leave out the punctuation, it’s much easier for them to type out a formatted number.

While you could write a regular expression to handle all of those formats, it’s probably a bad idea. Even if you nail every format in this list, it’s very easy to miss one. Besides, you really only care about the data, not how it’s formatted. So, instead of worrying about punctuation, why not strip it out?

function isPhoneNumber(string) {
  return /^1?\d{10}$/.test(string.replace(/\D/g, ""));
}

The replace function is replacing the \D character, which matches any non-digit characters, with an empty string. The g, or global flag, tells the function to replace all matches to the regular expression instead of just the first.

Getting Even More Serious

Everybody loves your phone number page, and you’re the king of the water cooler at work. However, being the pro that you are, you want to take things one step further.

The North American Numbering Plan is the phone number standard used in the U.S., Canada, and twenty-three other countries. This system has a few simple rules:

  1. A phone number ((234) 567-8901) is broken up into three pieces: The area code (234), the exchange code (567) and the subscriber number (8901).
  2. For the area code and exchange code, the first digit can be 2 through 9 and the second and third digits can be 0 through 9.
  3. The exchange code cannot have 1 as the third digit if 1 is also the second digit.

Your regex already works for the first rule, but it breaks the second and third. For now, let’s only worry about the second rule. The new regular expression needs to look something like the following:

/^1?<AREA CODE><EXCHANGE CODE><SUBSCRIBER NUMBER>$/

The subscriber number is easy; it’s four digits.

/^1?<AREA CODE><EXCHANGE CODE>\d{4}$/

The area code is a little tricker. You need a number between 2 and 9, followed by two digits. To accomplish that, you can use a character set! A character set lets you specify a group of characters to choose from.

/^1?[23456789]\d\d<EXCHANGE CODE>\d{4}$/

That’s great, but it’s annoying to type out all the characters between 2 and 9. Clean it up with a character range.

/^1?[2-9]\d\d<EXCHANGE CODE>\d{4}$/

That’s better! Since the exchange code is the same as the area code, you could duplicate your regex to finish off the number.

/^1?[2-9]\d\d[2-9]\d\d\d{4}$/

But, wouldn’t it be nice if you didn’t have to copy and paste the area code section of your regex? You can simplify it up by using a group! Groups are formed by wrapping characters in parentheses.

/^1?([2-9]\d\d){2}\d{4}$/

Now, [2-9]\d\d is contained in a group and {2} specifies that that group should occur twice.

That’s it! Here’s what the final isPhoneNumber function looks like:

function isPhoneNumber(string) {
  return /^1?([2-9]\d\d){2}\d{4}$/.test(string.replace(/\D/g, ""));
}

When to Avoid Regular Expressions

Regular expressions are great, but there’s some problems you just shouldn’t tackle with them.

Wrapping Up

In this article, you’ve learned when to use regular expressions and when to avoid them, and you’ve experienced the process of writing one. Hopefully regular expressions seem a bit less ominous, and maybe even intriguing. If you use a regex to solve a tricky problem, let me know in the comments!

If you’d like to read more about regular expressions, check out the excellent MDN Regular Expressions Guide.