Regular Expressions for the Rest of Us

By  on  

Sooner or later you'll run across a regular expression. With their cryptic syntax, confusing documentation and massive learning curve, most developers settle for copying and pasting them from StackOverflow and hoping they work. But what if you could decode regular expressions and harness their power? In this article, I'll show you why you should take a second look at regular expressions, and how you can use them in the real world.

Why Regular Expressions?

Why bother with regular expressions at all? Why should you care?

  • Matching: Regular expressions are great at determining if a string matches some format, such as a phone number, email or credit card number.
  • Replacement: Regular expressions make it easy to find and replace patterns in a string. For example, text.replace(/\s+/g, " ") replaces all chunks of whitespace in text, such as " \n\t ", with a single space.
  • Extraction: It's easy to extract pieces of information from a pattern with regular expressions. For example, name.matches(/^(Mr|Ms|Mrs|Dr)\.?\s/i)[1] extracts a person's title from a string, such as "Mr" from "Mr. Schropp".
  • Portability: Almost every major language has a regular expression library. The syntax is mostly standardized, so you don't have to worry about relearning regexes when you switch languages.
  • Coding: When writing code, you can use regular expressions to search through files with tools such as find and replace in Atom or ack in the command line.
  • Clear and Concise: If you're comfortable with regular expressions, you can perform some pretty tricky operations with a very small amount of code.
  • Fame and Glory: Regular expressions will give you superpowers.

How to Write Regular Expressions

The best way to learn regular expressions is by using an example. Let's say you're building a web page with a phone number input. Because you're a rockstar developer, you decide to display a checkmark when the phone number is valid and an X when it's invalid.

See the Pen Regular Expression Demo by Landon Schropp (@LandonSchropp) on CodePen.

<input id="phone-number" type="text">
<label class="valid" for="phone-number"><img src="check.svg"></label>
<label class="invalid" for="phone-number"><img src="x.svg"></label>
input:not([data-validation="valid"]) ~ label.valid,
input:not([data-validation="invalid"]) ~ label.invalid {
  display: none;
}
$("input").on("input blur", function(event) {
  if (isPhoneNumber($(this).val())) {
    $(this).attr({ "data-validation": "valid" });
    return;
  }

  if (event.type == "blur") {
    $(this).attr({ "data-validation": "invalid" });
  }
  else {
    $(this).removeAttr("data-validation");
  }
});

With the above code, whenever a person types or pastes a valid number into the input, the check image is displayed. When the user blurs the input and the value is invalid, the error X is displayed.

Since you know that phone numbers are made up of ten digits, your first pass at isPhoneNumber looks like this:

function isPhoneNumber(string) {
  return /\d\d\d\d\d\d\d\d\d\d/.test(string);
}

This function contains a regular expression between the / characters with ten \d's, or digit characters. The test method returns true if the regex matches the string and false if it doesn't. If you run isPhoneNumber("5558675309"), it returns true! Woohoo!

However, writing ten \d's is little redundant. Luckily, you can use the curly braces to accomplish the same thing.

function isPhoneNumber(string) {
  return /\d{10}/.test(string);
}

Sometimes, when people type in phone numbers, they start with a leading 1. Wouldn't it be nice if your regex could handle those cases? You can with the ? character!

function isPhoneNumber(string) {
  return /1?\d{10}/.test(string);
}

The ? symbol means zero or one, so now isPhoneNumber returns true for both "5558675309" and "15558675309"!

So far, isPhoneNumber is pretty good, but you're missing one key thing: regexes are more than happy to match parts of a string. As it stands, isPhoneNumber("555555555555555555") returns true because that string contains ten numbers. You can fix this problem by using the ^ and $ anchors.

function isPhoneNumber(string) {
  return /^1?\d{10}$/.test(string);
}

Roughly, ^ matches the beginning of the string and $ matches the end, so now your regex will match the whole phone number.

Getting Serious

You released your page, and it's a smashing success, but there's one major problem. In the U.S., there are many common ways to write a phone number:

  • (234) 567-8901
  • 234-567-8901
  • 234.567.8901
  • 234/567-8901
  • 234 567 8901
  • +1 (234) 567-8901
  • 1-234-567-8901

While your users could leave out the punctuation, it's much easier for them to type out a formatted number.

While you could write a regular expression to handle all of those formats, it's probably a bad idea. Even if you nail every format in this list, it's very easy to miss one. Besides, you really only care about the data, not how it's formatted. So, instead of worrying about punctuation, why not strip it out?

function isPhoneNumber(string) {
  return /^1?\d{10}$/.test(string.replace(/\D/g, ""));
}

The replace function is replacing the \D character, which matches any non-digit characters, with an empty string. The g, or global flag, tells the function to replace all matches to the regular expression instead of just the first.

Getting Even More Serious

Everybody loves your phone number page, and you're the king of the water cooler at work. However, being the pro that you are, you want to take things one step further.

The North American Numbering Plan is the phone number standard used in the U.S., Canada, and twenty-three other countries. This system has a few simple rules:

  1. A phone number ((234) 567-8901) is broken up into three pieces: The area code (234), the exchange code (567) and the subscriber number (8901).
  2. For the area code and exchange code, the first digit can be 2 through 9 and the second and third digits can be 0 through 9.
  3. The exchange code cannot have 1 as the third digit if 1 is also the second digit.

Your regex already works for the first rule, but it breaks the second and third. For now, let's only worry about the second rule. The new regular expression needs to look something like the following:

/^1?<AREA CODE><EXCHANGE CODE><SUBSCRIBER NUMBER>$/

The subscriber number is easy; it's four digits.

/^1?<AREA CODE><EXCHANGE CODE>\d{4}$/

The area code is a little tricker. You need a number between 2 and 9, followed by two digits. To accomplish that, you can use a character set! A character set lets you specify a group of characters to choose from.

/^1?[23456789]\d\d<EXCHANGE CODE>\d{4}$/

That's great, but it's annoying to type out all the characters between 2 and 9. Clean it up with a character range.

/^1?[2-9]\d\d<EXCHANGE CODE>\d{4}$/

That's better! Since the exchange code is the same as the area code, you could duplicate your regex to finish off the number.

/^1?[2-9]\d\d[2-9]\d\d\d{4}$/

But, wouldn't it be nice if you didn't have to copy and paste the area code section of your regex? You can simplify it up by using a group! Groups are formed by wrapping characters in parentheses.

/^1?([2-9]\d\d){2}\d{4}$/

Now, [2-9]\d\d is contained in a group and {2} specifies that that group should occur twice.

That's it! Here's what the final isPhoneNumber function looks like:

function isPhoneNumber(string) {
  return /^1?([2-9]\d\d){2}\d{4}$/.test(string.replace(/\D/g, ""));
}

When to Avoid Regular Expressions

Regular expressions are great, but there's some problems you just shouldn't tackle with them.

  • Don't be too strict. There's little value in being too strict with regular expressions. For phone numbers, even if we did match all of the rules in NANP, there's still no way to know if a phone number is real. If I rattled off the number (555) 555-5555, it matches the pattern but it's not a real phone number.
  • Don't write an HTML parser. While it's fine to use regexes to parse simple things, they're not useful for parsing entire languages. Without getting too technical, you're not going to have a good time parsing non-regular languages with regular expressions.
  • Don't use them for really complicated strings. The full regex for emails is 6,318 characters long. A simple, imperfect one looks like this: /^[^@]+@[^@]+\.[^@\.]+$/. As a general rule of thumb, if you regular expression is longer than a line of code, it might be time to look for another solution.

Wrapping Up

In this article, you've learned when to use regular expressions and when to avoid them, and you've experienced the process of writing one. Hopefully regular expressions seem a bit less ominous, and maybe even intriguing. If you use a regex to solve a tricky problem, let me know in the comments!

If you'd like to read more about regular expressions, check out the excellent MDN Regular Expressions Guide.

Landon Schropp

About Landon Schropp

Landon is a developer and entrepreneur based in Seattle. He's the author of the Free Flexbox Starter Course and Unraveling Flexbox, a book on how to create modern, responsive layouts in CSS. He's passionate about building simple apps people love to use.

Recent Features

Incredible Demos

  • By
    Web Notifications API

    Every UI framework has the same set of widgets which have become almost essential to modern sites: modals, tooltips, button varieties, and notifications.  One problem I find is each site having their own widget colors, styles, and more -- users don't get a consistent experience.  Apparently the...

  • By
    Sexy Link Transformations with CSS

    I was recently visiting MooTools Developer Christoph Pojer's website and noticed a sexy link hover effect:  when you hover the link, the the link animates and tilts to the left or the right.  To enhance the effect, the background color of the link is...

Discussion

  1. Fredrik

    And then try to use the same regular expression on international phone numbers, postal codes etc. ;-)

    • Ha ha, that’s where it starts to get a little bit more complicated.

  2. A (x) symbol is not a good choice as an indicator for invalid input, as that exact symbol is already widely used for clearing input fields. You even put it at the same spot (at the right edge inside the field).

    • Yep, I agree. I was just trying to keep the example light.

  3. I remember, as a programmer, I wrote long pieces of code to get my search scripts right, more were the parameters, more was the complexity and lower was the speed. Learning regular expressions was the best decision in my life, it made the search scripts simpler and faster. You gave a very nice teaser of sorts for the regular expressions.

  4. If you want a nice regexp debugger: http://debuggex.com
    That’s always helpful to get a graphical feedback from a regexp

  5. Justice

    Are Regular Expressions the same in all programming languages? For example, is Regular Expression syntax in PHP the same with the JAVASCRIPT or any other programming language?

    • Michael Ash

      No regular expression are not the same across language. The basic syntax is mostly the same but feature support and advanced syntax varies widely.

    • Michael’s right. Different languages support different features, and sometimes the syntax is different. However, the core concepts are usually the same and should translate.

  6. @Landon you really made regx very easy. I bet, after going through this article programmers are gonna use it no matter how slow or inefficient(some cases) is this. I believe it’s a very good piece of programming art for particularly beginner or intermediate level. Because the way you squeezed repetition in a concise code with a perfect presentation is just a fine example of code re-usability, need for a new function(in case of code anomaly) and simplicity.

  7. Tony

    Took me a few tries (and a few more when I discovered that the JS regex engine doesn’t support (?ifthen|else)), but I think I have an expression that covers all three of the NANP rules. It’s not too terribly lengthy, character-count-wise, but it’s a *fantastic* example of how quickly regex can go from clear and concise to hieroglyphics that get copy/pasted with fingers crossed. Here’s the expression:

  8. Rituparna

    Ohh my God! I have just found the thing that i was looking for a long time.. I can teach myself anything except this RegEx..

    Thanks a ton :) I will practice thousands time

  9. /^[^@]+@[^@]+\.[^@\.]+$/

    I realise the above is just a quick & imperfect example for emails, but it would exclude domains like “.co.uk” which are very common in some countries, so maybe worth amending to be “too lax” rather than “too strict” to make it safe to use? The below should allow the two dot domains, though perhaps it’s simpler just to leave out “not .” in the group after the first dot match.

    /^[^@]+@[^@]+\.[^@\.]+\.?[^@\.]*$/

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!