Regex is strange and powerful

by Jesse Perry on Wednesday, March 10, 2021

Regex is strange and powerful

Like the Continuum Transfunctioner Regex is a mysterious device, and it’s mystery is only exceeded by it’s power. I am gonna try to demystify it just a wee bit in this post. I will give a few examples of some common patterns to match, the regex that will match it, and why the particular choices were made. So, here we go!

Use a tool to help you debug your regex

Here are a few tools that help you debug your regex. Each of them will let you enter some text (or it might come with some pre-populated), and then see your pattern match as you type. I think the first one has the best interface.

Match an email address like nunya@bidness.corp

Let’s match my favorite email address when I am forced to fill out forms, nunya@bidness.corp. Here is a basic regex that explicitly matches this string, nothing before or after. If we break down the elements, here is what we have.

Explicitly matching

First, let’s cheat. Valid regex for the email address above would include the line below. But wait, ‘that’s just literally the entire email address’ you say? That’s right, regex can happily match exactly what you type, if it is exactly right. But where is the fun in that? Let’s be bold and match more!

nunya@bidness.corp
 matches
nunya@bidness.corp

Pattern matching

Now we will use some real regex power, we are going to look for patterns in a more reusable way. We will define the type of characters we want to match, the number of those characters, and refine the scope to the entire line.

Here is the finished regex and the pattern it matches.

^[a-zA-Z09._]+@[a-zA-Z0-9]+\.[a-zA-Z]{2,4}$
 matches
nunya@bidness.corp

Refactor to accommodate non-english characters and numbers

What if I wanted to allow characters in other languages, or even numbers in other languages? How would I do that? I would use \w for letters and \d for numbers as shown below.

The finished regex below is a lot cleaner, isn’t it?

^[\w\d._]+@[\w\d]+\.[\w]{2,4}$
 matches
nunya@bidness.corp

Match a phone number

Let’s take what we used above to match a phone number. Specifically, let’s use the phone number for Homer Simpson’s snow plow business, Mr. Plow 555-555-3226.

What is our pattern?

We want 3 sets of digits, separated by a delimiter, -. So how do we do that? We know that we could use the [0-9], but \d would works as well.

[0-9]{3}-[0-9]{3}-[0-9]{4}
 matches
555-555-3226

More than just - for a delimiter

Let’s now accept both the - but also the . as a delimiter. We do that by putting it in a character group like [-.].

[0-9]{3}[-.][0-9]{3}[-.][0-9]{4}
 matches
555-555-3226
555.555.3226
 nomatch
5555553226

Allow for no delimiters

But what if we want zero delimiters? How do we do that? We can’t use the + because that matches 1 or more. We also can’t use the * either, because that means zero or more, so we get the zero but we can’t have more than 1. So, we use the ? character, think of it as ‘optional’, because it will match either zero or 1.

[0-9]{3}[-.]?[0-9]{3}[-.]?[0-9]{4}
 matches
555-555-3226
555.555.3226
5555553226

Now let’s use our ninja regex skills to find an IP address

First, what is our pattern? An IP address has 4 octets up to 255 separated by periods, for example 1.1.1.1 is valid, as is 200.196.7.30. That makes for a little more excitement… let’s see how.

It isn’t as easy as just 4 sets of up to 3 numbers

Let’s begin with a simple regex to get the 4 groups of numbers with the period delimiter. Note that we are matching the valid sets, but we are also getting octets far beyond 255.

[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}
 matches
1.1.1.1
200.196.7.30
999.999.999.999

We need some skills here to make sure each octet is 255 or less

Here is an example that uses groups of numbers in parentheses and use of the ‘or’ character |. Let’s take a single octet and see how we can regex it, we will break down (25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]) in each step.

So, what does that whole mess mean? That means 250-255 OR 200-249 OR 100-199 OR 0-9. That was fun, let’s do that 3 more times with a \. between each one for the period. I will wrap the line for you, but in practice you might won’t be able to wrap the line if you are following along in dubbex.

(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
  (25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
  (25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
  (25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])
 matches
1.1.1.1
200.196.7.30
 nomatch
999.999.999.999

There, we matched the valid IP addresses but not the invalid one. Here is a great reference for regular expressions for IP Addresses with more varieties and a more indepth explanation.

Cheatsheet for regex

Here is a cheatsheet for basic regex, it covers the basics. This is by no means complete, there is more, but when I can’t get it done with this cheatsheet I am sure I am going to google it.

Common characters and modifiers

Special matches

Hopefully regex is a little less mysterious!