Regex is strange and powerful

Like the Continuum Transfunctioner Regex is a mysterious device, and it’s mystery is only exceeded by it’s power. I am gonna try to demystify it just a wee bit in this post. I will give a few examples of some common patterns to match, the regex that will match it, and why the particular choices were made. So, here we go!

Use a tool to help you debug your regex

Here are a few tools that help you debug your regex. Each of them will let you enter some text (or it might come with some pre-populated), and then see your pattern match as you type. I think the first one has the best interface.

Match an email address like `nunya@bidness.corp`

Let’s match my favorite email address when I am forced to fill out forms, nunya@bidness.corp. Here is a basic regex that explicitly matches this string, nothing before or after. If we break down the elements, here is what we have.

Explicitly matching

First, let’s cheat. Valid regex for the email address above would include the line below. But wait, ‘that’s just literally the entire email address’ you say? That’s right, regex can happily match exactly what you type, if it is exactly right. But where is the fun in that? Let’s be bold and match more!

nunya@bidness.corp
 matches
nunya@bidness.corp

Pattern matching

Now we will use some real regex power, we are going to look for patterns in a more reusable way. We will define the type of characters we want to match, the number of those characters, and refine the scope to the entire line.

^ means the beginning of the line.
[a-zA-Z09._] looks for any upper/lowercase letters or numbers, including the . and _ characters.
After the [...] pattern, we want to match any number of that pattern, hence the + which means one or more of the preceding letter, or pattern in our case.
The @ is literally matching the @ symbol.
[a-zA-Z09._] is the same as before.
the \. is escaping the . because we want the literal period character. In regex the . means ‘any character’, we often use the * for a wildcard, the * in regex means any number of characters, zero or more.
[a-zA-Z] is only letters.
Now we only want 2 to 4 letters, so we use the {2,4} to ensure we get no less than 2 but no more than 4 of any letters.
We wrap up the pattern match with the $ which means the end of the line.

Here is the finished regex and the pattern it matches.

^[a-zA-Z09._]+@[a-zA-Z0-9]+\.[a-zA-Z]{2,4}$
 matches
nunya@bidness.corp

Refactor to accommodate non-english characters and numbers

What if I wanted to allow characters in other languages, or even numbers in other languages? How would I do that? I would use \w for letters and \d for numbers as shown below.

\w means any word character.
\d means any digit.

The finished regex below is a lot cleaner, isn’t it?

^[\w\d._]+@[\w\d]+\.[\w]{2,4}$
 matches
nunya@bidness.corp

Match a phone number

Let’s take what we used above to match a phone number. Specifically, let’s use the phone number for Homer Simpson’s snow plow business, Mr. Plow 555-555-3226.

What is our pattern?

We want 3 sets of digits, separated by a delimiter, -. So how do we do that? We know that we could use the [0-9], but \d would works as well.

[0-9]{3}-[0-9]{3}-[0-9]{4}
 matches
555-555-3226

More than just `-` for a delimiter

Let’s now accept both the - but also the . as a delimiter. We do that by putting it in a character group like [-.].

[0-9]{3}[-.][0-9]{3}[-.][0-9]{4}
 matches
555-555-3226
555.555.3226
 nomatch
5555553226

Allow for no delimiters

But what if we want zero delimiters? How do we do that? We can’t use the + because that matches 1 or more. We also can’t use the * either, because that means zero or more, so we get the zero but we can’t have more than 1. So, we use the ? character, think of it as ‘optional’, because it will match either zero or 1.

[0-9]{3}[-.]?[0-9]{3}[-.]?[0-9]{4}
 matches
555-555-3226
555.555.3226
5555553226

Now let’s use our ninja regex skills to find an IP address

First, what is our pattern? An IP address has 4 octets up to 255 separated by periods, for example 1.1.1.1 is valid, as is 200.196.7.30. That makes for a little more excitement… let’s see how.

It isn’t as easy as just 4 sets of up to 3 numbers

Let’s begin with a simple regex to get the 4 groups of numbers with the period delimiter. Note that we are matching the valid sets, but we are also getting octets far beyond 255.

[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}
 matches
1.1.1.1
200.196.7.30
999.999.999.999

We need some skills here to make sure each octet is 255 or less

Here is an example that uses groups of numbers in parentheses and use of the ‘or’ character |. Let’s take a single octet and see how we can regex it, we will break down (25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]) in each step.

25[0-5] is simple, a 25 followed by any number between 0 and 5, so 250-255.
2[0-4][0-9] is a 2 followed by any 0 through 4 and finally a 0 through 9, basically anything between 200-249
1[0-9][0-9] is a 1 followed by 0-9 and a 0-9, basically 100-199.
[1-9]?[0-9] is an optional 1-9 followed by a 0-9, essentially 0-99.
Finally, each of those groups are separated by the | which is the ‘or’ flag.

So, what does that whole mess mean? That means 250-255 OR 200-249 OR 100-199 OR 0-9. That was fun, let’s do that 3 more times with a \. between each one for the period. I will wrap the line for you, but in practice you might won’t be able to wrap the line if you are following along in dubbex.

(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
  (25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
  (25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
  (25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])
 matches
1.1.1.1
200.196.7.30
 nomatch
999.999.999.999

There, we matched the valid IP addresses but not the invalid one. Here is a great reference for regular expressions for IP Addresses with more varieties and a more indepth explanation.

Cheatsheet for regex

Here is a cheatsheet for basic regex, it covers the basics. This is by no means complete, there is more, but when I can’t get it done with this cheatsheet I am sure I am going to google it.

Common characters and modifiers

? make the preceding character optional (zero or 1)
+ is greedy - it won’t just get the next character, it will keep on searching until it finds the last one.
+? is not greedy - it will just search to the next instance of a character.
.+ matches all characters to the end of the line (combines . any character with + any number)
[ ] says any of the characters inside the brackets
{3} would match 3 occurrences of the preceding token
{3,4} could match the range of 3 to 4 characters of the preceding token
^ matches only if it’s at the beginning of a string
$ matches only if it’s the end of a string

Special matches

\w matches any word character
\r carriage return
\n new lines
\s space characters
\w word character
\d digits 0-9

Hopefully regex is a little less mysterious!

Regex is strange and powerful

by Jesse Perry on Wednesday, March 10, 2021

Regex is strange and powerful

Use a tool to help you debug your regex

Match an email address like nunya@bidness.corp

Explicitly matching

Pattern matching

Refactor to accommodate non-english characters and numbers

Match a phone number

What is our pattern?

More than just - for a delimiter

Allow for no delimiters

Now let’s use our ninja regex skills to find an IP address

It isn’t as easy as just 4 sets of up to 3 numbers

We need some skills here to make sure each octet is 255 or less

Cheatsheet for regex

Common characters and modifiers

Special matches

Match an email address like `nunya@bidness.corp`

More than just `-` for a delimiter