Regex is strange and powerful
Like the Continuum Transfunctioner Regex is a mysterious device, and it’s mystery is only exceeded by it’s power. I am gonna try to demystify it just a wee bit in this post. I will give a few examples of some common patterns to match, the regex that will match it, and why the particular choices were made. So, here we go!
Use a tool to help you debug your regex
Here are a few tools that help you debug your regex. Each of them will let you enter some text (or it might come with some pre-populated), and then see your pattern match as you type. I think the first one has the best interface.
Match an email address like nunya@bidness.corp
Let’s match my favorite email address when I am forced to fill out forms,
nunya@bidness.corp
. Here is a basic regex that explicitly matches this string,
nothing before or after. If we break down the elements, here is what we have.
Explicitly matching
First, let’s cheat. Valid regex for the email address above would include the line below. But wait, ‘that’s just literally the entire email address’ you say? That’s right, regex can happily match exactly what you type, if it is exactly right. But where is the fun in that? Let’s be bold and match more!
nunya@bidness.corp
matches
nunya@bidness.corp
Pattern matching
Now we will use some real regex power, we are going to look for patterns in a more reusable way. We will define the type of characters we want to match, the number of those characters, and refine the scope to the entire line.
^
means the beginning of the line.[a-zA-Z09._]
looks for any upper/lowercase letters or numbers, including the.
and_
characters.- After the
[...]
pattern, we want to match any number of that pattern, hence the+
which means one or more of the preceding letter, or pattern in our case. - The
@
is literally matching the@
symbol. [a-zA-Z09._]
is the same as before.- the
\.
is escaping the.
because we want the literal period character. In regex the.
means ‘any character’, we often use the*
for a wildcard, the*
in regex means any number of characters, zero or more. [a-zA-Z]
is only letters.- Now we only want 2 to 4 letters, so we use the
{2,4}
to ensure we get no less than 2 but no more than 4 of any letters. - We wrap up the pattern match with the
$
which means the end of the line.
Here is the finished regex and the pattern it matches.
^[a-zA-Z09._]+@[a-zA-Z0-9]+\.[a-zA-Z]{2,4}$
matches
nunya@bidness.corp
Refactor to accommodate non-english characters and numbers
What if I wanted to allow characters in other languages, or even numbers in
other languages? How would I do that? I would use \w
for letters and \d
for
numbers as shown below.
\w
means any word character.\d
means any digit.
The finished regex below is a lot cleaner, isn’t it?
^[\w\d._]+@[\w\d]+\.[\w]{2,4}$
matches
nunya@bidness.corp
Match a phone number
Let’s take what we used above to match a phone number. Specifically, let’s use the phone number for Homer Simpson’s snow plow business, Mr. Plow 555-555-3226.
What is our pattern?
We want 3 sets of digits, separated by a delimiter, -
. So how do we do
that? We know that we could use the [0-9]
, but \d
would works as well.
[0-9]{3}-[0-9]{3}-[0-9]{4}
matches
555-555-3226
More than just -
for a delimiter
Let’s now accept both the -
but also the .
as a delimiter. We do that by
putting it in a character group like [-.]
.
[0-9]{3}[-.][0-9]{3}[-.][0-9]{4}
matches
555-555-3226
555.555.3226
nomatch
5555553226
Allow for no delimiters
But what if we want zero delimiters? How do we do that? We can’t use the +
because that matches 1 or more. We also can’t use the *
either, because that
means zero or more, so we get the zero but we can’t have more than 1. So, we use
the ?
character, think of it as ‘optional’, because it will match either zero
or 1.
[0-9]{3}[-.]?[0-9]{3}[-.]?[0-9]{4}
matches
555-555-3226
555.555.3226
5555553226
Now let’s use our ninja regex skills to find an IP address
First, what is our pattern? An IP address has 4 octets up to 255 separated by
periods, for example 1.1.1.1
is valid, as is 200.196.7.30
. That makes for a
little more excitement… let’s see how.
It isn’t as easy as just 4 sets of up to 3 numbers
Let’s begin with a simple regex to get the 4 groups of numbers with the period delimiter. Note that we are matching the valid sets, but we are also getting octets far beyond 255.
[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}
matches
1.1.1.1
200.196.7.30
999.999.999.999
We need some skills here to make sure each octet is 255 or less
Here is an example that uses groups of numbers in parentheses and use of the
‘or’ character |
. Let’s take a single octet and see how we can regex it, we
will break down (25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])
in each step.
25[0-5]
is simple, a 25 followed by any number between 0 and 5, so 250-255.2[0-4][0-9]
is a 2 followed by any 0 through 4 and finally a 0 through 9, basically anything between 200-2491[0-9][0-9]
is a 1 followed by 0-9 and a 0-9, basically 100-199.[1-9]?[0-9]
is an optional 1-9 followed by a 0-9, essentially 0-99.- Finally, each of those groups are separated by the
|
which is the ‘or’ flag.
So, what does that whole mess mean? That means 250-255 OR 200-249 OR 100-199 OR 0-9. That
was fun, let’s do that 3 more times with a \.
between each one for the period.
I will wrap the line for you, but in practice you might won’t be able to wrap the
line if you are following along in dubbex.
(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])
matches
1.1.1.1
200.196.7.30
nomatch
999.999.999.999
There, we matched the valid IP addresses but not the invalid one. Here is a great reference for regular expressions for IP Addresses with more varieties and a more indepth explanation.
Cheatsheet for regex
Here is a cheatsheet for basic regex, it covers the basics. This is by no means complete, there is more, but when I can’t get it done with this cheatsheet I am sure I am going to google it.
Common characters and modifiers
?
make the preceding character optional (zero or 1)+
is greedy - it won’t just get the next character, it will keep on searching until it finds the last one.+?
is not greedy - it will just search to the next instance of a character..+
matches all characters to the end of the line (combines.
any character with+
any number)[ ]
says any of the characters inside the brackets{3}
would match 3 occurrences of the preceding token{3,4}
could match the range of 3 to 4 characters of the preceding token^
matches only if it’s at the beginning of a string$
matches only if it’s the end of a string
Special matches
\w
matches any word character\r
carriage return\n
new lines\s
space characters\w
word character\d
digits 0-9
Hopefully regex is a little less mysterious!