Regular Expression Tutorial 1: special characters

Regular expression is a computer code, useful to find certain strings in a text file. It can also do ambiguous matching with complex conditions. I use it quite often but forget some details, so it would be useful for people to refresh the knowledge of regular expression.

There are certain characters used in regular expression which have special meanings.  In other words, these characters are not read as the way you see in word or notepad. If you have these characters in the regular expression, the program translates differently.

     Name              Function
\    back slash        escape character
[ ]  square brackets   single character match
{ }  curly braces      repeats
( )  parenthesis       reference or subexpression
^    hat               beginning of a line (not string)
$    dollar            end of a line (not string)
|    pipe              alternation [OR]
*    asterisk          zero or more times of repeat
+    plus sign         1 or more times of repeat
?    question mark     occur 0 times or once 
.    dot               any single character
!    exclamation       negation [NOT]

Back slash “\” is the first one in the list and this character is used as “escape”, meaning if the program sees this character, it will do different things depending on what character comes next. The list below is to specify non-printable characters.

\t   tab
\n   new line
\r   carriage return
\f   form feed character (end of page character)

Back slash can also be used to specify certain character class

\s     a white space
\S     non white space
\d     a digit [0-9]
\D     non digit [^0-9]
\w     word character [a-zA-Z_0-9]
\W     non word character
\b     similar to \W but creates boundary (read below)
\A     beginning of a string
\z     end of a string

\W and \b are similar in the sense that both of them recognize the string starting with non-word character. The difference is \b includes only the part after non-word character, while \W select non-word character and the word. For example in a sentence “It is very hot today.”, \bho recognize only “ho”, while \Who includes the space character and “ho”. \b defines the word boundary, so if you want to search a word “all”, you can use /ball/b. It will not match “tallest”, ‘ball” or “alleged”.

Another way of using \ is to use special character as a normal character. For example, \^ will work to search ^ character. Double back slashes \\ is used to search \ character.

Brackets [ ], brances { } & parenthesis ( ) in regular expression treat inside characters differently. Combinations of these allows more complicated methods for string match.

[ ] Square brackets:  Match one character if the character is inside of the square brackets. [A-Z] matches  a character A to Z. [0-9] matches a digit (same as \d). [a-zA-Z_0-9] matches to any word character. K[ab2] matches to Ka, Kb and K2.
^ inside of  square brackets [^ ] will match the characters NOT in the square brackets. For example [^Ab] will match any characters except “A” and “b”.

{ } Curly braces: Usually, you have one or two number(s) in curly braces, such as {3}, {4,5} and {2,}. It takes the preceding character and look for the repeats specified in the braces. For example, b{3} matches bbb. [1-3]{2} matches to 11,12,13,21,22,23,31,32,33. If two numbers are specified, repeating numbers have to be in that range. a{2,4} matches aa, aaa and aaaa. If there is no second number with comma, it translates as minimum number of repeats.

( ) Parenthesis: Put characters inside of parenthesis, if you want to refer the matched string again. To refer, use back slash and a number which indicates the position of the parenthesis. For example, ([ac]b)\1 mathes either abab or cbcb because \1 is referring to the string in the parenthesis, which is either ab or cb. I explained a little more about usage of parenthesis in another post. Parenthesis is also used together with pipe |. Pipe is alternation, in math it is equivalent to OR. So if you want to match a string abc OR acc, you can do like (abc|acc).

( ) Parenthesis with ?: When ? is inside of parenthesis, it works quite uniquely dependent on what comes after that (e.g. =,!,<=,<!,>).  It is used to capture substrings of an input strings. Why do you need this kind of function? Let’s say you want to capture “gold” in the sentence “I like the gold coin.”, but only when it is followed by “coin”.  Cases like this, when you want capture something with conditions but you don’t want to capture the conditions), parethesis with ? inside will be used.

For example, if it is together with equal sign (=), it will look for a substring preceding the parenthesis with conditions, but it only capture the substring outside of parenthesis.

gold(?=\scoin)              I really like the gold coin.

Red part is what is captured, and green is the condition met in the parenthesis. To do negative assertion, you use exclamation ! instead of equal =. For example, if you want to capture “gold” not followed by ” coin”.

gold(?!\scoin\b)           I really like gold medals but not gold coins.

It captures first “gold”, because it is not followed by “coin”, second “gold” is NOT captured because \b defines the end of the word. If you remove the \b, it will capture both “gold”s. If your condition is before the string you want to match, you put < between ? and =.

There are more ways to capture a string with conditions using different special characters and they are also quite useful. I am not going to discuss with these, please refer here.

If you want to quickly test how your regular expression behaves, you can use regular expression test web site (1 & 2) or free text editor such as EditPad Pro.

Also you can use a lot of random words to check what actually match with your regular expression.

Other useful links

Common regular expressions (e.g. URL, email and etc)

Regular expression library

More regular expression examples

About bioinfomagician

Bioinformatic Scientist @ UCLA

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: