Regular Expression and Its Implementation in Python and dotNet

1228

5 min read

This post is more than 3 year(s) old.

Regex Basics Cheat Sheet

I find this helpful for a quick reference.

I use Regex101 to test my regex patterns.

Some Confusing Topics

Lazy and Greedy Matching

Quantifiers have lazy and greedy matching variants. Lazy matching asks the regex to match a pattern as short as possible, and greedy matching matches as long as possible.

Greedy Matching:

RegexStringResult
5*255555666255555
.*atThe fat cat sat on the matThe fat cat sat on the mat

Lazy Matching:

RegexStringResult
5*?25555566625
.*?atThe fat cat sat on the matThe fat

Capturing and Non-capturing Group

() in a Regex expression does not just create a logical area where the quantifiers act on - it creates a “grouping” which is returned by the matched instances as a “sub-match”.

RegexStringResult
([a-z0-9_-])@([a-z])\.com123abc@test.com123@test.com
group 1: 123abc; group 2: test

You have access to the matched instance (in C#, via Match.Value) and all its groups (Match.Groups property, which is an array) - because you used () to capture and group them.

Of course this compromises performance. If you really do not need this grouping information, use a non-capturing group:

RegexStringResult
(?:[a-z0-9_-])@(?:[a-z])\.com123abc@test.com123@test.com
no group info (Match.Groups.Count==0)

Lookarounds

Lookaround are special kinds of non-capturing groups. Sometimes, you want a pattern that is preceded/followed by another pattern, but do not want the second pattern to be included into the matched instances returned. In such cases, use lookarounds to not only exclude it from the match’s groups but also the match itself.

Take positive lookahead (use ?=) for an example:

RegexStringResult
[0-9]+(?=%)It improves by 24%.24

The pattern 24% is found, but the % character, which is placed in a positive lookahead, is not included in the match result.

Other kinds of lookaheads function similarly.

Character Escape

Characters . $ ^ { [ ( | ) * + ? \ must be escaped or place in a positive character set [].

Regex Flags

Regex Flags controls options for regex operations (e.g., whether to do a case-sensitive match or a case-insensitive one). There are two ways to apply flag in C# which are both discussed below.

Regex in Python

import re
re.search(pattern, string)    # returns a match object representing the first occurrence of pattern within string
re.sub(pattern, repl, string) # substitutes all matches of pattern within string with repl
re.fullmatch(pattern, string) # returns a match object, requiring that pattern matches the entirety of string
re.match(pattern, string)     # returns a match object, requiring that string starts with a substring that matches pattern
re.findall(pattern, string)   # returns a list of strings representing all matches of pattern within string, from left to right

Regex in C#

!!! note Dependencies Namespace: System.Text.RegularExpressions

Assembly: `System.Text.RegularExpressions.dll`

Important Classes

Classes:

Enums:

Use Regex Class

!!! warning Untested Codes This snippet of codes has not been tested yet and are thus for illustration of concepts only.

//Test strings
string str1 = "123_abc|ABC!789@a2K";

//A Regex object stores a regular expression pattern and regex options.
//If options are not specified, default options are used.
//Remember, RegexOptions is a flagged enum!
Regex reg = new Regex(@"[a-b]+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
Match rlt = reg.Match(str1);
Console.WriteLine(rlt.Value); //abc

//Regex options can also be specified inline:
Regex reg2 = new Regex(@"(?:i)[a-b]+"); //case-insensitive match
Regex reg3 = new Regex(@"(?:-i)[a-b]+"); //case-sensitive match; the "-" inverses the meaning
Regex reg4 = new Regex(@"(?-i)[a-z]+(?i)[k-n]+"); // case sensitive, then case-insensitive match (switch on and off)
Regex reg5 = new Regex(2"(?is-m:expression)"); // set multiple options in one go

//You can also specify the character position in the input string at which to start the search.
Match rlt2 = reg.Match(str1, 7);
Console.WriteLine(rlt2.Value); //ABC

//Get the next match
Match rlt3 = rlt2.NextMatch();
Console.WriteLine(rlt3.Value); //a

//Use Matches to get and print all matches
MatchCollection allRlt = reg.Matches(str1);
foreach (Match rlt in allRlt){
  Console.WriteLine(rlt.Value); //prints abc, ABC, a, K
}

//Split strings, using the pattern as delimiters
Regex reg = new Regex(@"[_|!@]", RegexOptions.ExplicitCapture);
string[] strs = reg.Split(str1);
foreach (string rlt in strs){
  Console.WriteLine(rlt); //prints abc, ABC, a, K
}

//Replace all numbers with letter Z
Regex reg = new Regex(@"[0-9]+");
string newStr = reg.Replace(str1);
Console.WriteLine(newStr); //Z_abc|ABC!Z@aZK

The Regex Class can also be used as a static class. In such cases, pass the regular expression pattern and the regex options as parameters into the methods such as Match().

Match rlt_alt = Regex.Match(str1,@"[a-b]+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
Console.WriteLine(rlt_alt.Value); //abc

Using Groups

!!! warning Not My Codes This snippet of codes are not my work. They are modified from Microsoft’s Official Documentation.

string pattern = @"(\b(\w+?)[,:;]?\s?)+[?.!]";
string input = "This is one sentence. This is a second sentence.";

Match match = Regex.Match(input, pattern);
if(!match.Success) return;
Console.WriteLine("Match: " + match.Value); //Match: This is one sentence.
int groupCtr = 0;
foreach (Group group in match.Groups)
{
  groupCtr++;
  Console.WriteLine("Group {0}: '{1}'", groupCtr, group.Value);
  int captureCtr = 0;
  foreach (Capture capture in group.Captures)
  {
    captureCtr++;
    Console.WriteLine("   Capture {0}: '{1}'", captureCtr, capture.Value);
  }
}
//Prints:
//Group 1: 'This is one sentence.'
//   Capture 1: 'This is one sentence.'
//Group 2: 'sentence'
//   Capture 1: 'This '
//   Capture 2: 'is '
//   Capture 3: 'one '
//   Capture 4: 'sentence'
//Group 3: 'sentence'
//   Capture 1: 'This'
//   Capture 2: 'is'
//   Capture 3: 'one'
//   Capture 4: 'sentence'

When to Use Which?

-- Yu Long
Published on Aug 10, 2021, PDT
Updated on Aug 10, 2021, PDT