Welcome!

Microsoft Cloud Authors: Pat Romanski, Andreas Grabner, Nick Basinger, Kevin Benedict, Liz McMillan

Related Topics: Microsoft Cloud

Microsoft Cloud: Article

Patterns of a Different Kind

Patterns of a Different Kind

Much has been made of .NET's language-neutral features and how programmers can choose from a wide variety of languages ranging from C# and VB.NET to niche languages such as Python and Eiffel. But there is one dialect that all language variants speak - and that is the language of regular expressions.

Regular expressions grew out of work done by Hartford, Connecticut-born mathematician Stephen Kleene on what he termed "the algebra of regular sets." His theoretical ideas later formed the basis for early text-manipulation tools on the Unix platform and were further popularized by tight integration of a regular expression engine into the Perl programming language.

Most programmers who have been coding for any length of time have had occasion to use regular expressions in some fashion. Their terse, somewhat cryptic syntax can be intimidating at first, but with a basic understanding of the most common pattern constructs, regular expressions can become an indispensable part of the .NET programmer's toolkit. The richly expressive nature of regular expressions allows programmers to work with patterns and matches within text in a powerful and flexible way. They are ideal for validation of data entry, extracting and replacing substrings, and generating reports for text-based nonrelational data.

This article begins with a general overview of regular expressions. If you're already comfortable with them, feel free to skip over the first section and delve into the.NET implementation details!

Regular Expressions: An Overview
Character Literals
The most elementary pattern is based on sequences of character literals. By default, patterns are case sensitive and will match anything in the target text that is an exact match. For example, the expression the would match the bolded text in: The purple lathe was in the tree. The initial The did not provide a match due to the case sensitivity of the expression. Character literals on their own offer little in the way of power and flexibility, but when they're used in conjunction with the metacharacter constructs described below, the possibilities are extraordinary.

Wildcards
A period designates a wildcard character in regular expressions notation. This includes everything but the new-line character (this behavior is configurable - see Table 1). If the example above were changed to match on wildcard characters, the result would be:

.he.. The purple lathe was in the tree

Notice how the expression now matches on the uppercase T in The, as well as the space character between some of the words. If the intent were to actually match a period in the target, the period in question would need to be escaped by prefixing it with a backslash. For example, .he\../. This escape mechanism applies to all metacharacters for which a literal match is the intent.

Positional Characters
There are three common positional metacharacters in regular expressions. They are sometimes called anchors and include beginning of line (^), end of line ($), and word boundary (\b). For example:

^.he.. The purple lathe was in the tree

The caret metacharacter anchors the pattern to the beginning of the target text and limits the match. The $ and \b work in a similar manner.

Character Classes
Character classes include a set of character literals enclosed in square brackets. They can include a list of individual literals such as [abcdef] or a range [a-f]. These two expressions are functionally equivalent and would match any lowercase letter between a and f. If we modify our first example to use a character class, it might look something like:

[Tt]he The purple lathe was in the tree

Notice how The matched this time because the [Tt] character class includes both an upper- and lowercase T. Additionally, the caret (^) symbol can be used to negate the meaning of the character class (when used inside [..]). For instance, using [^Tt]he would match all instances of he not preceded by an upper or lowercase t. Most of the commonly used character classes have an escaped shortcut notation to facilitate their use. For instance, \w matches any word character and is equivalent to [a-zA-Z_0-9]. For more details on character class shorthand notations, see Table 2.

Grouping and Alternation
Literals and metacharacters can be grouped together by using parentheses to surround a subexpression. These atomic groups can then be used as the basis for more sophisticated expressions. The grouping construct can help improve the readability of regular expressions as well as providing a mechanism to capture "submatches." A simple example would be:

(purp)\w The purple lathe purports to turn

Alternation allows for a choice between two or more combinations of literals and metacharacters and is implemented by using the pipe (|) symbol. The pipe equates to a logical or and is applied as follows:

(purp|lat)\w The purple lathe purports to turn

Quantifiers
Quantifiers allow the programmer to specify how many times a given element or subexpression occurs in the pattern. This is one of the more powerful ways that patterns can be honed to match or exclude certain text. Quantifiers are always placed to the immediate right of the literal, metacharacter, or group they qualify. The most basic of these quantifiers is the asterisk (*), which means zero or more occurrences. The plus (+) sign means one or more, and the question mark (?) means zero or one. The following are some examples: (purp).* The purple lathe purports to turn
(purp)\w+ The purple lathe purports to turn
(purp)\w? The purple lathe purports to turn

The *, + and ? quantifiers are termed abstract quantifiers. There is another class of quantifiers called numeric quantifiers. These use curly braces to specify more exacting patterns. They come in three flavors: {n}, {n,}, and {n, x}, with n and x being integers. The first of these specifies that the element or subexpression occurs exactly n times. The second means at least n number of times and the third at least n number of times but no more than x times.

(purp).{5} The purple lathe purports to turn
(purp).{5,7} The purple lathe purports to turn

Regular Expressions in the .NET Framework
The .NET Framework provides a robust easy-to-use object model for working with and manipulating regular expression matches. The .NET implementation is based on a traditional NFA (Nondeterministic Finite Automaton) engine and is compatible with Perl 5 regular expressions. All the regex-related classes live in the System.Text.RegularExpressions namespace.

Certain behaviors and characteristics of the .NET regular expression engine can be tweaked by specifying regular expression options at object construction time or at the time of method invocation. These options are defined in the RegexOptions enumeration and can be combined using a bitwise combination of RegexOptions values. A list of these options is shown in Table 1.

The reference documentation lists ten classes in the System.Text.RegularExpressions namespace, but for our purposes we will focus on the three core objects: Regex, Match, and Group. The noncore classes are either collection-based classes of core objects (as is the case with MatchCollection and GroupCollection), reserved for .NET Framework internals code, or beyond the scope of this article. All code samples will be illustrated using the C# language and assume that the System.Text.Regular Expressions namespace has been imported via the using statement to allow for shorthand notation of the regex types. That said, let's take a look at the core .NET regular expression objects!

Regex Object
The Regex class is used to define an immutable regular expression. To use the Regex object simply new up an instance as follows:

Regex re = new Regex("[Tt]ruth");

This statement will initialize and compile the regular expression, which can then be used to test for the existence of a match, capture the matched text, replace substrings within the target string, or split the text into a string array based on a regular expression delimiter.

Simple Matching
Using the Regex object created above we can test for the existence of a match as follows:

if(re.IsMatch("We hold these Truths to be self-evident..."))
System.Console.WriteLine("It's a match...");
else
System.Console.WriteLine("Try again...");

The Regex object also contains a handful of static or convenience methods for accomplishing many regex-related tasks without explicitly instantiating a Regex object. The statement above can be rewritten as follows:

if(Regex.IsMatch("[Tt]ruth" ,
"We hold these Truths to be self-evident..."))
System.Console.WriteLine("It's a match...");
else
System.Console.WriteLine("Try again...");

This code is somewhat easier to read and more succinct. For a complete list of IsMatch() overloaded methods, refer to the .NET Framework documentation.

It is worth noting as we move along that many of the overloaded methods for the core .NET regex objects take Int32 arguments. The general rule is that when only one Int32 argument is present it represents an offset into the target string that determines where the search begins. The default is at the beginning of the target string, or position zero. For example, RegEx.IsMatch ("Take me out to the ballgame", "ak", 10) would return false since the only text in the target string that would be searched is the bolded text in: Take me out to the ballgame.

Capturing Matched Text
The previous examples work great if you just want to test for the existence of a pattern in a string, but what if we actually need to extract that match and do something more meaningful with it? Without getting into too much detail on the Match object, the following technique can be used to assign the match to a string variable:

Regex re = new Regex(@"\d{2}[-/]\d{2}[-/]\d{2,4}");
string dt = re.Match
("Meet me on 02-23-2003 at your office").Value;
System.Console.WriteLine("Meeting date: " + dt);

This excerpt of code would result in Meeting date: 02-23-2003 being displayed in the console window. The Match method of the Regex object returns a Match object that in turn exposes a Value property of type string. The regex used here is worth taking a closer look at. Using a combination of character class shorthand expressions (\d), numeric quantifiers ({2}, {2,4}), as well as custom character classes ([-/]), this expression would have matched on dates in the form of: 02-23-2003, 02/23/2003, 02-23-03, and 02/23/03. This is where the real power of regular expressions become evident - when dealing with subtle differences in patterns that are for our purposes semantically the same.

One other item of note is the use of the C# verbatim string literal @. This allows us to place escape sequences within our regex definitions without getting a compile error. VB.NET does not require the use of the @, and escape sequences such as \d or \s can be placed directly within a string definition with no ill effects. The @ literal will be used throughout many of the examples in this article.

String Replacement
A common task in many applications is substring replacement. The Replace method of the RegEx object (static or instance) can easily accomplish this for us. If, for instance, we wanted to round down account balances so that they were expressed in whole dollars rather than fractional amounts, we could use the following code to do that:

string oldstring = @"Beginning balance :
$12.34, Ending Balance $6.78";
string newstring = Regex.Replace(oldstring,
@"\.\d\d", ".00");
System.Console.WriteLine(newstring);

This would result in Beginning balance : $12.00, Ending Balance $6.00 being written to the console window. The ability to use a regular expression here allows us to abstract our replacement in two very important ways: first, we are only interested in side-by-side numeric characters, and second, only those that follow a decimal point. This task could have been accomplished via manual parsing and procedural techniques but certainly not with one line of code! For a complete list of Replace() overloaded methods, refer to the .NET Framework documentation.

There is one overload of the Replace() method that requires two Int32 arguments. The second of these is the offset parameter discussed earlier. The first Int32 argument in this instance represents a count. That is, how many times should the replacement occur. For example, re.Replace(x, y, 5, 10) would replace matches in x, with string y a maximum of 5 times starting at position 10. Similar overload method signatures exist for most core .NET regex objects.

String Splitting
An occasional programming task is the parsing of a string or document based on some delimiter(s). In many cases the parsed output is then saved to some relational data store or displayed on a screen for further manipulation or reporting purposes. The Split() method of the Regex object can do this for us. For instance, if we needed to parse a phone number into its constituent parts, we could do so as follows:

string[] parts = Regex.Split("860 555-4321",
@"\.|\s+|-");
for(int x = 0; x < parts.Length; x++)
{
System.Console.WriteLine("Phone Part " +
(x + 1) + ": " + parts[x]);
}

Console Results:

Phone Part 1: 860
Phone Part 2: 555
Phone Part 3: 4321

The key here is to once again point out how the ability to use a regular expression to match on the delimiter allows us to do so in a much more flexible and succinct way. The expression \.|\s+|- will parse the following phone numbers in the exact same manner:

860.555.4321
860-555-5321
860 555.4321
860.555-4321
(Any other permutation that is delimited
by spaces, hyphens, or periods)

Once again, for a complete list of Split() overloaded methods, refer to the .NET Framework documentation.

Match ObjectB The Match object represents a single match for a given regular expression. What this really means becomes clearer if we take a look at a previous example:

(purp)\w The purple lathe purports to turn

Applying the regular expression (purp)\w to the target text results in two distinct matches. In .NET, each one of these is defined by a Match object. Match objects are immutable and have no public constructors. There are several means of getting a reference to an individual Match object: Regex.Match() static method, reObj.Match() instance method, matchObj.NextMatch() instance method, or by traversing the Matches collection (MatchCollection) returned by the RegEx.Matches static or reObj.Matches() instance methods.

The Match object exposes some very useful properties that allow the programmer to go beyond simple data validation techniques or string replacements. Table 3 shows the possibilities:

Listing 1 shows an example of the use of the Match object. We simply used the static Matches method of the Regex object to return a Matches collection. Using the foreach construct, we then loop through the individual Match objects and display their properties. .NET's ability to capture each match and expose it as an easy-to-use object opens up many possibilities for custom text processing. There is one important property of the Match object that we did not discuss yet and that is the Groups property, discussed in the next section.

Group Object
A group object contains match information about a single capturing group. Essentially, they're submatches within matches. What does all of this actually mean, you may ask? If we look at back at the regex overview earlier, there was a discussion about grouping literals and metacharacters together by enclosing them within parentheses. In the world of regular expressions, what this logically does is create what is called a capturing group. Writing expressions in this manner allows us to work with matches in more granular way. For instance, take a look at the following example:

(\d\d?)-(\d\d?)-(\d{2,4})

Kaitlyn's birthday is 11-14-1998 and Ryan's is 09-11-2001.

As we would expect, the expression matches on both dates. Each of these is represented by a discrete Match object. Each of these Match objects also exposes a Groups property, which is a collection of Group objects. The Group object exposes the same properties as the Match object: Success, Value, Length, and Index (in fact, the Match object is derived from the Group object!). In this case, the regex (\d\d?)-(\d\d?)-(\d{2,4}) defines three capturing groups (parenthetical expressions). The first Match object would have a Groups collection that looks like:

match1.Groups[0].Value equals 11-14-1998
match1.Groups[1].Value equals 11
match1.Groups[2].Value equals 14
match1.Groups[3].Value equals 1998

The first thing to note is Groups[0]. Groups[0] is the matching text in it's entirety and is logically equivalent to match.Value. It is Groups[1-n] that are of the most interest to us. These allow us to gain access to submatches within our matches simply by indexing into the Groups collection using the ordinal position of the capturing group in the regex. In this example the use of capturing groups not only allows us to extract dates from a string but also to easily parse the month, day, and year! What could be easier? Well, actually, there is a technique that can be employed to make programming and working with groups somewhat easier and less brittle than using captured group ordinal positions. We can use something called named groups.

Named groups are implemented by decorating the capturing groups with a name! The most common way to do this is with the following construct: ?<name>. So, if we modify our previous regex to use named groups, it would look like:

(?<month>\d\d?)-(?<day>\d\d?)-(?<year>\d{2,4})

What this allows us to do is to index into the Groups collection using the group name rather than ordinal position, as follows:

match1.Groups["month"].Value equals 11
match1.Groups["day"].Value equals 14
match1.Groups["year"].Value equals 1998

match2.Groups["month"].Value equals 09
match2.Groups["day"].Value equals 11
match2.Groups["year"].Value equals 2001

Advanced .NET: The MatchEvaluator Delegate
Earlier in this article we discussed the Replace() method of the Regex object. This provides an extremely powerful mechanism for finding and replacing strings in a block of text. This works fine when the replacement string is something that can be expressed in terms of a regular expression. As flexible as they are, regular expressions do have semantic and syntactical limitations. What if we could tap into a full-featured language such as C# or VB.NET to perform calculations on or massage our matches before replacing text? Well, we can, by implementing our own match delegate to perform custom processing. A .NET delegate is fundamentally a type-safe function pointer.

Say, for instance we have a document that has fourth-quarter sales results for our company. All figures are in U.S. dollars, but we need to have our Moscow office look them over before presenting to shareholders. In order to facilitate this, we will convert U.S. dollars to Russian rubles in all of our documents. We can do this using the code shown in Listing 2.

Executing this code would result in newdoc being set equal to ... 438,522 rubles for services and 566,574 rubles in product .... The only requirement placed on our custom method is that its signature match the MatchEvaluator delegate definition. This means that it must take a Match object as an argument and return a string. Beyond this, we can implement the delegate however we choose.

MatchEvaluators certainly open up a whole host of possibilities when it comes to replacing strings within strings. The Match object gives us access to the Groups collection, as illustrated in the example. So, go now and Replace () with abandon!

Conclusion
The regular expression types in the .NET Framework make working with patterns and matches within text a breeze. The abstraction of common regular expression concepts into robust, easy-to-understand classes such as Regex, Match, and Group afford all .NET programmers the ability to slice, dice, and dissect text in ways they only dreamed about a year ago.

The types in the System.Text-.RegularExpressions namespace are full-fledged members of the .NET Framework class library - not adjuncts or an afterthought. Hopefully, .NET regular expressions will help to make the arcane text parsing tasks possible and the mundane ones more fun!

More Stories By Mike Morris

Mike Morris is cofounder of Developer Box LLC, a software development and consulting company located in Middletown, CT. He has been developing enterprise solutions for over 12 years on a variety of platforms.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


IoT & Smart Cities Stories
The platform combines the strengths of Singtel's extensive, intelligent network capabilities with Microsoft's cloud expertise to create a unique solution that sets new standards for IoT applications," said Mr Diomedes Kastanis, Head of IoT at Singtel. "Our solution provides speed, transparency and flexibility, paving the way for a more pervasive use of IoT to accelerate enterprises' digitalisation efforts. AI-powered intelligent connectivity over Microsoft Azure will be the fastest connected pat...
There are many examples of disruption in consumer space – Uber disrupting the cab industry, Airbnb disrupting the hospitality industry and so on; but have you wondered who is disrupting support and operations? AISERA helps make businesses and customers successful by offering consumer-like user experience for support and operations. We have built the world’s first AI-driven IT / HR / Cloud / Customer Support and Operations solution.
Codete accelerates their clients growth through technological expertise and experience. Codite team works with organizations to meet the challenges that digitalization presents. Their clients include digital start-ups as well as established enterprises in the IT industry. To stay competitive in a highly innovative IT industry, strong R&D departments and bold spin-off initiatives is a must. Codete Data Science and Software Architects teams help corporate clients to stay up to date with the mod...
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...
Druva is the global leader in Cloud Data Protection and Management, delivering the industry's first data management-as-a-service solution that aggregates data from endpoints, servers and cloud applications and leverages the public cloud to offer a single pane of glass to enable data protection, governance and intelligence-dramatically increasing the availability and visibility of business critical information, while reducing the risk, cost and complexity of managing and protecting it. Druva's...
BMC has unmatched experience in IT management, supporting 92 of the Forbes Global 100, and earning recognition as an ITSM Gartner Magic Quadrant Leader for five years running. Our solutions offer speed, agility, and efficiency to tackle business challenges in the areas of service management, automation, operations, and the mainframe.
The Jevons Paradox suggests that when technological advances increase efficiency of a resource, it results in an overall increase in consumption. Writing on the increased use of coal as a result of technological improvements, 19th-century economist William Stanley Jevons found that these improvements led to the development of new ways to utilize coal. In his session at 19th Cloud Expo, Mark Thiele, Chief Strategy Officer for Apcera, compared the Jevons Paradox to modern-day enterprise IT, examin...
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, we provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading...
DSR is a supplier of project management, consultancy services and IT solutions that increase effectiveness of a company's operations in the production sector. The company combines in-depth knowledge of international companies with expert knowledge utilising IT tools that support manufacturing and distribution processes. DSR ensures optimization and integration of internal processes which is necessary for companies to grow rapidly. The rapid growth is possible thanks, to specialized services an...
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...