Welcome!

.NET Authors: Liz McMillan, Mark O'Neill, Peter Silva, Yakov Werde, Matthew Pollicove

Related Topics: .NET

.NET: Article

Demystifying Regular Expressions

Developing regular expressions in .NET

The ability to perform pattern-matching operations on text is a skill that is highly useful to any programmer. Whether you are creating a routine to validate data entered into a form, performing parsing and mining on data sets, or searching for sequence similarities in the human genome, chances are that the ability to construct a regular expression will be of great value to you. A single regular expression can often be used to create the same pattern-matching functionality that would otherwise require a lengthy subroutine.

Yet despite these apparent benefits, many .NET developers find regular expressions daunting because they have a syntax based on Perl 5 regular expressions and are thus somewhat different from the typical .NET language constructs. In this article I will discuss the basic syntax of regular expressions and how they operate in order to demystify them and allow you to begin to tap into their potential.

Basic Regular Expression Syntax
In their most basic form, regular expressions can be utilized to find character-by-character matches, if any, within a string. In other words, a regular expression consisting of "123" would be able to match the "123" portion of the string "abc123", but would not yield a positive match for the string "abcdef". When performing regular expression matching, it is also important to remember that they are case sensitive and as a result "abc" will not match "ABC". While such hard set-matching capabilities can be beneficial under certain circumstances, regular expressions are capable of creating much more sophisticated patterns for string comparison. For example, if we want to match either "123" or "abc", we could easily accomplish this using alternation. In order to invoke the alternation capability we place a "|" between the regular expression pattern possibilities. Thus the regular expression "123|abc" will be able to produce a match for either "123" or "abc", depending on which it encounters first.

It is also possible to declare a list of possible matches for a given character position by placing acceptable characters between brackets. Accordingly, a regular expression consisting of "[aA]bc" could match either "abc" or "Abc". More complex patterns can be made by taking advantage of subpatterns, which are indicated by placement between parentheses. Consider the regular expression pattern "ab(cd|CD)". The presence of the subpattern "cd|CD" enables our expression to match either "abcd" or "abCD". Parentheses also work to capture the substrings that match their enclosed subpatterns. This functionality will be illustrated in a later section.

Quantifiers
At this point we have surveyed the syntactic structures that supply the base functionality to regular expression patterns, but there are several other syntactic structures that are beneficial to understand. The first of these that we will consider is that of quantifiers. Quantifiers are useful if we desire to match multiples of the same character or string of characters. For example, if we sought to match 100 consecutive instances of "abc", we could use the expression "(abc){100}". It is important to note that quantifiers will work with only the unit that directly precedes them. Thus, if we did not use parentheses to specify "abc" as a unit, the expression would instead match "ab" followed by 100 "c"s. The different types of Quantifiers available are summarized in Table 1.

Predefined Subpatterns
Predefined subpatterns are useful because they allow us to signify a grouping of possible characters with a single specifier. For example, "\d" could be used to represent any integer, or "\w" could be used to represent any alphanumeric character. A summary of all possible predefined subpatterns can be found in Table 2. These subpatterns are especially useful when combined with quantifiers. For example, the regular expression "d+|\d+.\d+" could be used to match any number, either integer or real.

Regular Expressions in the .NET Framework
Within the .NET Framework the namespace that contains the regular expression functionality is the System.Text.RegularExpressions namespace. For purposes of this article - and for most common matching routines - the main classes to be concerned with are the Match and MatchCollection classes. The Match class is used to store the results from a single regular expression match. If multiple matches are made by iterating through a string in a left-to-right manner, then these matches are stored in a MatchCollection class object.

In essence, a MatchCollection contains a list of Match objects that can be individually referenced by an index value. The Match class has a variety of properties that store useful information about a matching operation, such as whether the match was a success or not, what portion of the string matched, and the length of the matching substring. These properties are summarized in Table 3.

The RegularExpressions namespace also contains several methods that are used to perform matching operations as well as string-replacement operations. These methods are summarized in Table 4. In the following sections of this article we will code a simple VB.NET application that will demonstrate the syntax of string matching, string replacement, and substring capturing. In order to begin this application, let's take five text boxes and a button and drop them on our form, as shown in Figure 1.

Matching and Capturing Operations
The form shown in Figure 1 is typical of a simple form that might be used for the entry of client data into an application. The form contains text boxes designed to accept first and last names, a phone number, and a client ID number. We are interested in verifying whether or not the phone number entered is valid. We are going to accomplish this by first reading the phone number into a variable called PhNum. Next we also declare a variable of Match type named ValidNum, as shown in the code found in Listing 1.

We then invoke the Match method of the regular expression namespace to compare the string PhNum to the regular expression pattern provided. The results of this matching operation are stored in the variable ValidNum. If we had instead needed to fill a MatchCollection with multiple phone numbers, we would have used the Matches method. This pattern should identify most common ways of writing U.S.-based phone numbers that include area codes. The regular expression allows for optional parentheses around the area code and also allows either a hyphen or a space to separate the area code from the rest of the phone number.

It is important to note that a backslash must be placed before the parenthesis in order for the parenthesis to be considered as part of the pattern. When a backslash is placed before any character with special meaning, i.e., (, ), [, ], !, ?, *, etc., it tells the regular expression engine to ignore its special meaning and just consider it as a normal character. The expression also checks that the area code contains three digits and that the remainder of the number is in the form of three digits followed by a "-" and four more digits. If you look closely at the regular expression you will notice that there is also a set of parentheses that distinguishes the area code as a subpattern and another set that distinguishes the rest of the number as a subpattern. The purpose of these extra sets of parentheses is to perform substring captures, as will be demonstrated below.

After we perform our matching operation we begin to analyze our results with a conditional statement that determines whether the provided phone number matches the acceptable phone number pattern provided. If the match was not successful, our program outputs that the phone number was invalid. If the match was successful, our code first calls upon the ValidNum variables Value property and then prints out that the number is valid (see Figure 2).

In the same output string we also call upon the Groups property of the value. This property stores a listing of all of the substrings matched by our capturing parentheses in order of appearance from left to right. The index value is used to reference the individual substring matches. The 0 value index always contains the overall match for the entire regular expression, the 1 index contains the match for the first set of capturing parentheses, the 2 index the second set, and so on. In this way we are able to use our regular expression to isolate both the area code and the remainder of the phone number and print them out though the Groups Value property.

Substitution
In the final section of this article I will illustrate how to utilize regular expressions to perform string substitutions. Within the .NET Framework this is accomplished through the use of the Replace method. This method will accept a starting string as well as a regular expression pattern and a replacement string. The method searches the starting string for a portion that matches the provided pattern. If a match is found the matching portion of the string is removed from the starting string and the replacement string is substituted in its place.

In the following example we are assuming that the client ID number consists of several letters followed by a series of five numbers. In order to maintain client privacy we are going to replace the five digits of the ID number with XXXXX. The code in the following code snippet demonstrates how we would accomplish this. The client ID number is stored in the IDNum variable, which is input into our replacement operation. The regular expression provided will match the sequence of five consecutive digits in the ID number. The last string in the method call, "XXXXX", is the replacement string that will be inserted in place of the five digits in our form's display.

TextBox5.Text = Regex.Replace(IDNum, "\d{5}", "XXXXX")

Conclusion
This article sought to provide the basics behind regular expression usage and to alert readers to the potentially useful roles they may play in their programming tasks. Of course, the power and uses of regular expressions go far beyond what can be expressed in a single article. However, it should provide a good starting point from which further skills can be developed.

More Stories By Christopher Frenz

Christopher Frenz is the author of "Visual Basic and Visual Basic .NET for Scientists and Engineers" (Apress) and "Pro Perl Parsing" (Apress). He is a faculty member in the Department of Computer Engineering at the New York City College of Technology (CUNY), where he performs computational biology and machine learning research.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.