Chapter 1

Pattern Matching Library

This chapter describes the general notion of searching for and matching patterns in a string; it can be used for a wide range of applications, from simple pattern matching to search-and-replace facilities needed in editors and text-processing systems.

1. Overview

This chapter describes the general notion of searching for and matching patterns in a string; it can be used for a wide range of applications, from simple pattern matching to search-and-replace facilities needed in editors and text-processing systems.

The Pattern Matching library consists of classes for specialized and optimized matchers or searchers, each of which has its own strengths; you might find it worthwhile to obtain an overview of those provided in order to find the matcher that is the right for your specific task.

We organized this chapter along the following points:

Figure 15-1: Pattern Matching Class Hierarchy


2. Overall Structure of the Library

The general notions and structure of the Pattern Matching library are explored here. The discussion covers the following points:

2.1. Searching for Patterns

    The short interfaces of the classes discussed in this chapter appear in Chapter 2.

The most abstract description of an object capable of searching for a specific pattern is encapsulated in class SEARCHER.

A searcher is an object containing some pattern. There are no restrictions placed on the pattern, except that it can be stored in a string. A searcher can be made case-insensitive in searching for the pattern, but by default, it will be case-sensitive. There are also features that toggle case-sensitivity. A searcher is capable of searching in different input strings for the pattern it contains. It is bound to this pattern until a subsequent call to one of the make routines, which links it to a new pattern. It is not necessary to make a new searcher object if you want to bind an existing searcher to a new pattern.

In linking the searcher to a specific pattern, a copy of that pattern is made and stored in the searcher. The pattern is then compiled, translating it into an internal form that is more suitable for fast searching. If compiling is successful, the attribute compiled will indicates this. If the pattern can not be compiled (e.g., a regular expression pattern containing a syntax error), attribute error_code will indicate the kind of error.

If a searcher is marked as compiled, it is ready to search for the pattern in text stored in a string. There are two different searching routines, one for searching forwards and one for searching backwards in the input string. Indices can be used to set off in the input string the relevant part for searching. Being able to specify the relevant part of the string means that you do not need to create a substring of your input for a substring search, which would be vastly inefficient for large inputs.

The search_backward routine is especially useful in editing tools, where you not only want to look for the next occurrence of a pattern but often also for a previous one.

Both search routines set a Boolean attribute found, which indicates if a search was successful. If it was, two attributes (start_position and end_position) are set to values specifying the position of the pattern in the input string.

Most of the features of class SEARCHER are deferred. They build a framework for subsequent classes which implement true searchers.

2.2. Matching a Pattern

Class MATCHER inherits all features from class SEARCHER and adds itself features that are specific to pattern matching.

One might ask what the difference is between searching and matching. While searching is predominantly for those areas where you look for some pattern in a more or less large amount of text (like text processors, editors, etc.), you will also often encounter situations where some input as a whole must conform to a specific pattern; think, for example, of input validation for a database form field.

To support the different needs for pattern matching, features are added, which match the input string as a whole or part. The latter are further divided into features that match the beginning, the end, or any part of the input string.

All these matching routines expect an input string as argument. Note that indices for the string are not possible, as is the case with the searching routines, since here we are interested in matching the string as a whole (or an already specified part of it).

Each routine is defined twice: once as a procedure performing a match and once as a function that simply returns a value indicating whether or not the search in the input string was successful. This is only for convenience; both have the same semantics (in the sense that if one matches, so does the other). The only difference is that the procedures have side effects; while the functions do not. The side effects are exactly the same as with the searching routines: If the match is successful, the attribute found is set accordingly, as are the features start_position and end_position. Thus, if you are only interested in determining whether some input matches the pattern, this can easily be determined with the function; if you also need the position of the pattern in the input, you will have to use the procedure and check the attribute found, which indicates whether the match was successful.

Class MATCHER does not implement any of the features. Instead it builds a framework for subsequent classes that implement various kinds of matchers.

2.3. Getting the Matched Part

You will probably have noticed that neither of the classes SEARCHER or MATCHER has a feature that yields the found or matched substring; only its position is provided.

The reason for this is that we do not want to maintain a reference to the input string from inside the searcher or matcher. This would mean that garbage collection could not be performed on the input string as long as the searcher or matcher object exists (or until the next search or match operation is performed on another input string), at least while there is no mechanism to release that string. The effects could be disastrous for very large input strings (e.g., several Mbytes). To avoid waste of memory and free the user from the annoying task of always having to release the input string, regardless of whether he/she is interested in the matched substring, this functionality has been moved to a different class, the EXTRACTOR.

This class, which inherits from MATCHER, redefines each searching or matching routine such that it maintains a reference to the input string. After a successful match, the feature matched_string delivers a copy of the matched part of the input string. A call to release releases the reference to the input string. This feature should be called if you want to keep your matcher for later use but do not need it for the input string any more. Of course, once you have released your input string, you can no longer use the feature matched_string, which has a corresponding precondition.

Class EXTRACTOR, like SEARCHER and MATCHER, is deferred. It redefines the searching or matching routines on the basis of the still undefined routines inherited from class MATCHER. This class is used in later (concrete) implementations of matcher classes to provide them with the described functionality.

3. Searching for Substrings

Sometimes the need arises to search for a substring in a large amount of data or to look very often for a specific string. This is the right task for a searcher object of type SUBSTRING.

Class SUBSTRING, which inherits directly from SEARCHER, is an implementation of a searcher that is suitable for searching inside very large strings or for a large number of searches. Its implementation is based on an algorithm first published by Daniel M. Sunday and is extremely fast in searching for an exact match with a substring pattern. In general, it is the faster the longer the pattern is. (To be more exact: in the best-case scenario it is n times faster than simple substring searching, where n is the length of the pattern; in the worst case it is no worse than straightforward searching; the average depends on the distribution of the characters.)

The class implements all deferred features; no additional features are supplied. Since its compilation takes some time, the total amount of data to be searched through should be at least 0.5 - 1 Kbytes. Note that this does not mean that each input string should be that large. If the search string is used a number of times, the sum of all input lengths counts.

There are no reserved characters. Any characters (even NUL) can be part of the pattern as well as used in the input string.

A searcher object of type SUBSTRING can be safely stored and retrieved if it is reachable from some storable object (inheriting from class STORABLE).

4. Regular Expression Matcher

The regular expression matcher is a matcher that uses regular expressions as patterns. The regular expression patterns are in a style known from many UNIX utilities; the most prominent one is egrep. They are based on a freely redistributable reimplementation of the Version 8 regexp routines originally written by Henry Spencer. The term "based" is used here, since the original C code was heavily modified and extended to serve the needs of a tool that is more powerful and flexible than the original V8 regular expressions. Nevertheless, the original work was a good starting point and saved a lot of effort.

The implementation uses non-deterministic automata, with all of the well-known advantages and disadvantages. The advantages are small automaton size and fast compilation of regular expressions. The disadvantages include longer execution times for non-deterministic automata, especially when performing unoptimized, parallel searches. This implementation is a compromise between fast executing (but slow compiling) deterministic automata and the general form of non-deterministic automata that are rather slow at execution time. In fact we do some optimizing to make the simpler cases (which make up over 95% of all usage) run faster. The price is a somewhat more complicated description (on complex patterns) of which match will be taken. In most cases it is the longest possible (as in general non-deterministic automata), with a few exceptions for complicated patterns, especially when they are ambiguous. For a detailed description see below.

This automaton is useful for every-day matching purposes (e.g., input validation) and interactive search (and replace) routines in text editors. If however, you are looking for a matcher that can process vast amounts of data in a very short time (like egrep) or can execute very complicated patterns very fast (e.g., lexical analysis for a language parser), you are probably better served with a deterministic automaton as provided by the Lex library.

The rest of the discussion on the regular expression matcher covers the following points:

4.1. Syntax for Regular Expression Patterns

A regular expression is zero or more alternatives that are separated by '|'. It matches anything that matches one of the alternatives.

An alternative consists of zero or more concatenated elements. An alternative matches if all of its elements match.

An element is an atom, possibly followed by '*', '+' or '?'. An atom followed by one of the following signs matches when the given condition is satisfied:

    *
    if the atom matches 0 or more times
    +
    if the atom matches 1 or more times
    ?
    if the atom matches exactly 0 or 1 time

An atom is either a regular expression in parentheses (matching if that regular expression matches) or one of the following:

    c
    any character without a special meaning; matches this character exactly
    \c
    any character with a special meaning (see "Special Characters" below)
    .
    dot; matches any single character
    ^
    caret; matches empty string at the beginning of an input string
    $
    dollar; matches empty string at the end of an input string
    <
    left angle bracket; matches empty string at the beginning of word
    >
    right angle bracket; matches an empty string at the end of a word
    [..]
    character class; matches any character inside '[ ]' (see below)
    [^..]
    negated character class; matches any character not inside '[ ]'
    {x}
    predefined character class; matches any character of class x

4.2. Special Characters

The following are characters with a special meaning: ^$()[]{}.+*?|\<>.

If any of these characters is to appear as literal in a pattern, it has to be escaped by '\' (e.g., '\\' stands for '\' itself). Any escaped character, beside the ones with special meaning above, stands for itself, with the following exceptions:

    \0
    stands for the NUL character
    \t
    stands for a tab
    \n
    stands for a newline
    \r
    stands for a carriage return
    \b
    stands for a backspace
    \f
    stands for a form feed

The special characters '<', '>', '^', and '$' do not match any characters but can be viewed as places between characters. The leading word boundary '<' (trailing word boundary '>') can be viewed as the place right before (after) a word, having a word character on the right (left) and a non-word character on the left (right). If the place is before the first or after the last character, the non-existent character is treated as if it were a non-word character. A word character is any alpha-numeric character or '_'.

There are no reserved character values; in fact you can use any characters in your pattern or in your input string. This is even valid for the NUL character.

4.3. Character Classes

A character class, is a sequence of characters enclosed in '[ ]'. It normally matches any single character from the sequence. If however the sequence begins with a '^', it matches any single character not in the sequence.

Two characters in the sequence separated by '-' is the shorthand form for all ASCII characters between the two characters (e.g., '[0-9]' matches any digit). A range defined in this way is only valid if the character in front of '-' comes before the character after '-' (in the ASCII table). If your matching is case- insensitive, remember that the range still pertains to a section of the ASCII table; case sensitivity does not influence the pattern itself. Thus, the character class '[K-y]' will match the character 'Z' even when matching is case-insensitive; the pattern is not translated to '[k-y]' or '[K-Y]'.

If you want to include any special characters in a character class, escape them with '\', as described above. (This is also useful for '-' itself, since it is not a special character outside of a character class.) If you prefer old egrep style, you can also include the literal ']' in a character class by making it the first character (following a possible '^'). To include a literal '-', you can also make it the first or last character.

4.4. Predefined Character Classes

Besides the normally defined character classes, there are also some predefined classes, . A predefined character class is defined by the name of the class enclosed in '{ }'. The following classes are already defined:

    {d}
    any digit
    {c}
    any capital letter
    {s}
    any small letter
    {l}
    any letter (capital or small)
    {a}
    any alpha numeric (letter or digit)
    {w}
    any word character (alpha numeric or _)
    {b}
    any blank character (e.g., tab, blank, newline etc.)
    {p}
    any printable character (characters between '\040' and '\177')

Predefined character classes can also be used in normal classes, but not in a range specification (e.g., '[^{p}\0]' is the class of characters which are not printable and not NUL).

4.5. Single-line Versus Multi-line Mode

A regular expression matcher can be in two modes. By default, it is in single-line mode and has the already defined behavior. It is sometimes useful, however, to view an input string as a multi-line buffer, where you want to match against each line. You can do so by putting the matcher into multi-line mode.

In multi-line mode, some of the special characters have slightly different meanings. While in single-line mode embedded newlines will not be matched by '^' or '$', in multi-line mode a '^' will match after any newline within the string, and '$' will match before any newline.

To facilitate multi-line substitutions, the '.' character never matches a newline, and a negated character class does not match a newline either. However, this doesn't mean that you cannot match any newline in your pattern in multi-line mode. If you want to match a newline, you have to do so explicitly; for example, use '(.|\n)' to match any character inclusive newline.

4.6. What Part of the Input Will Match

If a pattern is anchored to the beginning and end of an input string. there is clearly not much room for different matches. The only possibilities are through alternative branches of the pattern that all lead to a valid match.

If the pattern is not anchored, or anchored only to one side of the string, there are normally many more alternatives to consider. In general, if a regular expression can match two different parts of the input string, the selected match will be the one which begins earliest. If both begin at the same place but match different lengths, or match the same length in different ways, it is more complicated.

The alternatives in a pattern are processed from left to right. When there are several possible matches for a construct in the search pattern containing '*', '+' or '?', the longest match is considered first. In nested constructs, the outermost construct is processed first, and in concatenated constructs the leftmost are considered first for a forward search and the rightmost for a backward search. Once a match is found for a given construct contained in the search pattern, it is maintained until a match is found for the entire search pattern or all the following constructs fail, at which time it will fail, too. If there are any alternatives for this construct, the next one is selected, as described above, and an attempt is made to find a match for the following constructs.

For example, the pattern '(ab|a)b*c' could match the string 'abbc' in two different ways. The first choice in a forward match is between 'ab' and 'a', which 'ab' wins, since it is earlier and leads to a successful overall match. Since one 'b' is already consumed, the 'b*' can only match the remaining single 'b'; it is subject to the earlier choice. In the end, 'c' is consumed and the overall match succeeds.

If the string is matched backwards, another way is chosen. A backward match would first consume the 'c'; then the 'b*' would take as many b's as possible (of which there are two here); thus, the first of the alternatives, 'ab' and 'b', can no longer match. Therefore, the second one is chosen, which leads to the second alternative matching of 'abbc'. Note that if we had chosen the pattern '(a|ab)b*c' instead, forward and backward search would have lead to the same match in both cases.

In a particular case where '|' is not present and there is only one '*', '+' or '?', then the longest possible match will be chosen. Thus, 'ab*' matched against 'xabbbbby' will always yield 'abbbbb' regardless of whether the match is forward or backward. Note, however, that if 'ab*' is tried against 'xabyabbbz', it will match 'ab' right after the 'x' in a forward search, while 'abbb' right in front of 'z' is matched in a backward search. This is due to the begins-earliest rule. In effect, the decision on where to start the match is the first choice to be made. Hence, subsequent choices must respect this even if it leads to a less-preferred alternative.

Because evaluation of a branch is always from left to right, you should always put the longer alternatives to the left of the shorter ones if you want the longest possible match.

4.7. The Class Interface of the Regular Expression Matcher

A regular expression matcher is an instance of class REGEXP, which inherits from deferred class MATCHER and implements all the deferred features.

After a call to one of the make routines, the argument pattern is translated into the corresponding NFA. If the compilation succeeds, compiled is set to true. If it fails, probably because of a syntax error, error_code contains one of the error codes from class REGEXP_ERRORS, which indicates the kind of error.

If the pattern is compiled, features search_forward and search_backward can be used to search some (or part of an) input string for the next occurrence of the regular expression pattern. Note that if the pattern is anchored to the beginning or end of the input string (using '^' or '$'), the search succeeds in single-line mode only if the beginning or end of the string is contained in the relevant part denoted by the from and to indices.

The matching routines can be used to determine if the input string, as a whole or part, matches with the regular expression. If match or matches is used, the string as a whole must match against the regular expression pattern, regardless of whether it was anchored to the beginning and end of the string. Accordingly, the same is true for the other matching routines. Note that the matching routines are the only way to anchor a pattern in multi-line mode to the beginning and end of the input string, since '^' and '$' also match linefeeds inside the input string.

Besides the features inherited from the ancestor classes, REGEXP introduces some special features. First, there are features to switch between the line modes and to check which mode the matcher is in. Second, there are special features that deal with subexpressions of the regular expression pattern.

As described by the syntax rules above, subexpressions can be grouped by using parenthesis '( )'. Besides its syntactic relevance, grouping has some semantic side effects. At compile time, every subexpression enclosed in '( )' is numbered from left to right, starting with one. The feature number_of_subexpr returns what its name suggests: the number of subexpressions encountered in a compiled regular expression pattern. Now if the pattern is matched against an input string (by use of match or search) and the match succeeds, all submatches of the subexpressions are recorded and can be requested from the matcher object. The features start_of_subexpr and end_of_subexpr are provided for this. They expect the number of the subexpression as parameter and return the corresponding positions in the input string.

This feature is very convenient if you not only want to match an input against a pattern but also split the input into parts. For example, if you want to check if an input is a valid date, match it against '({d}{d}?)/({d}{d}?)/({d}{d})'; if the input really is a date, this will split it into month, day, and year parts, which can be easily retrieved and checked for consistent values.

You may ask why there is no feature that directly returns the substring of the input that matches the i-th subexpression. Here the same comments are valid as mentioned earlier in the discussion of class EXTRACTOR. The actual splitting is moved to a separate class, called REGEXP_EXTRACTOR, which has exactly the desired feature.

Class REGEXP_EXTRACTOR inherits all features from REGEXP and EXTRACTOR and adds a feature matched_substring, which delivers the substring of the input string that matched the i-th subexpression of the regular expression. Note that the result of matched_substring may be void; then the i-th subexpression was not part of the match, which can be the case, since subexpressions are numbered at compile time. Note also that the result may be the empty string for the case the subexpression is defined to match the empty string. If you use matchers from type REGEXP_EXTRACTOR, do not forget to release the input string when it is no longer needed.

Both REGEXP and REGEXP_EXTRACTOR objects can safely be made persistent in the usual way (using STORABLE).

5. Wildcard Expressions

While regular expressions are powerful but complex, wildcards offer much simpler use, but the approach to pattern matching is not as powerful. This implementation of a wildcard pattern matcher is based on the regular expression matcher described above; the same comments on compile and execution speed are valid here, too. The interface is easy to use, similar to the file name generation mechanism used in general UNIX shells, with the addition of alternatives, which makes it more flexible and powerful. Because of its simplicity, the tool is also suited to end users.

We address the following points in the discussion on wildcard expressions:

5.1. Syntax of Wildcard Patterns

A wildcard pattern is a string of characters, some of which have special meanings (like in regular expressions). A wildcard pattern matches if the concatenation of its elements matches.

Each element consists of one of the following and matches when the associated condition is satisfied:

    *
    any sequence of characters (zero or more)
    ?
    any single character
    c
    the character itself, if it's not a special character
    [..]
    any of the enclosed characters, where characters separated by '-' denote a range
    [^..]
    any character not in the list of enclosed characters (following '^')
    {..}
    any character sequence, matched by any of the comma-separated list of wildcard patterns inside the braces (see below)

The following are special characters: *?[]^{},\. To match any special character itself, the character must be escaped with '\'. An escaped character that is not one of the special characters above stands for itself, with the following exceptions:

    \0
    stands for the NUL character
    \t
    stands for a tab
    \n
    stands for a newline
    \r
    stands for a carriage return
    \b
    stands for a backspace
    \f
    stands for a form feed

Character classes ([..] and [^..]) are defined the same way as in regular expressions; for a complete description see the explanations there. Remember, however, that there are no predefined character classes.

Brace expressions are useful to express alternatives (which are more powerful than simple character classes) in a wildcard pattern. A brace expression consists of a comma-separated list of wildcard patterns enclosed in '{ }'. A brace expression matches whenever one of its alternatives matches. For example, the pattern "hello {nice ,funny ,}world" matches the strings: "hello nice word", "hello funny world", and "hello world" (because of the solitary comma at the end of the brace expression). Since a brace expression is itself (part of) a wildcard pattern, they can be arbitrarily nested.

Like in regular expressions, a matcher can be in single-line or in multi-line mode. In single-line mode, which is the default, a pattern matches exactly as described above. In multi-line mode, the input string will be treated as a multi-line text buffer. Therefore, the following holds:

    ?
    will not match a newline,
    *
    matches only characters which are not newlines, and
    [^..]
    never matches a newline, regardless of what characters are inside the braces.

5.2. The Class Interface of the Wildcard Expression Matcher

Wildcard expression matchers are described by class WILDCARD, which inherits from class MATCHER and provides a straight-forward implementation of the deferred features by using a regular expression matcher.

WILDCARD does not add any additional features (like REGEXP) besides the ones already inherited. To get the matched substring out of the input string, there is also a class WILDCARD_EXTRACTOR, which inherits both from WILDCARD and EXTRACTOR, providing the desired result.