The short
interfaces of the classes discussed in this chapter appear in
Chapter 2. |
The most abstract description of an object capable of searching for a specific
pattern is encapsulated in class
SEARCHER.
A
searcher is an object containing some pattern. There are no restrictions placed
on the pattern, except that it can be stored in a string. A searcher can be
made case-insensitive in searching for the pattern, but by default, it will be
case-sensitive. There are also features that toggle case-sensitivity. A
searcher is capable of searching in different input strings for the pattern it
contains. It is bound to this pattern until a subsequent call to one of the
make routines, which links it to a new pattern. It is not necessary to make a
new searcher object if you want to bind an existing searcher to a new pattern.
In linking the searcher to a specific pattern, a copy of that pattern is made
and stored in the searcher. The pattern is then compiled, translating it into
an internal form that is more suitable for fast searching. If compiling is
successful, the attribute compiled
will indicates this. If the pattern can not be compiled (e.g., a regular
expression pattern containing a syntax error), attribute error_code will indicate the kind of error.
If a searcher is marked as compiled, it is ready to search for the pattern in
text stored in a string. There are two different searching routines, one for
searching forwards and one for searching backwards in the input string. Indices
can be used to set off in the input string the relevant part for searching.
Being able to specify the relevant part of the string means that you do not
need to create a substring of your input for a substring search, which would be
vastly inefficient for large inputs.
The search_backward routine is
especially useful in editing tools, where you not only want to look for the
next occurrence of a pattern but often also for a previous one.
Both
search routines set a Boolean attribute found, which indicates if a search was successful.
If it was, two attributes (start_position and end_position) are set to values specifying the
position of the pattern in the input string.
Most of the features of class
SEARCHER
are
deferred.
They build a framework for subsequent classes which implement true
searchers.
Class
MATCHER
inherits all features from class
SEARCHER
and adds itself features that are specific to pattern matching.
One might ask what the difference is between searching and matching. While
searching is predominantly for those areas where you look for some pattern in a
more or less large amount of text (like text processors, editors, etc.), you
will also often encounter situations where some input as a whole must conform
to a specific pattern; think, for example, of input validation for a database
form field.
To support the different needs for pattern matching, features are added, which
match the input string as a whole or part. The latter are further divided into
features that match the beginning, the end, or any part of the input string.
All
these matching routines expect an input string as argument. Note that indices
for the string are not possible, as is the case with the searching routines,
since here we are interested in matching the string as a whole (or an already
specified part of it).
Each
routine is defined twice: once as a procedure performing a match and once as a
function that simply returns a value indicating whether or not the search in
the input string was successful. This is only for convenience; both have the
same semantics (in the sense that if one matches, so does the other). The only
difference is that the procedures have side effects; while the functions do
not. The side effects are exactly the same as with the searching routines: If
the match is successful, the attribute found is set accordingly, as are the features start_position and end_position. Thus, if you are only interested in
determining whether some input matches the pattern, this can easily be
determined with the function; if you also need the position of the pattern in
the input, you will have to use the procedure and check the attribute found, which indicates whether the match
was successful.
Class
MATCHER
does not implement any of the features. Instead it builds a framework for
subsequent classes that implement various kinds of matchers.
You will probably have noticed that neither of the classes
SEARCHER
or
MATCHER
has a feature that yields the found or matched substring; only its position is
provided.
The reason for this is that we do not want to maintain a reference to the input
string from inside the searcher or matcher. This would mean that
garbage
collection
could not be performed on the input string as long as the searcher or matcher
object exists (or until the next search or match operation is performed on
another input string), at least while there is no mechanism to release that
string. The effects could be disastrous for very large input strings (e.g.,
several Mbytes). To avoid waste of memory and free the user from the annoying
task of always having to release the input string, regardless of whether he/she
is interested in the matched substring, this functionality has been moved to a
different class, the
EXTRACTOR.
This
class, which inherits from
MATCHER,
redefines each searching or matching routine such that it maintains a reference
to the input string. After a successful match, the feature matched_string delivers a copy of the matched part
of the input string. A call to release releases the reference to the input string.
This feature should be called if you want to keep your matcher for later use
but do not need it for the input string any more. Of course, once you have
released your input string, you can no longer use the feature matched_string, which has a corresponding
precondition.
Class
EXTRACTOR,
like
SEARCHER
and
MATCHER,
is
deferred.
It redefines the searching or matching routines on the basis of the still
undefined routines inherited from class
MATCHER.
This class is used in later (concrete) implementations of matcher classes to
provide them with the described functionality.
Sometimes
the need arises to search for a substring in a large amount of data or to look
very often for a specific string. This is the right task for a searcher object
of type
SUBSTRING.
Class
SUBSTRING,
which inherits directly from
SEARCHER,
is an implementation of a searcher that is suitable for searching inside very
large strings or for a large number of searches. Its implementation is based on
an algorithm first published by Daniel M. Sunday and is extremely fast in
searching for an exact match with a substring pattern. In general, it is the
faster the longer the pattern is. (To be more exact: in the best-case scenario
it is n times faster than simple substring searching, where n is
the length of the pattern; in the worst case it is no worse than
straightforward searching; the average depends on the distribution of the
characters.)
The class implements all
deferred
features; no additional features are supplied. Since its compilation takes some
time, the total amount of data to be searched through should be at least 0.5 -
1 Kbytes. Note that this does not mean that each input string should be that
large. If the search string is used a number of times, the sum of all input
lengths counts.
There are no reserved characters. Any characters (even NUL) can be part of the
pattern as well as used in the input string.
A searcher object of type
SUBSTRING
can be safely stored and retrieved if it is reachable from some storable object
(inheriting from class
STORABLE).
The
regular expression matcher is a matcher that uses regular expressions as
patterns. The regular expression patterns are in a style known from many
UNIX utilities; the most prominent one is egrep. They are based
on a freely redistributable reimplementation of the Version 8 regexp routines originally written by
Henry Spencer. The term "based" is used here, since the original C code was
heavily modified and extended to serve the needs of a tool that is more
powerful and flexible than the original V8 regular expressions.
Nevertheless, the original work was a good starting point and saved a lot of
effort.
The implementation uses non-deterministic automata, with all of the well-known
advantages and disadvantages. The advantages are small automaton size and fast
compilation of regular expressions. The disadvantages include longer execution
times for non-deterministic automata, especially when performing unoptimized,
parallel searches. This implementation is a compromise between fast executing
(but slow compiling) deterministic automata and the general form of
non-deterministic automata that are rather slow at execution time. In fact we
do some optimizing to make the simpler cases (which make up over 95% of all
usage) run faster. The price is a somewhat more complicated description (on
complex patterns) of which match will be taken. In most cases it is the longest
possible (as in general non-deterministic automata), with a few
exceptions
for complicated patterns, especially when they are ambiguous. For a detailed
description see below.
This
automaton is useful for every-day matching purposes (e.g., input validation)
and interactive search (and replace) routines in text editors. If however, you
are looking for a matcher that can process vast amounts of data in a very short
time (like egrep) or can execute very complicated patterns very fast
(e.g., lexical analysis for a language parser), you are probably better served
with a deterministic automaton as provided by the Lex library.
The rest of the discussion on the regular expression matcher covers the
following points:
A
regular expression is zero or more alternatives that are separated by
'|'. It matches anything that matches one of the alternatives.
An alternative consists of zero or more concatenated elements. An alternative
matches if all of its elements match.
An element is an atom, possibly followed by '*', '+' or
'?'. An atom followed by one of the following signs matches when the
given condition is satisfied:
*
|
if
the atom matches 0 or more times
|
+
|
if
the atom matches 1 or more times
|
?
|
if
the atom matches exactly 0 or 1 time
|
An atom is either a regular expression in parentheses (matching if that regular
expression matches) or one of the following:
c
|
any
character without a special meaning; matches this character exactly
|
\c
|
any
character with a special meaning (see "Special Characters" below)
|
.
|
dot;
matches any single character
|
^
|
caret;
matches empty string at the beginning of an input string
|
$
|
dollar;
matches empty string at the end of an input string
|
<
|
left
angle bracket; matches empty string at the beginning of word
|
>
|
right
angle bracket; matches an empty string at the end of a word
|
[..]
|
character
class; matches any character inside '[ ]' (see below)
|
[^..]
|
negated
character class; matches any character not inside '[ ]'
|
{x}
|
predefined
character class; matches any character of class x
|
The following are characters with a special meaning:
^$()[]{}.+*?|\<>.
If
any of these characters is to appear as literal in a pattern, it has to be
escaped by '\' (e.g., '\\' stands for '\' itself). Any
escaped character, beside the ones with special meaning above, stands for
itself, with the following
exceptions:
\0
|
stands
for the NUL character
|
\t
|
stands
for a tab
|
\n
|
stands
for a newline
|
\r
|
stands
for a carriage return
|
\b
|
stands
for a backspace
|
\f
|
stands
for a form feed
|
The special characters '<', '>', '^', and '$'
do not match any characters but can be viewed as places between characters. The
leading word boundary '<' (trailing word boundary '>') can
be viewed as the place right before (after) a word, having a word character on
the right (left) and a non-word character on the left (right). If the place is
before the first or after the last character, the non-existent character is
treated as if it were a non-word character. A word character is any
alpha-numeric character or '_'.
There are no reserved character values; in fact you can use any characters in
your pattern or in your input string. This is even valid for the NUL character.
A
character class, is a sequence of characters enclosed in '[ ]'. It
normally matches any single character from the sequence. If however the
sequence begins with a '^', it matches any single character not in the
sequence.
Two characters in the sequence separated by '-' is the shorthand form
for all ASCII characters between the two characters (e.g.,
'[0-9]' matches any digit). A range defined in this way is only valid if
the character in front of '-' comes before the character after
'-' (in the ASCII table). If your matching is case- insensitive,
remember that the range still pertains to a section of the ASCII table; case
sensitivity does not influence the pattern itself. Thus, the character class
'[K-y]' will match the character 'Z' even when matching is
case-insensitive; the pattern is not translated to '[k-y]' or
'[K-Y]'.
If you want to include any special characters in a character class, escape them
with '\', as described above. (This is also useful for '-'
itself, since it is not a special character outside of a character class.) If
you prefer old egrep style, you can also include the literal ']'
in a character class by making it the first character (following a possible
'^'). To include a literal '-', you can also make it the first or
last character.
Besides
the normally defined character classes, there are also some predefined classes,
. A predefined character class is defined by the name of the class enclosed in
'{ }'. The following classes are already defined:
{d}
|
any
digit
|
{c}
|
any
capital letter
|
{s}
|
any
small letter
|
{l}
|
any
letter (capital or small)
|
{a}
|
any
alpha numeric (letter or digit)
|
{w}
|
any
word character (alpha numeric or _)
|
{b}
|
any
blank character (e.g., tab, blank, newline etc.)
|
{p}
|
any
printable character (characters between '\040' and
'\177')
|
Predefined character classes can also be used in normal classes, but not in a
range specification (e.g., '[^{p}\0]' is the class of characters which
are not printable and not NUL).
A
regular expression matcher can be in two modes. By default, it is in
single-line mode and has the already defined behavior. It is sometimes useful,
however, to view an input string as a multi-line buffer, where you want to
match against each line. You can do so by putting the matcher into multi-line
mode.
In multi-line mode, some of the special characters have slightly different
meanings. While in single-line mode embedded newlines will not be matched by
'^' or '$', in multi-line mode a '^' will match after any
newline within the string, and '$' will match before any newline.
To facilitate multi-line substitutions, the '.' character never matches
a newline, and a negated character class does not match a newline either.
However, this doesn't mean that you cannot match any newline in your pattern in
multi-line mode. If you want to match a newline, you have to do so explicitly;
for example, use '(.|\n)' to match any character inclusive newline.
If a pattern is anchored to the beginning and end of an input string. there is
clearly not much room for different matches. The only possibilities are through
alternative branches of the pattern that all lead to a valid match.
If
the pattern is not anchored, or anchored only to one side of the string, there
are normally many more alternatives to consider. In general, if a regular
expression can match two different parts of the input string, the selected
match will be the one which begins earliest. If both begin at the same place
but match different lengths, or match the same length in different ways, it is
more complicated.
The alternatives in a pattern are processed from left to right. When there are
several possible matches for a construct in the search pattern containing
'*', '+' or '?', the longest match is considered first. In
nested constructs, the outermost construct is processed first, and in
concatenated constructs the leftmost are considered first for a forward search
and the rightmost for a backward search. Once a match is found for a given
construct contained in the search pattern, it is maintained until a match is
found for the entire search pattern or all the following constructs fail, at
which time it will fail, too. If there are any alternatives for this construct,
the next one is selected, as described above, and an attempt is made to find a
match for the following constructs.
For example, the pattern '(ab|a)b*c' could match the string
'abbc' in two different ways. The first choice in a forward match is
between 'ab' and 'a', which 'ab' wins, since it is earlier and
leads to a successful overall match. Since one 'b' is already consumed,
the 'b*' can only match the remaining single 'b'; it is subject
to the earlier choice. In the end, 'c' is consumed and the overall match
succeeds.
If the string is matched backwards, another way is chosen. A backward match
would first consume the 'c'; then the 'b*' would take as many
b's as possible (of which there are two here); thus, the first of the
alternatives, 'ab' and 'b', can no longer match. Therefore, the
second one is chosen, which leads to the second alternative matching of
'abbc'. Note that if we had chosen the pattern '(a|ab)b*c'
instead, forward and backward search would have lead to the same match in both
cases.
In a particular case where '|' is not present and there is only one
'*', '+' or '?', then the longest possible match will be
chosen. Thus, 'ab*' matched against 'xabbbbby' will always yield
'abbbbb' regardless of whether the match is forward or backward. Note,
however, that if 'ab*' is tried against 'xabyabbbz', it will
match 'ab' right after the 'x' in a forward search, while
'abbb' right in front of 'z' is matched in a backward search.
This is due to the begins-earliest rule. In effect, the decision on where to
start the match is the first choice to be made. Hence, subsequent choices must
respect this even if it leads to a less-preferred alternative.
Because evaluation of a branch is always from left to right, you should always
put the longer alternatives to the left of the shorter ones if you want the
longest possible match.
A regular expression matcher is an instance of class
REGEXP,
which inherits from
deferred
class
MATCHER
and implements all the deferred features.
After a call to one of the make
routines, the argument pattern is translated into the corresponding NFA. If the
compilation succeeds, compiled is
set to true. If it fails, probably because of a syntax error, error_code contains one of the error codes from class
REGEXP_ERRORS,
which indicates the kind of error.
If the pattern is compiled, features search_forward and search_backward can be used to search some (or part
of an) input string for the next occurrence of the regular expression pattern.
Note that if the pattern is anchored to the beginning or end of the input
string (using '^' or '$'), the search succeeds in single-line
mode only if the beginning or end of the string is contained in the relevant
part denoted by the from and to indices.
The matching routines can be used to determine if the input string, as a whole
or part, matches with the regular expression. If match or matches is used, the string as a whole must match
against the regular expression pattern, regardless of whether it was anchored
to the beginning and end of the string. Accordingly, the same is true for the
other matching routines. Note that the matching routines are the only way to
anchor a pattern in multi-line mode to the beginning and end of the input
string, since '^' and '$' also match linefeeds inside the input
string.
Besides
the features inherited from the
ancestor
classes,
REGEXP
introduces some special features. First, there are features to switch between
the line modes and to check which mode the matcher is in. Second, there are
special features that deal with subexpressions of the regular expression
pattern.
As described by the syntax rules above, subexpressions can be grouped by using
parenthesis '( )'. Besides its syntactic relevance, grouping has some
semantic side effects. At compile time, every subexpression enclosed in
'( )' is numbered from left to right, starting with one. The
feature number_of_subexpr returns
what its name suggests: the number of subexpressions encountered in a compiled
regular expression pattern. Now if the pattern is matched against an input
string (by use of match or search) and the match succeeds, all submatches of
the subexpressions are recorded and can be requested from the matcher object.
The features start_of_subexpr and
end_of_subexpr are provided for
this. They expect the number of the subexpression as parameter and return the
corresponding positions in the input string.
This feature is very convenient if you not only want to match an input against
a pattern but also split the input into parts. For example, if you want to
check if an input is a valid date, match it against
'({d}{d}?)/({d}{d}?)/({d}{d})'; if the input really is a date, this will
split it into month, day, and year parts, which can be easily retrieved and
checked for consistent values.
You may ask why there is no feature that directly returns the substring of the
input that matches the i-th subexpression. Here the same comments are valid as
mentioned earlier in the discussion of class
EXTRACTOR.
The actual splitting is moved to a separate class, called
REGEXP_EXTRACTOR,
which has exactly the desired feature.
Class
REGEXP_EXTRACTOR
inherits all features from
REGEXP
and
EXTRACTOR
and adds a feature matched_substring, which delivers the substring of
the input string that matched the i-th subexpression of the regular expression.
Note that the result of matched_substring may be void; then the i-th
subexpression was not part of the match, which can be the case, since
subexpressions are numbered at compile time. Note also that the result may be
the empty string for the case the subexpression is defined to match the empty
string. If you use matchers from type
REGEXP_EXTRACTOR,
do not forget to release the input string when it is no longer needed.
Both
REGEXP
and
REGEXP_EXTRACTOR
objects can safely be made persistent in the usual way (using
STORABLE).
While
regular expressions are powerful but complex, wildcards offer much simpler use,
but the approach to pattern matching is not as powerful. This implementation of
a wildcard pattern matcher is based on the regular expression matcher described
above; the same comments on compile and execution speed are valid here, too.
The interface is easy to use, similar to the file name generation mechanism
used in general UNIX shells, with the addition of alternatives, which
makes it more flexible and powerful. Because of its simplicity, the tool is
also suited to end users.
We address the following points in the discussion on wildcard expressions:
A wildcard pattern is a string of characters, some of which have special
meanings (like in regular expressions). A wildcard pattern matches if the
concatenation of its elements matches.
Each element consists of one of the following and matches when the associated
condition is satisfied:
*
|
any
sequence of characters (zero or more)
|
?
|
any
single character
|
c
|
the
character itself, if it's not a special character
|
[..]
|
any
of the enclosed characters, where characters separated by '-' denote a
range
|
[^..]
|
any
character not in the list of enclosed characters (following '^')
|
{..}
|
any
character sequence, matched by any of the comma-separated list of wildcard
patterns inside the braces (see below)
|
The following are special characters: *?[]^{},\. To match any special
character itself, the character must be escaped with '\'. An escaped
character that is not one of the special characters above stands for itself,
with the following
exceptions:
\0
|
stands
for the NUL character
|
\t
|
stands
for a tab
|
\n
|
stands
for a newline
|
\r
|
stands
for a carriage return
|
\b
|
stands
for a backspace
|
\f
|
stands
for a form feed
|
Character classes ([..] and [^..]) are defined the
same way as in regular expressions; for a complete description see the
explanations there. Remember, however, that there are no predefined character
classes.
Brace expressions are useful to express alternatives (which are more powerful
than simple character classes) in a wildcard pattern. A brace expression
consists of a comma-separated list of wildcard patterns enclosed in '{
}'. A brace expression matches whenever one of its alternatives matches.
For example, the pattern "hello {nice ,funny ,}world"
matches the strings: "hello nice word", "hello funny world", and
"hello world" (because of the solitary comma at the end of the brace
expression). Since a brace expression is itself (part of) a wildcard pattern,
they can be arbitrarily nested.
Like
in regular expressions, a matcher can be in single-line or in multi-line mode.
In single-line mode, which is the default, a pattern matches exactly as
described above. In multi-line mode, the input string will be treated as a
multi-line text buffer. Therefore, the following holds:
?
|
will
not match a newline,
|
*
|
matches
only characters which are not newlines, and
|
[^..]
|
never
matches a newline, regardless of what characters are inside the braces.
|
Wildcard expression matchers are described by class
WILDCARD,
which inherits from class
MATCHER
and provides a straight-forward implementation of the
deferred
features by using a regular expression matcher.
WILDCARD
does not add any additional features (like
REGEXP)
besides the ones already inherited. To get the matched substring out of the
input string, there is also a class
WILDCARD_EXTRACTOR,
which inherits both from
WILDCARD
and
EXTRACTOR,
providing the desired result.