diff options
Diffstat (limited to 'man/regex.texinfo')
-rw-r--r-- | man/regex.texinfo | 588 |
1 files changed, 588 insertions, 0 deletions
diff --git a/man/regex.texinfo b/man/regex.texinfo new file mode 100644 index 00000000000..87529a5ee28 --- /dev/null +++ b/man/regex.texinfo @@ -0,0 +1,588 @@ +\input texinfo +@comment -*- Mode: texinfo -*- +@comment This documents the GNU regex library +@setfilename regex + +@comment >> @code{"foo"} for literal strings vs @b{"foo"} vs @code{foo} +@comment >> (this file is presently using the last --- it looks ok in +@comment >> info; wait to see what it looks like under botex) + + +@comment >> superior of (dir) is temporary +@node top, syntax, , (dir) +@comment node-name, next, previous, up +@chapter @dfn{regex} regular expression matching library. + +@section Overview + +Regular expression matching allows you to test whether a string fits +into a specific syntactic shape. You can also search a string for a +substring that fits a pattern. + +A regular expression describes a set of strings. The simplest case is +one that describes a particular string; for example, the string @samp{foo} +when regarded as a regular expression matches @samp{foo} and nothing else. +Nontrivial regular expressions use certain special constructs so that +they can match more than one string. For example, the regular expression +@samp{foo\|bar} matches either the string @samp{foo} or the string @samp{bar}; the +regular expression @samp{c[ad]*r} matches any of the strings @samp{cr}, @samp{car}, +@samp{cdr}, @samp{caar}, @samp{cadddar} and all other such strings with any number of +@samp{a}'s and @samp{d}'s. + +The first step in matching a regular expression is to compile it. +You must supply the pattern string and also a pattern buffer to hold +the compiled result. That result contains the pattern in an internal +format that is easier to use in matching. + +Having compiled a pattern, you can match it against strings. You can +match the compiled pattern any number of times against different +strings. + +@menu +* syntax:: Syntax of regular expressions +* directives:: Meaning of characters as regex string directives. +* emacs:: Additional character directives available + only for use within Emacs. +* programming:: Using the regex library from C programs +* unix:: Unix-compatible entry-points to regex library +@end menu + +@node syntax, directives, top, top +@comment node-name, next, previous, up +@section Syntax of Regular Expressions + +Regular expressions have a syntax in which a few characters are special +constructs and the rest are @dfn{ordinary}. An ordinary character is a +simple regular expression which matches that character and nothing else. +The special characters are @samp{$}, @samp{^}, @samp{.}, @samp{*}, +@samp{+}, @samp{?}, @samp{[}, @samp{]} and @samp{\}. Any other character +appearing in a regular expression is ordinary, unless a @samp{\} precedes +it.@refill + +For example, @samp{f} is not a special character, so it is ordinary, +and therefore @samp{f} is a regular expression that matches the string @samp{f} +and no other string. (It does @emph{not} match the string @samp{ff}.) Likewise, +@samp{o} is a regular expression that matches only @samp{o}. + +Any two regular expressions @var{a} and @var{b} can be concatenated. +The result is a regular expression which matches a string if @var{a} +matches some amount of the beginning of that string and @var{b} +matches the rest of the string. + +As a simple example, we can concatenate the regular expressions +@samp{f} and @samp{o} to get the regular expression @samp{fo}, +which matches only the string @samp{fo}. Still trivial. + +Note: for Unix compatibility, special characters are treated as +ordinary ones if they are in contexts where their special meanings +make no sense. For example, @samp{*foo} treats @samp{*} as ordinary since +there is no preceding expression on which the @samp{*} can act. +It is poor practice to depend on this behavior; better to quote +the special character anyway, regardless of where is appears. + + +@node directives, emacs , syntax, top +@comment node-name, next, previous, up + +@ifinfo +The following are the characters and character sequences which have +special meaning within regular expressions. +Any character not mentioned here is not special; it stands for exactly +itself for the purposes of searching and matching. @xref{syntax}. +@end ifinfo + +@table @samp +@item . +is a special character that matches anything except a newline. +Using concatenation, we can make regular expressions like @samp{a.b} which +matches any three-character string which begins with @samp{a} and ends with @samp{b}.@refill + +@item * +is not a construct by itself; it is a suffix, which means the preceding +regular expression is to be repeated as many times as possible. In @samp{fo*}, +the @samp{*} applies to the @samp{o}, so @samp{fo*} matches @samp{f} followed by any number of @samp{o}'s.@refill + +The case of zero @samp{o}'s is allowed: @samp{fo*} does match @samp{f}.@refill + +@samp{*} always applies to the @emph{smallest} possible preceding expression. +Thus, @samp{fo*} has a repeating @samp{o}, not a repeating @samp{fo}.@refill + +The matcher processes a @samp{*} construct by matching, immediately, as many +repetitions as can be found. Then it continues with the rest of the +pattern. If that fails, backtracking occurs, discarding some of +the matches of the @samp{*}'d construct in case that makes it possible +to match the rest of the pattern. For example, matching @samp{c[ad]*ar} +against the string @samp{caddaar}, the @samp{[ad]*} first matches @samp{addaa}, +but this does not allow the next @samp{a} in the pattern to match. +So the last of the matches of @samp{[ad]} is undone and the following +@samp{a} is tried again. Now it succeeds.@refill + +@item + +@samp{+} is like @samp{*} except that at least one match for the preceding +pattern is required for @samp{+}. Thus, @samp{c[ad]+r} does not match +@samp{cr} but does match anything else that @samp{c[ad]*r} would match. + +@item ? +@samp{?} is like @samp{*} except that it allows either zero or one match +for the preceding pattern. Thus, @samp{c[ad]?r} matches @samp{cr} or +@samp{car} or @samp{cdr}, and nothing else. + +@item [ @dots{} ] +@samp{[} begins a @dfn{character set}, which is terminated by a @samp{]}. +In the simplest case, the characters between the two form the set. +Thus, @samp{[ad]} matches either @samp{a} or @samp{d}, +and @samp{[ad]*} matches any string of @samp{a}'s and @samp{d}'s +(including the empty string), from which it follows that +@samp{c[ad]*r} matches @samp{car}, etc.@refill + +Character ranges can also be included in a character set, by writing two +characters with a @samp{-} between them. Thus, @samp{[a-z]} matches +any lower-case letter. Ranges may be intermixed freely with +individual characters, as in @samp{[a-z$%.]}, which matches any +lower case letter or @samp{$}, @samp{%} or period.@refill + +Note that the usual special characters are not special any more inside a +character set. A completely different set of special characters exists +inside character sets: @samp{]}, @samp{-} and @samp{^}.@refill + +To include a @samp{]} in a character set, you must make it +the first character. For example, @samp{[]a]} matches @samp{]} or @samp{a}. +To include a @samp{-}, you must use it in a context where it cannot possibly +indicate a range: that is, as the first character, or immediately +after a range.@refill + +@item [^ @dots{} ] +@samp{[^} begins a @dfn{complement character set}, which matches any +character except the ones specified. Thus, @samp{[^a-z0-9A-Z]} +matches all characters @emph{except} letters and digits.@refill + +@samp{^} is not special in a character set unless it is the first character. +The character following the @samp{^} is treated as if it were first +(it may be a @samp{-} or a @samp{]}).@refill + +@item ^ +is a special character that matches the empty string -- but only +if at the beginning of a line in the text being matched. Otherwise +it fails to match anything. Thus, @samp{^foo} matches a @samp{foo} +which occurs at the beginning of a line.@refill + +@item $ +is similar to @samp{^} but matches only at the end of a line. +Thus, @samp{xx*$} matches a string of one or more @samp{x}'s +at the end of a line.@refill + +@item \ +has two functions: it quotes the above special characters +(including @samp{\}), and it introduces additional special constructs.@refill + +Because @samp{\} quotes special characters, @samp{\$} is a regular +expression which matches only @samp{$}, and @samp{\[} is a regular +expression which matches only @samp{[}, and so on.@refill + +For the most part, @samp{\} followed by any character matches only that +character. However, there are several exceptions: characters which, when +preceded by @samp{\}, are special constructs. Such characters are always +ordinary when encountered on their own.@refill + +No new special characters will ever be defined. All extensions to +the regular expression syntax are made by defining new two-character +constructs that begin with @samp{\}.@refill + +@item \| +specifies an alternative. +Two regular expressions @var{a} and @var{b} with @samp{\|} in +between form an expression that matches anything that either @var{a} or +@var{b} will match.@refill + +Thus, @samp{foo\|bar} matches either @samp{foo} or @samp{bar} +but no other string.@refill + +@samp{\|} applies to the largest possible surrounding expressions. Only a +surrounding @samp{\( @dots{} \)} grouping can limit the grouping power of +@samp{\|}.@refill + +Full backtracking capability exists when multiple @samp{\|}'s are used.@refill + +@item \( @dots{} \) +is a grouping construct that serves three purposes: + +@enumerate +@item +To enclose a set of @samp{\|} alternatives for other operations. +Thus, @samp{\(foo\|bar\)x} matches either @samp{foox} or @samp{barx}. + +@item +To enclose a complicated expression for the postfix @samp{*} to operate on. +Thus, @samp{ba\(na\)*} matches @samp{bananana}, etc., with any (zero or +more) number of @samp{na}'s.@refill + +@item +To mark a matched substring for future reference. + +@end enumerate + +This last application is not a consequence of the idea of a parenthetical +grouping; it is a separate feature which happens to be assigned as a +second meaning to the same @samp{\( @dots{} \)} construct because there is no +conflict in practice between the two meanings. Here is an explanation +of this feature:@refill + +@item \@var{digit} +After the end of a @samp{\( @dots{} \)} construct, the matcher remembers the +beginning and end of the text matched by that construct. Then, later on +in the regular expression, you can use @samp{\} followed by @var{digit} +to mean ``match the same text matched the @var{digit}'th time by the +@samp{\( @dots{} \)} construct.'' The @samp{\( @dots{} \)} constructs +are numbered in order of commencement in the regexp.@refill + +The strings matching the first nine @samp{\( @dots{} \)} constructs appearing +in a regular expression are assigned numbers 1 through 9 in order of their +beginnings. +@samp{\1} through @samp{\9} may be used to refer to the text matched by +the corresponding @samp{\( @dots{} \)} construct.@refill + +For example, @samp{\(.*\)\1} matches any string that is composed of two +identical halves. The @samp{\(.*\)} matches the first half, which may be +anything, but the @samp{\1} that follows must match the same exact text.@refill + +@item \b +matches the empty string, but only if it is at the beginning or +end of a word. Thus, @samp{\bfoo\b} matches any occurrence of +@samp{foo} as a separate word. @samp{\bball\(s\|\)\b} matches +@samp{ball} or @samp{balls} as a separate word.@refill + +@item \B +matches the empty string, provided it is @emph{not} at the beginning or +end of a word.@refill + +@item \< +matches the empty string, but only if it is at the beginning +of a word. + +@item \> +matches the empty string, but only if it is at the end of a word. + +@item \w +matches any word-constituent character. + +@item \W +matches any character that is not a word-constituent. +@end table + +There are a number of additional @samp{\} regexp directives available for use +within Emacs only. +@ifinfo +(@pxref{emacs}). +@comment no need to make a tex xref to something one line down! +@end ifinfo + +@node emacs, programming, directives, top +@comment node-name, next, previous, up +@subsection Constructs Available in Emacs Only + +@table @samp +@item \` +matches the empty string, but only if it is at the beginning +of the buffer.@refill + +@item \' +matches the empty string, but only if it is at the end of +the buffer.@refill + +@item \s@var{code} +matches any character whose syntax is @var{code}. +@var{code} is a letter which represents a syntax code: +thus, @samp{w} for word constituent, @samp{-} for +whitespace, @samp{(} for open-parenthesis, etc. +See the documentation for the Emacs function @samp{modify-syntax-entry} +for further details.@refill + +Thus, @samp{\s(} matches any character with open-parenthesis syntax. + +@item \S@var{code} +matches any character whose syntax is not @var{code}. +@end table + +@node programming, compiling, emacs, top +@comment node-name, next, previous, up +@section Programming using the @code{regex} library + +@ifinfo +The subnodes accessible from this menu give information on entry +points and data structures which C programs need to interface to the +@code{regex} library. +@end ifinfo + +@menu +* compiling:: How to compile regular expressions +* matching:: Matching compiled regular expressions +* searching:: Searching for compiled regular expressions +* translation:: Translating characters into other characters + (for both compilation and matching) +* registers:: determining what was matched +* split:: matching data which is split into two pieces +* unix:: Unix-compatible entry-points to regex library +@end menu + +@node compiling, matching, programming , programming +@comment node-name, next, previous, up +@subsection Compiling a Regular Expression + +To compile a regular expression, you must supply a pattern buffer. +This is a structure defined, in the include file @file{regex.h}, as follows: + +@example +struct re_pattern_buffer + @{ + char *buffer /* Space holding the compiled pattern commands. */ + int allocated /* Size of space that buffer points to */ + int used /* Length of portion of buffer actually occupied */ + char *fastmap; /* Pointer to fastmap, if any, or zero if none. */ + /* re_search uses the fastmap, if there is one, + to skip quickly over totally implausible + characters */ + char *translate; + /* Translate table to apply to characters before + comparing, or zero for no translation. + The translation is applied to a pattern when + it is compiled and to data when it is matched. */ + char fastmap_accurate; + /* Set to zero when a new pattern is stored, + set to one when the fastmap is updated from it. */ + @}; +@end example + +Before compiling a pattern, you must initialize the @code{buffer} field to +point to a block of memory obtained with @code{malloc}, +and the @code{allocated} field to the size of that block, in bytes. +The pattern compiler will replace this block with a larger one if necessary. + +You must also initialize the @code{translate} field to point to the translate +table that you will use when you match the compiled pattern, or to zero +if you will use no translate table when you match. @xref{translation}. + +Then call @code{re_compile_pattern} to compile a regular expression +into the buffer: +@example +re_compile_pattern (@var{regex}, @var{regex_size}, @var{buf}) +@end example + +@var{regex} is the address of the regular expression (@code{char *}), +@var{regex_size} is its length (@code{int}), +@var{buf} is the address of the buffer (@code{struct re_pattern_buffer *}). + +@code{re_compile_pattern} returns zero if it succeeds in compiling the regular +expression. In that case, @code{*buf} now contains the results. +Otherwise, @code{re_compile_pattern} returns a string which serves as +an error message. + +After compiling, if you wish to search for the pattern, you must +initialize the @code{fastmap} component of the pattern buffer. +@xref{searching}. + +@node matching, searching, compiling, programming +@comment node-name, next, previous, up +@subsection Matching a Compiled Pattern + +Once a regular expression has been compiled into a pattern buffer, +you can match the pattern buffer against a string with @code{re_match}. + +@example +re_match (@var{buf}, @var{string}, @var{size}, @var{pos}, @var{regs}) +@end example + +@var{buf} is, once again, the address of the buffer (@code{struct re_pattern_buffer *}). +@var{string} is the string to be matched (@code{char *}). +@var{size} is the length of that string (@code{int}). +@var{pos} is the position within the string at which to begin matching (@code{int}). +The beginning of the string is position 0. +@var{regs} is described below. Normally it is zero. @xref{registers}. + +@code{re_match} returns @code{-1} if the pattern does not match; otherwise, +it returns the length of the portion of @code{string} which was matched. + +For example, suppose that @var{buf} points to a buffer containing the result +of compiling @code{x*}, @var{string} points to @code{xxxxxy}, and @var{size} is @code{6}. +Suppose that @var{pos} is @code{2}. Then the last three @code{x}'s will be matched, +so @code{re_match} will return @code{3}. +If @var{pos} is zero, the value will be @code{5}. +If @var{pos} is @code{5} or @code{6}, the value will be zero, meaning that the null string +was successfully matched. +Note that since @code{x*} matches the empty string, it will never entirely fail. + +It is up to the caller to avoid passing a value of @var{pos} that results in +matching outside the specified string. @var{pos} must not be negative and +must not be greater than @var{size}. + +@node searching, translation, matching, programming +@comment node-name, next, previous, up +@subsection Searching for a Match + +Searching means trying successive starting positions for a match until a +match is found. To search, you supply a compiled pattern buffer. Before +searching you must initialize the @code{fastmap} field of the pattern +buffer (see below). + +@example +re_search (@var{buf}, @var{string}, @var{size}, @var{startpos}, @var{range}, @var{regs}) +@end example + +@noindent +is called like @code{re_match} except that the @var{pos} argument is +replaced by two arguments @var{startpos} and @var{range}. @code{re_search} +tests for a match starting at index @var{startpos}, then at +@code{@var{startpos} + 1}, and so on. It tries @var{range} consecutive +positions before giving up and returning @code{-1}. If a match is found, +@code{re_search} returns the index at which the match was found.@refill + +If @var{range} is negative, @var{re_search} tries starting positions +@var{startpos}, @code{@var{startpos} - 1}, @dots{} in that order. +@code{|@var{range}|} is the number of tries made.@refill + +It is up to the caller to avoid passing value of @var{startpos} and +@var{range} that result in matching outside the specified string. +@var{startpos} must be between zero and @var{size}, inclusive, and so must +@code{@var{startpos} + @var{range} - 1} (if @var{range} is positive) or +@code{@var{startpos} + @var{range} + 1} (if @var{range} is negative).@refill + +If you may be searching over a long distance (that is, trying many +different match starting points) with a compiled pattern, you should use a +@dfn{fastmap} in it. This is a block of 256 bytes, whose address is +placed in the @code{fastmap} component of the pattern buffer. The first +time you search for a particular compiled pattern, the fastmap is set so +that @code{@var{fastmap}[@var{ch}]} is nonzero if the character @var{ch} +might possibly start a match for this pattern. @code{re_search} checks +each character against the fastmap so that it can skip more quickly over +non-matches. + +If you do not want a fastmap, store zero in the @code{fastmap} component of the +pattern buffer before calling @code{re_search}. + +In either case, you must initialize this component in a pattern buffer +before you can use that buffer in a search; but you can choose as an +initial value either zero or the address of a suitable block of memory. + +If you compile a new pattern in an existing pattern buffer, it is not +necessary to reinitialize the @code{fastmap} component (unless you +wish to override your previous choice). + +@node translation, registers, searching, programming +@comment node-name, next, previous, up +@subsection Translate Tables + +With a translate table, you can apply a transformation to all characters +before they are compared. For example, a table that maps lower case letters +into upper case (or vice versa) causes differences in case to be ignored +by matching. + +A translate table is a block of 256 bytes. Each character of raw data is +used as an index in the translate table. The value found there is used +instead of the original character. Each character in a regular +expression, except for the syntactic constructs, is translated when the +expression is compiled. Each character of a string being matched is +translated whenever it is compared or tested. + +A suitable translate table to ignore differences in case maps all +characters into themselves, except for lower case letters, which are +mapped into the corresponding upper case letters. +It could be initialized by: + +@example +for (i = 0; i < 0400; i++) + table[i] = i; +for (i = 'a'; i <= 'z'; i++) + table[i] = i - 040; +@end example + +You specify the use of a translate table by putting its address in the +@var{translate} component of the compiled pattern buffer. If this component +is zero, no translation is done. Since both compilation and matching use +the translate table, you must use the same table contents for both +operations or confusing things will happen. + +@node registers, split, translation, programming +@comment node-name, next, previous, up +@subsection Registers: or ``What Did the @samp{\( @dots{} \)} Groupings Actually Match?'' + +If you want to find out, after the match, what each of the first nine +@samp{\( @dots{} \)} groupings actually matched, you can pass the @var{regs} argument +to the match or search function. Pass the address of a structure of this type: + +@example +struct re_registers + @{ + int start[RE_NREGS]; + int end[RE_NREGS]; + @}; +@end example + + @code{re_match} and @code{re_search} will store into this structure the +data you want. @code{@var{regs}->start[@var{reg}]} will be the index in +@var{string} of the beginning of the data matched by the @var{reg}'th +@samp{\( @dots{} \)} grouping, and @code{@var{regs}->end[@var{reg}]} will +be the index of the end of that data (the index of the first character +beyond those matched). The values in the start and end arrays at +indexes greater than the number of @samp{\( @dots{} \)} groupings +present in the regular expression will be set to the value -1. Register +numbers start at 1 and run to @code{RE_NREGS - 1} (normally @code{9}). +@code{@var{regs}->start[0]} and @code{@var{regs}->end[0]} are similar but +describe the extent of the substring matched by the entire pattern.@refill + + Both @code{struct re_registers} and @code{RE_NREGS} are defined in @file{regex.h}. + +@node split, unix, registers, programming +@comment node-name, next, previous, up +@subsection Matching against Split Data + +The functions @code{re_match_2} and @code{re_search_2} allow one to match in or search +data which is divided into two strings. + +@code{re_match_2} works like @code{re_match} except that two data strings and +sizes must be given. + +@example +re_match_2 (@var{buf}, @var{string1}, @var{size1}, @var{string2}, @var{size2}, @var{pos}, @var{regs}) +@end example + +The matcher regards the contents of @var{string1} as effectively followed by +the contents of @var{string2}, and matches the combined string against the +pattern in @var{buf}. + +@code{re_search_2} is likewise similar to @code{re_search}: + +@example +re_search_2 (@var{buf}, @var{string1}, @var{size1}, @var{string2}, @var{size2}, @var{startpos}, @var{range}, @var{regs}) +@end example + +The value returned by @var{re_search_2} is an index into the combined data +made up of @var{string1} and @var{string2}. It never exceeds @code{@var{size1} + @var{size2}}. +The values returned in the @var{regs} structure (if there is one) are likewise +indices in the combined data. + +@node unix, , split, programming +@comment node-name, next, previous, up +@subsection Unix-Compatible Entry Points + +The standard Berkeley Unix way to compile a regular expression is to call +@code{re_comp}. This function takes a single argument, the address of the +regular expression, which is assumed to be terminated by a null character. + +@code{re_comp} does not ask you to specify a pattern buffer because it has its +own pattern buffer --- just one. Using @code{re_comp}, one may match only the +most recently compiled regular expression. + +The value of @code{re_comp} is zero for success or else an error message string, +as for @code{re_compile_pattern}. + +Calling @code{re_comp} with the null string as argument it has no effect; +the contents of the buffer remain unchanged. + +The standard Berkeley Unix way to match the last regular expression compiled +is to call @code{re_exec}. This takes a single argument, the address of +the string to be matched. This string is assumed to be terminated by +a null character. Matching is tried starting at each position in the +string. @code{re_exec} returns @code{1} for success or @code{0} for failure. +One cannot find out how long a substring was matched, nor what the +@samp{\( @dots{} \)} groupings matched. + +@bye |