summaryrefslogtreecommitdiff
path: root/man/regex.texinfo
diff options
context:
space:
mode:
Diffstat (limited to 'man/regex.texinfo')
-rw-r--r--man/regex.texinfo588
1 files changed, 588 insertions, 0 deletions
diff --git a/man/regex.texinfo b/man/regex.texinfo
new file mode 100644
index 00000000000..87529a5ee28
--- /dev/null
+++ b/man/regex.texinfo
@@ -0,0 +1,588 @@
+\input texinfo
+@comment -*- Mode: texinfo -*-
+@comment This documents the GNU regex library
+@setfilename regex
+
+@comment >> @code{"foo"} for literal strings vs @b{"foo"} vs @code{foo}
+@comment >> (this file is presently using the last --- it looks ok in
+@comment >> info; wait to see what it looks like under botex)
+
+
+@comment >> superior of (dir) is temporary
+@node top, syntax, , (dir)
+@comment node-name, next, previous, up
+@chapter @dfn{regex} regular expression matching library.
+
+@section Overview
+
+Regular expression matching allows you to test whether a string fits
+into a specific syntactic shape. You can also search a string for a
+substring that fits a pattern.
+
+A regular expression describes a set of strings. The simplest case is
+one that describes a particular string; for example, the string @samp{foo}
+when regarded as a regular expression matches @samp{foo} and nothing else.
+Nontrivial regular expressions use certain special constructs so that
+they can match more than one string. For example, the regular expression
+@samp{foo\|bar} matches either the string @samp{foo} or the string @samp{bar}; the
+regular expression @samp{c[ad]*r} matches any of the strings @samp{cr}, @samp{car},
+@samp{cdr}, @samp{caar}, @samp{cadddar} and all other such strings with any number of
+@samp{a}'s and @samp{d}'s.
+
+The first step in matching a regular expression is to compile it.
+You must supply the pattern string and also a pattern buffer to hold
+the compiled result. That result contains the pattern in an internal
+format that is easier to use in matching.
+
+Having compiled a pattern, you can match it against strings. You can
+match the compiled pattern any number of times against different
+strings.
+
+@menu
+* syntax:: Syntax of regular expressions
+* directives:: Meaning of characters as regex string directives.
+* emacs:: Additional character directives available
+ only for use within Emacs.
+* programming:: Using the regex library from C programs
+* unix:: Unix-compatible entry-points to regex library
+@end menu
+
+@node syntax, directives, top, top
+@comment node-name, next, previous, up
+@section Syntax of Regular Expressions
+
+Regular expressions have a syntax in which a few characters are special
+constructs and the rest are @dfn{ordinary}. An ordinary character is a
+simple regular expression which matches that character and nothing else.
+The special characters are @samp{$}, @samp{^}, @samp{.}, @samp{*},
+@samp{+}, @samp{?}, @samp{[}, @samp{]} and @samp{\}. Any other character
+appearing in a regular expression is ordinary, unless a @samp{\} precedes
+it.@refill
+
+For example, @samp{f} is not a special character, so it is ordinary,
+and therefore @samp{f} is a regular expression that matches the string @samp{f}
+and no other string. (It does @emph{not} match the string @samp{ff}.) Likewise,
+@samp{o} is a regular expression that matches only @samp{o}.
+
+Any two regular expressions @var{a} and @var{b} can be concatenated.
+The result is a regular expression which matches a string if @var{a}
+matches some amount of the beginning of that string and @var{b}
+matches the rest of the string.
+
+As a simple example, we can concatenate the regular expressions
+@samp{f} and @samp{o} to get the regular expression @samp{fo},
+which matches only the string @samp{fo}. Still trivial.
+
+Note: for Unix compatibility, special characters are treated as
+ordinary ones if they are in contexts where their special meanings
+make no sense. For example, @samp{*foo} treats @samp{*} as ordinary since
+there is no preceding expression on which the @samp{*} can act.
+It is poor practice to depend on this behavior; better to quote
+the special character anyway, regardless of where is appears.
+
+
+@node directives, emacs , syntax, top
+@comment node-name, next, previous, up
+
+@ifinfo
+The following are the characters and character sequences which have
+special meaning within regular expressions.
+Any character not mentioned here is not special; it stands for exactly
+itself for the purposes of searching and matching. @xref{syntax}.
+@end ifinfo
+
+@table @samp
+@item .
+is a special character that matches anything except a newline.
+Using concatenation, we can make regular expressions like @samp{a.b} which
+matches any three-character string which begins with @samp{a} and ends with @samp{b}.@refill
+
+@item *
+is not a construct by itself; it is a suffix, which means the preceding
+regular expression is to be repeated as many times as possible. In @samp{fo*},
+the @samp{*} applies to the @samp{o}, so @samp{fo*} matches @samp{f} followed by any number of @samp{o}'s.@refill
+
+The case of zero @samp{o}'s is allowed: @samp{fo*} does match @samp{f}.@refill
+
+@samp{*} always applies to the @emph{smallest} possible preceding expression.
+Thus, @samp{fo*} has a repeating @samp{o}, not a repeating @samp{fo}.@refill
+
+The matcher processes a @samp{*} construct by matching, immediately, as many
+repetitions as can be found. Then it continues with the rest of the
+pattern. If that fails, backtracking occurs, discarding some of
+the matches of the @samp{*}'d construct in case that makes it possible
+to match the rest of the pattern. For example, matching @samp{c[ad]*ar}
+against the string @samp{caddaar}, the @samp{[ad]*} first matches @samp{addaa},
+but this does not allow the next @samp{a} in the pattern to match.
+So the last of the matches of @samp{[ad]} is undone and the following
+@samp{a} is tried again. Now it succeeds.@refill
+
+@item +
+@samp{+} is like @samp{*} except that at least one match for the preceding
+pattern is required for @samp{+}. Thus, @samp{c[ad]+r} does not match
+@samp{cr} but does match anything else that @samp{c[ad]*r} would match.
+
+@item ?
+@samp{?} is like @samp{*} except that it allows either zero or one match
+for the preceding pattern. Thus, @samp{c[ad]?r} matches @samp{cr} or
+@samp{car} or @samp{cdr}, and nothing else.
+
+@item [ @dots{} ]
+@samp{[} begins a @dfn{character set}, which is terminated by a @samp{]}.
+In the simplest case, the characters between the two form the set.
+Thus, @samp{[ad]} matches either @samp{a} or @samp{d},
+and @samp{[ad]*} matches any string of @samp{a}'s and @samp{d}'s
+(including the empty string), from which it follows that
+@samp{c[ad]*r} matches @samp{car}, etc.@refill
+
+Character ranges can also be included in a character set, by writing two
+characters with a @samp{-} between them. Thus, @samp{[a-z]} matches
+any lower-case letter. Ranges may be intermixed freely with
+individual characters, as in @samp{[a-z$%.]}, which matches any
+lower case letter or @samp{$}, @samp{%} or period.@refill
+
+Note that the usual special characters are not special any more inside a
+character set. A completely different set of special characters exists
+inside character sets: @samp{]}, @samp{-} and @samp{^}.@refill
+
+To include a @samp{]} in a character set, you must make it
+the first character. For example, @samp{[]a]} matches @samp{]} or @samp{a}.
+To include a @samp{-}, you must use it in a context where it cannot possibly
+indicate a range: that is, as the first character, or immediately
+after a range.@refill
+
+@item [^ @dots{} ]
+@samp{[^} begins a @dfn{complement character set}, which matches any
+character except the ones specified. Thus, @samp{[^a-z0-9A-Z]}
+matches all characters @emph{except} letters and digits.@refill
+
+@samp{^} is not special in a character set unless it is the first character.
+The character following the @samp{^} is treated as if it were first
+(it may be a @samp{-} or a @samp{]}).@refill
+
+@item ^
+is a special character that matches the empty string -- but only
+if at the beginning of a line in the text being matched. Otherwise
+it fails to match anything. Thus, @samp{^foo} matches a @samp{foo}
+which occurs at the beginning of a line.@refill
+
+@item $
+is similar to @samp{^} but matches only at the end of a line.
+Thus, @samp{xx*$} matches a string of one or more @samp{x}'s
+at the end of a line.@refill
+
+@item \
+has two functions: it quotes the above special characters
+(including @samp{\}), and it introduces additional special constructs.@refill
+
+Because @samp{\} quotes special characters, @samp{\$} is a regular
+expression which matches only @samp{$}, and @samp{\[} is a regular
+expression which matches only @samp{[}, and so on.@refill
+
+For the most part, @samp{\} followed by any character matches only that
+character. However, there are several exceptions: characters which, when
+preceded by @samp{\}, are special constructs. Such characters are always
+ordinary when encountered on their own.@refill
+
+No new special characters will ever be defined. All extensions to
+the regular expression syntax are made by defining new two-character
+constructs that begin with @samp{\}.@refill
+
+@item \|
+specifies an alternative.
+Two regular expressions @var{a} and @var{b} with @samp{\|} in
+between form an expression that matches anything that either @var{a} or
+@var{b} will match.@refill
+
+Thus, @samp{foo\|bar} matches either @samp{foo} or @samp{bar}
+but no other string.@refill
+
+@samp{\|} applies to the largest possible surrounding expressions. Only a
+surrounding @samp{\( @dots{} \)} grouping can limit the grouping power of
+@samp{\|}.@refill
+
+Full backtracking capability exists when multiple @samp{\|}'s are used.@refill
+
+@item \( @dots{} \)
+is a grouping construct that serves three purposes:
+
+@enumerate
+@item
+To enclose a set of @samp{\|} alternatives for other operations.
+Thus, @samp{\(foo\|bar\)x} matches either @samp{foox} or @samp{barx}.
+
+@item
+To enclose a complicated expression for the postfix @samp{*} to operate on.
+Thus, @samp{ba\(na\)*} matches @samp{bananana}, etc., with any (zero or
+more) number of @samp{na}'s.@refill
+
+@item
+To mark a matched substring for future reference.
+
+@end enumerate
+
+This last application is not a consequence of the idea of a parenthetical
+grouping; it is a separate feature which happens to be assigned as a
+second meaning to the same @samp{\( @dots{} \)} construct because there is no
+conflict in practice between the two meanings. Here is an explanation
+of this feature:@refill
+
+@item \@var{digit}
+After the end of a @samp{\( @dots{} \)} construct, the matcher remembers the
+beginning and end of the text matched by that construct. Then, later on
+in the regular expression, you can use @samp{\} followed by @var{digit}
+to mean ``match the same text matched the @var{digit}'th time by the
+@samp{\( @dots{} \)} construct.'' The @samp{\( @dots{} \)} constructs
+are numbered in order of commencement in the regexp.@refill
+
+The strings matching the first nine @samp{\( @dots{} \)} constructs appearing
+in a regular expression are assigned numbers 1 through 9 in order of their
+beginnings.
+@samp{\1} through @samp{\9} may be used to refer to the text matched by
+the corresponding @samp{\( @dots{} \)} construct.@refill
+
+For example, @samp{\(.*\)\1} matches any string that is composed of two
+identical halves. The @samp{\(.*\)} matches the first half, which may be
+anything, but the @samp{\1} that follows must match the same exact text.@refill
+
+@item \b
+matches the empty string, but only if it is at the beginning or
+end of a word. Thus, @samp{\bfoo\b} matches any occurrence of
+@samp{foo} as a separate word. @samp{\bball\(s\|\)\b} matches
+@samp{ball} or @samp{balls} as a separate word.@refill
+
+@item \B
+matches the empty string, provided it is @emph{not} at the beginning or
+end of a word.@refill
+
+@item \<
+matches the empty string, but only if it is at the beginning
+of a word.
+
+@item \>
+matches the empty string, but only if it is at the end of a word.
+
+@item \w
+matches any word-constituent character.
+
+@item \W
+matches any character that is not a word-constituent.
+@end table
+
+There are a number of additional @samp{\} regexp directives available for use
+within Emacs only.
+@ifinfo
+(@pxref{emacs}).
+@comment no need to make a tex xref to something one line down!
+@end ifinfo
+
+@node emacs, programming, directives, top
+@comment node-name, next, previous, up
+@subsection Constructs Available in Emacs Only
+
+@table @samp
+@item \`
+matches the empty string, but only if it is at the beginning
+of the buffer.@refill
+
+@item \'
+matches the empty string, but only if it is at the end of
+the buffer.@refill
+
+@item \s@var{code}
+matches any character whose syntax is @var{code}.
+@var{code} is a letter which represents a syntax code:
+thus, @samp{w} for word constituent, @samp{-} for
+whitespace, @samp{(} for open-parenthesis, etc.
+See the documentation for the Emacs function @samp{modify-syntax-entry}
+for further details.@refill
+
+Thus, @samp{\s(} matches any character with open-parenthesis syntax.
+
+@item \S@var{code}
+matches any character whose syntax is not @var{code}.
+@end table
+
+@node programming, compiling, emacs, top
+@comment node-name, next, previous, up
+@section Programming using the @code{regex} library
+
+@ifinfo
+The subnodes accessible from this menu give information on entry
+points and data structures which C programs need to interface to the
+@code{regex} library.
+@end ifinfo
+
+@menu
+* compiling:: How to compile regular expressions
+* matching:: Matching compiled regular expressions
+* searching:: Searching for compiled regular expressions
+* translation:: Translating characters into other characters
+ (for both compilation and matching)
+* registers:: determining what was matched
+* split:: matching data which is split into two pieces
+* unix:: Unix-compatible entry-points to regex library
+@end menu
+
+@node compiling, matching, programming , programming
+@comment node-name, next, previous, up
+@subsection Compiling a Regular Expression
+
+To compile a regular expression, you must supply a pattern buffer.
+This is a structure defined, in the include file @file{regex.h}, as follows:
+
+@example
+struct re_pattern_buffer
+ @{
+ char *buffer /* Space holding the compiled pattern commands. */
+ int allocated /* Size of space that buffer points to */
+ int used /* Length of portion of buffer actually occupied */
+ char *fastmap; /* Pointer to fastmap, if any, or zero if none. */
+ /* re_search uses the fastmap, if there is one,
+ to skip quickly over totally implausible
+ characters */
+ char *translate;
+ /* Translate table to apply to characters before
+ comparing, or zero for no translation.
+ The translation is applied to a pattern when
+ it is compiled and to data when it is matched. */
+ char fastmap_accurate;
+ /* Set to zero when a new pattern is stored,
+ set to one when the fastmap is updated from it. */
+ @};
+@end example
+
+Before compiling a pattern, you must initialize the @code{buffer} field to
+point to a block of memory obtained with @code{malloc},
+and the @code{allocated} field to the size of that block, in bytes.
+The pattern compiler will replace this block with a larger one if necessary.
+
+You must also initialize the @code{translate} field to point to the translate
+table that you will use when you match the compiled pattern, or to zero
+if you will use no translate table when you match. @xref{translation}.
+
+Then call @code{re_compile_pattern} to compile a regular expression
+into the buffer:
+@example
+re_compile_pattern (@var{regex}, @var{regex_size}, @var{buf})
+@end example
+
+@var{regex} is the address of the regular expression (@code{char *}),
+@var{regex_size} is its length (@code{int}),
+@var{buf} is the address of the buffer (@code{struct re_pattern_buffer *}).
+
+@code{re_compile_pattern} returns zero if it succeeds in compiling the regular
+expression. In that case, @code{*buf} now contains the results.
+Otherwise, @code{re_compile_pattern} returns a string which serves as
+an error message.
+
+After compiling, if you wish to search for the pattern, you must
+initialize the @code{fastmap} component of the pattern buffer.
+@xref{searching}.
+
+@node matching, searching, compiling, programming
+@comment node-name, next, previous, up
+@subsection Matching a Compiled Pattern
+
+Once a regular expression has been compiled into a pattern buffer,
+you can match the pattern buffer against a string with @code{re_match}.
+
+@example
+re_match (@var{buf}, @var{string}, @var{size}, @var{pos}, @var{regs})
+@end example
+
+@var{buf} is, once again, the address of the buffer (@code{struct re_pattern_buffer *}).
+@var{string} is the string to be matched (@code{char *}).
+@var{size} is the length of that string (@code{int}).
+@var{pos} is the position within the string at which to begin matching (@code{int}).
+The beginning of the string is position 0.
+@var{regs} is described below. Normally it is zero. @xref{registers}.
+
+@code{re_match} returns @code{-1} if the pattern does not match; otherwise,
+it returns the length of the portion of @code{string} which was matched.
+
+For example, suppose that @var{buf} points to a buffer containing the result
+of compiling @code{x*}, @var{string} points to @code{xxxxxy}, and @var{size} is @code{6}.
+Suppose that @var{pos} is @code{2}. Then the last three @code{x}'s will be matched,
+so @code{re_match} will return @code{3}.
+If @var{pos} is zero, the value will be @code{5}.
+If @var{pos} is @code{5} or @code{6}, the value will be zero, meaning that the null string
+was successfully matched.
+Note that since @code{x*} matches the empty string, it will never entirely fail.
+
+It is up to the caller to avoid passing a value of @var{pos} that results in
+matching outside the specified string. @var{pos} must not be negative and
+must not be greater than @var{size}.
+
+@node searching, translation, matching, programming
+@comment node-name, next, previous, up
+@subsection Searching for a Match
+
+Searching means trying successive starting positions for a match until a
+match is found. To search, you supply a compiled pattern buffer. Before
+searching you must initialize the @code{fastmap} field of the pattern
+buffer (see below).
+
+@example
+re_search (@var{buf}, @var{string}, @var{size}, @var{startpos}, @var{range}, @var{regs})
+@end example
+
+@noindent
+is called like @code{re_match} except that the @var{pos} argument is
+replaced by two arguments @var{startpos} and @var{range}. @code{re_search}
+tests for a match starting at index @var{startpos}, then at
+@code{@var{startpos} + 1}, and so on. It tries @var{range} consecutive
+positions before giving up and returning @code{-1}. If a match is found,
+@code{re_search} returns the index at which the match was found.@refill
+
+If @var{range} is negative, @var{re_search} tries starting positions
+@var{startpos}, @code{@var{startpos} - 1}, @dots{} in that order.
+@code{|@var{range}|} is the number of tries made.@refill
+
+It is up to the caller to avoid passing value of @var{startpos} and
+@var{range} that result in matching outside the specified string.
+@var{startpos} must be between zero and @var{size}, inclusive, and so must
+@code{@var{startpos} + @var{range} - 1} (if @var{range} is positive) or
+@code{@var{startpos} + @var{range} + 1} (if @var{range} is negative).@refill
+
+If you may be searching over a long distance (that is, trying many
+different match starting points) with a compiled pattern, you should use a
+@dfn{fastmap} in it. This is a block of 256 bytes, whose address is
+placed in the @code{fastmap} component of the pattern buffer. The first
+time you search for a particular compiled pattern, the fastmap is set so
+that @code{@var{fastmap}[@var{ch}]} is nonzero if the character @var{ch}
+might possibly start a match for this pattern. @code{re_search} checks
+each character against the fastmap so that it can skip more quickly over
+non-matches.
+
+If you do not want a fastmap, store zero in the @code{fastmap} component of the
+pattern buffer before calling @code{re_search}.
+
+In either case, you must initialize this component in a pattern buffer
+before you can use that buffer in a search; but you can choose as an
+initial value either zero or the address of a suitable block of memory.
+
+If you compile a new pattern in an existing pattern buffer, it is not
+necessary to reinitialize the @code{fastmap} component (unless you
+wish to override your previous choice).
+
+@node translation, registers, searching, programming
+@comment node-name, next, previous, up
+@subsection Translate Tables
+
+With a translate table, you can apply a transformation to all characters
+before they are compared. For example, a table that maps lower case letters
+into upper case (or vice versa) causes differences in case to be ignored
+by matching.
+
+A translate table is a block of 256 bytes. Each character of raw data is
+used as an index in the translate table. The value found there is used
+instead of the original character. Each character in a regular
+expression, except for the syntactic constructs, is translated when the
+expression is compiled. Each character of a string being matched is
+translated whenever it is compared or tested.
+
+A suitable translate table to ignore differences in case maps all
+characters into themselves, except for lower case letters, which are
+mapped into the corresponding upper case letters.
+It could be initialized by:
+
+@example
+for (i = 0; i < 0400; i++)
+ table[i] = i;
+for (i = 'a'; i <= 'z'; i++)
+ table[i] = i - 040;
+@end example
+
+You specify the use of a translate table by putting its address in the
+@var{translate} component of the compiled pattern buffer. If this component
+is zero, no translation is done. Since both compilation and matching use
+the translate table, you must use the same table contents for both
+operations or confusing things will happen.
+
+@node registers, split, translation, programming
+@comment node-name, next, previous, up
+@subsection Registers: or ``What Did the @samp{\( @dots{} \)} Groupings Actually Match?''
+
+If you want to find out, after the match, what each of the first nine
+@samp{\( @dots{} \)} groupings actually matched, you can pass the @var{regs} argument
+to the match or search function. Pass the address of a structure of this type:
+
+@example
+struct re_registers
+ @{
+ int start[RE_NREGS];
+ int end[RE_NREGS];
+ @};
+@end example
+
+ @code{re_match} and @code{re_search} will store into this structure the
+data you want. @code{@var{regs}->start[@var{reg}]} will be the index in
+@var{string} of the beginning of the data matched by the @var{reg}'th
+@samp{\( @dots{} \)} grouping, and @code{@var{regs}->end[@var{reg}]} will
+be the index of the end of that data (the index of the first character
+beyond those matched). The values in the start and end arrays at
+indexes greater than the number of @samp{\( @dots{} \)} groupings
+present in the regular expression will be set to the value -1. Register
+numbers start at 1 and run to @code{RE_NREGS - 1} (normally @code{9}).
+@code{@var{regs}->start[0]} and @code{@var{regs}->end[0]} are similar but
+describe the extent of the substring matched by the entire pattern.@refill
+
+ Both @code{struct re_registers} and @code{RE_NREGS} are defined in @file{regex.h}.
+
+@node split, unix, registers, programming
+@comment node-name, next, previous, up
+@subsection Matching against Split Data
+
+The functions @code{re_match_2} and @code{re_search_2} allow one to match in or search
+data which is divided into two strings.
+
+@code{re_match_2} works like @code{re_match} except that two data strings and
+sizes must be given.
+
+@example
+re_match_2 (@var{buf}, @var{string1}, @var{size1}, @var{string2}, @var{size2}, @var{pos}, @var{regs})
+@end example
+
+The matcher regards the contents of @var{string1} as effectively followed by
+the contents of @var{string2}, and matches the combined string against the
+pattern in @var{buf}.
+
+@code{re_search_2} is likewise similar to @code{re_search}:
+
+@example
+re_search_2 (@var{buf}, @var{string1}, @var{size1}, @var{string2}, @var{size2}, @var{startpos}, @var{range}, @var{regs})
+@end example
+
+The value returned by @var{re_search_2} is an index into the combined data
+made up of @var{string1} and @var{string2}. It never exceeds @code{@var{size1} + @var{size2}}.
+The values returned in the @var{regs} structure (if there is one) are likewise
+indices in the combined data.
+
+@node unix, , split, programming
+@comment node-name, next, previous, up
+@subsection Unix-Compatible Entry Points
+
+The standard Berkeley Unix way to compile a regular expression is to call
+@code{re_comp}. This function takes a single argument, the address of the
+regular expression, which is assumed to be terminated by a null character.
+
+@code{re_comp} does not ask you to specify a pattern buffer because it has its
+own pattern buffer --- just one. Using @code{re_comp}, one may match only the
+most recently compiled regular expression.
+
+The value of @code{re_comp} is zero for success or else an error message string,
+as for @code{re_compile_pattern}.
+
+Calling @code{re_comp} with the null string as argument it has no effect;
+the contents of the buffer remain unchanged.
+
+The standard Berkeley Unix way to match the last regular expression compiled
+is to call @code{re_exec}. This takes a single argument, the address of
+the string to be matched. This string is assumed to be terminated by
+a null character. Matching is tried starting at each position in the
+string. @code{re_exec} returns @code{1} for success or @code{0} for failure.
+One cannot find out how long a substring was matched, nor what the
+@samp{\( @dots{} \)} groupings matched.
+
+@bye