diff options
Diffstat (limited to 'doc/lispref/searching.texi')
-rw-r--r-- | doc/lispref/searching.texi | 235 |
1 files changed, 179 insertions, 56 deletions
diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index c697c929b6a..2fa7ebc903d 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi @@ -18,11 +18,12 @@ portions of it. * Searching and Case:: Case-independent or case-significant searching. * Regular Expressions:: Describing classes of strings. * Regexp Search:: Searching for a match for a regexp. -* POSIX Regexps:: Searching POSIX-style for the longest match. +* Longest Match:: Searching for the longest match. * Match Data:: Finding out which part of the text matched, after a string or regexp search. * Search and Replace:: Commands that loop, searching and replacing. * Standard Regexps:: Useful regexps for finding sentences, pages,... +* POSIX Regexps:: Emacs regexps vs POSIX regexps. @end menu The @samp{skip-chars@dots{}} functions also perform a kind of searching. @@ -277,10 +278,10 @@ character is a simple regular expression that matches that character and nothing else. The special characters are @samp{.}, @samp{*}, @samp{+}, @samp{?}, @samp{[}, @samp{^}, @samp{$}, and @samp{\}; no new special characters will be defined in the future. The character -@samp{]} is special if it ends a character alternative (see later). -The character @samp{-} is special inside a character alternative. A +@samp{]} is special if it ends a bracket expression (see later). +The character @samp{-} is special inside a bracket expression. A @samp{[:} and balancing @samp{:]} enclose a character class inside a -character alternative. Any other character appearing in a regular +bracket expression. Any other character appearing in a regular expression is ordinary, unless a @samp{\} precedes it. For example, @samp{f} is not a special character, so it is ordinary, and @@ -373,19 +374,21 @@ expression @samp{c[ad]*?a}, applied to that same string, matches just permits the whole expression to match is @samp{d}.) @item @samp{[ @dots{} ]} +@cindex bracket expression (in regexp) @cindex character alternative (in regexp) @cindex @samp{[} in regexp @cindex @samp{]} in regexp -is a @dfn{character alternative}, which begins with @samp{[} and is -terminated by @samp{]}. In the simplest case, the characters between -the two brackets are what this character alternative can match. +is a @dfn{bracket expression} (a.k.a.@: @dfn{character alternative}), +which begins with @samp{[} and is terminated by @samp{]}. In the +simplest case, the characters between the two brackets are what this +bracket expression can match. Thus, @samp{[ad]} matches either one @samp{a} or one @samp{d}, and @samp{[ad]*} matches any string composed of just @samp{a}s and @samp{d}s (including the empty string). It follows that @samp{c[ad]*r} matches @samp{cr}, @samp{car}, @samp{cdr}, @samp{caddaar}, etc. -You can also include character ranges in a character alternative, by +You can also include character ranges in a bracket expression, by writing the starting and ending characters with a @samp{-} between them. Thus, @samp{[a-z]} matches any lower-case @acronym{ASCII} letter. Ranges may be intermixed freely with individual characters, as in @@ -394,7 +397,7 @@ or @samp{$}, @samp{%} or period. However, the ending character of one range should not be the starting point of another one; for example, @samp{[a-m-z]} should be avoided. -A character alternative can also specify named character classes +A bracket expression can also specify named character classes (@pxref{Char Classes}). For example, @samp{[[:ascii:]]} matches any @acronym{ASCII} character. Using a character class is equivalent to mentioning each of the characters in that class; but the latter is not @@ -403,9 +406,9 @@ different characters. A character class should not appear as the lower or upper bound of a range. The usual regexp special characters are not special inside a -character alternative. A completely different set of characters is +bracket expression. A completely different set of characters is special: @samp{]}, @samp{-} and @samp{^}. -To include @samp{]} in a character alternative, put it at the +To include @samp{]} in a bracket expression, put it at the beginning. To include @samp{^}, put it anywhere but at the beginning. To include @samp{-}, put it at the end. Thus, @samp{[]^-]} matches all three of these special characters. You cannot use @samp{\} to @@ -443,7 +446,7 @@ characters and raw 8-bit bytes, but not non-ASCII characters. This feature is intended for searching text in unibyte buffers and strings. @end enumerate -Some kinds of character alternatives are not the best style even +Some kinds of bracket expressions are not the best style even though they have a well-defined meaning in Emacs. They include: @enumerate @@ -457,7 +460,7 @@ Unicode character escapes can help here; for example, for most programmers @samp{[ก-ฺ฿-๛]} is less clear than @samp{[\u0E01-\u0E3A\u0E3F-\u0E5B]}. @item -Although a character alternative can include duplicates, it is better +Although a bracket expression can include duplicates, it is better style to avoid them. For example, @samp{[XYa-yYb-zX]} is less clear than @samp{[XYa-z]}. @@ -468,30 +471,30 @@ is simpler to list the characters. For example, than @samp{[ij]}, and @samp{[i-k]} is less clear than @samp{[ijk]}. @item -Although a @samp{-} can appear at the beginning of a character -alternative or as the upper bound of a range, it is better style to -put @samp{-} by itself at the end of a character alternative. For +Although a @samp{-} can appear at the beginning of a bracket +expression or as the upper bound of a range, it is better style to +put @samp{-} by itself at the end of a bracket expression. For example, although @samp{[-a-z]} is valid, @samp{[a-z-]} is better style; and although @samp{[*--]} is valid, @samp{[*+,-]} is clearer. @end enumerate @item @samp{[^ @dots{} ]} @cindex @samp{^} in regexp -@samp{[^} begins a @dfn{complemented character alternative}. This -matches any character except the ones specified. Thus, -@samp{[^a-z0-9A-Z]} matches all characters @emph{except} ASCII letters and -digits. +@samp{[^} begins a @dfn{complemented bracket expression}, or +@dfn{complemented character alternative}. This matches any character +except the ones specified. Thus, @samp{[^a-z0-9A-Z]} matches all +characters @emph{except} ASCII letters and digits. -@samp{^} is not special in a character alternative unless it is the first +@samp{^} is not special in a bracket expression unless it is the first character. The character following the @samp{^} is treated as if it were first (in other words, @samp{-} and @samp{]} are not special there). -A complemented character alternative can match a newline, unless newline is +A complemented bracket expression can match a newline, unless newline is mentioned as one of the characters not to match. This is in contrast to the handling of regexps in programs such as @code{grep}. -You can specify named character classes, just like in character -alternatives. For instance, @samp{[^[:ascii:]]} matches any +You can specify named character classes, just like in bracket +expressions. For instance, @samp{[^[:ascii:]]} matches any non-@acronym{ASCII} character. @xref{Char Classes}. @item @samp{^} @@ -505,9 +508,10 @@ beginning of a line. When matching a string instead of a buffer, @samp{^} matches at the beginning of the string or after a newline character. -For historical compatibility reasons, @samp{^} can be used only at the -beginning of the regular expression, or after @samp{\(}, @samp{\(?:} -or @samp{\|}. +For historical compatibility, @samp{^} is special only at the beginning +of the regular expression, or after @samp{\(}, @samp{\(?:} or @samp{\|}. +Although @samp{^} is an ordinary character in other contexts, +it is good practice to use @samp{\^} even then. @item @samp{$} @cindex @samp{$} in regexp @@ -519,8 +523,10 @@ matches a string of one @samp{x} or more at the end of a line. When matching a string instead of a buffer, @samp{$} matches at the end of the string or before a newline character. -For historical compatibility reasons, @samp{$} can be used only at the +For historical compatibility, @samp{$} is special only at the end of the regular expression, or before @samp{\)} or @samp{\|}. +Although @samp{$} is an ordinary character in other contexts, +it is good practice to use @samp{\$} even then. @item @samp{\} @cindex @samp{\} in regexp @@ -540,14 +546,15 @@ example, the regular expression that matches the @samp{\} character is @samp{\} is @code{"\\\\"}. @end table -@strong{Please note:} For historical compatibility, special characters -are treated as ordinary ones if they are in contexts where their special -meanings make no sense. For example, @samp{*foo} treats @samp{*} as -ordinary since there is no preceding expression on which the @samp{*} -can act. It is poor practice to depend on this behavior; quote the -special character anyway, regardless of where it appears. +For historical compatibility, a repetition operator is treated as ordinary +if it appears at the start of a regular expression +or after @samp{^}, @samp{\`}, @samp{\(}, @samp{\(?:} or @samp{\|}. +For example, @samp{*foo} is treated as @samp{\*foo}, and +@samp{two\|^\@{2\@}} is treated as @samp{two\|^@{2@}}. +It is poor practice to depend on this behavior; use proper backslash +escaping anyway, regardless of where the repetition operator appears. -As a @samp{\} is not special inside a character alternative, it can +As a @samp{\} is not special inside a bracket expression, it can never remove the special meaning of @samp{-}, @samp{^} or @samp{]}. You should not quote these characters when they have no special meaning. This would not clarify anything, since backslashes @@ -556,23 +563,23 @@ special meaning, as in @samp{[^\]} (@code{"[^\\]"} for Lisp string syntax), which matches any single character except a backslash. In practice, most @samp{]} that occur in regular expressions close a -character alternative and hence are special. However, occasionally a +bracket expression and hence are special. However, occasionally a regular expression may try to match a complex pattern of literal @samp{[} and @samp{]}. In such situations, it sometimes may be necessary to carefully parse the regexp from the start to determine -which square brackets enclose a character alternative. For example, -@samp{[^][]]} consists of the complemented character alternative +which square brackets enclose a bracket expression. For example, +@samp{[^][]]} consists of the complemented bracket expression @samp{[^][]} (which matches any single character that is not a square bracket), followed by a literal @samp{]}. The exact rules are that at the beginning of a regexp, @samp{[} is special and @samp{]} not. This lasts until the first unquoted -@samp{[}, after which we are in a character alternative; @samp{[} is +@samp{[}, after which we are in a bracket expression; @samp{[} is no longer special (except when it starts a character class) but @samp{]} is special, unless it immediately follows the special @samp{[} or that @samp{[} followed by a @samp{^}. This lasts until the next special -@samp{]} that does not end a character class. This ends the character -alternative and restores the ordinary syntax of regular expressions; +@samp{]} that does not end a character class. This ends the bracket +expression and restores the ordinary syntax of regular expressions; an unquoted @samp{[} is special again and a @samp{]} not. @node Char Classes @@ -583,13 +590,13 @@ an unquoted @samp{[} is special again and a @samp{]} not. @cindex alpha character class, regexp @cindex xdigit character class, regexp - Below is a table of the classes you can use in a character -alternative, and what they mean. Note that the @samp{[} and @samp{]} -characters that enclose the class name are part of the name, so a -regular expression using these classes needs one more pair of -brackets. For example, a regular expression matching a sequence of -one or more letters and digits would be @samp{[[:alnum:]]+}, not -@samp{[:alnum:]+}. + Below is a table of the classes you can use in a bracket expression +(@pxref{Regexp Special, bracket expression}), and what they mean. +Note that the @samp{[} and @samp{]} characters that enclose the class +name are part of the name, so a regular expression using these classes +needs one more pair of brackets. For example, a regular expression +matching a sequence of one or more letters and digits would be +@samp{[[:alnum:]]+}, not @samp{[:alnum:]+}. @table @samp @item [:ascii:] @@ -662,6 +669,10 @@ This matches the hexadecimal digits: @samp{0} through @samp{9}, @samp{a} through @samp{f} and @samp{A} through @samp{F}. @end table +The classes @samp{[:space:]}, @samp{[:word:]} and @samp{[:punct:]} use +the syntax-table of the current buffer but not any overriding syntax +text properties (@pxref{Syntax Properties}). + @node Regexp Backslash @subsubsection Backslash Constructs in Regular Expressions @cindex backslash in regular expressions @@ -911,7 +922,7 @@ with a symbol-constituent character. @kindex invalid-regexp Not every string is a valid regular expression. For example, a string -that ends inside a character alternative without a terminating @samp{]} +that ends inside a bracket expression without a terminating @samp{]} is invalid, and so is a string that ends with a single @samp{\}. If an invalid regular expression is passed to any of the search functions, an @code{invalid-regexp} error is signaled. @@ -948,7 +959,7 @@ deciphered as follows: @table @code @item [.?!] -The first part of the pattern is a character alternative that matches +The first part of the pattern is a bracket expression that matches any one of three characters: period, question mark, and exclamation mark. The match must begin with one of these three characters. (This is one point where the new default regexp used by Emacs differs from @@ -960,7 +971,7 @@ The second part of the pattern matches any closing braces and quotation marks, zero or more of them, that may follow the period, question mark or exclamation mark. The @code{\"} is Lisp syntax for a double-quote in a string. The @samp{*} at the end indicates that the immediately -preceding regular expression (a character alternative, in this case) may be +preceding regular expression (a bracket expression, in this case) may be repeated zero or more times. @item \\($\\|@ $\\|\t\\|@ @ \\) @@ -1334,6 +1345,9 @@ Match any @acronym{ASCII} character (codes 0--127). Match any non-@acronym{ASCII} character (but not raw bytes). @end table +The classes @code{space}, @code{word} and @code{punct} use the +syntax-table of the current buffer but not any overriding syntax text +properties (@pxref{Syntax Properties}).@* Corresponding string regexp: @samp{[[:@var{class}:]]} @item @code{(syntax @var{syntax})} @@ -1911,9 +1925,10 @@ attempts. Other zero-width assertions may also bring benefits by causing a match to fail early. @item -Avoid or-patterns in favor of character alternatives: write +Avoid or-patterns in favor of bracket expressions: write @samp{[ab]} instead of @samp{a\|b}. Recall that @samp{\s-} and @samp{\sw} -are equivalent to @samp{[[:space:]]} and @samp{[[:word:]]}, respectively. +are equivalent to @samp{[[:space:]]} and @samp{[[:word:]]}, respectively, +most of the time. @item Since the last branch of an or-pattern does not add a backtrack point @@ -1957,6 +1972,17 @@ advice, don't be afraid of performing the matching in multiple function calls, each using a simpler regexp where backtracking can more easily be contained. +@defun re--describe-compiled regexp &optional raw +To help diagnose problems in your regexps or in the regexp engine +itself, this function returns a string describing the compiled +form of @var{regexp}. To make sense of it, it can be necessary +to read at least the description of the @code{re_opcode_t} type in the +@code{src/regex-emacs.c} file in Emacs' source code. + +It is currently able to give a meaningful description only if Emacs +was compiled with @code{--enable-checking}. +@end defun + @node Regexp Search @section Regular Expression Searching @cindex regular expression searching @@ -2193,8 +2219,8 @@ constructs, you should bind it temporarily for as small as possible a part of the code. @end defvar -@node POSIX Regexps -@section POSIX Regular Expression Searching +@node Longest Match +@section Longest-match searching for regular expression matches @cindex backtracking and POSIX regular expressions The usual regular expression functions do backtracking when necessary @@ -2209,7 +2235,9 @@ possibilities and found all matches, so they can report the longest match, as required by POSIX@. This is much slower, so use these functions only when you really need the longest match. - The POSIX search and match functions do not properly support the + Despite their names, the POSIX search and match functions +use Emacs regular expressions, not POSIX regular expressions. +@xref{POSIX Regexps}. Also, they do not properly support the non-greedy repetition operators (@pxref{Regexp Special, non-greedy}). This is because POSIX backtracking conflicts with the semantics of non-greedy repetition. @@ -2957,3 +2985,98 @@ values of the variables @code{sentence-end-double-space} @code{sentence-end-without-period}, and @code{sentence-end-without-space}. @end defun + +@node POSIX Regexps +@section Emacs versus POSIX Regular Expressions +@cindex POSIX regular expressions + +Regular expression syntax varies significantly among computer programs. +When writing Elisp code that generates regular expressions for use by other +programs, it is helpful to know how syntax variants differ. +To give a feel for the variation, this section discusses how +Emacs regular expressions differ from two syntax variants standarded by POSIX: +basic regular expressions (BREs) and extended regular expressions (EREs). +Plain @command{grep} uses BREs, and @samp{grep -E} uses EREs. + +Emacs regular expressions have a syntax closer to EREs than to BREs, +with some extensions. Here is a summary of how POSIX BREs and EREs +differ from Emacs regular expressions. + +@itemize @bullet +@item +In POSIX BREs @samp{+} and @samp{?} are not special. +The only backslash escape sequences are @samp{\(@dots{}\)}, +@samp{\@{@dots{}\@}}, @samp{\1} through @samp{\9}, along with the +escaped special characters @samp{\$}, @samp{\*}, @samp{\.}, @samp{\[}, +@samp{\\}, and @samp{\^}. +Therefore @samp{\(?:} acts like @samp{\([?]:}. +POSIX does not define how other BRE escapes behave; +for example, GNU @command{grep} treats @samp{\|} like Emacs does, +but does not support all the Emacs escapes. + +@item +In POSIX BREs, it is an implementation option whether @samp{^} is special +after @samp{\(}; GNU @command{grep} treats it like Emacs does. +In POSIX EREs, @samp{^} is always special outside of bracket expressions, +which means the ERE @samp{x^} never matches. +In Emacs regular expressions, @samp{^} is special only at the +beginning of the regular expression, or after @samp{\(}, @samp{\(?:} +or @samp{\|}. + +@item +In POSIX BREs, it is an implementation option whether @samp{$} is +special before @samp{\)}; GNU @command{grep} treats it like Emacs +does. In POSIX EREs, @samp{$} is always special outside of bracket +expressions (@pxref{Regexp Special, bracket expressions}), which means +the ERE @samp{$x} never matches. In Emacs regular expressions, +@samp{$} is special only at the end of the regular expression, or +before @samp{\)} or @samp{\|}. + +@item +In POSIX EREs @samp{@{}, @samp{(} and @samp{|} are special, +and @samp{)} is special when matched with a preceding @samp{(}. +These special characters do not use preceding backslashes; +@samp{(?} produces undefined results. +The only backslash escape sequences are the escaped special characters +@samp{\$}, @samp{\(}, @samp{\)}, @samp{\*}, @samp{\+}, @samp{\.}, +@samp{\?}, @samp{\[}, @samp{\\}, @samp{\^}, @samp{\@{} and @samp{\|}. +POSIX does not define how other ERE escapes behave; +for example, GNU @samp{grep -E} treats @samp{\1} like Emacs does, +but does not support all the Emacs escapes. + +@item +In POSIX BREs and EREs, undefined results are produced by repetition +operators at the start of a regular expression or subexpression +(possibly preceded by @samp{^}), except that the repetition operator +@samp{*} has the same behavior in BREs as in Emacs. +In Emacs, these operators are treated as ordinary. + +@item +In BREs and EREs, undefined results are produced by two repetition +operators in sequence. In Emacs, these have well-defined behavior, +e.g., @samp{a**} is equivalent to @samp{a*}. + +@item +In BREs and EREs, undefined results are produced by empty regular +expressions or subexpressions. In Emacs these have well-defined +behavior, e.g., @samp{\(\)*} matches the empty string, + +@item +In BREs and EREs, undefined results are produced for the named +character classes @samp{[:ascii:]}, @samp{[:multibyte:]}, +@samp{[:nonascii:]}, @samp{[:unibyte:]}, and @samp{[:word:]}. + +@item +BREs and EREs can contain collating symbols and equivalence +class expressions within bracket expressions, e.g., @samp{[[.ch.]d[=a=]]}. +Emacs regular expressions do not support this. + +@item +BREs, EREs, and the strings they match cannot contain encoding errors +or NUL bytes. In Emacs these constructs simply match themselves. + +@item +BRE and ERE searching always finds the longest match. +Emacs searching by default does not necessarily do so. +@xref{Longest Match}. +@end itemize |