Regex: Update PCRE to v8.35.

I was über lazy at first, so took libs from SM. But actually it's quite easy to compile, so let's update to latest version \o/.
2014-07-05 13:53:30 +02:00
parent d1153b8049
commit d4de0e6f1e
241 changed files with 51074 additions and 15011 deletions
--- a/tools/pcre/doc/pcretest.1
+++ b/tools/pcre/doc/pcretest.1
@@ -1,4 +1,4 @@
-.TH PCRETEST 1 "10 September 2012" "PCRE 8.32"
+.TH PCRETEST 1 "09 February 2014" "PCRE 8.35"
 .SH NAME
 pcretest - a program for testing Perl-compatible regular expressions.
 .SH SYNOPSIS
@@ -40,23 +40,34 @@ PCRE, and are unlikely to be of use otherwise. They are all documented here,
 but without much justification.
 .
 .
+.SH "INPUT DATA FORMAT"
+.rs
+.sp
+Input to \fBpcretest\fP is processed line by line, either by calling the C
+library's \fBfgets()\fP function, or via the \fBlibreadline\fP library (see
+below). In Unix-like environments, \fBfgets()\fP treats any bytes other than
+newline as data characters. However, in some Windows environments character 26
+(hex 1A) causes an immediate end of file, and no further data is read. For
+maximum portability, therefore, it is safest to use only ASCII characters in
+\fBpcretest\fP input files.
+.
+.
 .SH "PCRE's 8-BIT, 16-BIT AND 32-BIT LIBRARIES"
 .rs
 .sp
 From release 8.30, two separate PCRE libraries can be built. The original one
 supports 8-bit character strings, whereas the newer 16-bit library supports
-character strings encoded in 16-bit units. From release 8.32, a third
-library can be built, supporting character strings encoded in 32-bit units.
-The \fBpcretest\fP program can be
-used to test all three libraries. However, it is itself still an 8-bit program,
-reading 8-bit input and writing 8-bit output. When testing the 16-bit or 32-bit
-library, the patterns and data strings are converted to 16- or 32-bit format
-before being passed to the PCRE library functions. Results are converted to
-8-bit for output.
+character strings encoded in 16-bit units. From release 8.32, a third library
+can be built, supporting character strings encoded in 32-bit units. The
+\fBpcretest\fP program can be used to test all three libraries. However, it is
+itself still an 8-bit program, reading 8-bit input and writing 8-bit output.
+When testing the 16-bit or 32-bit library, the patterns and data strings are
+converted to 16- or 32-bit format before being passed to the PCRE library
+functions. Results are converted to 8-bit for output.
 .P
 References to functions and structures of the form \fBpcre[16|32]_xx\fP below
-mean "\fBpcre_xx\fP when using the 8-bit library or \fBpcre16_xx\fP when using
-the 16-bit library".
+mean "\fBpcre_xx\fP when using the 8-bit library, \fBpcre16_xx\fP when using
+the 16-bit library, or \fBpcre32_xx\fP when using the 32-bit library".
 .
 .
 .SH "COMMAND LINE OPTIONS"
@@ -85,22 +96,29 @@ internal form is output after compilation.
 .TP 10
 \fB-C\fP
 Output the version number of the PCRE library, and all available information
-about the optional features that are included, and then exit. All other options
-are ignored.
+about the optional features that are included, and then exit with zero exit
+code. All other options are ignored.
 .TP 10
 \fB-C\fP \fIoption\fP
 Output information about a specific build-time option, then exit. This
 functionality is intended for use in scripts such as \fBRunTest\fP. The
-following options output the value indicated:
+following options output the value and set the exit code as indicated:
 .sp
  ebcdic-nl  the code for LF (= NL) in an EBCDIC environment:
               0x15 or 0x25
               0 if used in an ASCII environment
-  linksize   the internal link size (2, 3, or 4)
+               exit code is always 0
+  linksize   the configured internal link size (2, 3, or 4)
+               exit code is set to the link size
  newline    the default newline setting:
               CR, LF, CRLF, ANYCRLF, or ANY
+               exit code is always 0
+  bsr        the default setting for what \eR matches:
+               ANYCRLF or ANY
+               exit code is always 0
 .sp
-The following options output 1 for true or zero for false:
+The following options output 1 for true or 0 for false, and set the exit code
+to the same value:
 .sp
  ebcdic     compiled for an EBCDIC environment
  jit        just-in-time support is available
@@ -108,7 +126,10 @@ The following options output 1 for true or zero for false:
  pcre32     the 32-bit library was built
  pcre8      the 8-bit library was built
  ucp        Unicode property support is available
-  utf        UTF-8 and/or UTF-16 and/or UTF-32 support is available
+  utf        UTF-8 and/or UTF-16 and/or UTF-32 support
+               is available
+.sp
+If an unknown option is given, an error message is output; the exit code is 0.
 .TP 10
 \fB-d\fP
 Behave as if each pattern has the \fB/D\fP (debug) modifier; the internal
@@ -137,6 +158,10 @@ Output the size of each compiled pattern after it has been compiled. This is
 equivalent to adding \fB/M\fP to each regular expression. The size is given in
 bytes for both libraries.
 .TP 10
+\fB-O\fP
+Behave as if each pattern has the \fB/O\fP modifier, that is disable
+auto-possessification for all patterns.
+.TP 10
 \fB-o\fP \fIosize\fP
 Set the number of elements in the output vector that is used when calling
 \fBpcre[16|32]_exec()\fP or \fBpcre[16|32]_dfa_exec()\fP to be \fIosize\fP. The
@@ -198,17 +223,21 @@ contains (*MARK) items there may also be differences, for the same reason. The
 should never be studied (see the \fB/S\fP pattern modifier below).
 .TP 10
 \fB-t\fP
-Run each compile, study, and match many times with a timer, and output
-resulting time per compile or match (in milliseconds). Do not set \fB-m\fP with
-\fB-t\fP, because you will then get the size output a zillion times, and the
-timing will be distorted. You can control the number of iterations that are
-used for timing by following \fB-t\fP with a number (as a separate item on the
-command line). For example, "-t 1000" would iterate 1000 times. The default is
-to iterate 500000 times.
+Run each compile, study, and match many times with a timer, and output the
+resulting times per compile, study, or match (in milliseconds). Do not set
+\fB-m\fP with \fB-t\fP, because you will then get the size output a zillion
+times, and the timing will be distorted. You can control the number of
+iterations that are used for timing by following \fB-t\fP with a number (as a
+separate item on the command line). For example, "-t 1000" iterates 1000 times.
+The default is to iterate 500000 times.
 .TP 10
 \fB-tm\fP
 This is like \fB-t\fP except that it times only the matching phase, not the
 compile or study phases.
+.TP 10
+\fB-T\fP \fB-TM\fP
+These behave like \fB-t\fP and \fB-tm\fP, but in addition, at the end of a run,
+the total times for all compiles, studies, and matches are output.
 .
 .
 .SH DESCRIPTION
@@ -228,7 +257,7 @@ option states whether or not \fBreadline()\fP will be used.
 .P
 The program handles any number of sets of input on a single input file. Each
 set starts with a regular expression, and continues with any number of data
-lines to be matched against the pattern.
+lines to be matched against that pattern.
 .P
 Each data line is matched separately and independently. If you want to do
 multi-line matches, you have to use the \en escape sequence (or \er or \er\en,
@@ -280,6 +309,7 @@ fall into several groups that are described in detail in the following
 sections.
 .sp
  \fB/8\fP              set UTF mode
+  \fB/9\fP              set PCRE_NEVER_UTF (locks out UTF mode)
  \fB/?\fP              disable UTF validity check
  \fB/+\fP              show remainder of subject after match
  \fB/=\fP              show all captures (not just those that are set)
@@ -301,7 +331,9 @@ sections.
  \fB/M\fP              show compiled memory size
  \fB/m\fP              set PCRE_MULTILINE
  \fB/N\fP              set PCRE_NO_AUTO_CAPTURE
+  \fB/O\fP              set PCRE_NO_AUTO_POSSESS
  \fB/P\fP              use the POSIX wrapper
+  \fB/Q\fP              test external stack check function
  \fB/S\fP              study the pattern after compilation
  \fB/s\fP              set PCRE_DOTALL
  \fB/T\fP              select character tables
@@ -350,12 +382,14 @@ options that do not correspond to anything in Perl:
  \fB/8\fP              PCRE_UTF32          ) when using the 32-bit
  \fB/?\fP              PCRE_NO_UTF32_CHECK )   library
 .sp
+  \fB/9\fP              PCRE_NEVER_UTF
  \fB/A\fP              PCRE_ANCHORED
  \fB/C\fP              PCRE_AUTO_CALLOUT
  \fB/E\fP              PCRE_DOLLAR_ENDONLY
  \fB/f\fP              PCRE_FIRSTLINE
  \fB/J\fP              PCRE_DUPNAMES
  \fB/N\fP              PCRE_NO_AUTO_CAPTURE
+  \fB/O\fP              PCRE_NO_AUTO_POSSESS
  \fB/U\fP              PCRE_UNGREEDY
  \fB/W\fP              PCRE_UCP
  \fB/X\fP              PCRE_EXTRA
@@ -453,7 +487,10 @@ below.
 The \fB/I\fP modifier requests that \fBpcretest\fP output information about the
 compiled pattern (whether it is anchored, has a fixed first character, and
 so on). It does this by calling \fBpcre[16|32]_fullinfo()\fP after compiling a
-pattern. If the pattern is studied, the results of that are also output.
+pattern. If the pattern is studied, the results of that are also output. In
+this output, the word "char" means a non-UTF character, that is, the value of a
+single data item (8-bit, 16-bit, or 32-bit, depending on the library that is
+being tested).
 .P
 The \fB/K\fP modifier requests \fBpcretest\fP to show names from backtracking
 control verbs that are returned from calls to \fBpcre[16|32]_exec()\fP. It causes
@@ -483,13 +520,22 @@ the compiled pattern to be output. This does not include the size of the
 successfully studied with the PCRE_STUDY_JIT_COMPILE option, the size of the
 JIT compiled code is also output.
 .P
+The \fB/Q\fP modifier is used to test the use of \fBpcre_stack_guard\fP. It
+must be followed by '0' or '1', specifying the return code to be given from an
+external function that is passed to PCRE and used for stack checking during
+compilation (see the
+.\" HREF
+\fBpcreapi\fP
+.\"
+documentation for details).
+.P
 The \fB/S\fP modifier causes \fBpcre[16|32]_study()\fP to be called after the
 expression has been compiled, and the results used when the expression is
 matched. There are a number of qualifying characters that may follow \fB/S\fP.
 They may appear in any order.
 .P
-If \fBS\fP is followed by an exclamation mark, \fBpcre[16|32]_study()\fP is called
-with the PCRE_STUDY_EXTRA_NEEDED option, causing it always to return a
+If \fB/S\fP is followed by an exclamation mark, \fBpcre[16|32]_study()\fP is
+called with the PCRE_STUDY_EXTRA_NEEDED option, causing it always to return a
 \fBpcre_extra\fP block, even when studying discovers no useful information.
 .P
 If \fB/S\fP is followed by a second S character, it suppresses studying, even
@@ -565,6 +611,37 @@ The \fB/+\fP modifier works as described above. All other modifiers are
 ignored.
 .
 .
+.SS "Locking out certain modifiers"
+.rs
+.sp
+PCRE can be compiled with or without support for certain features such as
+UTF-8/16/32 or Unicode properties. Accordingly, the standard tests are split up
+into a number of different files that are selected for running depending on
+which features are available. When updating the tests, it is all too easy to
+put a new test into the wrong file by mistake; for example, to put a test that
+requires UTF support into a file that is used when it is not available. To help
+detect such mistakes as early as possible, there is a facility for locking out
+specific modifiers. If an input line for \fBpcretest\fP starts with the string
+"< forbid " the following sequence of characters is taken as a list of
+forbidden modifiers. For example, in the test files that must not use UTF or
+Unicode property support, this line appears:
+.sp
+  < forbid 8W
+.sp
+This locks out the /8 and /W modifiers. An immediate error is given if they are
+subsequently encountered. If the character string contains < but not >, all the
+multi-character modifiers that begin with < are locked out. Otherwise, such
+modifiers must be explicitly listed, for example:
+.sp
+  < forbid <JS><cr>
+.sp
+There must be a single space between < and "forbid" for this feature to be
+recognised. If there is not, the line is interpreted either as a request to
+re-load a pre-compiled pattern (see "SAVING AND RELOADING COMPILED PATTERNS"
+below) or, if there is a another < character, as a pattern that uses < as its
+delimiter.
+.
+.
 .SH "DATA LINES"
 .rs
 .sp
@@ -588,6 +665,7 @@ recognized:
  \ev         vertical tab (\ex0b)
  \ennn       octal character (up to 3 octal digits); always
               a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode
+  \eo{dd...}  octal character (any number of octal digits}
  \exhh       hexadecimal byte (up to 2 hex digits)
  \ex{hh...}  hexadecimal character (any number of hex digits)
 .\" JOIN
@@ -1011,10 +1089,9 @@ exact copy of the compiled pattern. If there is additional study data, this
 writing the file, \fBpcretest\fP expects to read a new pattern.
 .P
 A saved pattern can be reloaded into \fBpcretest\fP by specifying < and a file
-name instead of a pattern. The name of the file must not contain a < character,
-as otherwise \fBpcretest\fP will interpret the line as a pattern delimited by <
-characters.
-For example:
+name instead of a pattern. There must be no space between < and the file name,
+which must not contain a < character, as otherwise \fBpcretest\fP will
+interpret the line as a pattern delimited by < characters. For example:
 .sp
   re> </some/file
  Compiled pattern loaded from /some/file
@@ -1074,6 +1151,6 @@ Cambridge CB2 3QH, England.
 .rs
 .sp
 .nf
-Last updated: 10 September 2012
-Copyright (c) 1997-2012 University of Cambridge.
+Last updated: 09 February 2014
+Copyright (c) 1997-2014 University of Cambridge.
 .fi