Regex: Update PCRE to v8.35.

I was über lazy at first, so took libs from SM.
But actually it's quite easy to compile, so let's update to latest version \o/.
This commit is contained in:
Arkshine
2014-07-05 13:53:30 +02:00
parent d1153b8049
commit d4de0e6f1e
241 changed files with 51074 additions and 15011 deletions

View File

@ -1,4 +1,4 @@
.TH PCRETEST 1 "10 September 2012" "PCRE 8.32"
.TH PCRETEST 1 "09 February 2014" "PCRE 8.35"
.SH NAME
pcretest - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@ -40,23 +40,34 @@ PCRE, and are unlikely to be of use otherwise. They are all documented here,
but without much justification.
.
.
.SH "INPUT DATA FORMAT"
.rs
.sp
Input to \fBpcretest\fP is processed line by line, either by calling the C
library's \fBfgets()\fP function, or via the \fBlibreadline\fP library (see
below). In Unix-like environments, \fBfgets()\fP treats any bytes other than
newline as data characters. However, in some Windows environments character 26
(hex 1A) causes an immediate end of file, and no further data is read. For
maximum portability, therefore, it is safest to use only ASCII characters in
\fBpcretest\fP input files.
.
.
.SH "PCRE's 8-BIT, 16-BIT AND 32-BIT LIBRARIES"
.rs
.sp
From release 8.30, two separate PCRE libraries can be built. The original one
supports 8-bit character strings, whereas the newer 16-bit library supports
character strings encoded in 16-bit units. From release 8.32, a third
library can be built, supporting character strings encoded in 32-bit units.
The \fBpcretest\fP program can be
used to test all three libraries. However, it is itself still an 8-bit program,
reading 8-bit input and writing 8-bit output. When testing the 16-bit or 32-bit
library, the patterns and data strings are converted to 16- or 32-bit format
before being passed to the PCRE library functions. Results are converted to
8-bit for output.
character strings encoded in 16-bit units. From release 8.32, a third library
can be built, supporting character strings encoded in 32-bit units. The
\fBpcretest\fP program can be used to test all three libraries. However, it is
itself still an 8-bit program, reading 8-bit input and writing 8-bit output.
When testing the 16-bit or 32-bit library, the patterns and data strings are
converted to 16- or 32-bit format before being passed to the PCRE library
functions. Results are converted to 8-bit for output.
.P
References to functions and structures of the form \fBpcre[16|32]_xx\fP below
mean "\fBpcre_xx\fP when using the 8-bit library or \fBpcre16_xx\fP when using
the 16-bit library".
mean "\fBpcre_xx\fP when using the 8-bit library, \fBpcre16_xx\fP when using
the 16-bit library, or \fBpcre32_xx\fP when using the 32-bit library".
.
.
.SH "COMMAND LINE OPTIONS"
@ -85,22 +96,29 @@ internal form is output after compilation.
.TP 10
\fB-C\fP
Output the version number of the PCRE library, and all available information
about the optional features that are included, and then exit. All other options
are ignored.
about the optional features that are included, and then exit with zero exit
code. All other options are ignored.
.TP 10
\fB-C\fP \fIoption\fP
Output information about a specific build-time option, then exit. This
functionality is intended for use in scripts such as \fBRunTest\fP. The
following options output the value indicated:
following options output the value and set the exit code as indicated:
.sp
ebcdic-nl the code for LF (= NL) in an EBCDIC environment:
0x15 or 0x25
0 if used in an ASCII environment
linksize the internal link size (2, 3, or 4)
exit code is always 0
linksize the configured internal link size (2, 3, or 4)
exit code is set to the link size
newline the default newline setting:
CR, LF, CRLF, ANYCRLF, or ANY
exit code is always 0
bsr the default setting for what \eR matches:
ANYCRLF or ANY
exit code is always 0
.sp
The following options output 1 for true or zero for false:
The following options output 1 for true or 0 for false, and set the exit code
to the same value:
.sp
ebcdic compiled for an EBCDIC environment
jit just-in-time support is available
@ -108,7 +126,10 @@ The following options output 1 for true or zero for false:
pcre32 the 32-bit library was built
pcre8 the 8-bit library was built
ucp Unicode property support is available
utf UTF-8 and/or UTF-16 and/or UTF-32 support is available
utf UTF-8 and/or UTF-16 and/or UTF-32 support
is available
.sp
If an unknown option is given, an error message is output; the exit code is 0.
.TP 10
\fB-d\fP
Behave as if each pattern has the \fB/D\fP (debug) modifier; the internal
@ -137,6 +158,10 @@ Output the size of each compiled pattern after it has been compiled. This is
equivalent to adding \fB/M\fP to each regular expression. The size is given in
bytes for both libraries.
.TP 10
\fB-O\fP
Behave as if each pattern has the \fB/O\fP modifier, that is disable
auto-possessification for all patterns.
.TP 10
\fB-o\fP \fIosize\fP
Set the number of elements in the output vector that is used when calling
\fBpcre[16|32]_exec()\fP or \fBpcre[16|32]_dfa_exec()\fP to be \fIosize\fP. The
@ -198,17 +223,21 @@ contains (*MARK) items there may also be differences, for the same reason. The
should never be studied (see the \fB/S\fP pattern modifier below).
.TP 10
\fB-t\fP
Run each compile, study, and match many times with a timer, and output
resulting time per compile or match (in milliseconds). Do not set \fB-m\fP with
\fB-t\fP, because you will then get the size output a zillion times, and the
timing will be distorted. You can control the number of iterations that are
used for timing by following \fB-t\fP with a number (as a separate item on the
command line). For example, "-t 1000" would iterate 1000 times. The default is
to iterate 500000 times.
Run each compile, study, and match many times with a timer, and output the
resulting times per compile, study, or match (in milliseconds). Do not set
\fB-m\fP with \fB-t\fP, because you will then get the size output a zillion
times, and the timing will be distorted. You can control the number of
iterations that are used for timing by following \fB-t\fP with a number (as a
separate item on the command line). For example, "-t 1000" iterates 1000 times.
The default is to iterate 500000 times.
.TP 10
\fB-tm\fP
This is like \fB-t\fP except that it times only the matching phase, not the
compile or study phases.
.TP 10
\fB-T\fP \fB-TM\fP
These behave like \fB-t\fP and \fB-tm\fP, but in addition, at the end of a run,
the total times for all compiles, studies, and matches are output.
.
.
.SH DESCRIPTION
@ -228,7 +257,7 @@ option states whether or not \fBreadline()\fP will be used.
.P
The program handles any number of sets of input on a single input file. Each
set starts with a regular expression, and continues with any number of data
lines to be matched against the pattern.
lines to be matched against that pattern.
.P
Each data line is matched separately and independently. If you want to do
multi-line matches, you have to use the \en escape sequence (or \er or \er\en,
@ -280,6 +309,7 @@ fall into several groups that are described in detail in the following
sections.
.sp
\fB/8\fP set UTF mode
\fB/9\fP set PCRE_NEVER_UTF (locks out UTF mode)
\fB/?\fP disable UTF validity check
\fB/+\fP show remainder of subject after match
\fB/=\fP show all captures (not just those that are set)
@ -301,7 +331,9 @@ sections.
\fB/M\fP show compiled memory size
\fB/m\fP set PCRE_MULTILINE
\fB/N\fP set PCRE_NO_AUTO_CAPTURE
\fB/O\fP set PCRE_NO_AUTO_POSSESS
\fB/P\fP use the POSIX wrapper
\fB/Q\fP test external stack check function
\fB/S\fP study the pattern after compilation
\fB/s\fP set PCRE_DOTALL
\fB/T\fP select character tables
@ -350,12 +382,14 @@ options that do not correspond to anything in Perl:
\fB/8\fP PCRE_UTF32 ) when using the 32-bit
\fB/?\fP PCRE_NO_UTF32_CHECK ) library
.sp
\fB/9\fP PCRE_NEVER_UTF
\fB/A\fP PCRE_ANCHORED
\fB/C\fP PCRE_AUTO_CALLOUT
\fB/E\fP PCRE_DOLLAR_ENDONLY
\fB/f\fP PCRE_FIRSTLINE
\fB/J\fP PCRE_DUPNAMES
\fB/N\fP PCRE_NO_AUTO_CAPTURE
\fB/O\fP PCRE_NO_AUTO_POSSESS
\fB/U\fP PCRE_UNGREEDY
\fB/W\fP PCRE_UCP
\fB/X\fP PCRE_EXTRA
@ -453,7 +487,10 @@ below.
The \fB/I\fP modifier requests that \fBpcretest\fP output information about the
compiled pattern (whether it is anchored, has a fixed first character, and
so on). It does this by calling \fBpcre[16|32]_fullinfo()\fP after compiling a
pattern. If the pattern is studied, the results of that are also output.
pattern. If the pattern is studied, the results of that are also output. In
this output, the word "char" means a non-UTF character, that is, the value of a
single data item (8-bit, 16-bit, or 32-bit, depending on the library that is
being tested).
.P
The \fB/K\fP modifier requests \fBpcretest\fP to show names from backtracking
control verbs that are returned from calls to \fBpcre[16|32]_exec()\fP. It causes
@ -483,13 +520,22 @@ the compiled pattern to be output. This does not include the size of the
successfully studied with the PCRE_STUDY_JIT_COMPILE option, the size of the
JIT compiled code is also output.
.P
The \fB/Q\fP modifier is used to test the use of \fBpcre_stack_guard\fP. It
must be followed by '0' or '1', specifying the return code to be given from an
external function that is passed to PCRE and used for stack checking during
compilation (see the
.\" HREF
\fBpcreapi\fP
.\"
documentation for details).
.P
The \fB/S\fP modifier causes \fBpcre[16|32]_study()\fP to be called after the
expression has been compiled, and the results used when the expression is
matched. There are a number of qualifying characters that may follow \fB/S\fP.
They may appear in any order.
.P
If \fBS\fP is followed by an exclamation mark, \fBpcre[16|32]_study()\fP is called
with the PCRE_STUDY_EXTRA_NEEDED option, causing it always to return a
If \fB/S\fP is followed by an exclamation mark, \fBpcre[16|32]_study()\fP is
called with the PCRE_STUDY_EXTRA_NEEDED option, causing it always to return a
\fBpcre_extra\fP block, even when studying discovers no useful information.
.P
If \fB/S\fP is followed by a second S character, it suppresses studying, even
@ -565,6 +611,37 @@ The \fB/+\fP modifier works as described above. All other modifiers are
ignored.
.
.
.SS "Locking out certain modifiers"
.rs
.sp
PCRE can be compiled with or without support for certain features such as
UTF-8/16/32 or Unicode properties. Accordingly, the standard tests are split up
into a number of different files that are selected for running depending on
which features are available. When updating the tests, it is all too easy to
put a new test into the wrong file by mistake; for example, to put a test that
requires UTF support into a file that is used when it is not available. To help
detect such mistakes as early as possible, there is a facility for locking out
specific modifiers. If an input line for \fBpcretest\fP starts with the string
"< forbid " the following sequence of characters is taken as a list of
forbidden modifiers. For example, in the test files that must not use UTF or
Unicode property support, this line appears:
.sp
< forbid 8W
.sp
This locks out the /8 and /W modifiers. An immediate error is given if they are
subsequently encountered. If the character string contains < but not >, all the
multi-character modifiers that begin with < are locked out. Otherwise, such
modifiers must be explicitly listed, for example:
.sp
< forbid <JS><cr>
.sp
There must be a single space between < and "forbid" for this feature to be
recognised. If there is not, the line is interpreted either as a request to
re-load a pre-compiled pattern (see "SAVING AND RELOADING COMPILED PATTERNS"
below) or, if there is a another < character, as a pattern that uses < as its
delimiter.
.
.
.SH "DATA LINES"
.rs
.sp
@ -588,6 +665,7 @@ recognized:
\ev vertical tab (\ex0b)
\ennn octal character (up to 3 octal digits); always
a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode
\eo{dd...} octal character (any number of octal digits}
\exhh hexadecimal byte (up to 2 hex digits)
\ex{hh...} hexadecimal character (any number of hex digits)
.\" JOIN
@ -1011,10 +1089,9 @@ exact copy of the compiled pattern. If there is additional study data, this
writing the file, \fBpcretest\fP expects to read a new pattern.
.P
A saved pattern can be reloaded into \fBpcretest\fP by specifying < and a file
name instead of a pattern. The name of the file must not contain a < character,
as otherwise \fBpcretest\fP will interpret the line as a pattern delimited by <
characters.
For example:
name instead of a pattern. There must be no space between < and the file name,
which must not contain a < character, as otherwise \fBpcretest\fP will
interpret the line as a pattern delimited by < characters. For example:
.sp
re> </some/file
Compiled pattern loaded from /some/file
@ -1074,6 +1151,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
Last updated: 10 September 2012
Copyright (c) 1997-2012 University of Cambridge.
Last updated: 09 February 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi