Regex: Update PCRE to v8.35.
I was über lazy at first, so took libs from SM. But actually it's quite easy to compile, so let's update to latest version \o/.
This commit is contained in:
@ -1,4 +1,4 @@
|
||||
.TH PCRETEST 1 "10 September 2012" "PCRE 8.32"
|
||||
.TH PCRETEST 1 "09 February 2014" "PCRE 8.35"
|
||||
.SH NAME
|
||||
pcretest - a program for testing Perl-compatible regular expressions.
|
||||
.SH SYNOPSIS
|
||||
@ -40,23 +40,34 @@ PCRE, and are unlikely to be of use otherwise. They are all documented here,
|
||||
but without much justification.
|
||||
.
|
||||
.
|
||||
.SH "INPUT DATA FORMAT"
|
||||
.rs
|
||||
.sp
|
||||
Input to \fBpcretest\fP is processed line by line, either by calling the C
|
||||
library's \fBfgets()\fP function, or via the \fBlibreadline\fP library (see
|
||||
below). In Unix-like environments, \fBfgets()\fP treats any bytes other than
|
||||
newline as data characters. However, in some Windows environments character 26
|
||||
(hex 1A) causes an immediate end of file, and no further data is read. For
|
||||
maximum portability, therefore, it is safest to use only ASCII characters in
|
||||
\fBpcretest\fP input files.
|
||||
.
|
||||
.
|
||||
.SH "PCRE's 8-BIT, 16-BIT AND 32-BIT LIBRARIES"
|
||||
.rs
|
||||
.sp
|
||||
From release 8.30, two separate PCRE libraries can be built. The original one
|
||||
supports 8-bit character strings, whereas the newer 16-bit library supports
|
||||
character strings encoded in 16-bit units. From release 8.32, a third
|
||||
library can be built, supporting character strings encoded in 32-bit units.
|
||||
The \fBpcretest\fP program can be
|
||||
used to test all three libraries. However, it is itself still an 8-bit program,
|
||||
reading 8-bit input and writing 8-bit output. When testing the 16-bit or 32-bit
|
||||
library, the patterns and data strings are converted to 16- or 32-bit format
|
||||
before being passed to the PCRE library functions. Results are converted to
|
||||
8-bit for output.
|
||||
character strings encoded in 16-bit units. From release 8.32, a third library
|
||||
can be built, supporting character strings encoded in 32-bit units. The
|
||||
\fBpcretest\fP program can be used to test all three libraries. However, it is
|
||||
itself still an 8-bit program, reading 8-bit input and writing 8-bit output.
|
||||
When testing the 16-bit or 32-bit library, the patterns and data strings are
|
||||
converted to 16- or 32-bit format before being passed to the PCRE library
|
||||
functions. Results are converted to 8-bit for output.
|
||||
.P
|
||||
References to functions and structures of the form \fBpcre[16|32]_xx\fP below
|
||||
mean "\fBpcre_xx\fP when using the 8-bit library or \fBpcre16_xx\fP when using
|
||||
the 16-bit library".
|
||||
mean "\fBpcre_xx\fP when using the 8-bit library, \fBpcre16_xx\fP when using
|
||||
the 16-bit library, or \fBpcre32_xx\fP when using the 32-bit library".
|
||||
.
|
||||
.
|
||||
.SH "COMMAND LINE OPTIONS"
|
||||
@ -85,22 +96,29 @@ internal form is output after compilation.
|
||||
.TP 10
|
||||
\fB-C\fP
|
||||
Output the version number of the PCRE library, and all available information
|
||||
about the optional features that are included, and then exit. All other options
|
||||
are ignored.
|
||||
about the optional features that are included, and then exit with zero exit
|
||||
code. All other options are ignored.
|
||||
.TP 10
|
||||
\fB-C\fP \fIoption\fP
|
||||
Output information about a specific build-time option, then exit. This
|
||||
functionality is intended for use in scripts such as \fBRunTest\fP. The
|
||||
following options output the value indicated:
|
||||
following options output the value and set the exit code as indicated:
|
||||
.sp
|
||||
ebcdic-nl the code for LF (= NL) in an EBCDIC environment:
|
||||
0x15 or 0x25
|
||||
0 if used in an ASCII environment
|
||||
linksize the internal link size (2, 3, or 4)
|
||||
exit code is always 0
|
||||
linksize the configured internal link size (2, 3, or 4)
|
||||
exit code is set to the link size
|
||||
newline the default newline setting:
|
||||
CR, LF, CRLF, ANYCRLF, or ANY
|
||||
exit code is always 0
|
||||
bsr the default setting for what \eR matches:
|
||||
ANYCRLF or ANY
|
||||
exit code is always 0
|
||||
.sp
|
||||
The following options output 1 for true or zero for false:
|
||||
The following options output 1 for true or 0 for false, and set the exit code
|
||||
to the same value:
|
||||
.sp
|
||||
ebcdic compiled for an EBCDIC environment
|
||||
jit just-in-time support is available
|
||||
@ -108,7 +126,10 @@ The following options output 1 for true or zero for false:
|
||||
pcre32 the 32-bit library was built
|
||||
pcre8 the 8-bit library was built
|
||||
ucp Unicode property support is available
|
||||
utf UTF-8 and/or UTF-16 and/or UTF-32 support is available
|
||||
utf UTF-8 and/or UTF-16 and/or UTF-32 support
|
||||
is available
|
||||
.sp
|
||||
If an unknown option is given, an error message is output; the exit code is 0.
|
||||
.TP 10
|
||||
\fB-d\fP
|
||||
Behave as if each pattern has the \fB/D\fP (debug) modifier; the internal
|
||||
@ -137,6 +158,10 @@ Output the size of each compiled pattern after it has been compiled. This is
|
||||
equivalent to adding \fB/M\fP to each regular expression. The size is given in
|
||||
bytes for both libraries.
|
||||
.TP 10
|
||||
\fB-O\fP
|
||||
Behave as if each pattern has the \fB/O\fP modifier, that is disable
|
||||
auto-possessification for all patterns.
|
||||
.TP 10
|
||||
\fB-o\fP \fIosize\fP
|
||||
Set the number of elements in the output vector that is used when calling
|
||||
\fBpcre[16|32]_exec()\fP or \fBpcre[16|32]_dfa_exec()\fP to be \fIosize\fP. The
|
||||
@ -198,17 +223,21 @@ contains (*MARK) items there may also be differences, for the same reason. The
|
||||
should never be studied (see the \fB/S\fP pattern modifier below).
|
||||
.TP 10
|
||||
\fB-t\fP
|
||||
Run each compile, study, and match many times with a timer, and output
|
||||
resulting time per compile or match (in milliseconds). Do not set \fB-m\fP with
|
||||
\fB-t\fP, because you will then get the size output a zillion times, and the
|
||||
timing will be distorted. You can control the number of iterations that are
|
||||
used for timing by following \fB-t\fP with a number (as a separate item on the
|
||||
command line). For example, "-t 1000" would iterate 1000 times. The default is
|
||||
to iterate 500000 times.
|
||||
Run each compile, study, and match many times with a timer, and output the
|
||||
resulting times per compile, study, or match (in milliseconds). Do not set
|
||||
\fB-m\fP with \fB-t\fP, because you will then get the size output a zillion
|
||||
times, and the timing will be distorted. You can control the number of
|
||||
iterations that are used for timing by following \fB-t\fP with a number (as a
|
||||
separate item on the command line). For example, "-t 1000" iterates 1000 times.
|
||||
The default is to iterate 500000 times.
|
||||
.TP 10
|
||||
\fB-tm\fP
|
||||
This is like \fB-t\fP except that it times only the matching phase, not the
|
||||
compile or study phases.
|
||||
.TP 10
|
||||
\fB-T\fP \fB-TM\fP
|
||||
These behave like \fB-t\fP and \fB-tm\fP, but in addition, at the end of a run,
|
||||
the total times for all compiles, studies, and matches are output.
|
||||
.
|
||||
.
|
||||
.SH DESCRIPTION
|
||||
@ -228,7 +257,7 @@ option states whether or not \fBreadline()\fP will be used.
|
||||
.P
|
||||
The program handles any number of sets of input on a single input file. Each
|
||||
set starts with a regular expression, and continues with any number of data
|
||||
lines to be matched against the pattern.
|
||||
lines to be matched against that pattern.
|
||||
.P
|
||||
Each data line is matched separately and independently. If you want to do
|
||||
multi-line matches, you have to use the \en escape sequence (or \er or \er\en,
|
||||
@ -280,6 +309,7 @@ fall into several groups that are described in detail in the following
|
||||
sections.
|
||||
.sp
|
||||
\fB/8\fP set UTF mode
|
||||
\fB/9\fP set PCRE_NEVER_UTF (locks out UTF mode)
|
||||
\fB/?\fP disable UTF validity check
|
||||
\fB/+\fP show remainder of subject after match
|
||||
\fB/=\fP show all captures (not just those that are set)
|
||||
@ -301,7 +331,9 @@ sections.
|
||||
\fB/M\fP show compiled memory size
|
||||
\fB/m\fP set PCRE_MULTILINE
|
||||
\fB/N\fP set PCRE_NO_AUTO_CAPTURE
|
||||
\fB/O\fP set PCRE_NO_AUTO_POSSESS
|
||||
\fB/P\fP use the POSIX wrapper
|
||||
\fB/Q\fP test external stack check function
|
||||
\fB/S\fP study the pattern after compilation
|
||||
\fB/s\fP set PCRE_DOTALL
|
||||
\fB/T\fP select character tables
|
||||
@ -350,12 +382,14 @@ options that do not correspond to anything in Perl:
|
||||
\fB/8\fP PCRE_UTF32 ) when using the 32-bit
|
||||
\fB/?\fP PCRE_NO_UTF32_CHECK ) library
|
||||
.sp
|
||||
\fB/9\fP PCRE_NEVER_UTF
|
||||
\fB/A\fP PCRE_ANCHORED
|
||||
\fB/C\fP PCRE_AUTO_CALLOUT
|
||||
\fB/E\fP PCRE_DOLLAR_ENDONLY
|
||||
\fB/f\fP PCRE_FIRSTLINE
|
||||
\fB/J\fP PCRE_DUPNAMES
|
||||
\fB/N\fP PCRE_NO_AUTO_CAPTURE
|
||||
\fB/O\fP PCRE_NO_AUTO_POSSESS
|
||||
\fB/U\fP PCRE_UNGREEDY
|
||||
\fB/W\fP PCRE_UCP
|
||||
\fB/X\fP PCRE_EXTRA
|
||||
@ -453,7 +487,10 @@ below.
|
||||
The \fB/I\fP modifier requests that \fBpcretest\fP output information about the
|
||||
compiled pattern (whether it is anchored, has a fixed first character, and
|
||||
so on). It does this by calling \fBpcre[16|32]_fullinfo()\fP after compiling a
|
||||
pattern. If the pattern is studied, the results of that are also output.
|
||||
pattern. If the pattern is studied, the results of that are also output. In
|
||||
this output, the word "char" means a non-UTF character, that is, the value of a
|
||||
single data item (8-bit, 16-bit, or 32-bit, depending on the library that is
|
||||
being tested).
|
||||
.P
|
||||
The \fB/K\fP modifier requests \fBpcretest\fP to show names from backtracking
|
||||
control verbs that are returned from calls to \fBpcre[16|32]_exec()\fP. It causes
|
||||
@ -483,13 +520,22 @@ the compiled pattern to be output. This does not include the size of the
|
||||
successfully studied with the PCRE_STUDY_JIT_COMPILE option, the size of the
|
||||
JIT compiled code is also output.
|
||||
.P
|
||||
The \fB/Q\fP modifier is used to test the use of \fBpcre_stack_guard\fP. It
|
||||
must be followed by '0' or '1', specifying the return code to be given from an
|
||||
external function that is passed to PCRE and used for stack checking during
|
||||
compilation (see the
|
||||
.\" HREF
|
||||
\fBpcreapi\fP
|
||||
.\"
|
||||
documentation for details).
|
||||
.P
|
||||
The \fB/S\fP modifier causes \fBpcre[16|32]_study()\fP to be called after the
|
||||
expression has been compiled, and the results used when the expression is
|
||||
matched. There are a number of qualifying characters that may follow \fB/S\fP.
|
||||
They may appear in any order.
|
||||
.P
|
||||
If \fBS\fP is followed by an exclamation mark, \fBpcre[16|32]_study()\fP is called
|
||||
with the PCRE_STUDY_EXTRA_NEEDED option, causing it always to return a
|
||||
If \fB/S\fP is followed by an exclamation mark, \fBpcre[16|32]_study()\fP is
|
||||
called with the PCRE_STUDY_EXTRA_NEEDED option, causing it always to return a
|
||||
\fBpcre_extra\fP block, even when studying discovers no useful information.
|
||||
.P
|
||||
If \fB/S\fP is followed by a second S character, it suppresses studying, even
|
||||
@ -565,6 +611,37 @@ The \fB/+\fP modifier works as described above. All other modifiers are
|
||||
ignored.
|
||||
.
|
||||
.
|
||||
.SS "Locking out certain modifiers"
|
||||
.rs
|
||||
.sp
|
||||
PCRE can be compiled with or without support for certain features such as
|
||||
UTF-8/16/32 or Unicode properties. Accordingly, the standard tests are split up
|
||||
into a number of different files that are selected for running depending on
|
||||
which features are available. When updating the tests, it is all too easy to
|
||||
put a new test into the wrong file by mistake; for example, to put a test that
|
||||
requires UTF support into a file that is used when it is not available. To help
|
||||
detect such mistakes as early as possible, there is a facility for locking out
|
||||
specific modifiers. If an input line for \fBpcretest\fP starts with the string
|
||||
"< forbid " the following sequence of characters is taken as a list of
|
||||
forbidden modifiers. For example, in the test files that must not use UTF or
|
||||
Unicode property support, this line appears:
|
||||
.sp
|
||||
< forbid 8W
|
||||
.sp
|
||||
This locks out the /8 and /W modifiers. An immediate error is given if they are
|
||||
subsequently encountered. If the character string contains < but not >, all the
|
||||
multi-character modifiers that begin with < are locked out. Otherwise, such
|
||||
modifiers must be explicitly listed, for example:
|
||||
.sp
|
||||
< forbid <JS><cr>
|
||||
.sp
|
||||
There must be a single space between < and "forbid" for this feature to be
|
||||
recognised. If there is not, the line is interpreted either as a request to
|
||||
re-load a pre-compiled pattern (see "SAVING AND RELOADING COMPILED PATTERNS"
|
||||
below) or, if there is a another < character, as a pattern that uses < as its
|
||||
delimiter.
|
||||
.
|
||||
.
|
||||
.SH "DATA LINES"
|
||||
.rs
|
||||
.sp
|
||||
@ -588,6 +665,7 @@ recognized:
|
||||
\ev vertical tab (\ex0b)
|
||||
\ennn octal character (up to 3 octal digits); always
|
||||
a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode
|
||||
\eo{dd...} octal character (any number of octal digits}
|
||||
\exhh hexadecimal byte (up to 2 hex digits)
|
||||
\ex{hh...} hexadecimal character (any number of hex digits)
|
||||
.\" JOIN
|
||||
@ -1011,10 +1089,9 @@ exact copy of the compiled pattern. If there is additional study data, this
|
||||
writing the file, \fBpcretest\fP expects to read a new pattern.
|
||||
.P
|
||||
A saved pattern can be reloaded into \fBpcretest\fP by specifying < and a file
|
||||
name instead of a pattern. The name of the file must not contain a < character,
|
||||
as otherwise \fBpcretest\fP will interpret the line as a pattern delimited by <
|
||||
characters.
|
||||
For example:
|
||||
name instead of a pattern. There must be no space between < and the file name,
|
||||
which must not contain a < character, as otherwise \fBpcretest\fP will
|
||||
interpret the line as a pattern delimited by < characters. For example:
|
||||
.sp
|
||||
re> </some/file
|
||||
Compiled pattern loaded from /some/file
|
||||
@ -1074,6 +1151,6 @@ Cambridge CB2 3QH, England.
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 10 September 2012
|
||||
Copyright (c) 1997-2012 University of Cambridge.
|
||||
Last updated: 09 February 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
||||
|
Reference in New Issue
Block a user