Regex: Add PCRE 8.32 in tools directory.

This commit is contained in:
Arkshine
2014-07-05 00:28:24 +02:00
parent fe8e32155d
commit 7a6e793813
354 changed files with 256402 additions and 0 deletions

View File

@ -0,0 +1,180 @@
<html>
<!-- This is a manually maintained file that is the root of the HTML version of
the PCRE documentation. When the HTML documents are built from the man
page versions, the entire doc/html directory is emptied, this file is then
copied into doc/html/index.html, and the remaining files therein are
created by the 132html script.
-->
<head>
<title>PCRE specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>Perl-compatible Regular Expressions (PCRE)</h1>
<p>
The HTML documentation for PCRE comprises the following pages:
</p>
<table>
<tr><td><a href="pcre.html">pcre</a></td>
<td>&nbsp;&nbsp;Introductory page</td></tr>
<tr><td><a href="pcre16.html">pcre16</a></td>
<td>&nbsp;&nbsp;Discussion of the 16-bit PCRE library</td></tr>
<tr><td><a href="pcre32.html">pcre32</a></td>
<td>&nbsp;&nbsp;Discussion of the 32-bit PCRE library</td></tr>
<tr><td><a href="pcre-config.html">pcre-config</a></td>
<td>&nbsp;&nbsp;Information about the installation configuration</td></tr>
<tr><td><a href="pcreapi.html">pcreapi</a></td>
<td>&nbsp;&nbsp;PCRE's native API</td></tr>
<tr><td><a href="pcrebuild.html">pcrebuild</a></td>
<td>&nbsp;&nbsp;Options for building PCRE</td></tr>
<tr><td><a href="pcrecallout.html">pcrecallout</a></td>
<td>&nbsp;&nbsp;The <i>callout</i> facility</td></tr>
<tr><td><a href="pcrecompat.html">pcrecompat</a></td>
<td>&nbsp;&nbsp;Compability with Perl</td></tr>
<tr><td><a href="pcrecpp.html">pcrecpp</a></td>
<td>&nbsp;&nbsp;The C++ wrapper for the PCRE library</td></tr>
<tr><td><a href="pcredemo.html">pcredemo</a></td>
<td>&nbsp;&nbsp;A demonstration C program that uses the PCRE library</td></tr>
<tr><td><a href="pcregrep.html">pcregrep</a></td>
<td>&nbsp;&nbsp;The <b>pcregrep</b> command</td></tr>
<tr><td><a href="pcrejit.html">pcrejit</a></td>
<td>&nbsp;&nbsp;Discussion of the just-in-time optimization support</td></tr>
<tr><td><a href="pcrelimits.html">pcrelimits</a></td>
<td>&nbsp;&nbsp;Details of size and other limits</td></tr>
<tr><td><a href="pcrematching.html">pcrematching</a></td>
<td>&nbsp;&nbsp;Discussion of the two matching algorithms</td></tr>
<tr><td><a href="pcrepartial.html">pcrepartial</a></td>
<td>&nbsp;&nbsp;Using PCRE for partial matching</td></tr>
<tr><td><a href="pcrepattern.html">pcrepattern</a></td>
<td>&nbsp;&nbsp;Specification of the regular expressions supported by PCRE</td></tr>
<tr><td><a href="pcreperform.html">pcreperform</a></td>
<td>&nbsp;&nbsp;Some comments on performance</td></tr>
<tr><td><a href="pcreposix.html">pcreposix</a></td>
<td>&nbsp;&nbsp;The POSIX API to the PCRE library</td></tr>
<tr><td><a href="pcreprecompile.html">pcreprecompile</a></td>
<td>&nbsp;&nbsp;How to save and re-use compiled patterns</td></tr>
<tr><td><a href="pcresample.html">pcresample</a></td>
<td>&nbsp;&nbsp;Discussion of the pcredemo program</td></tr>
<tr><td><a href="pcrestack.html">pcrestack</a></td>
<td>&nbsp;&nbsp;Discussion of PCRE's stack usage</td></tr>
<tr><td><a href="pcresyntax.html">pcresyntax</a></td>
<td>&nbsp;&nbsp;Syntax quick-reference summary</td></tr>
<tr><td><a href="pcretest.html">pcretest</a></td>
<td>&nbsp;&nbsp;The <b>pcretest</b> command for testing PCRE</td></tr>
<tr><td><a href="pcreunicode.html">pcreunicode</a></td>
<td>&nbsp;&nbsp;Discussion of Unicode and UTF-8/UTF-16/UTF-32 support</td></tr>
</table>
<p>
There are also individual pages that summarize the interface for each function
in the library. There is a single page for each triple of 8-bit/16-bit/32-bit
functions.
</p>
<table>
<tr><td><a href="pcre_assign_jit_stack.html">pcre_assign_jit_stack</a></td>
<td>&nbsp;&nbsp;Assign stack for JIT matching</td></tr>
<tr><td><a href="pcre_compile.html">pcre_compile</a></td>
<td>&nbsp;&nbsp;Compile a regular expression</td></tr>
<tr><td><a href="pcre_compile2.html">pcre_compile2</a></td>
<td>&nbsp;&nbsp;Compile a regular expression (alternate interface)</td></tr>
<tr><td><a href="pcre_config.html">pcre_config</a></td>
<td>&nbsp;&nbsp;Show build-time configuration options</td></tr>
<tr><td><a href="pcre_copy_named_substring.html">pcre_copy_named_substring</a></td>
<td>&nbsp;&nbsp;Extract named substring into given buffer</td></tr>
<tr><td><a href="pcre_copy_substring.html">pcre_copy_substring</a></td>
<td>&nbsp;&nbsp;Extract numbered substring into given buffer</td></tr>
<tr><td><a href="pcre_dfa_exec.html">pcre_dfa_exec</a></td>
<td>&nbsp;&nbsp;Match a compiled pattern to a subject string
(DFA algorithm; <i>not</i> Perl compatible)</td></tr>
<tr><td><a href="pcre_free_study.html">pcre_free_study</a></td>
<td>&nbsp;&nbsp;Free study data</td></tr>
<tr><td><a href="pcre_exec.html">pcre_exec</a></td>
<td>&nbsp;&nbsp;Match a compiled pattern to a subject string
(Perl compatible)</td></tr>
<tr><td><a href="pcre_free_substring.html">pcre_free_substring</a></td>
<td>&nbsp;&nbsp;Free extracted substring</td></tr>
<tr><td><a href="pcre_free_substring_list.html">pcre_free_substring_list</a></td>
<td>&nbsp;&nbsp;Free list of extracted substrings</td></tr>
<tr><td><a href="pcre_fullinfo.html">pcre_fullinfo</a></td>
<td>&nbsp;&nbsp;Extract information about a pattern</td></tr>
<tr><td><a href="pcre_get_named_substring.html">pcre_get_named_substring</a></td>
<td>&nbsp;&nbsp;Extract named substring into new memory</td></tr>
<tr><td><a href="pcre_get_stringnumber.html">pcre_get_stringnumber</a></td>
<td>&nbsp;&nbsp;Convert captured string name to number</td></tr>
<tr><td><a href="pcre_get_substring.html">pcre_get_substring</a></td>
<td>&nbsp;&nbsp;Extract numbered substring into new memory</td></tr>
<tr><td><a href="pcre_get_substring_list.html">pcre_get_substring_list</a></td>
<td>&nbsp;&nbsp;Extract all substrings into new memory</td></tr>
<tr><td><a href="pcre_info.html">pcre_info</a></td>
<td>&nbsp;&nbsp;Obsolete information extraction function</td></tr>
<tr><td><a href="pcre_jit_stack_alloc.html">pcre_jit_stack_alloc</a></td>
<td>&nbsp;&nbsp;Create a stack for JIT matching</td></tr>
<tr><td><a href="pcre_jit_stack_free.html">pcre_jit_stack_free</a></td>
<td>&nbsp;&nbsp;Free a JIT matching stack</td></tr>
<tr><td><a href="pcre_maketables.html">pcre_maketables</a></td>
<td>&nbsp;&nbsp;Build character tables in current locale</td></tr>
<tr><td><a href="pcre_pattern_to_host_byte_order.html">pcre_pattern_to_host_byte_order</a></td>
<td>&nbsp;&nbsp;Convert compiled pattern to host byte order if necessary</td></tr>
<tr><td><a href="pcre_refcount.html">pcre_refcount</a></td>
<td>&nbsp;&nbsp;Maintain reference count in compiled pattern</td></tr>
<tr><td><a href="pcre_study.html">pcre_study</a></td>
<td>&nbsp;&nbsp;Study a compiled pattern</td></tr>
<tr><td><a href="pcre_utf16_to_host_byte_order.html">pcre_utf16_to_host_byte_order</a></td>
<td>&nbsp;&nbsp;Convert UTF-16 string to host byte order if necessary</td></tr>
<tr><td><a href="pcre_utf32_to_host_byte_order.html">pcre_utf32_to_host_byte_order</a></td>
<td>&nbsp;&nbsp;Convert UTF-32 string to host byte order if necessary</td></tr>
<tr><td><a href="pcre_version.html">pcre_version</a></td>
<td>&nbsp;&nbsp;Return PCRE version and release date</td></tr>
</table>
</html>

View File

@ -0,0 +1,109 @@
<html>
<head>
<title>pcre-config specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre-config man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
<li><a name="TOC3" href="#SEC3">OPTIONS</a>
<li><a name="TOC4" href="#SEC4">SEE ALSO</a>
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
<li><a name="TOC6" href="#SEC6">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
<P>
<b>pcre-config [--prefix] [--exec-prefix] [--version] [--libs]</b>
<b>[--libs16] [--libs32] [--libs-cpp] [--libs-posix]</b>
<b>[--cflags] [--cflags-posix]</b>
</P>
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
<P>
<b>pcre-config</b> returns the configuration of the installed PCRE
libraries and the options required to compile a program to use them. Some of
the options apply only to the 8-bit, or 16-bit, or 32-bit libraries,
respectively, and are
not available if only one of those libraries has been built. If an unavailable
option is encountered, the "usage" information is output.
</P>
<br><a name="SEC3" href="#TOC1">OPTIONS</a><br>
<P>
<b>--prefix</b>
Writes the directory prefix used in the PCRE installation for architecture
independent files (<i>/usr</i> on many systems, <i>/usr/local</i> on some
systems) to the standard output.
</P>
<P>
<b>--exec-prefix</b>
Writes the directory prefix used in the PCRE installation for architecture
dependent files (normally the same as <b>--prefix</b>) to the standard output.
</P>
<P>
<b>--version</b>
Writes the version number of the installed PCRE libraries to the standard
output.
</P>
<P>
<b>--libs</b>
Writes to the standard output the command line options required to link
with the 8-bit PCRE library (<b>-lpcre</b> on many systems).
</P>
<P>
<b>--libs16</b>
Writes to the standard output the command line options required to link
with the 16-bit PCRE library (<b>-lpcre16</b> on many systems).
</P>
<P>
<b>--libs32</b>
Writes to the standard output the command line options required to link
with the 32-bit PCRE library (<b>-lpcre32</b> on many systems).
</P>
<P>
<b>--libs-cpp</b>
Writes to the standard output the command line options required to link with
PCRE's C++ wrapper library (<b>-lpcrecpp</b> <b>-lpcre</b> on many
systems).
</P>
<P>
<b>--libs-posix</b>
Writes to the standard output the command line options required to link with
PCRE's POSIX API wrapper library (<b>-lpcreposix</b> <b>-lpcre</b> on many
systems).
</P>
<P>
<b>--cflags</b>
Writes to the standard output the command line options required to compile
files that use PCRE (this may include some <b>-I</b> options, but is blank on
many systems).
</P>
<P>
<b>--cflags-posix</b>
Writes to the standard output the command line options required to compile
files that use PCRE's POSIX API wrapper library (this may include some <b>-I</b>
options, but is blank on many systems).
</P>
<br><a name="SEC4" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre(3)</b>
</P>
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
<P>
This manual page was originally written by Mark Baker for the Debian GNU/Linux
system. It has been subsequently revised as a generic PCRE man page.
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
Last updated: 24 June 2012
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,204 @@
<html>
<head>
<title>pcre specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">INTRODUCTION</a>
<li><a name="TOC2" href="#SEC2">SECURITY CONSIDERATIONS</a>
<li><a name="TOC3" href="#SEC3">USER DOCUMENTATION</a>
<li><a name="TOC4" href="#SEC4">AUTHOR</a>
<li><a name="TOC5" href="#SEC5">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">INTRODUCTION</a><br>
<P>
The PCRE library is a set of functions that implement regular expression
pattern matching using the same syntax and semantics as Perl, with just a few
differences. Some features that appeared in Python and PCRE before they
appeared in Perl are also available using the Python syntax, there is some
support for one or two .NET and Oniguruma syntax items, and there is an option
for requesting some minor changes that give better JavaScript compatibility.
</P>
<P>
Starting with release 8.30, it is possible to compile two separate PCRE
libraries: the original, which supports 8-bit character strings (including
UTF-8 strings), and a second library that supports 16-bit character strings
(including UTF-16 strings). The build process allows either one or both to be
built. The majority of the work to make this possible was done by Zoltan
Herczeg.
</P>
<P>
Starting with release 8.32 it is possible to compile a third separate PCRE
library, which supports 32-bit character strings (including
UTF-32 strings). The build process allows any set of the 8-, 16- and 32-bit
libraries. The work to make this possible was done by Christian Persch.
</P>
<P>
The three libraries contain identical sets of functions, except that the names
in the 16-bit library start with <b>pcre16_</b> instead of <b>pcre_</b>, and the
names in the 32-bit library start with <b>pcre32_</b> instead of <b>pcre_</b>. To
avoid over-complication and reduce the documentation maintenance load, most of
the documentation describes the 8-bit library, with the differences for the
16-bit and 32-bit libraries described separately in the
<a href="pcre16.html"><b>pcre16</b></a>
and
<a href="pcre32.html"><b>pcre32</b></a>
pages. References to functions or structures of the form <i>pcre[16|32]_xxx</i>
should be read as meaning "<i>pcre_xxx</i> when using the 8-bit library,
<i>pcre16_xxx</i> when using the 16-bit library, or <i>pcre32_xxx</i> when using
the 32-bit library".
</P>
<P>
The current implementation of PCRE corresponds approximately with Perl 5.12,
including support for UTF-8/16/32 encoded strings and Unicode general category
properties. However, UTF-8/16/32 and Unicode support has to be explicitly
enabled; it is not the default. The Unicode tables correspond to Unicode
release 6.2.0.
</P>
<P>
In addition to the Perl-compatible matching function, PCRE contains an
alternative function that matches the same compiled patterns in a different
way. In certain circumstances, the alternative function has some advantages.
For a discussion of the two matching algorithms, see the
<a href="pcrematching.html"><b>pcrematching</b></a>
page.
</P>
<P>
PCRE is written in C and released as a C library. A number of people have
written wrappers and interfaces of various kinds. In particular, Google Inc.
have provided a comprehensive C++ wrapper for the 8-bit library. This is now
included as part of the PCRE distribution. The
<a href="pcrecpp.html"><b>pcrecpp</b></a>
page has details of this interface. Other people's contributions can be found
in the <i>Contrib</i> directory at the primary FTP site, which is:
<a href="ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre">ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre</a>
</P>
<P>
Details of exactly which Perl regular expression features are and are not
supported by PCRE are given in separate documents. See the
<a href="pcrepattern.html"><b>pcrepattern</b></a>
and
<a href="pcrecompat.html"><b>pcrecompat</b></a>
pages. There is a syntax summary in the
<a href="pcresyntax.html"><b>pcresyntax</b></a>
page.
</P>
<P>
Some features of PCRE can be included, excluded, or changed when the library is
built. The
<a href="pcre_config.html"><b>pcre_config()</b></a>
function makes it possible for a client to discover which features are
available. The features themselves are described in the
<a href="pcrebuild.html"><b>pcrebuild</b></a>
page. Documentation about building PCRE for various operating systems can be
found in the <b>README</b> and <b>NON-AUTOTOOLS_BUILD</b> files in the source
distribution.
</P>
<P>
The libraries contains a number of undocumented internal functions and data
tables that are used by more than one of the exported external functions, but
which are not intended for use by external callers. Their names all begin with
"_pcre_" or "_pcre16_" or "_pcre32_", which hopefully will not provoke any name
clashes. In some environments, it is possible to control which external symbols
are exported when a shared library is built, and in these cases the
undocumented symbols are not exported.
</P>
<br><a name="SEC2" href="#TOC1">SECURITY CONSIDERATIONS</a><br>
<P>
If you are using PCRE in a non-UTF application that permits users to supply
arbitrary patterns for compilation, you should be aware of a feature that
allows users to turn on UTF support from within a pattern, provided that PCRE
was built with UTF support. For example, an 8-bit pattern that begins with
"(*UTF8)" or "(*UTF)" turns on UTF-8 mode, which interprets patterns and
subjects as strings of UTF-8 characters instead of individual 8-bit characters.
This causes both the pattern and any data against which it is matched to be
checked for UTF-8 validity. If the data string is very long, such a check might
use sufficiently many resources as to cause your application to lose
performance.
</P>
<P>
The best way of guarding against this possibility is to use the
<b>pcre_fullinfo()</b> function to check the compiled pattern's options for UTF.
</P>
<P>
If your application is one that supports UTF, be aware that validity checking
can take time. If the same data string is to be matched many times, you can use
the PCRE_NO_UTF[8|16|32]_CHECK option for the second and subsequent matches to
save redundant checks.
</P>
<P>
Another way that performance can be hit is by running a pattern that has a very
large search tree against a string that will never match. Nested unlimited
repeats in a pattern are a common example. PCRE provides some protection
against this: see the PCRE_EXTRA_MATCH_LIMIT feature in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page.
</P>
<br><a name="SEC3" href="#TOC1">USER DOCUMENTATION</a><br>
<P>
The user documentation for PCRE comprises a number of different sections. In
the "man" format, each of these is a separate "man page". In the HTML format,
each is a separate page, linked from the index page. In the plain text format,
all the sections, except the <b>pcredemo</b> section, are concatenated, for ease
of searching. The sections are as follows:
<pre>
pcre this document
pcre16 details of the 16-bit library
pcre32 details of the 32-bit library
pcre-config show PCRE installation configuration information
pcreapi details of PCRE's native C API
pcrebuild options for building PCRE
pcrecallout details of the callout feature
pcrecompat discussion of Perl compatibility
pcrecpp details of the C++ wrapper for the 8-bit library
pcredemo a demonstration C program that uses PCRE
pcregrep description of the <b>pcregrep</b> command (8-bit only)
pcrejit discussion of the just-in-time optimization support
pcrelimits details of size and other limits
pcrematching discussion of the two matching algorithms
pcrepartial details of the partial matching facility
pcrepattern syntax and semantics of supported regular expressions
pcreperform discussion of performance issues
pcreposix the POSIX-compatible C API for the 8-bit library
pcreprecompile details of saving and re-using precompiled patterns
pcresample discussion of the pcredemo program
pcrestack discussion of stack usage
pcresyntax quick syntax reference
pcretest description of the <b>pcretest</b> testing command
pcreunicode discussion of Unicode and UTF-8/16/32 support
</pre>
In addition, in the "man" and HTML formats, there is a short page for each
C library function, listing its arguments and results.
</P>
<br><a name="SEC4" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<P>
Putting an actual email address here seems to have been a spam magnet, so I've
taken it away. If you want to email me, use my two initials, followed by the
two digits 10, at the domain cam.ac.uk.
</P>
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
<P>
Last updated: 11 November 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,383 @@
<html>
<head>
<title>pcre16 specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre16 man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE 16-BIT API BASIC FUNCTIONS</a>
<li><a name="TOC2" href="#SEC2">PCRE 16-BIT API STRING EXTRACTION FUNCTIONS</a>
<li><a name="TOC3" href="#SEC3">PCRE 16-BIT API AUXILIARY FUNCTIONS</a>
<li><a name="TOC4" href="#SEC4">PCRE 16-BIT API INDIRECTED FUNCTIONS</a>
<li><a name="TOC5" href="#SEC5">PCRE 16-BIT API 16-BIT-ONLY FUNCTION</a>
<li><a name="TOC6" href="#SEC6">THE PCRE 16-BIT LIBRARY</a>
<li><a name="TOC7" href="#SEC7">THE HEADER FILE</a>
<li><a name="TOC8" href="#SEC8">THE LIBRARY NAME</a>
<li><a name="TOC9" href="#SEC9">STRING TYPES</a>
<li><a name="TOC10" href="#SEC10">STRUCTURE TYPES</a>
<li><a name="TOC11" href="#SEC11">16-BIT FUNCTIONS</a>
<li><a name="TOC12" href="#SEC12">SUBJECT STRING OFFSETS</a>
<li><a name="TOC13" href="#SEC13">NAMED SUBPATTERNS</a>
<li><a name="TOC14" href="#SEC14">OPTION NAMES</a>
<li><a name="TOC15" href="#SEC15">CHARACTER CODES</a>
<li><a name="TOC16" href="#SEC16">ERROR NAMES</a>
<li><a name="TOC17" href="#SEC17">ERROR TEXTS</a>
<li><a name="TOC18" href="#SEC18">CALLOUTS</a>
<li><a name="TOC19" href="#SEC19">TESTING</a>
<li><a name="TOC20" href="#SEC20">NOT SUPPORTED IN 16-BIT MODE</a>
<li><a name="TOC21" href="#SEC21">AUTHOR</a>
<li><a name="TOC22" href="#SEC22">REVISION</a>
</ul>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<br><a name="SEC1" href="#TOC1">PCRE 16-BIT API BASIC FUNCTIONS</a><br>
<P>
<b>pcre16 *pcre16_compile(PCRE_SPTR16 <i>pattern</i>, int <i>options</i>,</b>
<b>const char **<i>errptr</i>, int *<i>erroffset</i>,</b>
<b>const unsigned char *<i>tableptr</i>);</b>
</P>
<P>
<b>pcre16 *pcre16_compile2(PCRE_SPTR16 <i>pattern</i>, int <i>options</i>,</b>
<b>int *<i>errorcodeptr</i>,</b>
<b>const char **<i>errptr</i>, int *<i>erroffset</i>,</b>
<b>const unsigned char *<i>tableptr</i>);</b>
</P>
<P>
<b>pcre16_extra *pcre16_study(const pcre16 *<i>code</i>, int <i>options</i>,</b>
<b>const char **<i>errptr</i>);</b>
</P>
<P>
<b>void pcre16_free_study(pcre16_extra *<i>extra</i>);</b>
</P>
<P>
<b>int pcre16_exec(const pcre16 *<i>code</i>, const pcre16_extra *<i>extra</i>,</b>
<b>PCRE_SPTR16 <i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b>
<b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>);</b>
</P>
<P>
<b>int pcre16_dfa_exec(const pcre16 *<i>code</i>, const pcre16_extra *<i>extra</i>,</b>
<b>PCRE_SPTR16 <i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b>
<b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>,</b>
<b>int *<i>workspace</i>, int <i>wscount</i>);</b>
</P>
<br><a name="SEC2" href="#TOC1">PCRE 16-BIT API STRING EXTRACTION FUNCTIONS</a><br>
<P>
<b>int pcre16_copy_named_substring(const pcre16 *<i>code</i>,</b>
<b>PCRE_SPTR16 <i>subject</i>, int *<i>ovector</i>,</b>
<b>int <i>stringcount</i>, PCRE_SPTR16 <i>stringname</i>,</b>
<b>PCRE_UCHAR16 *<i>buffer</i>, int <i>buffersize</i>);</b>
</P>
<P>
<b>int pcre16_copy_substring(PCRE_SPTR16 <i>subject</i>, int *<i>ovector</i>,</b>
<b>int <i>stringcount</i>, int <i>stringnumber</i>, PCRE_UCHAR16 *<i>buffer</i>,</b>
<b>int <i>buffersize</i>);</b>
</P>
<P>
<b>int pcre16_get_named_substring(const pcre16 *<i>code</i>,</b>
<b>PCRE_SPTR16 <i>subject</i>, int *<i>ovector</i>,</b>
<b>int <i>stringcount</i>, PCRE_SPTR16 <i>stringname</i>,</b>
<b>PCRE_SPTR16 *<i>stringptr</i>);</b>
</P>
<P>
<b>int pcre16_get_stringnumber(const pcre16 *<i>code</i>,</b>
<b>PCRE_SPTR16 <i>name</i>);</b>
</P>
<P>
<b>int pcre16_get_stringtable_entries(const pcre16 *<i>code</i>,</b>
<b>PCRE_SPTR16 <i>name</i>, PCRE_UCHAR16 **<i>first</i>, PCRE_UCHAR16 **<i>last</i>);</b>
</P>
<P>
<b>int pcre16_get_substring(PCRE_SPTR16 <i>subject</i>, int *<i>ovector</i>,</b>
<b>int <i>stringcount</i>, int <i>stringnumber</i>,</b>
<b>PCRE_SPTR16 *<i>stringptr</i>);</b>
</P>
<P>
<b>int pcre16_get_substring_list(PCRE_SPTR16 <i>subject</i>,</b>
<b>int *<i>ovector</i>, int <i>stringcount</i>, PCRE_SPTR16 **<i>listptr</i>);</b>
</P>
<P>
<b>void pcre16_free_substring(PCRE_SPTR16 <i>stringptr</i>);</b>
</P>
<P>
<b>void pcre16_free_substring_list(PCRE_SPTR16 *<i>stringptr</i>);</b>
</P>
<br><a name="SEC3" href="#TOC1">PCRE 16-BIT API AUXILIARY FUNCTIONS</a><br>
<P>
<b>pcre16_jit_stack *pcre16_jit_stack_alloc(int <i>startsize</i>, int <i>maxsize</i>);</b>
</P>
<P>
<b>void pcre16_jit_stack_free(pcre16_jit_stack *<i>stack</i>);</b>
</P>
<P>
<b>void pcre16_assign_jit_stack(pcre16_extra *<i>extra</i>,</b>
<b>pcre16_jit_callback <i>callback</i>, void *<i>data</i>);</b>
</P>
<P>
<b>const unsigned char *pcre16_maketables(void);</b>
</P>
<P>
<b>int pcre16_fullinfo(const pcre16 *<i>code</i>, const pcre16_extra *<i>extra</i>,</b>
<b>int <i>what</i>, void *<i>where</i>);</b>
</P>
<P>
<b>int pcre16_refcount(pcre16 *<i>code</i>, int <i>adjust</i>);</b>
</P>
<P>
<b>int pcre16_config(int <i>what</i>, void *<i>where</i>);</b>
</P>
<P>
<b>const char *pcre16_version(void);</b>
</P>
<P>
<b>int pcre16_pattern_to_host_byte_order(pcre16 *<i>code</i>,</b>
<b>pcre16_extra *<i>extra</i>, const unsigned char *<i>tables</i>);</b>
</P>
<br><a name="SEC4" href="#TOC1">PCRE 16-BIT API INDIRECTED FUNCTIONS</a><br>
<P>
<b>void *(*pcre16_malloc)(size_t);</b>
</P>
<P>
<b>void (*pcre16_free)(void *);</b>
</P>
<P>
<b>void *(*pcre16_stack_malloc)(size_t);</b>
</P>
<P>
<b>void (*pcre16_stack_free)(void *);</b>
</P>
<P>
<b>int (*pcre16_callout)(pcre16_callout_block *);</b>
</P>
<br><a name="SEC5" href="#TOC1">PCRE 16-BIT API 16-BIT-ONLY FUNCTION</a><br>
<P>
<b>int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *<i>output</i>,</b>
<b>PCRE_SPTR16 <i>input</i>, int <i>length</i>, int *<i>byte_order</i>,</b>
<b>int <i>keep_boms</i>);</b>
</P>
<br><a name="SEC6" href="#TOC1">THE PCRE 16-BIT LIBRARY</a><br>
<P>
Starting with release 8.30, it is possible to compile a PCRE library that
supports 16-bit character strings, including UTF-16 strings, as well as or
instead of the original 8-bit library. The majority of the work to make this
possible was done by Zoltan Herczeg. The two libraries contain identical sets
of functions, used in exactly the same way. Only the names of the functions and
the data types of their arguments and results are different. To avoid
over-complication and reduce the documentation maintenance load, most of the
PCRE documentation describes the 8-bit library, with only occasional references
to the 16-bit library. This page describes what is different when you use the
16-bit library.
</P>
<P>
WARNING: A single application can be linked with both libraries, but you must
take care when processing any particular pattern to use functions from just one
library. For example, if you want to study a pattern that was compiled with
<b>pcre16_compile()</b>, you must do so with <b>pcre16_study()</b>, not
<b>pcre_study()</b>, and you must free the study data with
<b>pcre16_free_study()</b>.
</P>
<br><a name="SEC7" href="#TOC1">THE HEADER FILE</a><br>
<P>
There is only one header file, <b>pcre.h</b>. It contains prototypes for all the
functions in all libraries, as well as definitions of flags, structures, error
codes, etc.
</P>
<br><a name="SEC8" href="#TOC1">THE LIBRARY NAME</a><br>
<P>
In Unix-like systems, the 16-bit library is called <b>libpcre16</b>, and can
normally be accesss by adding <b>-lpcre16</b> to the command for linking an
application that uses PCRE.
</P>
<br><a name="SEC9" href="#TOC1">STRING TYPES</a><br>
<P>
In the 8-bit library, strings are passed to PCRE library functions as vectors
of bytes with the C type "char *". In the 16-bit library, strings are passed as
vectors of unsigned 16-bit quantities. The macro PCRE_UCHAR16 specifies an
appropriate data type, and PCRE_SPTR16 is defined as "const PCRE_UCHAR16 *". In
very many environments, "short int" is a 16-bit data type. When PCRE is built,
it defines PCRE_UCHAR16 as "unsigned short int", but checks that it really is a
16-bit data type. If it is not, the build fails with an error message telling
the maintainer to modify the definition appropriately.
</P>
<br><a name="SEC10" href="#TOC1">STRUCTURE TYPES</a><br>
<P>
The types of the opaque structures that are used for compiled 16-bit patterns
and JIT stacks are <b>pcre16</b> and <b>pcre16_jit_stack</b> respectively. The
type of the user-accessible structure that is returned by <b>pcre16_study()</b>
is <b>pcre16_extra</b>, and the type of the structure that is used for passing
data to a callout function is <b>pcre16_callout_block</b>. These structures
contain the same fields, with the same names, as their 8-bit counterparts. The
only difference is that pointers to character strings are 16-bit instead of
8-bit types.
</P>
<br><a name="SEC11" href="#TOC1">16-BIT FUNCTIONS</a><br>
<P>
For every function in the 8-bit library there is a corresponding function in
the 16-bit library with a name that starts with <b>pcre16_</b> instead of
<b>pcre_</b>. The prototypes are listed above. In addition, there is one extra
function, <b>pcre16_utf16_to_host_byte_order()</b>. This is a utility function
that converts a UTF-16 character string to host byte order if necessary. The
other 16-bit functions expect the strings they are passed to be in host byte
order.
</P>
<P>
The <i>input</i> and <i>output</i> arguments of
<b>pcre16_utf16_to_host_byte_order()</b> may point to the same address, that is,
conversion in place is supported. The output buffer must be at least as long as
the input.
</P>
<P>
The <i>length</i> argument specifies the number of 16-bit data units in the
input string; a negative value specifies a zero-terminated string.
</P>
<P>
If <i>byte_order</i> is NULL, it is assumed that the string starts off in host
byte order. This may be changed by byte-order marks (BOMs) anywhere in the
string (commonly as the first character).
</P>
<P>
If <i>byte_order</i> is not NULL, a non-zero value of the integer to which it
points means that the input starts off in host byte order, otherwise the
opposite order is assumed. Again, BOMs in the string can change this. The final
byte order is passed back at the end of processing.
</P>
<P>
If <i>keep_boms</i> is not zero, byte-order mark characters (0xfeff) are copied
into the output string. Otherwise they are discarded.
</P>
<P>
The result of the function is the number of 16-bit units placed into the output
buffer, including the zero terminator if the string was zero-terminated.
</P>
<br><a name="SEC12" href="#TOC1">SUBJECT STRING OFFSETS</a><br>
<P>
The offsets within subject strings that are returned by the matching functions
are in 16-bit units rather than bytes.
</P>
<br><a name="SEC13" href="#TOC1">NAMED SUBPATTERNS</a><br>
<P>
The name-to-number translation table that is maintained for named subpatterns
uses 16-bit characters. The <b>pcre16_get_stringtable_entries()</b> function
returns the length of each entry in the table as the number of 16-bit data
units.
</P>
<br><a name="SEC14" href="#TOC1">OPTION NAMES</a><br>
<P>
There are two new general option names, PCRE_UTF16 and PCRE_NO_UTF16_CHECK,
which correspond to PCRE_UTF8 and PCRE_NO_UTF8_CHECK in the 8-bit library. In
fact, these new options define the same bits in the options word. There is a
discussion about the
<a href="pcreunicode.html#utf16strings">validity of UTF-16 strings</a>
in the
<a href="pcreunicode.html"><b>pcreunicode</b></a>
page.
</P>
<P>
For the <b>pcre16_config()</b> function there is an option PCRE_CONFIG_UTF16
that returns 1 if UTF-16 support is configured, otherwise 0. If this option is
given to <b>pcre_config()</b> or <b>pcre32_config()</b>, or if the
PCRE_CONFIG_UTF8 or PCRE_CONFIG_UTF32 option is given to <b>pcre16_config()</b>,
the result is the PCRE_ERROR_BADOPTION error.
</P>
<br><a name="SEC15" href="#TOC1">CHARACTER CODES</a><br>
<P>
In 16-bit mode, when PCRE_UTF16 is not set, character values are treated in the
same way as in 8-bit, non UTF-8 mode, except, of course, that they can range
from 0 to 0xffff instead of 0 to 0xff. Character types for characters less than
0xff can therefore be influenced by the locale in the same way as before.
Characters greater than 0xff have only one case, and no "type" (such as letter
or digit).
</P>
<P>
In UTF-16 mode, the character code is Unicode, in the range 0 to 0x10ffff, with
the exception of values in the range 0xd800 to 0xdfff because those are
"surrogate" values that are used in pairs to encode values greater than 0xffff.
</P>
<P>
A UTF-16 string can indicate its endianness by special code knows as a
byte-order mark (BOM). The PCRE functions do not handle this, expecting strings
to be in host byte order. A utility function called
<b>pcre16_utf16_to_host_byte_order()</b> is provided to help with this (see
above).
</P>
<br><a name="SEC16" href="#TOC1">ERROR NAMES</a><br>
<P>
The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 correspond to
their 8-bit counterparts. The error PCRE_ERROR_BADMODE is given when a compiled
pattern is passed to a function that processes patterns in the other
mode, for example, if a pattern compiled with <b>pcre_compile()</b> is passed to
<b>pcre16_exec()</b>.
</P>
<P>
There are new error codes whose names begin with PCRE_UTF16_ERR for invalid
UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for UTF-8 strings that
are described in the section entitled
<a href="pcreapi.html#badutf8reasons">"Reason codes for invalid UTF-8 strings"</a>
in the main
<a href="pcreapi.html"><b>pcreapi</b></a>
page. The UTF-16 errors are:
<pre>
PCRE_UTF16_ERR1 Missing low surrogate at end of string
PCRE_UTF16_ERR2 Invalid low surrogate follows high surrogate
PCRE_UTF16_ERR3 Isolated low surrogate
PCRE_UTF16_ERR4 Non-character
</PRE>
</P>
<br><a name="SEC17" href="#TOC1">ERROR TEXTS</a><br>
<P>
If there is an error while compiling a pattern, the error text that is passed
back by <b>pcre16_compile()</b> or <b>pcre16_compile2()</b> is still an 8-bit
character string, zero-terminated.
</P>
<br><a name="SEC18" href="#TOC1">CALLOUTS</a><br>
<P>
The <i>subject</i> and <i>mark</i> fields in the callout block that is passed to
a callout function point to 16-bit vectors.
</P>
<br><a name="SEC19" href="#TOC1">TESTING</a><br>
<P>
The <b>pcretest</b> program continues to operate with 8-bit input and output
files, but it can be used for testing the 16-bit library. If it is run with the
command line option <b>-16</b>, patterns and subject strings are converted from
8-bit to 16-bit before being passed to PCRE, and the 16-bit library functions
are used instead of the 8-bit ones. Returned 16-bit strings are converted to
8-bit for output. If both the 8-bit and the 32-bit libraries were not compiled,
<b>pcretest</b> defaults to 16-bit and the <b>-16</b> option is ignored.
</P>
<P>
When PCRE is being built, the <b>RunTest</b> script that is called by "make
check" uses the <b>pcretest</b> <b>-C</b> option to discover which of the 8-bit,
16-bit and 32-bit libraries has been built, and runs the tests appropriately.
</P>
<br><a name="SEC20" href="#TOC1">NOT SUPPORTED IN 16-BIT MODE</a><br>
<P>
Not all the features of the 8-bit library are available with the 16-bit
library. The C++ and POSIX wrapper functions support only the 8-bit library,
and the <b>pcregrep</b> program is at present 8-bit only.
</P>
<br><a name="SEC21" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><a name="SEC22" href="#TOC1">REVISION</a><br>
<P>
Last updated: 08 November 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,76 @@
<html>
<head>
<title>pcre_assign_jit_stack specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_assign_jit_stack man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>void pcre_assign_jit_stack(pcre_extra *<i>extra</i>,</b>
<b>pcre_jit_callback <i>callback</i>, void *<i>data</i>);</b>
</P>
<P>
<b>void pcre16_assign_jit_stack(pcre16_extra *<i>extra</i>,</b>
<b>pcre16_jit_callback <i>callback</i>, void *<i>data</i>);</b>
</P>
<P>
<b>void pcre32_assign_jit_stack(pcre32_extra *<i>extra</i>,</b>
<b>pcre32_jit_callback <i>callback</i>, void *<i>data</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function provides control over the memory used as a stack at run-time by a
call to <b>pcre[16|32]_exec()</b> with a pattern that has been successfully
compiled with JIT optimization. The arguments are:
<pre>
extra the data pointer returned by <b>pcre[16|32]_study()</b>
callback a callback function
data a JIT stack or a value to be passed to the callback
function
</PRE>
</P>
<P>
If <i>callback</i> is NULL and <i>data</i> is NULL, an internal 32K block on
the machine stack is used.
</P>
<P>
If <i>callback</i> is NULL and <i>data</i> is not NULL, <i>data</i> must
be a valid JIT stack, the result of calling <b>pcre[16|32]_jit_stack_alloc()</b>.
</P>
<P>
If <i>callback</i> not NULL, it is called with <i>data</i> as an argument at
the start of matching, in order to set up a JIT stack. If the result is NULL,
the internal 32K stack is used; otherwise the return value must be a valid JIT
stack, the result of calling <b>pcre[16|32]_jit_stack_alloc()</b>.
</P>
<P>
You may safely assign the same JIT stack to multiple patterns, as long as they
are all matched in the same thread. In a multithread application, each thread
must use its own JIT stack. For more details, see the
<a href="pcrejit.html"><b>pcrejit</b></a>
page.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,108 @@
<html>
<head>
<title>pcre_compile specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_compile man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>pcre *pcre_compile(const char *<i>pattern</i>, int <i>options</i>,</b>
<b>const char **<i>errptr</i>, int *<i>erroffset</i>,</b>
<b>const unsigned char *<i>tableptr</i>);</b>
</P>
<P>
<b>pcre16 *pcre16_compile(PCRE_SPTR16 <i>pattern</i>, int <i>options</i>,</b>
<b>const char **<i>errptr</i>, int *<i>erroffset</i>,</b>
<b>const unsigned char *<i>tableptr</i>);</b>
</P>
<P>
<b>pcre32 *pcre32_compile(PCRE_SPTR32 <i>pattern</i>, int <i>options</i>,</b>
<b>const char **<i>errptr</i>, int *<i>erroffset</i>,</b>
<b>const unsigned char *<i>tableptr</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function compiles a regular expression into an internal form. It is the
same as <b>pcre[16|32]_compile2()</b>, except for the absence of the
<i>errorcodeptr</i> argument. Its arguments are:
<pre>
<i>pattern</i> A zero-terminated string containing the
regular expression to be compiled
<i>options</i> Zero or more option bits
<i>errptr</i> Where to put an error message
<i>erroffset</i> Offset in pattern where error was found
<i>tableptr</i> Pointer to character tables, or NULL to
use the built-in default
</pre>
The option bits are:
<pre>
PCRE_ANCHORED Force pattern anchoring
PCRE_AUTO_CALLOUT Compile automatic callouts
PCRE_BSR_ANYCRLF \R matches only CR, LF, or CRLF
PCRE_BSR_UNICODE \R matches all Unicode line endings
PCRE_CASELESS Do caseless matching
PCRE_DOLLAR_ENDONLY $ not to match newline at end
PCRE_DOTALL . matches anything including NL
PCRE_DUPNAMES Allow duplicate names for subpatterns
PCRE_EXTENDED Ignore white space and # comments
PCRE_EXTRA PCRE extra features
(not much use currently)
PCRE_FIRSTLINE Force matching to be before newline
PCRE_JAVASCRIPT_COMPAT JavaScript compatibility
PCRE_MULTILINE ^ and $ match newlines within data
PCRE_NEWLINE_ANY Recognize any Unicode newline sequence
PCRE_NEWLINE_ANYCRLF Recognize CR, LF, and CRLF as newline
sequences
PCRE_NEWLINE_CR Set CR as the newline sequence
PCRE_NEWLINE_CRLF Set CRLF as the newline sequence
PCRE_NEWLINE_LF Set LF as the newline sequence
PCRE_NO_AUTO_CAPTURE Disable numbered capturing paren-
theses (named ones available)
PCRE_NO_UTF16_CHECK Do not check the pattern for UTF-16
validity (only relevant if
PCRE_UTF16 is set)
PCRE_NO_UTF32_CHECK Do not check the pattern for UTF-32
validity (only relevant if
PCRE_UTF32 is set)
PCRE_NO_UTF8_CHECK Do not check the pattern for UTF-8
validity (only relevant if
PCRE_UTF8 is set)
PCRE_UCP Use Unicode properties for \d, \w, etc.
PCRE_UNGREEDY Invert greediness of quantifiers
PCRE_UTF16 Run in <b>pcre16_compile()</b> UTF-16 mode
PCRE_UTF32 Run in <b>pcre32_compile()</b> UTF-32 mode
PCRE_UTF8 Run in <b>pcre_compile()</b> UTF-8 mode
</pre>
PCRE must be built with UTF support in order to use PCRE_UTF8/16/32 and
PCRE_NO_UTF8/16/32_CHECK, and with UCP support if PCRE_UCP is used.
</P>
<P>
The yield of the function is a pointer to a private data structure that
contains the compiled pattern, or NULL if an error was detected. Note that
compiling regular expressions with one version of PCRE for use with a different
version is not guaranteed to work and may cause crashes.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,112 @@
<html>
<head>
<title>pcre_compile2 specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_compile2 man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>pcre *pcre_compile2(const char *<i>pattern</i>, int <i>options</i>,</b>
<b>int *<i>errorcodeptr</i>,</b>
<b>const char **<i>errptr</i>, int *<i>erroffset</i>,</b>
<b>const unsigned char *<i>tableptr</i>);</b>
</P>
<P>
<b>pcre16 *pcre16_compile2(PCRE_SPTR16 <i>pattern</i>, int <i>options</i>,</b>
<b>int *<i>errorcodeptr</i>,</b>
<b>const char **<i>errptr</i>, int *<i>erroffset</i>,</b>
<b>const unsigned char *<i>tableptr</i>);</b>
</P>
<P>
<b>pcre32 *pcre32_compile2(PCRE_SPTR32 <i>pattern</i>, int <i>options</i>,</b>
<b>int *<i>errorcodeptr</i>,</b>
<b>const char **<i>errptr</i>, int *<i>erroffset</i>,</b>
<b>const unsigned char *<i>tableptr</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function compiles a regular expression into an internal form. It is the
same as <b>pcre[16|32]_compile()</b>, except for the addition of the
<i>errorcodeptr</i> argument. The arguments are:
<pre>
<i>pattern</i> A zero-terminated string containing the
regular expression to be compiled
<i>options</i> Zero or more option bits
<i>errorcodeptr</i> Where to put an error code
<i>errptr</i> Where to put an error message
<i>erroffset</i> Offset in pattern where error was found
<i>tableptr</i> Pointer to character tables, or NULL to
use the built-in default
</pre>
The option bits are:
<pre>
PCRE_ANCHORED Force pattern anchoring
PCRE_AUTO_CALLOUT Compile automatic callouts
PCRE_BSR_ANYCRLF \R matches only CR, LF, or CRLF
PCRE_BSR_UNICODE \R matches all Unicode line endings
PCRE_CASELESS Do caseless matching
PCRE_DOLLAR_ENDONLY $ not to match newline at end
PCRE_DOTALL . matches anything including NL
PCRE_DUPNAMES Allow duplicate names for subpatterns
PCRE_EXTENDED Ignore white space and # comments
PCRE_EXTRA PCRE extra features
(not much use currently)
PCRE_FIRSTLINE Force matching to be before newline
PCRE_JAVASCRIPT_COMPAT JavaScript compatibility
PCRE_MULTILINE ^ and $ match newlines within data
PCRE_NEWLINE_ANY Recognize any Unicode newline sequence
PCRE_NEWLINE_ANYCRLF Recognize CR, LF, and CRLF as newline
sequences
PCRE_NEWLINE_CR Set CR as the newline sequence
PCRE_NEWLINE_CRLF Set CRLF as the newline sequence
PCRE_NEWLINE_LF Set LF as the newline sequence
PCRE_NO_AUTO_CAPTURE Disable numbered capturing paren-
theses (named ones available)
PCRE_NO_UTF16_CHECK Do not check the pattern for UTF-16
validity (only relevant if
PCRE_UTF16 is set)
PCRE_NO_UTF32_CHECK Do not check the pattern for UTF-32
validity (only relevant if
PCRE_UTF32 is set)
PCRE_NO_UTF8_CHECK Do not check the pattern for UTF-8
validity (only relevant if
PCRE_UTF8 is set)
PCRE_UCP Use Unicode properties for \d, \w, etc.
PCRE_UNGREEDY Invert greediness of quantifiers
PCRE_UTF16 Run <b>pcre16_compile()</b> in UTF-16 mode
PCRE_UTF32 Run <b>pcre32_compile()</b> in UTF-32 mode
PCRE_UTF8 Run <b>pcre_compile()</b> in UTF-8 mode
</pre>
PCRE must be built with UTF support in order to use PCRE_UTF8/16/32 and
PCRE_NO_UTF8/16/32_CHECK, and with UCP support if PCRE_UCP is used.
</P>
<P>
The yield of the function is a pointer to a private data structure that
contains the compiled pattern, or NULL if an error was detected. Note that
compiling regular expressions with one version of PCRE for use with a different
version is not guaranteed to work and may cause crashes.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,91 @@
<html>
<head>
<title>pcre_config specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_config man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>int pcre_config(int <i>what</i>, void *<i>where</i>);</b>
</P>
<P>
<b>int pcre16_config(int <i>what</i>, void *<i>where</i>);</b>
</P>
<P>
<b>int pcre32_config(int <i>what</i>, void *<i>where</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function makes it possible for a client program to find out which optional
features are available in the version of the PCRE library it is using. The
arguments are as follows:
<pre>
<i>what</i> A code specifying what information is required
<i>where</i> Points to where to put the data
</pre>
The <i>where</i> argument must point to an integer variable, except for
PCRE_CONFIG_MATCH_LIMIT and PCRE_CONFIG_MATCH_LIMIT_RECURSION, when it must
point to an unsigned long integer. The available codes are:
<pre>
PCRE_CONFIG_JIT Availability of just-in-time compiler
support (1=yes 0=no)
PCRE_CONFIG_JITTARGET String containing information about the
target architecture for the JIT compiler,
or NULL if there is no JIT support
PCRE_CONFIG_LINK_SIZE Internal link size: 2, 3, or 4
PCRE_CONFIG_MATCH_LIMIT Internal resource limit
PCRE_CONFIG_MATCH_LIMIT_RECURSION
Internal recursion depth limit
PCRE_CONFIG_NEWLINE Value of the default newline sequence:
13 (0x000d) for CR
10 (0x000a) for LF
3338 (0x0d0a) for CRLF
-2 for ANYCRLF
-1 for ANY
PCRE_CONFIG_BSR Indicates what \R matches by default:
0 all Unicode line endings
1 CR, LF, or CRLF only
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
Threshold of return slots, above which
<b>malloc()</b> is used by the POSIX API
PCRE_CONFIG_STACKRECURSE Recursion implementation (1=stack 0=heap)
PCRE_CONFIG_UTF16 Availability of UTF-16 support (1=yes
0=no); option for <b>pcre16_config()</b>
PCRE_CONFIG_UTF32 Availability of UTF-32 support (1=yes
0=no); option for <b>pcre32_config()</b>
PCRE_CONFIG_UTF8 Availability of UTF-8 support (1=yes 0=no);
option for <b>pcre_config()</b>
PCRE_CONFIG_UNICODE_PROPERTIES
Availability of Unicode property support
(1=yes 0=no)
</pre>
The function yields 0 on success or PCRE_ERROR_BADOPTION otherwise. That error
is also given if PCRE_CONFIG_UTF16 or PCRE_CONFIG_UTF32 is passed to
<b>pcre_config()</b>, if PCRE_CONFIG_UTF8 or PCRE_CONFIG_UTF32 is passed to
<b>pcre16_config()</b>, or if PCRE_CONFIG_UTF8 or PCRE_CONFIG_UTF16 is passed to
<b>pcre32_config()</b>.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,65 @@
<html>
<head>
<title>pcre_copy_named_substring specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_copy_named_substring man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>int pcre_copy_named_substring(const pcre *<i>code</i>,</b>
<b>const char *<i>subject</i>, int *<i>ovector</i>,</b>
<b>int <i>stringcount</i>, const char *<i>stringname</i>,</b>
<b>char *<i>buffer</i>, int <i>buffersize</i>);</b>
</P>
<P>
<b>int pcre16_copy_named_substring(const pcre16 *<i>code</i>,</b>
<b>PCRE_SPTR16 <i>subject</i>, int *<i>ovector</i>,</b>
<b>int <i>stringcount</i>, PCRE_SPTR16 <i>stringname</i>,</b>
<b>PCRE_UCHAR16 *<i>buffer</i>, int <i>buffersize</i>);</b>
</P>
<P>
<b>int pcre32_copy_named_substring(const pcre32 *<i>code</i>,</b>
<b>PCRE_SPTR32 <i>subject</i>, int *<i>ovector</i>,</b>
<b>int <i>stringcount</i>, PCRE_SPTR32 <i>stringname</i>,</b>
<b>PCRE_UCHAR32 *<i>buffer</i>, int <i>buffersize</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This is a convenience function for extracting a captured substring, identified
by name, into a given buffer. The arguments are:
<pre>
<i>code</i> Pattern that was successfully matched
<i>subject</i> Subject that has been successfully matched
<i>ovector</i> Offset vector that <b>pcre[16|32]_exec()</b> used
<i>stringcount</i> Value returned by <b>pcre[16|32]_exec()</b>
<i>stringname</i> Name of the required substring
<i>buffer</i> Buffer to receive the string
<i>buffersize</i> Size of buffer
</pre>
The yield is the length of the substring, PCRE_ERROR_NOMEMORY if the buffer was
too small, or PCRE_ERROR_NOSUBSTRING if the string name is invalid.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,61 @@
<html>
<head>
<title>pcre_copy_substring specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_copy_substring man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>int pcre_copy_substring(const char *<i>subject</i>, int *<i>ovector</i>,</b>
<b>int <i>stringcount</i>, int <i>stringnumber</i>, char *<i>buffer</i>,</b>
<b>int <i>buffersize</i>);</b>
</P>
<P>
<b>int pcre16_copy_substring(PCRE_SPTR16 <i>subject</i>, int *<i>ovector</i>,</b>
<b>int <i>stringcount</i>, int <i>stringnumber</i>, PCRE_UCHAR16 *<i>buffer</i>,</b>
<b>int <i>buffersize</i>);</b>
</P>
<P>
<b>int pcre32_copy_substring(PCRE_SPTR32 <i>subject</i>, int *<i>ovector</i>,</b>
<b>int <i>stringcount</i>, int <i>stringnumber</i>, PCRE_UCHAR32 *<i>buffer</i>,</b>
<b>int <i>buffersize</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This is a convenience function for extracting a captured substring into a given
buffer. The arguments are:
<pre>
<i>subject</i> Subject that has been successfully matched
<i>ovector</i> Offset vector that <b>pcre[16|32]_exec()</b> used
<i>stringcount</i> Value returned by <b>pcre[16|32]_exec()</b>
<i>stringnumber</i> Number of the required substring
<i>buffer</i> Buffer to receive the string
<i>buffersize</i> Size of buffer
</pre>
The yield is the length of the string, PCRE_ERROR_NOMEMORY if the buffer was
too small, or PCRE_ERROR_NOSUBSTRING if the string number is invalid.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,128 @@
<html>
<head>
<title>pcre_dfa_exec specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_dfa_exec man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>int pcre_dfa_exec(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b>
<b>const char *<i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b>
<b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>,</b>
<b>int *<i>workspace</i>, int <i>wscount</i>);</b>
</P>
<P>
<b>int pcre16_dfa_exec(const pcre16 *<i>code</i>, const pcre16_extra *<i>extra</i>,</b>
<b>PCRE_SPTR16 <i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b>
<b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>,</b>
<b>int *<i>workspace</i>, int <i>wscount</i>);</b>
</P>
<P>
<b>int pcre32_dfa_exec(const pcre32 *<i>code</i>, const pcre32_extra *<i>extra</i>,</b>
<b>PCRE_SPTR32 <i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b>
<b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>,</b>
<b>int *<i>workspace</i>, int <i>wscount</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function matches a compiled regular expression against a given subject
string, using an alternative matching algorithm that scans the subject string
just once (<i>not</i> Perl-compatible). Note that the main, Perl-compatible,
matching function is <b>pcre[16|32]_exec()</b>. The arguments for this function
are:
<pre>
<i>code</i> Points to the compiled pattern
<i>extra</i> Points to an associated <b>pcre[16|32]_extra</b> structure,
or is NULL
<i>subject</i> Points to the subject string
<i>length</i> Length of the subject string, in bytes
<i>startoffset</i> Offset in bytes in the subject at which to
start matching
<i>options</i> Option bits
<i>ovector</i> Points to a vector of ints for result offsets
<i>ovecsize</i> Number of elements in the vector
<i>workspace</i> Points to a vector of ints used as working space
<i>wscount</i> Number of elements in the vector
</pre>
The options are:
<pre>
PCRE_ANCHORED Match only at the first position
PCRE_BSR_ANYCRLF \R matches only CR, LF, or CRLF
PCRE_BSR_UNICODE \R matches all Unicode line endings
PCRE_NEWLINE_ANY Recognize any Unicode newline sequence
PCRE_NEWLINE_ANYCRLF Recognize CR, LF, & CRLF as newline sequences
PCRE_NEWLINE_CR Recognize CR as the only newline sequence
PCRE_NEWLINE_CRLF Recognize CRLF as the only newline sequence
PCRE_NEWLINE_LF Recognize LF as the only newline sequence
PCRE_NOTBOL Subject is not the beginning of a line
PCRE_NOTEOL Subject is not the end of a line
PCRE_NOTEMPTY An empty string is not a valid match
PCRE_NOTEMPTY_ATSTART An empty string at the start of the subject
is not a valid match
PCRE_NO_START_OPTIMIZE Do not do "start-match" optimizations
PCRE_NO_UTF16_CHECK Do not check the subject for UTF-16
validity (only relevant if PCRE_UTF16
was set at compile time)
PCRE_NO_UTF32_CHECK Do not check the subject for UTF-32
validity (only relevant if PCRE_UTF32
was set at compile time)
PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8
validity (only relevant if PCRE_UTF8
was set at compile time)
PCRE_PARTIAL ) Return PCRE_ERROR_PARTIAL for a partial
PCRE_PARTIAL_SOFT ) match if no full matches are found
PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
even if there is a full match as well
PCRE_DFA_SHORTEST Return only the shortest match
PCRE_DFA_RESTART Restart after a partial match
</pre>
There are restrictions on what may appear in a pattern when using this matching
function. Details are given in the
<a href="pcrematching.html"><b>pcrematching</b></a>
documentation. For details of partial matching, see the
<a href="pcrepartial.html"><b>pcrepartial</b></a>
page.
</P>
<P>
A <b>pcre[16|32]_extra</b> structure contains the following fields:
<pre>
<i>flags</i> Bits indicating which fields are set
<i>study_data</i> Opaque data from <b>pcre[16|32]_study()</b>
<i>match_limit</i> Limit on internal resource use
<i>match_limit_recursion</i> Limit on internal recursion depth
<i>callout_data</i> Opaque data passed back to callouts
<i>tables</i> Points to character tables or is NULL
<i>mark</i> For passing back a *MARK pointer
<i>executable_jit</i> Opaque data from JIT compilation
</pre>
The flag bits are PCRE_EXTRA_STUDY_DATA, PCRE_EXTRA_MATCH_LIMIT,
PCRE_EXTRA_MATCH_LIMIT_RECURSION, PCRE_EXTRA_CALLOUT_DATA,
PCRE_EXTRA_TABLES, PCRE_EXTRA_MARK and PCRE_EXTRA_EXECUTABLE_JIT. For this
matching function, the <i>match_limit</i> and <i>match_limit_recursion</i> fields
are not used, and must not be set. The PCRE_EXTRA_EXECUTABLE_JIT flag and
the corresponding variable are ignored.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,110 @@
<html>
<head>
<title>pcre_exec specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_exec man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>int pcre_exec(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b>
<b>const char *<i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b>
<b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>);</b>
</P>
<P>
<b>int pcre16_exec(const pcre16 *<i>code</i>, const pcre16_extra *<i>extra</i>,</b>
<b>PCRE_SPTR16 <i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b>
<b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>);</b>
</P>
<P>
<b>int pcre32_exec(const pcre32 *<i>code</i>, const pcre32_extra *<i>extra</i>,</b>
<b>PCRE_SPTR32 <i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b>
<b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function matches a compiled regular expression against a given subject
string, using a matching algorithm that is similar to Perl's. It returns
offsets to captured substrings. Its arguments are:
<pre>
<i>code</i> Points to the compiled pattern
<i>extra</i> Points to an associated <b>pcre[16|32]_extra</b> structure,
or is NULL
<i>subject</i> Points to the subject string
<i>length</i> Length of the subject string, in bytes
<i>startoffset</i> Offset in bytes in the subject at which to
start matching
<i>options</i> Option bits
<i>ovector</i> Points to a vector of ints for result offsets
<i>ovecsize</i> Number of elements in the vector (a multiple of 3)
</pre>
The options are:
<pre>
PCRE_ANCHORED Match only at the first position
PCRE_BSR_ANYCRLF \R matches only CR, LF, or CRLF
PCRE_BSR_UNICODE \R matches all Unicode line endings
PCRE_NEWLINE_ANY Recognize any Unicode newline sequence
PCRE_NEWLINE_ANYCRLF Recognize CR, LF, & CRLF as newline sequences
PCRE_NEWLINE_CR Recognize CR as the only newline sequence
PCRE_NEWLINE_CRLF Recognize CRLF as the only newline sequence
PCRE_NEWLINE_LF Recognize LF as the only newline sequence
PCRE_NOTBOL Subject string is not the beginning of a line
PCRE_NOTEOL Subject string is not the end of a line
PCRE_NOTEMPTY An empty string is not a valid match
PCRE_NOTEMPTY_ATSTART An empty string at the start of the subject
is not a valid match
PCRE_NO_START_OPTIMIZE Do not do "start-match" optimizations
PCRE_NO_UTF16_CHECK Do not check the subject for UTF-16
validity (only relevant if PCRE_UTF16
was set at compile time)
PCRE_NO_UTF32_CHECK Do not check the subject for UTF-32
validity (only relevant if PCRE_UTF32
was set at compile time)
PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8
validity (only relevant if PCRE_UTF8
was set at compile time)
PCRE_PARTIAL ) Return PCRE_ERROR_PARTIAL for a partial
PCRE_PARTIAL_SOFT ) match if no full matches are found
PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
if that is found before a full match
</pre>
For details of partial matching, see the
<a href="pcrepartial.html"><b>pcrepartial</b></a>
page. A <b>pcre_extra</b> structure contains the following fields:
<pre>
<i>flags</i> Bits indicating which fields are set
<i>study_data</i> Opaque data from <b>pcre[16|32]_study()</b>
<i>match_limit</i> Limit on internal resource use
<i>match_limit_recursion</i> Limit on internal recursion depth
<i>callout_data</i> Opaque data passed back to callouts
<i>tables</i> Points to character tables or is NULL
<i>mark</i> For passing back a *MARK pointer
<i>executable_jit</i> Opaque data from JIT compilation
</pre>
The flag bits are PCRE_EXTRA_STUDY_DATA, PCRE_EXTRA_MATCH_LIMIT,
PCRE_EXTRA_MATCH_LIMIT_RECURSION, PCRE_EXTRA_CALLOUT_DATA,
PCRE_EXTRA_TABLES, PCRE_EXTRA_MARK and PCRE_EXTRA_EXECUTABLE_JIT.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,46 @@
<html>
<head>
<title>pcre_free_study specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_free_study man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>void pcre_free_study(pcre_extra *<i>extra</i>);</b>
</P>
<P>
<b>void pcre16_free_study(pcre16_extra *<i>extra</i>);</b>
</P>
<P>
<b>void pcre32_free_study(pcre32_extra *<i>extra</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function is used to free the memory used for the data generated by a call
to <b>pcre[16|32]_study()</b> when it is no longer needed. The argument must be the
result of such a call.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,46 @@
<html>
<head>
<title>pcre_free_substring specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_free_substring man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>void pcre_free_substring(const char *<i>stringptr</i>);</b>
</P>
<P>
<b>void pcre16_free_substring(PCRE_SPTR16 <i>stringptr</i>);</b>
</P>
<P>
<b>void pcre32_free_substring(PCRE_SPTR32 <i>stringptr</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This is a convenience function for freeing the store obtained by a previous
call to <b>pcre[16|32]_get_substring()</b> or <b>pcre[16|32]_get_named_substring()</b>.
Its only argument is a pointer to the string.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,46 @@
<html>
<head>
<title>pcre_free_substring_list specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_free_substring_list man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>void pcre_free_substring_list(const char **<i>stringptr</i>);</b>
</P>
<P>
<b>void pcre16_free_substring_list(PCRE_SPTR16 *<i>stringptr</i>);</b>
</P>
<P>
<b>void pcre32_free_substring_list(PCRE_SPTR32 *<i>stringptr</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This is a convenience function for freeing the store obtained by a previous
call to <b>pcre[16|32]_get_substring_list()</b>. Its only argument is a pointer to
the list of string pointers.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,108 @@
<html>
<head>
<title>pcre_fullinfo specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_fullinfo man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>int pcre_fullinfo(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b>
<b>int <i>what</i>, void *<i>where</i>);</b>
</P>
<P>
<b>int pcre16_fullinfo(const pcre16 *<i>code</i>, const pcre16_extra *<i>extra</i>,</b>
<b>int <i>what</i>, void *<i>where</i>);</b>
</P>
<P>
<b>int pcre32_fullinfo(const pcre32 *<i>code</i>, const pcre32_extra *<i>extra</i>,</b>
<b>int <i>what</i>, void *<i>where</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function returns information about a compiled pattern. Its arguments are:
<pre>
<i>code</i> Compiled regular expression
<i>extra</i> Result of <b>pcre[16|32]_study()</b> or NULL
<i>what</i> What information is required
<i>where</i> Where to put the information
</pre>
The following information is available:
<pre>
PCRE_INFO_BACKREFMAX Number of highest back reference
PCRE_INFO_CAPTURECOUNT Number of capturing subpatterns
PCRE_INFO_DEFAULT_TABLES Pointer to default tables
PCRE_INFO_FIRSTBYTE Fixed first data unit for a match, or
-1 for start of string
or after newline, or
-2 otherwise
PCRE_INFO_FIRSTTABLE Table of first data units (after studying)
PCRE_INFO_HASCRORLF Return 1 if explicit CR or LF matches exist
PCRE_INFO_JCHANGED Return 1 if (?J) or (?-J) was used
PCRE_INFO_JIT Return 1 after successful JIT compilation
PCRE_INFO_JITSIZE Size of JIT compiled code
PCRE_INFO_LASTLITERAL Literal last data unit required
PCRE_INFO_MINLENGTH Lower bound length of matching strings
PCRE_INFO_NAMECOUNT Number of named subpatterns
PCRE_INFO_NAMEENTRYSIZE Size of name table entry
PCRE_INFO_NAMETABLE Pointer to name table
PCRE_INFO_OKPARTIAL Return 1 if partial matching can be tried
(always returns 1 after release 8.00)
PCRE_INFO_OPTIONS Option bits used for compilation
PCRE_INFO_SIZE Size of compiled pattern
PCRE_INFO_STUDYSIZE Size of study data
PCRE_INFO_FIRSTCHARACTER Fixed first data unit for a match
PCRE_INFO_FIRSTCHARACTERFLAGS Returns
1 if there is a first data character set, which can
then be retrieved using PCRE_INFO_FIRSTCHARACTER,
2 if the first character is at the start of the data
string or after a newline, and
0 otherwise
PCRE_INFO_REQUIREDCHAR Literal last data unit required
PCRE_INFO_REQUIREDCHARFLAGS Returns 1 if the last data character is set (which can then
be retrieved using PCRE_INFO_REQUIREDCHAR); 0 otherwise
</pre>
The <i>where</i> argument must point to an integer variable, except for the
following <i>what</i> values:
<pre>
PCRE_INFO_DEFAULT_TABLES const unsigned char *
PCRE_INFO_FIRSTTABLE const unsigned char *
PCRE_INFO_NAMETABLE PCRE_SPTR16 (16-bit library)
PCRE_INFO_NAMETABLE PCRE_SPTR32 (32-bit library)
PCRE_INFO_NAMETABLE const unsigned char * (8-bit library)
PCRE_INFO_OPTIONS unsigned long int
PCRE_INFO_SIZE size_t
PCRE_INFO_FIRSTCHARACTER uint32_t
PCRE_INFO_REQUIREDCHAR uint32_t
</pre>
The yield of the function is zero on success or:
<pre>
PCRE_ERROR_NULL the argument <i>code</i> was NULL
the argument <i>where</i> was NULL
PCRE_ERROR_BADMAGIC the "magic number" was not found
PCRE_ERROR_BADOPTION the value of <i>what</i> was invalid
</PRE>
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,68 @@
<html>
<head>
<title>pcre_get_named_substring specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_get_named_substring man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>int pcre_get_named_substring(const pcre *<i>code</i>,</b>
<b>const char *<i>subject</i>, int *<i>ovector</i>,</b>
<b>int <i>stringcount</i>, const char *<i>stringname</i>,</b>
<b>const char **<i>stringptr</i>);</b>
</P>
<P>
<b>int pcre16_get_named_substring(const pcre16 *<i>code</i>,</b>
<b>PCRE_SPTR16 <i>subject</i>, int *<i>ovector</i>,</b>
<b>int <i>stringcount</i>, PCRE_SPTR16 <i>stringname</i>,</b>
<b>PCRE_SPTR16 *<i>stringptr</i>);</b>
</P>
<P>
<b>int pcre32_get_named_substring(const pcre32 *<i>code</i>,</b>
<b>PCRE_SPTR32 <i>subject</i>, int *<i>ovector</i>,</b>
<b>int <i>stringcount</i>, PCRE_SPTR32 <i>stringname</i>,</b>
<b>PCRE_SPTR32 *<i>stringptr</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This is a convenience function for extracting a captured substring by name. The
arguments are:
<pre>
<i>code</i> Compiled pattern
<i>subject</i> Subject that has been successfully matched
<i>ovector</i> Offset vector that <b>pcre[16|32]_exec()</b> used
<i>stringcount</i> Value returned by <b>pcre[16|32]_exec()</b>
<i>stringname</i> Name of the required substring
<i>stringptr</i> Where to put the string pointer
</pre>
The memory in which the substring is placed is obtained by calling
<b>pcre[16|32]_malloc()</b>. The convenience function
<b>pcre[16|32]_free_substring()</b> can be used to free it when it is no longer
needed. The yield of the function is the length of the extracted substring,
PCRE_ERROR_NOMEMORY if sufficient memory could not be obtained, or
PCRE_ERROR_NOSUBSTRING if the string name is invalid.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,57 @@
<html>
<head>
<title>pcre_get_stringnumber specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_get_stringnumber man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>int pcre_get_stringnumber(const pcre *<i>code</i>,</b>
<b>const char *<i>name</i>);</b>
</P>
<P>
<b>int pcre16_get_stringnumber(const pcre16 *<i>code</i>,</b>
<b>PCRE_SPTR16 <i>name</i>);</b>
</P>
<P>
<b>int pcre32_get_stringnumber(const pcre32 *<i>code</i>,</b>
<b>PCRE_SPTR32 <i>name</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This convenience function finds the number of a named substring capturing
parenthesis in a compiled pattern. Its arguments are:
<pre>
<i>code</i> Compiled regular expression
<i>name</i> Name whose number is required
</pre>
The yield of the function is the number of the parenthesis if the name is
found, or PCRE_ERROR_NOSUBSTRING otherwise. When duplicate names are allowed
(PCRE_DUPNAMES is set), it is not defined which of the numbers is returned by
<b>pcre[16|32]_get_stringnumber()</b>. You can obtain the complete list by calling
<b>pcre[16|32]_get_stringtable_entries()</b>.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,60 @@
<html>
<head>
<title>pcre_get_stringtable_entries specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_get_stringtable_entries man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>int pcre_get_stringtable_entries(const pcre *<i>code</i>,</b>
<b>const char *<i>name</i>, char **<i>first</i>, char **<i>last</i>);</b>
</P>
<P>
<b>int pcre16_get_stringtable_entries(const pcre16 *<i>code</i>,</b>
<b>PCRE_SPTR16 <i>name</i>, PCRE_UCHAR16 **<i>first</i>, PCRE_UCHAR16 **<i>last</i>);</b>
</P>
<P>
<b>int pcre32_get_stringtable_entries(const pcre32 *<i>code</i>,</b>
<b>PCRE_SPTR32 <i>name</i>, PCRE_UCHAR32 **<i>first</i>, PCRE_UCHAR32 **<i>last</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This convenience function finds, for a compiled pattern, the first and last
entries for a given name in the table that translates capturing parenthesis
names into numbers. When names are required to be unique (PCRE_DUPNAMES is
<i>not</i> set), it is usually easier to use <b>pcre[16|32]_get_stringnumber()</b>
instead.
<pre>
<i>code</i> Compiled regular expression
<i>name</i> Name whose entries required
<i>first</i> Where to return a pointer to the first entry
<i>last</i> Where to return a pointer to the last entry
</pre>
The yield of the function is the length of each entry, or
PCRE_ERROR_NOSUBSTRING if none are found.
</P>
<P>
There is a complete description of the PCRE native API, including the format of
the table entries, in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page, and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,64 @@
<html>
<head>
<title>pcre_get_substring specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_get_substring man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>int pcre_get_substring(const char *<i>subject</i>, int *<i>ovector</i>,</b>
<b>int <i>stringcount</i>, int <i>stringnumber</i>,</b>
<b>const char **<i>stringptr</i>);</b>
</P>
<P>
<b>int pcre16_get_substring(PCRE_SPTR16 <i>subject</i>, int *<i>ovector</i>,</b>
<b>int <i>stringcount</i>, int <i>stringnumber</i>,</b>
<b>PCRE_SPTR16 *<i>stringptr</i>);</b>
</P>
<P>
<b>int pcre32_get_substring(PCRE_SPTR32 <i>subject</i>, int *<i>ovector</i>,</b>
<b>int <i>stringcount</i>, int <i>stringnumber</i>,</b>
<b>PCRE_SPTR32 *<i>stringptr</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This is a convenience function for extracting a captured substring. The
arguments are:
<pre>
<i>subject</i> Subject that has been successfully matched
<i>ovector</i> Offset vector that <b>pcre[16|32]_exec()</b> used
<i>stringcount</i> Value returned by <b>pcre[16|32]_exec()</b>
<i>stringnumber</i> Number of the required substring
<i>stringptr</i> Where to put the string pointer
</pre>
The memory in which the substring is placed is obtained by calling
<b>pcre[16|32]_malloc()</b>. The convenience function
<b>pcre[16|32]_free_substring()</b> can be used to free it when it is no longer
needed. The yield of the function is the length of the substring,
PCRE_ERROR_NOMEMORY if sufficient memory could not be obtained, or
PCRE_ERROR_NOSUBSTRING if the string number is invalid.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,61 @@
<html>
<head>
<title>pcre_get_substring_list specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_get_substring_list man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>int pcre_get_substring_list(const char *<i>subject</i>,</b>
<b>int *<i>ovector</i>, int <i>stringcount</i>, const char ***<i>listptr</i>);</b>
</P>
<P>
<b>int pcre16_get_substring_list(PCRE_SPTR16 <i>subject</i>,</b>
<b>int *<i>ovector</i>, int <i>stringcount</i>, PCRE_SPTR16 **<i>listptr</i>);</b>
</P>
<P>
<b>int pcre32_get_substring_list(PCRE_SPTR32 <i>subject</i>,</b>
<b>int *<i>ovector</i>, int <i>stringcount</i>, PCRE_SPTR32 **<i>listptr</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This is a convenience function for extracting a list of all the captured
substrings. The arguments are:
<pre>
<i>subject</i> Subject that has been successfully matched
<i>ovector</i> Offset vector that <b>pcre[16|32]_exec</b> used
<i>stringcount</i> Value returned by <b>pcre[16|32]_exec</b>
<i>listptr</i> Where to put a pointer to the list
</pre>
The memory in which the substrings and the list are placed is obtained by
calling <b>pcre[16|32]_malloc()</b>. The convenience function
<b>pcre[16|32]_free_substring_list()</b> can be used to free it when it is no
longer needed. A pointer to a list of pointers is put in the variable whose
address is in <i>listptr</i>. The list is terminated by a NULL pointer. The
yield of the function is zero on success or PCRE_ERROR_NOMEMORY if sufficient
memory could not be obtained.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,108 @@
<html>
<head>
<title>pcre_jit_exec specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_jit_exec man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>int pcre_jit_exec(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b>
<b>const char *<i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b>
<b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>,</b>
<b>pcre_jit_stack *<i>jstack</i>);</b>
</P>
<P>
<b>int pcre16_jit_exec(const pcre16 *<i>code</i>, const pcre16_extra *<i>extra</i>,</b>
<b>PCRE_SPTR16 <i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b>
<b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>,</b>
<b>pcre_jit_stack *<i>jstack</i>);</b>
</P>
<P>
<b>int pcre32_jit_exec(const pcre32 *<i>code</i>, const pcre32_extra *<i>extra</i>,</b>
<b>PCRE_SPTR32 <i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b>
<b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>,</b>
<b>pcre_jit_stack *<i>jstack</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function matches a compiled regular expression that has been successfully
studied with one of the JIT options against a given subject string, using a
matching algorithm that is similar to Perl's. It is a "fast path" interface to
JIT, and it bypasses some of the sanity checks that <b>pcre_exec()</b> applies.
It returns offsets to captured substrings. Its arguments are:
<pre>
<i>code</i> Points to the compiled pattern
<i>extra</i> Points to an associated <b>pcre[16|32]_extra</b> structure,
or is NULL
<i>subject</i> Points to the subject string
<i>length</i> Length of the subject string, in bytes
<i>startoffset</i> Offset in bytes in the subject at which to
start matching
<i>options</i> Option bits
<i>ovector</i> Points to a vector of ints for result offsets
<i>ovecsize</i> Number of elements in the vector (a multiple of 3)
<i>jstack</i> Pointer to a JIT stack
</pre>
The allowed options are:
<pre>
PCRE_NOTBOL Subject string is not the beginning of a line
PCRE_NOTEOL Subject string is not the end of a line
PCRE_NOTEMPTY An empty string is not a valid match
PCRE_NOTEMPTY_ATSTART An empty string at the start of the subject
is not a valid match
PCRE_NO_UTF16_CHECK Do not check the subject for UTF-16
validity (only relevant if PCRE_UTF16
was set at compile time)
PCRE_NO_UTF32_CHECK Do not check the subject for UTF-32
validity (only relevant if PCRE_UTF32
was set at compile time)
PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8
validity (only relevant if PCRE_UTF8
was set at compile time)
PCRE_PARTIAL ) Return PCRE_ERROR_PARTIAL for a partial
PCRE_PARTIAL_SOFT ) match if no full matches are found
PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
if that is found before a full match
</pre>
However, the PCRE_NO_UTF[8|16|32]_CHECK options have no effect, as this check
is never applied. For details of partial matching, see the
<a href="pcrepartial.html"><b>pcrepartial</b></a>
page. A <b>pcre_extra</b> structure contains the following fields:
<pre>
<i>flags</i> Bits indicating which fields are set
<i>study_data</i> Opaque data from <b>pcre[16|32]_study()</b>
<i>match_limit</i> Limit on internal resource use
<i>match_limit_recursion</i> Limit on internal recursion depth
<i>callout_data</i> Opaque data passed back to callouts
<i>tables</i> Points to character tables or is NULL
<i>mark</i> For passing back a *MARK pointer
<i>executable_jit</i> Opaque data from JIT compilation
</pre>
The flag bits are PCRE_EXTRA_STUDY_DATA, PCRE_EXTRA_MATCH_LIMIT,
PCRE_EXTRA_MATCH_LIMIT_RECURSION, PCRE_EXTRA_CALLOUT_DATA,
PCRE_EXTRA_TABLES, PCRE_EXTRA_MARK and PCRE_EXTRA_EXECUTABLE_JIT.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the JIT API in the
<a href="pcrejit.html"><b>pcrejit</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,55 @@
<html>
<head>
<title>pcre_jit_stack_alloc specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_jit_stack_alloc man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>pcre_jit_stack *pcre_jit_stack_alloc(int <i>startsize</i>,</b>
<b>int <i>maxsize</i>);</b>
</P>
<P>
<b>pcre16_jit_stack *pcre16_jit_stack_alloc(int <i>startsize</i>,</b>
<b>int <i>maxsize</i>);</b>
</P>
<P>
<b>pcre32_jit_stack *pcre32_jit_stack_alloc(int <i>startsize</i>,</b>
<b>int <i>maxsize</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function is used to create a stack for use by the code compiled by the JIT
optimization of <b>pcre[16|32]_study()</b>. The arguments are a starting size for
the stack, and a maximum size to which it is allowed to grow. The result can be
passed to the JIT run-time code by <b>pcre[16|32]_assign_jit_stack()</b>, or that
function can set up a callback for obtaining a stack. A maximum stack size of
512K to 1M should be more than enough for any pattern. For more details, see
the
<a href="pcrejit.html"><b>pcrejit</b></a>
page.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,48 @@
<html>
<head>
<title>pcre_jit_stack_free specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_jit_stack_free man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>void pcre_jit_stack_free(pcre_jit_stack *<i>stack</i>);</b>
</P>
<P>
<b>void pcre16_jit_stack_free(pcre16_jit_stack *<i>stack</i>);</b>
</P>
<P>
<b>void pcre32_jit_stack_free(pcre32_jit_stack *<i>stack</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function is used to free a JIT stack that was created by
<b>pcre[16|32]_jit_stack_alloc()</b> when it is no longer needed. For more details,
see the
<a href="pcrejit.html"><b>pcrejit</b></a>
page.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,48 @@
<html>
<head>
<title>pcre_maketables specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_maketables man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>const unsigned char *pcre_maketables(void);</b>
</P>
<P>
<b>const unsigned char *pcre16_maketables(void);</b>
</P>
<P>
<b>const unsigned char *pcre32_maketables(void);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function builds a set of character tables for character values less than
256. These can be passed to <b>pcre[16|32]_compile()</b> to override PCRE's
internal, built-in tables (which were made by <b>pcre[16|32]_maketables()</b> when
PCRE was compiled). You might want to do this if you are using a non-standard
locale. The function yields a pointer to the tables.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,58 @@
<html>
<head>
<title>pcre_pattern_to_host_byte_order specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_pattern_to_host_byte_order man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>int pcre_pattern_to_host_byte_order(pcre *<i>code</i>,</b>
<b>pcre_extra *<i>extra</i>, const unsigned char *<i>tables</i>);</b>
</P>
<P>
<b>int pcre16_pattern_to_host_byte_order(pcre16 *<i>code</i>,</b>
<b>pcre16_extra *<i>extra</i>, const unsigned char *<i>tables</i>);</b>
</P>
<P>
<b>int pcre32_pattern_to_host_byte_order(pcre32 *<i>code</i>,</b>
<b>pcre32_extra *<i>extra</i>, const unsigned char *<i>tables</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function ensures that the bytes in 2-byte and 4-byte values in a compiled
pattern are in the correct order for the current host. It is useful when a
pattern that has been compiled on one host is transferred to another that might
have different endianness. The arguments are:
<pre>
<i>code</i> A compiled regular expression
<i>extra</i> Points to an associated <b>pcre[16|32]_extra</b> structure,
or is NULL
<i>tables</i> Pointer to character tables, or NULL to
set the built-in default
</pre>
The result is 0 for success, a negative PCRE_ERROR_xxx value otherwise.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,51 @@
<html>
<head>
<title>pcre_refcount specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_refcount man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>int pcre_refcount(pcre *<i>code</i>, int <i>adjust</i>);</b>
</P>
<P>
<b>int pcre16_refcount(pcre16 *<i>code</i>, int <i>adjust</i>);</b>
</P>
<P>
<b>int pcre32_refcount(pcre32 *<i>code</i>, int <i>adjust</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function is used to maintain a reference count inside a data block that
contains a compiled pattern. Its arguments are:
<pre>
<i>code</i> Compiled regular expression
<i>adjust</i> Adjustment to reference value
</pre>
The yield of the function is the adjusted reference value, which is constrained
to lie between 0 and 65535.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,68 @@
<html>
<head>
<title>pcre_study specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_study man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>pcre_extra *pcre_study(const pcre *<i>code</i>, int <i>options</i>,</b>
<b>const char **<i>errptr</i>);</b>
</P>
<P>
<b>pcre16_extra *pcre16_study(const pcre16 *<i>code</i>, int <i>options</i>,</b>
<b>const char **<i>errptr</i>);</b>
</P>
<P>
<b>pcre32_extra *pcre32_study(const pcre32 *<i>code</i>, int <i>options</i>,</b>
<b>const char **<i>errptr</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function studies a compiled pattern, to see if additional information can
be extracted that might speed up matching. Its arguments are:
<pre>
<i>code</i> A compiled regular expression
<i>options</i> Options for <b>pcre[16|32]_study()</b>
<i>errptr</i> Where to put an error message
</pre>
If the function succeeds, it returns a value that can be passed to
<b>pcre[16|32]_exec()</b> or <b>pcre[16|32]_dfa_exec()</b> via their <i>extra</i>
arguments.
</P>
<P>
If the function returns NULL, either it could not find any additional
information, or there was an error. You can tell the difference by looking at
the error value. It is NULL in first case.
</P>
<P>
The only option is PCRE_STUDY_JIT_COMPILE. It requests just-in-time compilation
if possible. If PCRE has been compiled without JIT support, this option is
ignored. See the
<a href="pcrejit.html"><b>pcrejit</b></a>
page for further details.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,57 @@
<html>
<head>
<title>pcre_utf16_to_host_byte_order specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_utf16_to_host_byte_order man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *<i>output</i>,</b>
<b>PCRE_SPTR16 <i>input</i>, int <i>length</i>, int *<i>host_byte_order</i>,</b>
<b>int <i>keep_boms</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function, which exists only in the 16-bit library, converts a UTF-16
string to the correct order for the current host, taking account of any byte
order marks (BOMs) within the string. Its arguments are:
<pre>
<i>output</i> pointer to output buffer, may be the same as <i>input</i>
<i>input</i> pointer to input buffer
<i>length</i> number of 16-bit units in the input, or negative for
a zero-terminated string
<i>host_byte_order</i> a NULL value or a non-zero value pointed to means
start in host byte order
<i>keep_boms</i> if non-zero, BOMs are copied to the output string
</pre>
The result of the function is the number of 16-bit units placed into the output
buffer, including the zero terminator if the string was zero-terminated.
</P>
<P>
If <i>host_byte_order</i> is not NULL, it is set to indicate the byte order that
is current at the end of the string.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,46 @@
<html>
<head>
<title>pcre_version specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre_version man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>const char *pcre_version(void);</b>
</P>
<P>
<b>const char *pcre16_version(void);</b>
</P>
<P>
<b>const char *pcre32_version(void);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function (even in the 16-bit and 32-bit libraries) returns a
zero-terminated, 8-bit character string that gives the version number of the
PCRE library and the date of its release.
</P>
<P>
There is a complete description of the PCRE native API in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page and a description of the POSIX API in the
<a href="pcreposix.html"><b>pcreposix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,517 @@
<html>
<head>
<title>pcrebuild specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcrebuild man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE BUILD-TIME OPTIONS</a>
<li><a name="TOC2" href="#SEC2">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a>
<li><a name="TOC3" href="#SEC3">BUILDING SHARED AND STATIC LIBRARIES</a>
<li><a name="TOC4" href="#SEC4">C++ SUPPORT</a>
<li><a name="TOC5" href="#SEC5">UTF-8, UTF-16 AND UTF-32 SUPPORT</a>
<li><a name="TOC6" href="#SEC6">UNICODE CHARACTER PROPERTY SUPPORT</a>
<li><a name="TOC7" href="#SEC7">JUST-IN-TIME COMPILER SUPPORT</a>
<li><a name="TOC8" href="#SEC8">CODE VALUE OF NEWLINE</a>
<li><a name="TOC9" href="#SEC9">WHAT \R MATCHES</a>
<li><a name="TOC10" href="#SEC10">POSIX MALLOC USAGE</a>
<li><a name="TOC11" href="#SEC11">HANDLING VERY LARGE PATTERNS</a>
<li><a name="TOC12" href="#SEC12">AVOIDING EXCESSIVE STACK USAGE</a>
<li><a name="TOC13" href="#SEC13">LIMITING PCRE RESOURCE USAGE</a>
<li><a name="TOC14" href="#SEC14">CREATING CHARACTER TABLES AT BUILD TIME</a>
<li><a name="TOC15" href="#SEC15">USING EBCDIC CODE</a>
<li><a name="TOC16" href="#SEC16">PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT</a>
<li><a name="TOC17" href="#SEC17">PCREGREP BUFFER SIZE</a>
<li><a name="TOC18" href="#SEC18">PCRETEST OPTION FOR LIBREADLINE SUPPORT</a>
<li><a name="TOC19" href="#SEC19">DEBUGGING WITH VALGRIND SUPPORT</a>
<li><a name="TOC20" href="#SEC20">CODE COVERAGE REPORTING</a>
<li><a name="TOC21" href="#SEC21">SEE ALSO</a>
<li><a name="TOC22" href="#SEC22">AUTHOR</a>
<li><a name="TOC23" href="#SEC23">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE BUILD-TIME OPTIONS</a><br>
<P>
This document describes the optional features of PCRE that can be selected when
the library is compiled. It assumes use of the <b>configure</b> script, where
the optional features are selected or deselected by providing options to
<b>configure</b> before running the <b>make</b> command. However, the same
options can be selected in both Unix-like and non-Unix-like environments using
the GUI facility of <b>cmake-gui</b> if you are using <b>CMake</b> instead of
<b>configure</b> to build PCRE.
</P>
<P>
There is a lot more information about building PCRE without using
<b>configure</b> (including information about using <b>CMake</b> or building "by
hand") in the file called <i>NON-AUTOTOOLS-BUILD</i>, which is part of the PCRE
distribution. You should consult this file as well as the <i>README</i> file if
you are building in a non-Unix-like environment.
</P>
<P>
The complete list of options for <b>configure</b> (which includes the standard
ones such as the selection of the installation directory) can be obtained by
running
<pre>
./configure --help
</pre>
The following sections include descriptions of options whose names begin with
--enable or --disable. These settings specify changes to the defaults for the
<b>configure</b> command. Because of the way that <b>configure</b> works,
--enable and --disable always come in pairs, so the complementary option always
exists as well, but as it specifies the default, it is not described.
</P>
<br><a name="SEC2" href="#TOC1">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a><br>
<P>
By default, a library called <b>libpcre</b> is built, containing functions that
take string arguments contained in vectors of bytes, either as single-byte
characters, or interpreted as UTF-8 strings. You can also build a separate
library, called <b>libpcre16</b>, in which strings are contained in vectors of
16-bit data units and interpreted either as single-unit characters or UTF-16
strings, by adding
<pre>
--enable-pcre16
</pre>
to the <b>configure</b> command. You can also build a separate
library, called <b>libpcre32</b>, in which strings are contained in vectors of
32-bit data units and interpreted either as single-unit characters or UTF-32
strings, by adding
<pre>
--enable-pcre32
</pre>
to the <b>configure</b> command. If you do not want the 8-bit library, add
<pre>
--disable-pcre8
</pre>
as well. At least one of the three libraries must be built. Note that the C++
and POSIX wrappers are for the 8-bit library only, and that <b>pcregrep</b> is
an 8-bit program. None of these are built if you select only the 16-bit or
32-bit libraries.
</P>
<br><a name="SEC3" href="#TOC1">BUILDING SHARED AND STATIC LIBRARIES</a><br>
<P>
The PCRE building process uses <b>libtool</b> to build both shared and static
Unix libraries by default. You can suppress one of these by adding one of
<pre>
--disable-shared
--disable-static
</pre>
to the <b>configure</b> command, as required.
</P>
<br><a name="SEC4" href="#TOC1">C++ SUPPORT</a><br>
<P>
By default, if the 8-bit library is being built, the <b>configure</b> script
will search for a C++ compiler and C++ header files. If it finds them, it
automatically builds the C++ wrapper library (which supports only 8-bit
strings). You can disable this by adding
<pre>
--disable-cpp
</pre>
to the <b>configure</b> command.
</P>
<br><a name="SEC5" href="#TOC1">UTF-8, UTF-16 AND UTF-32 SUPPORT</a><br>
<P>
To build PCRE with support for UTF Unicode character strings, add
<pre>
--enable-utf
</pre>
to the <b>configure</b> command. This setting applies to all three libraries,
adding support for UTF-8 to the 8-bit library, support for UTF-16 to the 16-bit
library, and support for UTF-32 to the to the 32-bit library. There are no
separate options for enabling UTF-8, UTF-16 and UTF-32 independently because
that would allow ridiculous settings such as requesting UTF-16 support while
building only the 8-bit library. It is not possible to build one library with
UTF support and another without in the same configuration. (For backwards
compatibility, --enable-utf8 is a synonym of --enable-utf.)
</P>
<P>
Of itself, this setting does not make PCRE treat strings as UTF-8, UTF-16 or
UTF-32. As well as compiling PCRE with this option, you also have have to set
the PCRE_UTF8, PCRE_UTF16 or PCRE_UTF32 option (as appropriate) when you call
one of the pattern compiling functions.
</P>
<P>
If you set --enable-utf when compiling in an EBCDIC environment, PCRE expects
its input to be either ASCII or UTF-8 (depending on the run-time option). It is
not possible to support both EBCDIC and UTF-8 codes in the same version of the
library. Consequently, --enable-utf and --enable-ebcdic are mutually
exclusive.
</P>
<br><a name="SEC6" href="#TOC1">UNICODE CHARACTER PROPERTY SUPPORT</a><br>
<P>
UTF support allows the libraries to process character codepoints up to 0x10ffff
in the strings that they handle. On its own, however, it does not provide any
facilities for accessing the properties of such characters. If you want to be
able to use the pattern escapes \P, \p, and \X, which refer to Unicode
character properties, you must add
<pre>
--enable-unicode-properties
</pre>
to the <b>configure</b> command. This implies UTF support, even if you have
not explicitly requested it.
</P>
<P>
Including Unicode property support adds around 30K of tables to the PCRE
library. Only the general category properties such as <i>Lu</i> and <i>Nd</i> are
supported. Details are given in the
<a href="pcrepattern.html"><b>pcrepattern</b></a>
documentation.
</P>
<br><a name="SEC7" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
<P>
Just-in-time compiler support is included in the build by specifying
<pre>
--enable-jit
</pre>
This support is available only for certain hardware architectures. If this
option is set for an unsupported architecture, a compile time error occurs.
See the
<a href="pcrejit.html"><b>pcrejit</b></a>
documentation for a discussion of JIT usage. When JIT support is enabled,
pcregrep automatically makes use of it, unless you add
<pre>
--disable-pcregrep-jit
</pre>
to the "configure" command.
</P>
<br><a name="SEC8" href="#TOC1">CODE VALUE OF NEWLINE</a><br>
<P>
By default, PCRE interprets the linefeed (LF) character as indicating the end
of a line. This is the normal newline character on Unix-like systems. You can
compile PCRE to use carriage return (CR) instead, by adding
<pre>
--enable-newline-is-cr
</pre>
to the <b>configure</b> command. There is also a --enable-newline-is-lf option,
which explicitly specifies linefeed as the newline character.
<br>
<br>
Alternatively, you can specify that line endings are to be indicated by the two
character sequence CRLF. If you want this, add
<pre>
--enable-newline-is-crlf
</pre>
to the <b>configure</b> command. There is a fourth option, specified by
<pre>
--enable-newline-is-anycrlf
</pre>
which causes PCRE to recognize any of the three sequences CR, LF, or CRLF as
indicating a line ending. Finally, a fifth option, specified by
<pre>
--enable-newline-is-any
</pre>
causes PCRE to recognize any Unicode newline sequence.
</P>
<P>
Whatever line ending convention is selected when PCRE is built can be
overridden when the library functions are called. At build time it is
conventional to use the standard for your operating system.
</P>
<br><a name="SEC9" href="#TOC1">WHAT \R MATCHES</a><br>
<P>
By default, the sequence \R in a pattern matches any Unicode newline sequence,
whatever has been selected as the line ending sequence. If you specify
<pre>
--enable-bsr-anycrlf
</pre>
the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
selected when PCRE is built can be overridden when the library functions are
called.
</P>
<br><a name="SEC10" href="#TOC1">POSIX MALLOC USAGE</a><br>
<P>
When the 8-bit library is called through the POSIX interface (see the
<a href="pcreposix.html"><b>pcreposix</b></a>
documentation), additional working storage is required for holding the pointers
to capturing substrings, because PCRE requires three integers per substring,
whereas the POSIX interface provides only two. If the number of expected
substrings is small, the wrapper function uses space on the stack, because this
is faster than using <b>malloc()</b> for each call. The default threshold above
which the stack is no longer used is 10; it can be changed by adding a setting
such as
<pre>
--with-posix-malloc-threshold=20
</pre>
to the <b>configure</b> command.
</P>
<br><a name="SEC11" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
<P>
Within a compiled pattern, offset values are used to point from one part to
another (for example, from an opening parenthesis to an alternation
metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
are used for these offsets, leading to a maximum size for a compiled pattern of
around 64K. This is sufficient to handle all but the most gigantic patterns.
Nevertheless, some people do want to process truly enormous patterns, so it is
possible to compile PCRE to use three-byte or four-byte offsets by adding a
setting such as
<pre>
--with-link-size=3
</pre>
to the <b>configure</b> command. The value given must be 2, 3, or 4. For the
16-bit library, a value of 3 is rounded up to 4. In these libraries, using
longer offsets slows down the operation of PCRE because it has to load
additional data when handling them. For the 32-bit library the value is always
4 and cannot be overridden; the value of --with-link-size is ignored.
</P>
<br><a name="SEC12" href="#TOC1">AVOIDING EXCESSIVE STACK USAGE</a><br>
<P>
When matching with the <b>pcre_exec()</b> function, PCRE implements backtracking
by making recursive calls to an internal function called <b>match()</b>. In
environments where the size of the stack is limited, this can severely limit
PCRE's operation. (The Unix environment does not usually suffer from this
problem, but it may sometimes be necessary to increase the maximum stack size.
There is a discussion in the
<a href="pcrestack.html"><b>pcrestack</b></a>
documentation.) An alternative approach to recursion that uses memory from the
heap to remember data, instead of using recursive function calls, has been
implemented to work round the problem of limited stack size. If you want to
build a version of PCRE that works this way, add
<pre>
--disable-stack-for-recursion
</pre>
to the <b>configure</b> command. With this configuration, PCRE will use the
<b>pcre_stack_malloc</b> and <b>pcre_stack_free</b> variables to call memory
management functions. By default these point to <b>malloc()</b> and
<b>free()</b>, but you can replace the pointers so that your own functions are
used instead.
</P>
<P>
Separate functions are provided rather than using <b>pcre_malloc</b> and
<b>pcre_free</b> because the usage is very predictable: the block sizes
requested are always the same, and the blocks are always freed in reverse
order. A calling program might be able to implement optimized functions that
perform better than <b>malloc()</b> and <b>free()</b>. PCRE runs noticeably more
slowly when built in this way. This option affects only the <b>pcre_exec()</b>
function; it is not relevant for <b>pcre_dfa_exec()</b>.
</P>
<br><a name="SEC13" href="#TOC1">LIMITING PCRE RESOURCE USAGE</a><br>
<P>
Internally, PCRE has a function called <b>match()</b>, which it calls repeatedly
(sometimes recursively) when matching a pattern with the <b>pcre_exec()</b>
function. By controlling the maximum number of times this function may be
called during a single matching operation, a limit can be placed on the
resources used by a single call to <b>pcre_exec()</b>. The limit can be changed
at run time, as described in the
<a href="pcreapi.html"><b>pcreapi</b></a>
documentation. The default is 10 million, but this can be changed by adding a
setting such as
<pre>
--with-match-limit=500000
</pre>
to the <b>configure</b> command. This setting has no effect on the
<b>pcre_dfa_exec()</b> matching function.
</P>
<P>
In some environments it is desirable to limit the depth of recursive calls of
<b>match()</b> more strictly than the total number of calls, in order to
restrict the maximum amount of stack (or heap, if --disable-stack-for-recursion
is specified) that is used. A second limit controls this; it defaults to the
value that is set for --with-match-limit, which imposes no additional
constraints. However, you can set a lower limit by adding, for example,
<pre>
--with-match-limit-recursion=10000
</pre>
to the <b>configure</b> command. This value can also be overridden at run time.
</P>
<br><a name="SEC14" href="#TOC1">CREATING CHARACTER TABLES AT BUILD TIME</a><br>
<P>
PCRE uses fixed tables for processing characters whose code values are less
than 256. By default, PCRE is built with a set of tables that are distributed
in the file <i>pcre_chartables.c.dist</i>. These tables are for ASCII codes
only. If you add
<pre>
--enable-rebuild-chartables
</pre>
to the <b>configure</b> command, the distributed tables are no longer used.
Instead, a program called <b>dftables</b> is compiled and run. This outputs the
source for new set of tables, created in the default locale of your C run-time
system. (This method of replacing the tables does not work if you are cross
compiling, because <b>dftables</b> is run on the local host. If you need to
create alternative tables when cross compiling, you will have to do so "by
hand".)
</P>
<br><a name="SEC15" href="#TOC1">USING EBCDIC CODE</a><br>
<P>
PCRE assumes by default that it will run in an environment where the character
code is ASCII (or Unicode, which is a superset of ASCII). This is the case for
most computer operating systems. PCRE can, however, be compiled to run in an
EBCDIC environment by adding
<pre>
--enable-ebcdic
</pre>
to the <b>configure</b> command. This setting implies
--enable-rebuild-chartables. You should only use it if you know that you are in
an EBCDIC environment (for example, an IBM mainframe operating system). The
--enable-ebcdic option is incompatible with --enable-utf.
</P>
<P>
The EBCDIC character that corresponds to an ASCII LF is assumed to have the
value 0x15 by default. However, in some EBCDIC environments, 0x25 is used. In
such an environment you should use
<pre>
--enable-ebcdic-nl25
</pre>
as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR has the
same value as in ASCII, namely, 0x0d. Whichever of 0x15 and 0x25 is <i>not</i>
chosen as LF is made to correspond to the Unicode NEL character (which, in
Unicode, is 0x85).
</P>
<P>
The options that select newline behaviour, such as --enable-newline-is-cr,
and equivalent run-time options, refer to these character values in an EBCDIC
environment.
</P>
<br><a name="SEC16" href="#TOC1">PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT</a><br>
<P>
By default, <b>pcregrep</b> reads all files as plain text. You can build it so
that it recognizes files whose names end in <b>.gz</b> or <b>.bz2</b>, and reads
them with <b>libz</b> or <b>libbz2</b>, respectively, by adding one or both of
<pre>
--enable-pcregrep-libz
--enable-pcregrep-libbz2
</pre>
to the <b>configure</b> command. These options naturally require that the
relevant libraries are installed on your system. Configuration will fail if
they are not.
</P>
<br><a name="SEC17" href="#TOC1">PCREGREP BUFFER SIZE</a><br>
<P>
<b>pcregrep</b> uses an internal buffer to hold a "window" on the file it is
scanning, in order to be able to output "before" and "after" lines when it
finds a match. The size of the buffer is controlled by a parameter whose
default value is 20K. The buffer itself is three times this size, but because
of the way it is used for holding "before" lines, the longest line that is
guaranteed to be processable is the parameter size. You can change the default
parameter value by adding, for example,
<pre>
--with-pcregrep-bufsize=50K
</pre>
to the <b>configure</b> command. The caller of \fPpcregrep\fP can, however,
override this value by specifying a run-time option.
</P>
<br><a name="SEC18" href="#TOC1">PCRETEST OPTION FOR LIBREADLINE SUPPORT</a><br>
<P>
If you add
<pre>
--enable-pcretest-libreadline
</pre>
to the <b>configure</b> command, <b>pcretest</b> is linked with the
<b>libreadline</b> library, and when its input is from a terminal, it reads it
using the <b>readline()</b> function. This provides line-editing and history
facilities. Note that <b>libreadline</b> is GPL-licensed, so if you distribute a
binary of <b>pcretest</b> linked in this way, there may be licensing issues.
</P>
<P>
Setting this option causes the <b>-lreadline</b> option to be added to the
<b>pcretest</b> build. In many operating environments with a sytem-installed
<b>libreadline</b> this is sufficient. However, in some environments (e.g.
if an unmodified distribution version of readline is in use), some extra
configuration may be necessary. The INSTALL file for <b>libreadline</b> says
this:
<pre>
"Readline uses the termcap functions, but does not link with the
termcap or curses library itself, allowing applications which link
with readline the to choose an appropriate library."
</pre>
If your environment has not been set up so that an appropriate library is
automatically included, you may need to add something like
<pre>
LIBS="-ncurses"
</pre>
immediately before the <b>configure</b> command.
</P>
<br><a name="SEC19" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
<P>
By adding the
<pre>
--enable-valgrind
</pre>
option to to the <b>configure</b> command, PCRE will use valgrind annotations
to mark certain memory regions as unaddressable. This allows it to detect
invalid memory accesses, and is mostly useful for debugging PCRE itself.
</P>
<br><a name="SEC20" href="#TOC1">CODE COVERAGE REPORTING</a><br>
<P>
If your C compiler is gcc, you can build a version of PCRE that can generate a
code coverage report for its test suite. To enable this, you must install
<b>lcov</b> version 1.6 or above. Then specify
<pre>
--enable-coverage
</pre>
to the <b>configure</b> command and build PCRE in the usual way.
</P>
<P>
Note that using <b>ccache</b> (a caching C compiler) is incompatible with code
coverage reporting. If you have configured <b>ccache</b> to run automatically
on your system, you must set the environment variable
<pre>
CCACHE_DISABLE=1
</pre>
before running <b>make</b> to build PCRE, so that <b>ccache</b> is not used.
</P>
<P>
When --enable-coverage is used, the following addition targets are added to the
<i>Makefile</i>:
<pre>
make coverage
</pre>
This creates a fresh coverage report for the PCRE test suite. It is equivalent
to running "make coverage-reset", "make coverage-baseline", "make check", and
then "make coverage-report".
<pre>
make coverage-reset
</pre>
This zeroes the coverage counters, but does nothing else.
<pre>
make coverage-baseline
</pre>
This captures baseline coverage information.
<pre>
make coverage-report
</pre>
This creates the coverage report.
<pre>
make coverage-clean-report
</pre>
This removes the generated coverage report without cleaning the coverage data
itself.
<pre>
make coverage-clean-data
</pre>
This removes the captured coverage data without removing the coverage files
created at compile time (*.gcno).
<pre>
make coverage-clean
</pre>
This cleans all coverage data including the generated coverage report. For more
information about code coverage, see the <b>gcov</b> and <b>lcov</b>
documentation.
</P>
<br><a name="SEC21" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcreapi</b>(3), <b>pcre16</b>, <b>pcre32</b>, <b>pcre_config</b>(3).
</P>
<br><a name="SEC22" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><a name="SEC23" href="#TOC1">REVISION</a><br>
<P>
Last updated: 30 October 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,243 @@
<html>
<head>
<title>pcrecallout specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcrecallout man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
<li><a name="TOC3" href="#SEC3">MISSING CALLOUTS</a>
<li><a name="TOC4" href="#SEC4">THE CALLOUT INTERFACE</a>
<li><a name="TOC5" href="#SEC5">RETURN VALUES</a>
<li><a name="TOC6" href="#SEC6">AUTHOR</a>
<li><a name="TOC7" href="#SEC7">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
<P>
<b>#include &#60;pcre.h&#62;</b>
</P>
<P>
<b>int (*pcre_callout)(pcre_callout_block *);</b>
</P>
<P>
<b>int (*pcre16_callout)(pcre16_callout_block *);</b>
</P>
<P>
<b>int (*pcre32_callout)(pcre32_callout_block *);</b>
</P>
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
<P>
PCRE provides a feature called "callout", which is a means of temporarily
passing control to the caller of PCRE in the middle of pattern matching. The
caller of PCRE provides an external function by putting its entry point in the
global variable <i>pcre_callout</i> (<i>pcre16_callout</i> for the 16-bit
library, <i>pcre32_callout</i> for the 32-bit library). By default, this
variable contains NULL, which disables all calling out.
</P>
<P>
Within a regular expression, (?C) indicates the points at which the external
function is to be called. Different callout points can be identified by putting
a number less than 256 after the letter C. The default value is zero.
For example, this pattern has two callout points:
<pre>
(?C1)abc(?C2)def
</pre>
If the PCRE_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE
automatically inserts callouts, all with number 255, before each item in the
pattern. For example, if PCRE_AUTO_CALLOUT is used with the pattern
<pre>
A(\d{2}|--)
</pre>
it is processed as if it were
<br>
<br>
(?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
<br>
<br>
Notice that there is a callout before and after each parenthesis and
alternation bar. Automatic callouts can be used for tracking the progress of
pattern matching. The
<a href="pcretest.html"><b>pcretest</b></a>
command has an option that sets automatic callouts; when it is used, the output
indicates how the pattern is matched. This is useful information when you are
trying to optimize the performance of a particular pattern.
</P>
<P>
The use of callouts in a pattern makes it ineligible for optimization by the
just-in-time compiler. Studying such a pattern with the PCRE_STUDY_JIT_COMPILE
option always fails.
</P>
<br><a name="SEC3" href="#TOC1">MISSING CALLOUTS</a><br>
<P>
You should be aware that, because of optimizations in the way PCRE matches
patterns by default, callouts sometimes do not happen. For example, if the
pattern is
<pre>
ab(?C4)cd
</pre>
PCRE knows that any matching string must contain the letter "d". If the subject
string is "abyz", the lack of "d" means that matching doesn't ever start, and
the callout is never reached. However, with "abyd", though the result is still
no match, the callout is obeyed.
</P>
<P>
If the pattern is studied, PCRE knows the minimum length of a matching string,
and will immediately give a "no match" return without actually running a match
if the subject is not long enough, or, for unanchored patterns, if it has
been scanned far enough.
</P>
<P>
You can disable these optimizations by passing the PCRE_NO_START_OPTIMIZE
option to the matching function, or by starting the pattern with
(*NO_START_OPT). This slows down the matching process, but does ensure that
callouts such as the example above are obeyed.
</P>
<br><a name="SEC4" href="#TOC1">THE CALLOUT INTERFACE</a><br>
<P>
During matching, when PCRE reaches a callout point, the external function
defined by <i>pcre_callout</i> or <i>pcre[16|32]_callout</i> is called
(if it is set). This applies to both normal and DFA matching. The only
argument to the callout function is a pointer to a <b>pcre_callout</b>
or <b>pcre[16|32]_callout</b> block.
These structures contains the following fields:
<pre>
int <i>version</i>;
int <i>callout_number</i>;
int *<i>offset_vector</i>;
const char *<i>subject</i>; (8-bit version)
PCRE_SPTR16 <i>subject</i>; (16-bit version)
PCRE_SPTR32 <i>subject</i>; (32-bit version)
int <i>subject_length</i>;
int <i>start_match</i>;
int <i>current_position</i>;
int <i>capture_top</i>;
int <i>capture_last</i>;
void *<i>callout_data</i>;
int <i>pattern_position</i>;
int <i>next_item_length</i>;
const unsigned char *<i>mark</i>; (8-bit version)
const PCRE_UCHAR16 *<i>mark</i>; (16-bit version)
const PCRE_UCHAR32 *<i>mark</i>; (32-bit version)
</pre>
The <i>version</i> field is an integer containing the version number of the
block format. The initial version was 0; the current version is 2. The version
number will change again in future if additional fields are added, but the
intention is never to remove any of the existing fields.
</P>
<P>
The <i>callout_number</i> field contains the number of the callout, as compiled
into the pattern (that is, the number after ?C for manual callouts, and 255 for
automatically generated callouts).
</P>
<P>
The <i>offset_vector</i> field is a pointer to the vector of offsets that was
passed by the caller to the matching function. When <b>pcre_exec()</b> or
<b>pcre[16|32]_exec()</b> is used, the contents can be inspected, in order to extract
substrings that have been matched so far, in the same way as for extracting
substrings after a match has completed. For the DFA matching functions, this
field is not useful.
</P>
<P>
The <i>subject</i> and <i>subject_length</i> fields contain copies of the values
that were passed to the matching function.
</P>
<P>
The <i>start_match</i> field normally contains the offset within the subject at
which the current match attempt started. However, if the escape sequence \K
has been encountered, this value is changed to reflect the modified starting
point. If the pattern is not anchored, the callout function may be called
several times from the same point in the pattern for different starting points
in the subject.
</P>
<P>
The <i>current_position</i> field contains the offset within the subject of the
current match pointer.
</P>
<P>
When the <b>pcre_exec()</b> or <b>pcre[16|32]_exec()</b> is used, the
<i>capture_top</i> field contains one more than the number of the highest
numbered captured substring so far. If no substrings have been captured, the
value of <i>capture_top</i> is one. This is always the case when the DFA
functions are used, because they do not support captured substrings.
</P>
<P>
The <i>capture_last</i> field contains the number of the most recently captured
substring. If no substrings have been captured, its value is -1. This is always
the case for the DFA matching functions.
</P>
<P>
The <i>callout_data</i> field contains a value that is passed to a matching
function specifically so that it can be passed back in callouts. It is passed
in the <i>callout_data</i> field of a <b>pcre_extra</b> or <b>pcre[16|32]_extra</b>
data structure. If no such data was passed, the value of <i>callout_data</i> in
a callout block is NULL. There is a description of the <b>pcre_extra</b>
structure in the
<a href="pcreapi.html"><b>pcreapi</b></a>
documentation.
</P>
<P>
The <i>pattern_position</i> field is present from version 1 of the callout
structure. It contains the offset to the next item to be matched in the pattern
string.
</P>
<P>
The <i>next_item_length</i> field is present from version 1 of the callout
structure. It contains the length of the next item to be matched in the pattern
string. When the callout immediately precedes an alternation bar, a closing
parenthesis, or the end of the pattern, the length is zero. When the callout
precedes an opening parenthesis, the length is that of the entire subpattern.
</P>
<P>
The <i>pattern_position</i> and <i>next_item_length</i> fields are intended to
help in distinguishing between different automatic callouts, which all have the
same callout number. However, they are set for all callouts.
</P>
<P>
The <i>mark</i> field is present from version 2 of the callout structure. In
callouts from <b>pcre_exec()</b> or <b>pcre[16|32]_exec()</b> it contains a pointer to
the zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
(*THEN) item in the match, or NULL if no such items have been passed. Instances
of (*PRUNE) or (*THEN) without a name do not obliterate a previous (*MARK). In
callouts from the DFA matching functions this field always contains NULL.
</P>
<br><a name="SEC5" href="#TOC1">RETURN VALUES</a><br>
<P>
The external callout function returns an integer to PCRE. If the value is zero,
matching proceeds as normal. If the value is greater than zero, matching fails
at the current point, but the testing of other matching possibilities goes
ahead, just as if a lookahead assertion had failed. If the value is less than
zero, the match is abandoned, the matching function returns the negative value.
</P>
<P>
Negative values should normally be chosen from the set of PCRE_ERROR_xxx
values. In particular, PCRE_ERROR_NOMATCH forces a standard "no match" failure.
The error number PCRE_ERROR_CALLOUT is reserved for use by callout functions;
it will never be used by PCRE itself.
</P>
<br><a name="SEC6" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><a name="SEC7" href="#TOC1">REVISION</a><br>
<P>
Last updated: 24 June 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,216 @@
<html>
<head>
<title>pcrecompat specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcrecompat man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
DIFFERENCES BETWEEN PCRE AND PERL
</b><br>
<P>
This document describes the differences in the ways that PCRE and Perl handle
regular expressions. The differences described here are with respect to Perl
versions 5.10 and above.
</P>
<P>
1. PCRE has only a subset of Perl's Unicode support. Details of what it does
have are given in the
<a href="pcreunicode.html"><b>pcreunicode</b></a>
page.
</P>
<P>
2. PCRE allows repeat quantifiers only on parenthesized assertions, but they do
not mean what you might think. For example, (?!a){3} does not assert that the
next three characters are not "a". It just asserts that the next character is
not "a" three times (in principle: PCRE optimizes this to run the assertion
just once). Perl allows repeat quantifiers on other assertions such as \b, but
these do not seem to have any use.
</P>
<P>
3. Capturing subpatterns that occur inside negative lookahead assertions are
counted, but their entries in the offsets vector are never set. Perl sets its
numerical variables from any such patterns that are matched before the
assertion fails to match something (thereby succeeding), but only if the
negative lookahead assertion contains just one branch.
</P>
<P>
4. Though binary zero characters are supported in the subject string, they are
not allowed in a pattern string because it is passed as a normal C string,
terminated by zero. The escape sequence \0 can be used in the pattern to
represent a binary zero.
</P>
<P>
5. The following Perl escape sequences are not supported: \l, \u, \L,
\U, and \N when followed by a character name or Unicode value. (\N on its
own, matching a non-newline character, is supported.) In fact these are
implemented by Perl's general string-handling and are not part of its pattern
matching engine. If any of these are encountered by PCRE, an error is
generated by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
\U and \u are interpreted as JavaScript interprets them.
</P>
<P>
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE is
built with Unicode character property support. The properties that can be
tested with \p and \P are limited to the general category properties such as
Lu and Nd, script names such as Greek or Han, and the derived properties Any
and L&. PCRE does support the Cs (surrogate) property, which Perl does not; the
Perl documentation says "Because Perl hides the need for the user to understand
the internal representation of Unicode characters, there is no need to
implement the somewhat messy concept of surrogates."
</P>
<P>
7. PCRE does support the \Q...\E escape for quoting substrings. Characters in
between are treated as literals. This is slightly different from Perl in that $
and @ are also handled as literals inside the quotes. In Perl, they cause
variable interpolation (but of course PCRE does not have variables). Note the
following examples:
<pre>
Pattern PCRE matches Perl matches
\Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
</pre>
The \Q...\E sequence is recognized both inside and outside character classes.
</P>
<P>
8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
constructions. However, there is support for recursive patterns. This is not
available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE "callout"
feature allows an external function to be called during pattern matching. See
the
<a href="pcrecallout.html"><b>pcrecallout</b></a>
documentation for details.
</P>
<P>
9. Subpatterns that are called as subroutines (whether or not recursively) are
always treated as atomic groups in PCRE. This is like Python, but unlike Perl.
Captured values that are set outside a subroutine call can be reference from
inside in PCRE, but not in Perl. There is a discussion that explains these
differences in more detail in the
<a href="pcrepattern.html#recursiondifference">section on recursion differences from Perl</a>
in the
<a href="pcrepattern.html"><b>pcrepattern</b></a>
page.
</P>
<P>
10. If any of the backtracking control verbs are used in an assertion or in a
subpattern that is called as a subroutine (whether or not recursively), their
effect is confined to that subpattern; it does not extend to the surrounding
pattern. This is not always the case in Perl. In particular, if (*THEN) is
present in a group that is called as a subroutine, its action is limited to
that group, even if the group does not contain any | characters. There is one
exception to this: the name from a *(MARK), (*PRUNE), or (*THEN) that is
encountered in a successful positive assertion <i>is</i> passed back when a
match succeeds (compare capturing parentheses in assertions). Note that such
subpatterns are processed as anchored at the point where they are tested.
</P>
<P>
11. There are some differences that are concerned with the settings of captured
strings when part of a pattern is repeated. For example, matching "aba" against
the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set to "b".
</P>
<P>
12. PCRE's handling of duplicate subpattern numbers and duplicate subpattern
names is not as general as Perl's. This is a consequence of the fact the PCRE
works internally just with numbers, using an external table to translate
between numbers and names. In particular, a pattern such as (?|(?&#60;a&#62;A)|(?&#60;b)B),
where the two capturing parentheses have the same number but different names,
is not supported, and causes an error at compile time. If it were allowed, it
would not be possible to distinguish which parentheses matched, because both
names map to capturing subpattern number 1. To avoid this confusing situation,
an error is given at compile time.
</P>
<P>
13. Perl recognizes comments in some places that PCRE does not, for example,
between the ( and ? at the start of a subpattern. If the /x modifier is set,
Perl allows white space between ( and ? but PCRE never does, even if the
PCRE_EXTENDED option is set.
</P>
<P>
14. PCRE provides some extensions to the Perl regular expression facilities.
Perl 5.10 includes new features that are not in earlier versions of Perl, some
of which (such as named parentheses) have been in PCRE for some time. This list
is with respect to Perl 5.10:
<br>
<br>
(a) Although lookbehind assertions in PCRE must match fixed length strings,
each alternative branch of a lookbehind assertion can match a different length
of string. Perl requires them all to have the same length.
<br>
<br>
(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
meta-character matches only at the very end of the string.
<br>
<br>
(c) If PCRE_EXTRA is set, a backslash followed by a letter with no special
meaning is faulted. Otherwise, like Perl, the backslash is quietly ignored.
(Perl can be made to issue a warning.)
<br>
<br>
(d) If PCRE_UNGREEDY is set, the greediness of the repetition quantifiers is
inverted, that is, by default they are not greedy, but if followed by a
question mark they are.
<br>
<br>
(e) PCRE_ANCHORED can be used at matching time to force a pattern to be tried
only at the first matching position in the subject string.
<br>
<br>
(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, and
PCRE_NO_AUTO_CAPTURE options for <b>pcre_exec()</b> have no Perl equivalents.
<br>
<br>
(g) The \R escape sequence can be restricted to match only CR, LF, or CRLF
by the PCRE_BSR_ANYCRLF option.
<br>
<br>
(h) The callout facility is PCRE-specific.
<br>
<br>
(i) The partial matching facility is PCRE-specific.
<br>
<br>
(j) Patterns compiled by PCRE can be saved and re-used at a later time, even on
different hosts that have the other endianness. However, this does not apply to
optimized data created by the just-in-time compiler.
<br>
<br>
(k) The alternative matching functions (<b>pcre_dfa_exec()</b>,
<b>pcre16_dfa_exec()</b> and <b>pcre32_dfa_exec()</b>,) match in a different way
and are not Perl-compatible.
<br>
<br>
(l) PCRE recognizes some special sequences such as (*CR) at the start of
a pattern that set overall options that cannot be changed within the pattern.
</P>
<br><b>
AUTHOR
</b><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><b>
REVISION
</b><br>
<P>
Last updated: 25 August 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,368 @@
<html>
<head>
<title>pcrecpp specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcrecpp man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">SYNOPSIS OF C++ WRAPPER</a>
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
<li><a name="TOC3" href="#SEC3">MATCHING INTERFACE</a>
<li><a name="TOC4" href="#SEC4">QUOTING METACHARACTERS</a>
<li><a name="TOC5" href="#SEC5">PARTIAL MATCHES</a>
<li><a name="TOC6" href="#SEC6">UTF-8 AND THE MATCHING INTERFACE</a>
<li><a name="TOC7" href="#SEC7">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a>
<li><a name="TOC8" href="#SEC8">SCANNING TEXT INCREMENTALLY</a>
<li><a name="TOC9" href="#SEC9">PARSING HEX/OCTAL/C-RADIX NUMBERS</a>
<li><a name="TOC10" href="#SEC10">REPLACING PARTS OF STRINGS</a>
<li><a name="TOC11" href="#SEC11">AUTHOR</a>
<li><a name="TOC12" href="#SEC12">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">SYNOPSIS OF C++ WRAPPER</a><br>
<P>
<b>#include &#60;pcrecpp.h&#62;</b>
</P>
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
<P>
The C++ wrapper for PCRE was provided by Google Inc. Some additional
functionality was added by Giuseppe Maxia. This brief man page was constructed
from the notes in the <i>pcrecpp.h</i> file, which should be consulted for
further details. Note that the C++ wrapper supports only the original 8-bit
PCRE library. There is no 16-bit or 32-bit support at present.
</P>
<br><a name="SEC3" href="#TOC1">MATCHING INTERFACE</a><br>
<P>
The "FullMatch" operation checks that supplied text matches a supplied pattern
exactly. If pointer arguments are supplied, it copies matched sub-strings that
match sub-patterns into them.
<pre>
Example: successful match
pcrecpp::RE re("h.*o");
re.FullMatch("hello");
Example: unsuccessful match (requires full match):
pcrecpp::RE re("e");
!re.FullMatch("hello");
Example: creating a temporary RE object:
pcrecpp::RE("h.*o").FullMatch("hello");
</pre>
You can pass in a "const char*" or a "string" for "text". The examples below
tend to use a const char*. You can, as in the different examples above, store
the RE object explicitly in a variable or use a temporary RE object. The
examples below use one mode or the other arbitrarily. Either could correctly be
used for any of these examples.
</P>
<P>
You must supply extra pointer arguments to extract matched subpieces.
<pre>
Example: extracts "ruby" into "s" and 1234 into "i"
int i;
string s;
pcrecpp::RE re("(\\w+):(\\d+)");
re.FullMatch("ruby:1234", &s, &i);
Example: does not try to extract any extra sub-patterns
re.FullMatch("ruby:1234", &s);
Example: does not try to extract into NULL
re.FullMatch("ruby:1234", NULL, &i);
Example: integer overflow causes failure
!re.FullMatch("ruby:1234567891234", NULL, &i);
Example: fails because there aren't enough sub-patterns:
!pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
Example: fails because string cannot be stored in integer
!pcrecpp::RE("(.*)").FullMatch("ruby", &i);
</pre>
The provided pointer arguments can be pointers to any scalar numeric
type, or one of:
<pre>
string (matched piece is copied to string)
StringPiece (StringPiece is mutated to point to matched piece)
T (where "bool T::ParseFrom(const char*, int)" exists)
NULL (the corresponding matched sub-pattern is not copied)
</pre>
The function returns true iff all of the following conditions are satisfied:
<pre>
a. "text" matches "pattern" exactly;
b. The number of matched sub-patterns is &#62;= number of supplied
pointers;
c. The "i"th argument has a suitable type for holding the
string captured as the "i"th sub-pattern. If you pass in
void * NULL for the "i"th argument, or a non-void * NULL
of the correct type, or pass fewer arguments than the
number of sub-patterns, "i"th captured sub-pattern is
ignored.
</pre>
CAVEAT: An optional sub-pattern that does not exist in the matched
string is assigned the empty string. Therefore, the following will
return false (because the empty string is not a valid number):
<pre>
int number;
pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
</pre>
The matching interface supports at most 16 arguments per call.
If you need more, consider using the more general interface
<b>pcrecpp::RE::DoMatch</b>. See <b>pcrecpp.h</b> for the signature for
<b>DoMatch</b>.
</P>
<P>
NOTE: Do not use <b>no_arg</b>, which is used internally to mark the end of a
list of optional arguments, as a placeholder for missing arguments, as this can
lead to segfaults.
</P>
<br><a name="SEC4" href="#TOC1">QUOTING METACHARACTERS</a><br>
<P>
You can use the "QuoteMeta" operation to insert backslashes before all
potentially meaningful characters in a string. The returned string, used as a
regular expression, will exactly match the original string.
<pre>
Example:
string quoted = RE::QuoteMeta(unquoted);
</pre>
Note that it's legal to escape a character even if it has no special meaning in
a regular expression -- so this function does that. (This also makes it
identical to the perl function of the same name; see "perldoc -f quotemeta".)
For example, "1.5-2.0?" becomes "1\.5\-2\.0\?".
</P>
<br><a name="SEC5" href="#TOC1">PARTIAL MATCHES</a><br>
<P>
You can use the "PartialMatch" operation when you want the pattern
to match any substring of the text.
<pre>
Example: simple search for a string:
pcrecpp::RE("ell").PartialMatch("hello");
Example: find first number in a string:
int number;
pcrecpp::RE re("(\\d+)");
re.PartialMatch("x*100 + 20", &number);
assert(number == 100);
</PRE>
</P>
<br><a name="SEC6" href="#TOC1">UTF-8 AND THE MATCHING INTERFACE</a><br>
<P>
By default, pattern and text are plain text, one byte per character. The UTF8
flag, passed to the constructor, causes both pattern and string to be treated
as UTF-8 text, still a byte stream but potentially multiple bytes per
character. In practice, the text is likelier to be UTF-8 than the pattern, but
the match returned may depend on the UTF8 flag, so always use it when matching
UTF8 text. For example, "." will match one byte normally but with UTF8 set may
match up to three bytes of a multi-byte character.
<pre>
Example:
pcrecpp::RE_Options options;
options.set_utf8();
pcrecpp::RE re(utf8_pattern, options);
re.FullMatch(utf8_string);
Example: using the convenience function UTF8():
pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
re.FullMatch(utf8_string);
</pre>
NOTE: The UTF8 flag is ignored if pcre was not configured with the
<pre>
--enable-utf8 flag.
</PRE>
</P>
<br><a name="SEC7" href="#TOC1">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a><br>
<P>
PCRE defines some modifiers to change the behavior of the regular expression
engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to
pass such modifiers to a RE class. Currently, the following modifiers are
supported:
<pre>
modifier description Perl corresponding
PCRE_CASELESS case insensitive match /i
PCRE_MULTILINE multiple lines match /m
PCRE_DOTALL dot matches newlines /s
PCRE_DOLLAR_ENDONLY $ matches only at end N/A
PCRE_EXTRA strict escape parsing N/A
PCRE_EXTENDED ignore white spaces /x
PCRE_UTF8 handles UTF8 chars built-in
PCRE_UNGREEDY reverses * and *? N/A
PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)
</pre>
(*) Both Perl and PCRE allow non capturing parentheses by means of the
"?:" modifier within the pattern itself. e.g. (?:ab|cd) does not
capture, while (ab|cd) does.
</P>
<P>
For a full account on how each modifier works, please check the
PCRE API reference page.
</P>
<P>
For each modifier, there are two member functions whose name is made
out of the modifier in lowercase, without the "PCRE_" prefix. For
instance, PCRE_CASELESS is handled by
<pre>
bool caseless()
</pre>
which returns true if the modifier is set, and
<pre>
RE_Options & set_caseless(bool)
</pre>
which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be
accessed through the <b>set_match_limit()</b> and <b>match_limit()</b> member
functions. Setting <i>match_limit</i> to a non-zero value will limit the
execution of pcre to keep it from doing bad things like blowing the stack or
taking an eternity to return a result. A value of 5000 is good enough to stop
stack blowup in a 2MB thread stack. Setting <i>match_limit</i> to zero disables
match limiting. Alternatively, you can call <b>match_limit_recursion()</b>
which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE
recurses. <b>match_limit()</b> limits the number of matches PCRE does;
<b>match_limit_recursion()</b> limits the depth of internal recursion, and
therefore the amount of stack that is used.
</P>
<P>
Normally, to pass one or more modifiers to a RE class, you declare
a <i>RE_Options</i> object, set the appropriate options, and pass this
object to a RE constructor. Example:
<pre>
RE_Options opt;
opt.set_caseless(true);
if (RE("HELLO", opt).PartialMatch("hello world")) ...
</pre>
RE_options has two constructors. The default constructor takes no arguments and
creates a set of flags that are off by default. The optional parameter
<i>option_flags</i> is to facilitate transfer of legacy code from C programs.
This lets you do
<pre>
RE(pattern,
RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
</pre>
However, new code is better off doing
<pre>
RE(pattern,
RE_Options().set_caseless(true).set_multiline(true))
.PartialMatch(str);
</pre>
If you are going to pass one of the most used modifiers, there are some
convenience functions that return a RE_Options class with the
appropriate modifier already set: <b>CASELESS()</b>, <b>UTF8()</b>,
<b>MULTILINE()</b>, <b>DOTALL</b>(), and <b>EXTENDED()</b>.
</P>
<P>
If you need to set several options at once, and you don't want to go through
the pains of declaring a RE_Options object and setting several options, there
is a parallel method that give you such ability on the fly. You can concatenate
several <b>set_xxxxx()</b> member functions, since each of them returns a
reference to its class object. For example, to pass PCRE_CASELESS,
PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write:
<pre>
RE(" ^ xyz \\s+ .* blah$",
RE_Options()
.set_caseless(true)
.set_extended(true)
.set_multiline(true)).PartialMatch(sometext);
</PRE>
</P>
<br><a name="SEC8" href="#TOC1">SCANNING TEXT INCREMENTALLY</a><br>
<P>
The "Consume" operation may be useful if you want to repeatedly
match regular expressions at the front of a string and skip over
them as they match. This requires use of the "StringPiece" type,
which represents a sub-range of a real string. Like RE, StringPiece
is defined in the pcrecpp namespace.
<pre>
Example: read lines of the form "var = value" from a string.
string contents = ...; // Fill string somehow
pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
string var;
int value;
pcrecpp::RE re("(\\w+) = (\\d+)\n");
while (re.Consume(&input, &var, &value)) {
...;
}
</pre>
Each successful call to "Consume" will set "var/value", and also
advance "input" so it points past the matched text.
</P>
<P>
The "FindAndConsume" operation is similar to "Consume" but does not
anchor your match at the beginning of the string. For example, you
could extract all words from a string by repeatedly calling
<pre>
pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
</PRE>
</P>
<br><a name="SEC9" href="#TOC1">PARSING HEX/OCTAL/C-RADIX NUMBERS</a><br>
<P>
By default, if you pass a pointer to a numeric value, the
corresponding text is interpreted as a base-10 number. You can
instead wrap the pointer with a call to one of the operators Hex(),
Octal(), or CRadix() to interpret the text in another base. The
CRadix operator interprets C-style "0" (base-8) and "0x" (base-16)
prefixes, but defaults to base-10.
<pre>
Example:
int a, b, c, d;
pcrecpp::RE re("(.*) (.*) (.*) (.*)");
re.FullMatch("100 40 0100 0x40",
pcrecpp::Octal(&a), pcrecpp::Hex(&b),
pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
</pre>
will leave 64 in a, b, c, and d.
</P>
<br><a name="SEC10" href="#TOC1">REPLACING PARTS OF STRINGS</a><br>
<P>
You can replace the first match of "pattern" in "str" with "rewrite".
Within "rewrite", backslash-escaped digits (\1 to \9) can be
used to insert text matching corresponding parenthesized group
from the pattern. \0 in "rewrite" refers to the entire matching
text. For example:
<pre>
string s = "yabba dabba doo";
pcrecpp::RE("b+").Replace("d", &s);
</pre>
will leave "s" containing "yada dabba doo". The result is true if the pattern
matches and a replacement occurs, false otherwise.
</P>
<P>
<b>GlobalReplace</b> is like <b>Replace</b> except that it replaces all
occurrences of the pattern in the string with the rewrite. Replacements are
not subject to re-matching. For example:
<pre>
string s = "yabba dabba doo";
pcrecpp::RE("b+").GlobalReplace("d", &s);
</pre>
will leave "s" containing "yada dada doo". It returns the number of
replacements made.
</P>
<P>
<b>Extract</b> is like <b>Replace</b>, except that if the pattern matches,
"rewrite" is copied into "out" (an additional argument) with substitutions.
The non-matching portions of "text" are ignored. Returns true iff a match
occurred and the extraction happened successfully; if no match occurs, the
string is left unaffected.
</P>
<br><a name="SEC11" href="#TOC1">AUTHOR</a><br>
<P>
The C++ wrapper was contributed by Google Inc.
<br>
Copyright &copy; 2007 Google Inc.
<br>
</P>
<br><a name="SEC12" href="#TOC1">REVISION</a><br>
<P>
Last updated: 08 January 2012
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,426 @@
<html>
<head>
<title>pcredemo specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcredemo man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<ul>
</ul>
<PRE>
/*************************************************
* PCRE DEMONSTRATION PROGRAM *
*************************************************/
/* This is a demonstration program to illustrate the most straightforward ways
of calling the PCRE regular expression library from a C program. See the
pcresample documentation for a short discussion ("man pcresample" if you have
the PCRE man pages installed).
In Unix-like environments, if PCRE is installed in your standard system
libraries, you should be able to compile this program using this command:
gcc -Wall pcredemo.c -lpcre -o pcredemo
If PCRE is not installed in a standard place, it is likely to be installed with
support for the pkg-config mechanism. If you have pkg-config, you can compile
this program using this command:
gcc -Wall pcredemo.c `pkg-config --cflags --libs libpcre` -o pcredemo
If you do not have pkg-config, you may have to use this:
gcc -Wall pcredemo.c -I/usr/local/include -L/usr/local/lib \
-R/usr/local/lib -lpcre -o pcredemo
Replace "/usr/local/include" and "/usr/local/lib" with wherever the include and
library files for PCRE are installed on your system. Only some operating
systems (e.g. Solaris) use the -R option.
Building under Windows:
If you want to statically link this program against a non-dll .a file, you must
define PCRE_STATIC before including pcre.h, otherwise the pcre_malloc() and
pcre_free() exported functions will be declared __declspec(dllimport), with
unwanted results. So in this environment, uncomment the following line. */
/* #define PCRE_STATIC */
#include &lt;stdio.h&gt;
#include &lt;string.h&gt;
#include &lt;pcre.h&gt;
#define OVECCOUNT 30 /* should be a multiple of 3 */
int main(int argc, char **argv)
{
pcre *re;
const char *error;
char *pattern;
char *subject;
unsigned char *name_table;
unsigned int option_bits;
int erroffset;
int find_all;
int crlf_is_newline;
int namecount;
int name_entry_size;
int ovector[OVECCOUNT];
int subject_length;
int rc, i;
int utf8;
/**************************************************************************
* First, sort out the command line. There is only one possible option at *
* the moment, "-g" to request repeated matching to find all occurrences, *
* like Perl's /g option. We set the variable find_all to a non-zero value *
* if the -g option is present. Apart from that, there must be exactly two *
* arguments. *
**************************************************************************/
find_all = 0;
for (i = 1; i &lt; argc; i++)
{
if (strcmp(argv[i], "-g") == 0) find_all = 1;
else break;
}
/* After the options, we require exactly two arguments, which are the pattern,
and the subject string. */
if (argc - i != 2)
{
printf("Two arguments required: a regex and a subject string\n");
return 1;
}
pattern = argv[i];
subject = argv[i+1];
subject_length = (int)strlen(subject);
/*************************************************************************
* Now we are going to compile the regular expression pattern, and handle *
* and errors that are detected. *
*************************************************************************/
re = pcre_compile(
pattern, /* the pattern */
0, /* default options */
&amp;error, /* for error message */
&amp;erroffset, /* for error offset */
NULL); /* use default character tables */
/* Compilation failed: print the error message and exit */
if (re == NULL)
{
printf("PCRE compilation failed at offset %d: %s\n", erroffset, error);
return 1;
}
/*************************************************************************
* If the compilation succeeded, we call PCRE again, in order to do a *
* pattern match against the subject string. This does just ONE match. If *
* further matching is needed, it will be done below. *
*************************************************************************/
rc = pcre_exec(
re, /* the compiled pattern */
NULL, /* no extra data - we didn't study the pattern */
subject, /* the subject string */
subject_length, /* the length of the subject */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* output vector for substring information */
OVECCOUNT); /* number of elements in the output vector */
/* Matching failed: handle error cases */
if (rc &lt; 0)
{
switch(rc)
{
case PCRE_ERROR_NOMATCH: printf("No match\n"); break;
/*
Handle other special cases if you like
*/
default: printf("Matching error %d\n", rc); break;
}
pcre_free(re); /* Release memory used for the compiled pattern */
return 1;
}
/* Match succeded */
printf("\nMatch succeeded at offset %d\n", ovector[0]);
/*************************************************************************
* We have found the first match within the subject string. If the output *
* vector wasn't big enough, say so. Then output any substrings that were *
* captured. *
*************************************************************************/
/* The output vector wasn't big enough */
if (rc == 0)
{
rc = OVECCOUNT/3;
printf("ovector only has room for %d captured substrings\n", rc - 1);
}
/* Show substrings stored in the output vector by number. Obviously, in a real
application you might want to do things other than print them. */
for (i = 0; i &lt; rc; i++)
{
char *substring_start = subject + ovector[2*i];
int substring_length = ovector[2*i+1] - ovector[2*i];
printf("%2d: %.*s\n", i, substring_length, substring_start);
}
/**************************************************************************
* That concludes the basic part of this demonstration program. We have *
* compiled a pattern, and performed a single match. The code that follows *
* shows first how to access named substrings, and then how to code for *
* repeated matches on the same subject. *
**************************************************************************/
/* See if there are any named substrings, and if so, show them by name. First
we have to extract the count of named parentheses from the pattern. */
(void)pcre_fullinfo(
re, /* the compiled pattern */
NULL, /* no extra data - we didn't study the pattern */
PCRE_INFO_NAMECOUNT, /* number of named substrings */
&amp;namecount); /* where to put the answer */
if (namecount &lt;= 0) printf("No named substrings\n"); else
{
unsigned char *tabptr;
printf("Named substrings\n");
/* Before we can access the substrings, we must extract the table for
translating names to numbers, and the size of each entry in the table. */
(void)pcre_fullinfo(
re, /* the compiled pattern */
NULL, /* no extra data - we didn't study the pattern */
PCRE_INFO_NAMETABLE, /* address of the table */
&amp;name_table); /* where to put the answer */
(void)pcre_fullinfo(
re, /* the compiled pattern */
NULL, /* no extra data - we didn't study the pattern */
PCRE_INFO_NAMEENTRYSIZE, /* size of each entry in the table */
&amp;name_entry_size); /* where to put the answer */
/* Now we can scan the table and, for each entry, print the number, the name,
and the substring itself. */
tabptr = name_table;
for (i = 0; i &lt; namecount; i++)
{
int n = (tabptr[0] &lt;&lt; 8) | tabptr[1];
printf("(%d) %*s: %.*s\n", n, name_entry_size - 3, tabptr + 2,
ovector[2*n+1] - ovector[2*n], subject + ovector[2*n]);
tabptr += name_entry_size;
}
}
/*************************************************************************
* If the "-g" option was given on the command line, we want to continue *
* to search for additional matches in the subject string, in a similar *
* way to the /g option in Perl. This turns out to be trickier than you *
* might think because of the possibility of matching an empty string. *
* What happens is as follows: *
* *
* If the previous match was NOT for an empty string, we can just start *
* the next match at the end of the previous one. *
* *
* If the previous match WAS for an empty string, we can't do that, as it *
* would lead to an infinite loop. Instead, a special call of pcre_exec() *
* is made with the PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED flags set. *
* The first of these tells PCRE that an empty string at the start of the *
* subject is not a valid match; other possibilities must be tried. The *
* second flag restricts PCRE to one match attempt at the initial string *
* position. If this match succeeds, an alternative to the empty string *
* match has been found, and we can print it and proceed round the loop, *
* advancing by the length of whatever was found. If this match does not *
* succeed, we still stay in the loop, advancing by just one character. *
* In UTF-8 mode, which can be set by (*UTF8) in the pattern, this may be *
* more than one byte. *
* *
* However, there is a complication concerned with newlines. When the *
* newline convention is such that CRLF is a valid newline, we must *
* advance by two characters rather than one. The newline convention can *
* be set in the regex by (*CR), etc.; if not, we must find the default. *
*************************************************************************/
if (!find_all) /* Check for -g */
{
pcre_free(re); /* Release the memory used for the compiled pattern */
return 0; /* Finish unless -g was given */
}
/* Before running the loop, check for UTF-8 and whether CRLF is a valid newline
sequence. First, find the options with which the regex was compiled; extract
the UTF-8 state, and mask off all but the newline options. */
(void)pcre_fullinfo(re, NULL, PCRE_INFO_OPTIONS, &amp;option_bits);
utf8 = option_bits &amp; PCRE_UTF8;
option_bits &amp;= PCRE_NEWLINE_CR|PCRE_NEWLINE_LF|PCRE_NEWLINE_CRLF|
PCRE_NEWLINE_ANY|PCRE_NEWLINE_ANYCRLF;
/* If no newline options were set, find the default newline convention from the
build configuration. */
if (option_bits == 0)
{
int d;
(void)pcre_config(PCRE_CONFIG_NEWLINE, &amp;d);
/* Note that these values are always the ASCII ones, even in
EBCDIC environments. CR = 13, NL = 10. */
option_bits = (d == 13)? PCRE_NEWLINE_CR :
(d == 10)? PCRE_NEWLINE_LF :
(d == (13&lt;&lt;8 | 10))? PCRE_NEWLINE_CRLF :
(d == -2)? PCRE_NEWLINE_ANYCRLF :
(d == -1)? PCRE_NEWLINE_ANY : 0;
}
/* See if CRLF is a valid newline sequence. */
crlf_is_newline =
option_bits == PCRE_NEWLINE_ANY ||
option_bits == PCRE_NEWLINE_CRLF ||
option_bits == PCRE_NEWLINE_ANYCRLF;
/* Loop for second and subsequent matches */
for (;;)
{
int options = 0; /* Normally no options */
int start_offset = ovector[1]; /* Start at end of previous match */
/* If the previous match was for an empty string, we are finished if we are
at the end of the subject. Otherwise, arrange to run another match at the
same point to see if a non-empty match can be found. */
if (ovector[0] == ovector[1])
{
if (ovector[0] == subject_length) break;
options = PCRE_NOTEMPTY_ATSTART | PCRE_ANCHORED;
}
/* Run the next matching operation */
rc = pcre_exec(
re, /* the compiled pattern */
NULL, /* no extra data - we didn't study the pattern */
subject, /* the subject string */
subject_length, /* the length of the subject */
start_offset, /* starting offset in the subject */
options, /* options */
ovector, /* output vector for substring information */
OVECCOUNT); /* number of elements in the output vector */
/* This time, a result of NOMATCH isn't an error. If the value in "options"
is zero, it just means we have found all possible matches, so the loop ends.
Otherwise, it means we have failed to find a non-empty-string match at a
point where there was a previous empty-string match. In this case, we do what
Perl does: advance the matching position by one character, and continue. We
do this by setting the "end of previous match" offset, because that is picked
up at the top of the loop as the point at which to start again.
There are two complications: (a) When CRLF is a valid newline sequence, and
the current position is just before it, advance by an extra byte. (b)
Otherwise we must ensure that we skip an entire UTF-8 character if we are in
UTF-8 mode. */
if (rc == PCRE_ERROR_NOMATCH)
{
if (options == 0) break; /* All matches found */
ovector[1] = start_offset + 1; /* Advance one byte */
if (crlf_is_newline &amp;&amp; /* If CRLF is newline &amp; */
start_offset &lt; subject_length - 1 &amp;&amp; /* we are at CRLF, */
subject[start_offset] == '\r' &amp;&amp;
subject[start_offset + 1] == '\n')
ovector[1] += 1; /* Advance by one more. */
else if (utf8) /* Otherwise, ensure we */
{ /* advance a whole UTF-8 */
while (ovector[1] &lt; subject_length) /* character. */
{
if ((subject[ovector[1]] &amp; 0xc0) != 0x80) break;
ovector[1] += 1;
}
}
continue; /* Go round the loop again */
}
/* Other matching errors are not recoverable. */
if (rc &lt; 0)
{
printf("Matching error %d\n", rc);
pcre_free(re); /* Release memory used for the compiled pattern */
return 1;
}
/* Match succeded */
printf("\nMatch succeeded again at offset %d\n", ovector[0]);
/* The match succeeded, but the output vector wasn't big enough. */
if (rc == 0)
{
rc = OVECCOUNT/3;
printf("ovector only has room for %d captured substrings\n", rc - 1);
}
/* As before, show substrings stored in the output vector by number, and then
also any named substrings. */
for (i = 0; i &lt; rc; i++)
{
char *substring_start = subject + ovector[2*i];
int substring_length = ovector[2*i+1] - ovector[2*i];
printf("%2d: %.*s\n", i, substring_length, substring_start);
}
if (namecount &lt;= 0) printf("No named substrings\n"); else
{
unsigned char *tabptr = name_table;
printf("Named substrings\n");
for (i = 0; i &lt; namecount; i++)
{
int n = (tabptr[0] &lt;&lt; 8) | tabptr[1];
printf("(%d) %*s: %.*s\n", n, name_entry_size - 3, tabptr + 2,
ovector[2*n+1] - ovector[2*n], subject + ovector[2*n]);
tabptr += name_entry_size;
}
}
} /* End of loop to find second and subsequent matches */
printf("\n");
pcre_free(re); /* Release memory used for the compiled pattern */
return 0;
}
/* End of pcredemo.c */
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,757 @@
<html>
<head>
<title>pcregrep specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcregrep man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
<li><a name="TOC3" href="#SEC3">SUPPORT FOR COMPRESSED FILES</a>
<li><a name="TOC4" href="#SEC4">BINARY FILES</a>
<li><a name="TOC5" href="#SEC5">OPTIONS</a>
<li><a name="TOC6" href="#SEC6">ENVIRONMENT VARIABLES</a>
<li><a name="TOC7" href="#SEC7">NEWLINES</a>
<li><a name="TOC8" href="#SEC8">OPTIONS COMPATIBILITY</a>
<li><a name="TOC9" href="#SEC9">OPTIONS WITH DATA</a>
<li><a name="TOC10" href="#SEC10">MATCHING ERRORS</a>
<li><a name="TOC11" href="#SEC11">DIAGNOSTICS</a>
<li><a name="TOC12" href="#SEC12">SEE ALSO</a>
<li><a name="TOC13" href="#SEC13">AUTHOR</a>
<li><a name="TOC14" href="#SEC14">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
<P>
<b>pcregrep [options] [long options] [pattern] [path1 path2 ...]</b>
</P>
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
<P>
<b>pcregrep</b> searches files for character patterns, in the same way as other
grep commands do, but it uses the PCRE regular expression library to support
patterns that are compatible with the regular expressions of Perl 5. See
<a href="pcrepattern.html"><b>pcrepattern</b>(3)</a>
for a full description of syntax and semantics of the regular expressions
that PCRE supports.
</P>
<P>
Patterns, whether supplied on the command line or in a separate file, are given
without delimiters. For example:
<pre>
pcregrep Thursday /etc/motd
</pre>
If you attempt to use delimiters (for example, by surrounding a pattern with
slashes, as is common in Perl scripts), they are interpreted as part of the
pattern. Quotes can of course be used to delimit patterns on the command line
because they are interpreted by the shell, and indeed quotes are required if a
pattern contains white space or shell metacharacters.
</P>
<P>
The first argument that follows any option settings is treated as the single
pattern to be matched when neither <b>-e</b> nor <b>-f</b> is present.
Conversely, when one or both of these options are used to specify patterns, all
arguments are treated as path names. At least one of <b>-e</b>, <b>-f</b>, or an
argument pattern must be provided.
</P>
<P>
If no files are specified, <b>pcregrep</b> reads the standard input. The
standard input can also be referenced by a name consisting of a single hyphen.
For example:
<pre>
pcregrep some-pattern /file1 - /file3
</pre>
By default, each line that matches a pattern is copied to the standard
output, and if there is more than one file, the file name is output at the
start of each line, followed by a colon. However, there are options that can
change how <b>pcregrep</b> behaves. In particular, the <b>-M</b> option makes it
possible to search for patterns that span line boundaries. What defines a line
boundary is controlled by the <b>-N</b> (<b>--newline</b>) option.
</P>
<P>
The amount of memory used for buffering files that are being scanned is
controlled by a parameter that can be set by the <b>--buffer-size</b> option.
The default value for this parameter is specified when <b>pcregrep</b> is built,
with the default default being 20K. A block of memory three times this size is
used (to allow for buffering "before" and "after" lines). An error occurs if a
line overflows the buffer.
</P>
<P>
Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the greater.
BUFSIZ is defined in <b>&#60;stdio.h&#62;</b>. When there is more than one pattern
(specified by the use of <b>-e</b> and/or <b>-f</b>), each pattern is applied to
each line in the order in which they are defined, except that all the <b>-e</b>
patterns are tried before the <b>-f</b> patterns.
</P>
<P>
By default, as soon as one pattern matches a line, no further patterns are
considered. However, if <b>--colour</b> (or <b>--color</b>) is used to colour the
matching substrings, or if <b>--only-matching</b>, <b>--file-offsets</b>, or
<b>--line-offsets</b> is used to output only the part of the line that matched
(either shown literally, or as an offset), scanning resumes immediately
following the match, so that further matches on the same line can be found. If
there are multiple patterns, they are all tried on the remainder of the line,
but patterns that follow the one that matched are not tried on the earlier part
of the line.
</P>
<P>
This behaviour means that the order in which multiple patterns are specified
can affect the output when one of the above options is used. This is no longer
the same behaviour as GNU grep, which now manages to display earlier matches
for later patterns (as long as there is no overlap).
</P>
<P>
Patterns that can match an empty string are accepted, but empty string
matches are never recognized. An example is the pattern "(super)?(man)?", in
which all components are optional. This pattern finds all occurrences of both
"super" and "man"; the output differs from matching with "super|man" when only
the matching substrings are being shown.
</P>
<P>
If the <b>LC_ALL</b> or <b>LC_CTYPE</b> environment variable is set,
<b>pcregrep</b> uses the value to set a locale when calling the PCRE library.
The <b>--locale</b> option can be used to override this.
</P>
<br><a name="SEC3" href="#TOC1">SUPPORT FOR COMPRESSED FILES</a><br>
<P>
It is possible to compile <b>pcregrep</b> so that it uses <b>libz</b> or
<b>libbz2</b> to read files whose names end in <b>.gz</b> or <b>.bz2</b>,
respectively. You can find out whether your binary has support for one or both
of these file types by running it with the <b>--help</b> option. If the
appropriate support is not present, files are treated as plain text. The
standard input is always so treated.
</P>
<br><a name="SEC4" href="#TOC1">BINARY FILES</a><br>
<P>
By default, a file that contains a binary zero byte within the first 1024 bytes
is identified as a binary file, and is processed specially. (GNU grep also
identifies binary files in this manner.) See the <b>--binary-files</b> option
for a means of changing the way binary files are handled.
</P>
<br><a name="SEC5" href="#TOC1">OPTIONS</a><br>
<P>
The order in which some of the options appear can affect the output. For
example, both the <b>-h</b> and <b>-l</b> options affect the printing of file
names. Whichever comes later in the command line will be the one that takes
effect. Similarly, except where noted below, if an option is given twice, the
later setting is used. Numerical values for options may be followed by K or M,
to signify multiplication by 1024 or 1024*1024 respectively.
</P>
<P>
<b>--</b>
This terminates the list of options. It is useful if the next item on the
command line starts with a hyphen but is not an option. This allows for the
processing of patterns and filenames that start with hyphens.
</P>
<P>
<b>-A</b> <i>number</i>, <b>--after-context=</b><i>number</i>
Output <i>number</i> lines of context after each matching line. If filenames
and/or line numbers are being output, a hyphen separator is used instead of a
colon for the context lines. A line containing "--" is output between each
group of lines, unless they are in fact contiguous in the input file. The value
of <i>number</i> is expected to be relatively small. However, <b>pcregrep</b>
guarantees to have up to 8K of following text available for context output.
</P>
<P>
<b>-a</b>, <b>--text</b>
Treat binary files as text. This is equivalent to
<b>--binary-files</b>=<i>text</i>.
</P>
<P>
<b>-B</b> <i>number</i>, <b>--before-context=</b><i>number</i>
Output <i>number</i> lines of context before each matching line. If filenames
and/or line numbers are being output, a hyphen separator is used instead of a
colon for the context lines. A line containing "--" is output between each
group of lines, unless they are in fact contiguous in the input file. The value
of <i>number</i> is expected to be relatively small. However, <b>pcregrep</b>
guarantees to have up to 8K of preceding text available for context output.
</P>
<P>
<b>--binary-files=</b><i>word</i>
Specify how binary files are to be processed. If the word is "binary" (the
default), pattern matching is performed on binary files, but the only output is
"Binary file &#60;name&#62; matches" when a match succeeds. If the word is "text",
which is equivalent to the <b>-a</b> or <b>--text</b> option, binary files are
processed in the same way as any other file. In this case, when a match
succeeds, the output may be binary garbage, which can have nasty effects if
sent to a terminal. If the word is "without-match", which is equivalent to the
<b>-I</b> option, binary files are not processed at all; they are assumed not to
be of interest.
</P>
<P>
<b>--buffer-size=</b><i>number</i>
Set the parameter that controls how much memory is used for buffering files
that are being scanned.
</P>
<P>
<b>-C</b> <i>number</i>, <b>--context=</b><i>number</i>
Output <i>number</i> lines of context both before and after each matching line.
This is equivalent to setting both <b>-A</b> and <b>-B</b> to the same value.
</P>
<P>
<b>-c</b>, <b>--count</b>
Do not output individual lines from the files that are being scanned; instead
output the number of lines that would otherwise have been shown. If no lines
are selected, the number zero is output. If several files are are being
scanned, a count is output for each of them. However, if the
<b>--files-with-matches</b> option is also used, only those files whose counts
are greater than zero are listed. When <b>-c</b> is used, the <b>-A</b>,
<b>-B</b>, and <b>-C</b> options are ignored.
</P>
<P>
<b>--colour</b>, <b>--color</b>
If this option is given without any data, it is equivalent to "--colour=auto".
If data is required, it must be given in the same shell item, separated by an
equals sign.
</P>
<P>
<b>--colour=</b><i>value</i>, <b>--color=</b><i>value</i>
This option specifies under what circumstances the parts of a line that matched
a pattern should be coloured in the output. By default, the output is not
coloured. The value (which is optional, see above) may be "never", "always", or
"auto". In the latter case, colouring happens only if the standard output is
connected to a terminal. More resources are used when colouring is enabled,
because <b>pcregrep</b> has to search for all possible matches in a line, not
just one, in order to colour them all.
<br>
<br>
The colour that is used can be specified by setting the environment variable
PCREGREP_COLOUR or PCREGREP_COLOR. The value of this variable should be a
string of two numbers, separated by a semicolon. They are copied directly into
the control string for setting colour on a terminal, so it is your
responsibility to ensure that they make sense. If neither of the environment
variables is set, the default is "1;31", which gives red.
</P>
<P>
<b>-D</b> <i>action</i>, <b>--devices=</b><i>action</i>
If an input path is not a regular file or a directory, "action" specifies how
it is to be processed. Valid values are "read" (the default) or "skip"
(silently skip the path).
</P>
<P>
<b>-d</b> <i>action</i>, <b>--directories=</b><i>action</i>
If an input path is a directory, "action" specifies how it is to be processed.
Valid values are "read" (the default in non-Windows environments, for
compatibility with GNU grep), "recurse" (equivalent to the <b>-r</b> option), or
"skip" (silently skip the path, the default in Windows environments). In the
"read" case, directories are read as if they were ordinary files. In some
operating systems the effect of reading a directory like this is an immediate
end-of-file; in others it may provoke an error.
</P>
<P>
<b>-e</b> <i>pattern</i>, <b>--regex=</b><i>pattern</i>, <b>--regexp=</b><i>pattern</i>
Specify a pattern to be matched. This option can be used multiple times in
order to specify several patterns. It can also be used as a way of specifying a
single pattern that starts with a hyphen. When <b>-e</b> is used, no argument
pattern is taken from the command line; all arguments are treated as file
names. There is no limit to the number of patterns. They are applied to each
line in the order in which they are defined until one matches.
<br>
<br>
If <b>-f</b> is used with <b>-e</b>, the command line patterns are matched first,
followed by the patterns from the file(s), independent of the order in which
these options are specified. Note that multiple use of <b>-e</b> is not the same
as a single pattern with alternatives. For example, X|Y finds the first
character in a line that is X or Y, whereas if the two patterns are given
separately, with X first, <b>pcregrep</b> finds X if it is present, even if it
follows Y in the line. It finds Y only if there is no X in the line. This
matters only if you are using <b>-o</b> or <b>--colo(u)r</b> to show the part(s)
of the line that matched.
</P>
<P>
<b>--exclude</b>=<i>pattern</i>
Files (but not directories) whose names match the pattern are skipped without
being processed. This applies to all files, whether listed on the command line,
obtained from <b>--file-list</b>, or by scanning a directory. The pattern is a
PCRE regular expression, and is matched against the final component of the file
name, not the entire path. The <b>-F</b>, <b>-w</b>, and <b>-x</b> options do not
apply to this pattern. The option may be given any number of times in order to
specify multiple patterns. If a file name matches both an <b>--include</b>
and an <b>--exclude</b> pattern, it is excluded. There is no short form for this
option.
</P>
<P>
<b>--exclude-from=</b><i>filename</i>
Treat each non-empty line of the file as the data for an <b>--exclude</b>
option. What constitutes a newline when reading the file is the operating
system's default. The <b>--newline</b> option has no effect on this option. This
option may be given more than once in order to specify a number of files to
read.
</P>
<P>
<b>--exclude-dir</b>=<i>pattern</i>
Directories whose names match the pattern are skipped without being processed,
whatever the setting of the <b>--recursive</b> option. This applies to all
directories, whether listed on the command line, obtained from
<b>--file-list</b>, or by scanning a parent directory. The pattern is a PCRE
regular expression, and is matched against the final component of the directory
name, not the entire path. The <b>-F</b>, <b>-w</b>, and <b>-x</b> options do not
apply to this pattern. The option may be given any number of times in order to
specify more than one pattern. If a directory matches both <b>--include-dir</b>
and <b>--exclude-dir</b>, it is excluded. There is no short form for this
option.
</P>
<P>
<b>-F</b>, <b>--fixed-strings</b>
Interpret each data-matching pattern as a list of fixed strings, separated by
newlines, instead of as a regular expression. What constitutes a newline for
this purpose is controlled by the <b>--newline</b> option. The <b>-w</b> (match
as a word) and <b>-x</b> (match whole line) options can be used with <b>-F</b>.
They apply to each of the fixed strings. A line is selected if any of the fixed
strings are found in it (subject to <b>-w</b> or <b>-x</b>, if present). This
option applies only to the patterns that are matched against the contents of
files; it does not apply to patterns specified by any of the <b>--include</b> or
<b>--exclude</b> options.
</P>
<P>
<b>-f</b> <i>filename</i>, <b>--file=</b><i>filename</i>
Read patterns from the file, one per line, and match them against
each line of input. What constitutes a newline when reading the file is the
operating system's default. The <b>--newline</b> option has no effect on this
option. Trailing white space is removed from each line, and blank lines are
ignored. An empty file contains no patterns and therefore matches nothing. See
also the comments about multiple patterns versus a single pattern with
alternatives in the description of <b>-e</b> above.
<br>
<br>
If this option is given more than once, all the specified files are
read. A data line is output if any of the patterns match it. A filename can
be given as "-" to refer to the standard input. When <b>-f</b> is used, patterns
specified on the command line using <b>-e</b> may also be present; they are
tested before the file's patterns. However, no other pattern is taken from the
command line; all arguments are treated as the names of paths to be searched.
</P>
<P>
<b>--file-list</b>=<i>filename</i>
Read a list of files and/or directories that are to be scanned from the given
file, one per line. Trailing white space is removed from each line, and blank
lines are ignored. These paths are processed before any that are listed on the
command line. The filename can be given as "-" to refer to the standard input.
If <b>--file</b> and <b>--file-list</b> are both specified as "-", patterns are
read first. This is useful only when the standard input is a terminal, from
which further lines (the list of files) can be read after an end-of-file
indication. If this option is given more than once, all the specified files are
read.
</P>
<P>
<b>--file-offsets</b>
Instead of showing lines or parts of lines that match, show each match as an
offset from the start of the file and a length, separated by a comma. In this
mode, no context is shown. That is, the <b>-A</b>, <b>-B</b>, and <b>-C</b>
options are ignored. If there is more than one match in a line, each of them is
shown separately. This option is mutually exclusive with <b>--line-offsets</b>
and <b>--only-matching</b>.
</P>
<P>
<b>-H</b>, <b>--with-filename</b>
Force the inclusion of the filename at the start of output lines when searching
a single file. By default, the filename is not shown in this case. For matching
lines, the filename is followed by a colon; for context lines, a hyphen
separator is used. If a line number is also being output, it follows the file
name.
</P>
<P>
<b>-h</b>, <b>--no-filename</b>
Suppress the output filenames when searching multiple files. By default,
filenames are shown when multiple files are searched. For matching lines, the
filename is followed by a colon; for context lines, a hyphen separator is used.
If a line number is also being output, it follows the file name.
</P>
<P>
<b>--help</b>
Output a help message, giving brief details of the command options and file
type support, and then exit. Anything else on the command line is
ignored.
</P>
<P>
<b>-I</b>
Treat binary files as never matching. This is equivalent to
<b>--binary-files</b>=<i>without-match</i>.
</P>
<P>
<b>-i</b>, <b>--ignore-case</b>
Ignore upper/lower case distinctions during comparisons.
</P>
<P>
<b>--include</b>=<i>pattern</i>
If any <b>--include</b> patterns are specified, the only files that are
processed are those that match one of the patterns (and do not match an
<b>--exclude</b> pattern). This option does not affect directories, but it
applies to all files, whether listed on the command line, obtained from
<b>--file-list</b>, or by scanning a directory. The pattern is a PCRE regular
expression, and is matched against the final component of the file name, not
the entire path. The <b>-F</b>, <b>-w</b>, and <b>-x</b> options do not apply to
this pattern. The option may be given any number of times. If a file name
matches both an <b>--include</b> and an <b>--exclude</b> pattern, it is excluded.
There is no short form for this option.
</P>
<P>
<b>--include-from=</b><i>filename</i>
Treat each non-empty line of the file as the data for an <b>--include</b>
option. What constitutes a newline for this purpose is the operating system's
default. The <b>--newline</b> option has no effect on this option. This option
may be given any number of times; all the files are read.
</P>
<P>
<b>--include-dir</b>=<i>pattern</i>
If any <b>--include-dir</b> patterns are specified, the only directories that
are processed are those that match one of the patterns (and do not match an
<b>--exclude-dir</b> pattern). This applies to all directories, whether listed
on the command line, obtained from <b>--file-list</b>, or by scanning a parent
directory. The pattern is a PCRE regular expression, and is matched against the
final component of the directory name, not the entire path. The <b>-F</b>,
<b>-w</b>, and <b>-x</b> options do not apply to this pattern. The option may be
given any number of times. If a directory matches both <b>--include-dir</b> and
<b>--exclude-dir</b>, it is excluded. There is no short form for this option.
</P>
<P>
<b>-L</b>, <b>--files-without-match</b>
Instead of outputting lines from the files, just output the names of the files
that do not contain any lines that would have been output. Each file name is
output once, on a separate line.
</P>
<P>
<b>-l</b>, <b>--files-with-matches</b>
Instead of outputting lines from the files, just output the names of the files
containing lines that would have been output. Each file name is output
once, on a separate line. Searching normally stops as soon as a matching line
is found in a file. However, if the <b>-c</b> (count) option is also used,
matching continues in order to obtain the correct count, and those files that
have at least one match are listed along with their counts. Using this option
with <b>-c</b> is a way of suppressing the listing of files with no matches.
</P>
<P>
<b>--label</b>=<i>name</i>
This option supplies a name to be used for the standard input when file names
are being output. If not supplied, "(standard input)" is used. There is no
short form for this option.
</P>
<P>
<b>--line-buffered</b>
When this option is given, input is read and processed line by line, and the
output is flushed after each write. By default, input is read in large chunks,
unless <b>pcregrep</b> can determine that it is reading from a terminal (which
is currently possible only in Unix-like environments). Output to terminal is
normally automatically flushed by the operating system. This option can be
useful when the input or output is attached to a pipe and you do not want
<b>pcregrep</b> to buffer up large amounts of data. However, its use will affect
performance, and the <b>-M</b> (multiline) option ceases to work.
</P>
<P>
<b>--line-offsets</b>
Instead of showing lines or parts of lines that match, show each match as a
line number, the offset from the start of the line, and a length. The line
number is terminated by a colon (as usual; see the <b>-n</b> option), and the
offset and length are separated by a comma. In this mode, no context is shown.
That is, the <b>-A</b>, <b>-B</b>, and <b>-C</b> options are ignored. If there is
more than one match in a line, each of them is shown separately. This option is
mutually exclusive with <b>--file-offsets</b> and <b>--only-matching</b>.
</P>
<P>
<b>--locale</b>=<i>locale-name</i>
This option specifies a locale to be used for pattern matching. It overrides
the value in the <b>LC_ALL</b> or <b>LC_CTYPE</b> environment variables. If no
locale is specified, the PCRE library's default (usually the "C" locale) is
used. There is no short form for this option.
</P>
<P>
<b>--match-limit</b>=<i>number</i>
Processing some regular expression patterns can require a very large amount of
memory, leading in some cases to a program crash if not enough is available.
Other patterns may take a very long time to search for all possible matching
strings. The <b>pcre_exec()</b> function that is called by <b>pcregrep</b> to do
the matching has two parameters that can limit the resources that it uses.
<br>
<br>
The <b>--match-limit</b> option provides a means of limiting resource usage
when processing patterns that are not going to match, but which have a very
large number of possibilities in their search trees. The classic example is a
pattern that uses nested unlimited repeats. Internally, PCRE uses a function
called <b>match()</b> which it calls repeatedly (sometimes recursively). The
limit set by <b>--match-limit</b> is imposed on the number of times this
function is called during a match, which has the effect of limiting the amount
of backtracking that can take place.
<br>
<br>
The <b>--recursion-limit</b> option is similar to <b>--match-limit</b>, but
instead of limiting the total number of times that <b>match()</b> is called, it
limits the depth of recursive calls, which in turn limits the amount of memory
that can be used. The recursion depth is a smaller number than the total number
of calls, because not all calls to <b>match()</b> are recursive. This limit is
of use only if it is set smaller than <b>--match-limit</b>.
<br>
<br>
There are no short forms for these options. The default settings are specified
when the PCRE library is compiled, with the default default being 10 million.
</P>
<P>
<b>-M</b>, <b>--multiline</b>
Allow patterns to match more than one line. When this option is given, patterns
may usefully contain literal newline characters and internal occurrences of ^
and $ characters. The output for a successful match may consist of more than
one line, the last of which is the one in which the match ended. If the matched
string ends with a newline sequence the output ends at the end of that line.
<br>
<br>
When this option is set, the PCRE library is called in "multiline" mode.
There is a limit to the number of lines that can be matched, imposed by the way
that <b>pcregrep</b> buffers the input file as it scans it. However,
<b>pcregrep</b> ensures that at least 8K characters or the rest of the document
(whichever is the shorter) are available for forward matching, and similarly
the previous 8K characters (or all the previous characters, if fewer than 8K)
are guaranteed to be available for lookbehind assertions. This option does not
work when input is read line by line (see \fP--line-buffered\fP.)
</P>
<P>
<b>-N</b> <i>newline-type</i>, <b>--newline</b>=<i>newline-type</i>
The PCRE library supports five different conventions for indicating
the ends of lines. They are the single-character sequences CR (carriage return)
and LF (linefeed), the two-character sequence CRLF, an "anycrlf" convention,
which recognizes any of the preceding three types, and an "any" convention, in
which any Unicode line ending sequence is assumed to end a line. The Unicode
sequences are the three just mentioned, plus VT (vertical tab, U+000B), FF
(form feed, U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and
PS (paragraph separator, U+2029).
<br>
<br>
When the PCRE library is built, a default line-ending sequence is specified.
This is normally the standard sequence for the operating system. Unless
otherwise specified by this option, <b>pcregrep</b> uses the library's default.
The possible values for this option are CR, LF, CRLF, ANYCRLF, or ANY. This
makes it possible to use <b>pcregrep</b> to scan files that have come from other
environments without having to modify their line endings. If the data that is
being scanned does not agree with the convention set by this option,
<b>pcregrep</b> may behave in strange ways. Note that this option does not
apply to files specified by the <b>-f</b>, <b>--exclude-from</b>, or
<b>--include-from</b> options, which are expected to use the operating system's
standard newline sequence.
</P>
<P>
<b>-n</b>, <b>--line-number</b>
Precede each output line by its line number in the file, followed by a colon
for matching lines or a hyphen for context lines. If the filename is also being
output, it precedes the line number. This option is forced if
<b>--line-offsets</b> is used.
</P>
<P>
<b>--no-jit</b>
If the PCRE library is built with support for just-in-time compiling (which
speeds up matching), <b>pcregrep</b> automatically makes use of this, unless it
was explicitly disabled at build time. This option can be used to disable the
use of JIT at run time. It is provided for testing and working round problems.
It should never be needed in normal use.
</P>
<P>
<b>-o</b>, <b>--only-matching</b>
Show only the part of the line that matched a pattern instead of the whole
line. In this mode, no context is shown. That is, the <b>-A</b>, <b>-B</b>, and
<b>-C</b> options are ignored. If there is more than one match in a line, each
of them is shown separately. If <b>-o</b> is combined with <b>-v</b> (invert the
sense of the match to find non-matching lines), no output is generated, but the
return code is set appropriately. If the matched portion of the line is empty,
nothing is output unless the file name or line number are being printed, in
which case they are shown on an otherwise empty line. This option is mutually
exclusive with <b>--file-offsets</b> and <b>--line-offsets</b>.
</P>
<P>
<b>-o</b><i>number</i>, <b>--only-matching</b>=<i>number</i>
Show only the part of the line that matched the capturing parentheses of the
given number. Up to 32 capturing parentheses are supported, and -o0 is
equivalent to <b>-o</b> without a number. Because these options can be given
without an argument (see above), if an argument is present, it must be given in
the same shell item, for example, -o3 or --only-matching=2. The comments given
for the non-argument case above also apply to this case. If the specified
capturing parentheses do not exist in the pattern, or were not set in the
match, nothing is output unless the file name or line number are being printed.
<br>
<br>
If this option is given multiple times, multiple substrings are output, in the
order the options are given. For example, -o3 -o1 -o3 causes the substrings
matched by capturing parentheses 3 and 1 and then 3 again to be output. By
default, there is no separator (but see the next option).
</P>
<P>
<b>--om-separator</b>=<i>text</i>
Specify a separating string for multiple occurrences of <b>-o</b>. The default
is an empty string. Separating strings are never coloured.
</P>
<P>
<b>-q</b>, <b>--quiet</b>
Work quietly, that is, display nothing except error messages. The exit
status indicates whether or not any matches were found.
</P>
<P>
<b>-r</b>, <b>--recursive</b>
If any given path is a directory, recursively scan the files it contains,
taking note of any <b>--include</b> and <b>--exclude</b> settings. By default, a
directory is read as a normal file; in some operating systems this gives an
immediate end-of-file. This option is a shorthand for setting the <b>-d</b>
option to "recurse".
</P>
<P>
<b>--recursion-limit</b>=<i>number</i>
See <b>--match-limit</b> above.
</P>
<P>
<b>-s</b>, <b>--no-messages</b>
Suppress error messages about non-existent or unreadable files. Such files are
quietly skipped. However, the return code is still 2, even if matches were
found in other files.
</P>
<P>
<b>-u</b>, <b>--utf-8</b>
Operate in UTF-8 mode. This option is available only if PCRE has been compiled
with UTF-8 support. All patterns (including those for any <b>--exclude</b> and
<b>--include</b> options) and all subject lines that are scanned must be valid
strings of UTF-8 characters.
</P>
<P>
<b>-V</b>, <b>--version</b>
Write the version numbers of <b>pcregrep</b> and the PCRE library to the
standard output and then exit. Anything else on the command line is
ignored.
</P>
<P>
<b>-v</b>, <b>--invert-match</b>
Invert the sense of the match, so that lines which do <i>not</i> match any of
the patterns are the ones that are found.
</P>
<P>
<b>-w</b>, <b>--word-regex</b>, <b>--word-regexp</b>
Force the patterns to match only whole words. This is equivalent to having \b
at the start and end of the pattern. This option applies only to the patterns
that are matched against the contents of files; it does not apply to patterns
specified by any of the <b>--include</b> or <b>--exclude</b> options.
</P>
<P>
<b>-x</b>, <b>--line-regex</b>, <b>--line-regexp</b>
Force the patterns to be anchored (each must start matching at the beginning of
a line) and in addition, require them to match entire lines. This is equivalent
to having ^ and $ characters at the start and end of each alternative branch in
every pattern. This option applies only to the patterns that are matched
against the contents of files; it does not apply to patterns specified by any
of the <b>--include</b> or <b>--exclude</b> options.
</P>
<br><a name="SEC6" href="#TOC1">ENVIRONMENT VARIABLES</a><br>
<P>
The environment variables <b>LC_ALL</b> and <b>LC_CTYPE</b> are examined, in that
order, for a locale. The first one that is set is used. This can be overridden
by the <b>--locale</b> option. If no locale is set, the PCRE library's default
(usually the "C" locale) is used.
</P>
<br><a name="SEC7" href="#TOC1">NEWLINES</a><br>
<P>
The <b>-N</b> (<b>--newline</b>) option allows <b>pcregrep</b> to scan files with
different newline conventions from the default. Any parts of the input files
that are written to the standard output are copied identically, with whatever
newline sequences they have in the input. However, the setting of this option
does not affect the interpretation of files specified by the <b>-f</b>,
<b>--exclude-from</b>, or <b>--include-from</b> options, which are assumed to use
the operating system's standard newline sequence, nor does it affect the way in
which <b>pcregrep</b> writes informational messages to the standard error and
output streams. For these it uses the string "\n" to indicate newlines,
relying on the C I/O library to convert this to an appropriate sequence.
</P>
<br><a name="SEC8" href="#TOC1">OPTIONS COMPATIBILITY</a><br>
<P>
Many of the short and long forms of <b>pcregrep</b>'s options are the same
as in the GNU <b>grep</b> program. Any long option of the form
<b>--xxx-regexp</b> (GNU terminology) is also available as <b>--xxx-regex</b>
(PCRE terminology). However, the <b>--file-list</b>, <b>--file-offsets</b>,
<b>--include-dir</b>, <b>--line-offsets</b>, <b>--locale</b>, <b>--match-limit</b>,
<b>-M</b>, <b>--multiline</b>, <b>-N</b>, <b>--newline</b>, <b>--om-separator</b>,
<b>--recursion-limit</b>, <b>-u</b>, and <b>--utf-8</b> options are specific to
<b>pcregrep</b>, as is the use of the <b>--only-matching</b> option with a
capturing parentheses number.
</P>
<P>
Although most of the common options work the same way, a few are different in
<b>pcregrep</b>. For example, the <b>--include</b> option's argument is a glob
for GNU <b>grep</b>, but a regular expression for <b>pcregrep</b>. If both the
<b>-c</b> and <b>-l</b> options are given, GNU grep lists only file names,
without counts, but <b>pcregrep</b> gives the counts.
</P>
<br><a name="SEC9" href="#TOC1">OPTIONS WITH DATA</a><br>
<P>
There are four different ways in which an option with data can be specified.
If a short form option is used, the data may follow immediately, or (with one
exception) in the next command line item. For example:
<pre>
-f/some/file
-f /some/file
</pre>
The exception is the <b>-o</b> option, which may appear with or without data.
Because of this, if data is present, it must follow immediately in the same
item, for example -o3.
</P>
<P>
If a long form option is used, the data may appear in the same command line
item, separated by an equals character, or (with two exceptions) it may appear
in the next command line item. For example:
<pre>
--file=/some/file
--file /some/file
</pre>
Note, however, that if you want to supply a file name beginning with ~ as data
in a shell command, and have the shell expand ~ to a home directory, you must
separate the file name from the option, because the shell does not treat ~
specially unless it is at the start of an item.
</P>
<P>
The exceptions to the above are the <b>--colour</b> (or <b>--color</b>) and
<b>--only-matching</b> options, for which the data is optional. If one of these
options does have data, it must be given in the first form, using an equals
character. Otherwise <b>pcregrep</b> will assume that it has no data.
</P>
<br><a name="SEC10" href="#TOC1">MATCHING ERRORS</a><br>
<P>
It is possible to supply a regular expression that takes a very long time to
fail to match certain lines. Such patterns normally involve nested indefinite
repeats, for example: (a+)*\d when matched against a line of a's with no final
digit. The PCRE matching function has a resource limit that causes it to abort
in these circumstances. If this happens, <b>pcregrep</b> outputs an error
message and the line that caused the problem to the standard error stream. If
there are more than 20 such errors, <b>pcregrep</b> gives up.
</P>
<P>
The <b>--match-limit</b> option of <b>pcregrep</b> can be used to set the overall
resource limit; there is a second option called <b>--recursion-limit</b> that
sets a limit on the amount of memory (usually stack) that is used (see the
discussion of these options above).
</P>
<br><a name="SEC11" href="#TOC1">DIAGNOSTICS</a><br>
<P>
Exit status is 0 if any matches were found, 1 if no matches were found, and 2
for syntax errors, overlong lines, non-existent or inaccessible files (even if
matches were found in other files) or too many matching errors. Using the
<b>-s</b> option to suppress error messages about inaccessible files does not
affect the return code.
</P>
<br><a name="SEC12" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcrepattern</b>(3), <b>pcresyntax</b>(3), <b>pcretest</b>(1).
</P>
<br><a name="SEC13" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
<P>
Last updated: 13 September 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,458 @@
<html>
<head>
<title>pcrejit specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcrejit man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE JUST-IN-TIME COMPILER SUPPORT</a>
<li><a name="TOC2" href="#SEC2">8-BIT, 16-BIT AND 32-BIT SUPPORT</a>
<li><a name="TOC3" href="#SEC3">AVAILABILITY OF JIT SUPPORT</a>
<li><a name="TOC4" href="#SEC4">SIMPLE USE OF JIT</a>
<li><a name="TOC5" href="#SEC5">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a>
<li><a name="TOC6" href="#SEC6">RETURN VALUES FROM JIT EXECUTION</a>
<li><a name="TOC7" href="#SEC7">SAVING AND RESTORING COMPILED PATTERNS</a>
<li><a name="TOC8" href="#SEC8">CONTROLLING THE JIT STACK</a>
<li><a name="TOC9" href="#SEC9">JIT STACK FAQ</a>
<li><a name="TOC10" href="#SEC10">EXAMPLE CODE</a>
<li><a name="TOC11" href="#SEC11">JIT FAST PATH API</a>
<li><a name="TOC12" href="#SEC12">SEE ALSO</a>
<li><a name="TOC13" href="#SEC13">AUTHOR</a>
<li><a name="TOC14" href="#SEC14">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE JUST-IN-TIME COMPILER SUPPORT</a><br>
<P>
Just-in-time compiling is a heavyweight optimization that can greatly speed up
pattern matching. However, it comes at the cost of extra processing before the
match is performed. Therefore, it is of most benefit when the same pattern is
going to be matched many times. This does not necessarily mean many calls of a
matching function; if the pattern is not anchored, matching attempts may take
place many times at various positions in the subject, even for a single call.
Therefore, if the subject string is very long, it may still pay to use JIT for
one-off matches.
</P>
<P>
JIT support applies only to the traditional Perl-compatible matching function.
It does not apply when the DFA matching function is being used. The code for
this support was written by Zoltan Herczeg.
</P>
<br><a name="SEC2" href="#TOC1">8-BIT, 16-BIT AND 32-BIT SUPPORT</a><br>
<P>
JIT support is available for all of the 8-bit, 16-bit and 32-bit PCRE
libraries. To keep this documentation simple, only the 8-bit interface is
described in what follows. If you are using the 16-bit library, substitute the
16-bit functions and 16-bit structures (for example, <i>pcre16_jit_stack</i>
instead of <i>pcre_jit_stack</i>). If you are using the 32-bit library,
substitute the 32-bit functions and 32-bit structures (for example,
<i>pcre32_jit_stack</i> instead of <i>pcre_jit_stack</i>).
</P>
<br><a name="SEC3" href="#TOC1">AVAILABILITY OF JIT SUPPORT</a><br>
<P>
JIT support is an optional feature of PCRE. The "configure" option --enable-jit
(or equivalent CMake option) must be set when PCRE is built if you want to use
JIT. The support is limited to the following hardware platforms:
<pre>
ARM v5, v7, and Thumb2
Intel x86 32-bit and 64-bit
MIPS 32-bit
Power PC 32-bit and 64-bit
SPARC 32-bit (experimental)
</pre>
If --enable-jit is set on an unsupported platform, compilation fails.
</P>
<P>
A program that is linked with PCRE 8.20 or later can tell if JIT support is
available by calling <b>pcre_config()</b> with the PCRE_CONFIG_JIT option. The
result is 1 when JIT is available, and 0 otherwise. However, a simple program
does not need to check this in order to use JIT. The normal API is implemented
in a way that falls back to the interpretive code if JIT is not available. For
programs that need the best possible performance, there is also a "fast path"
API that is JIT-specific.
</P>
<P>
If your program may sometimes be linked with versions of PCRE that are older
than 8.20, but you want to use JIT when it is available, you can test
the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT macro such
as PCRE_CONFIG_JIT, for compile-time control of your code.
</P>
<br><a name="SEC4" href="#TOC1">SIMPLE USE OF JIT</a><br>
<P>
You have to do two things to make use of the JIT support in the simplest way:
<pre>
(1) Call <b>pcre_study()</b> with the PCRE_STUDY_JIT_COMPILE option for
each compiled pattern, and pass the resulting <b>pcre_extra</b> block to
<b>pcre_exec()</b>.
(2) Use <b>pcre_free_study()</b> to free the <b>pcre_extra</b> block when it is
no longer needed, instead of just freeing it yourself. This ensures that
any JIT data is also freed.
</pre>
For a program that may be linked with pre-8.20 versions of PCRE, you can insert
<pre>
#ifndef PCRE_STUDY_JIT_COMPILE
#define PCRE_STUDY_JIT_COMPILE 0
#endif
</pre>
so that no option is passed to <b>pcre_study()</b>, and then use something like
this to free the study data:
<pre>
#ifdef PCRE_CONFIG_JIT
pcre_free_study(study_ptr);
#else
pcre_free(study_ptr);
#endif
</pre>
PCRE_STUDY_JIT_COMPILE requests the JIT compiler to generate code for complete
matches. If you want to run partial matches using the PCRE_PARTIAL_HARD or
PCRE_PARTIAL_SOFT options of <b>pcre_exec()</b>, you should set one or both of
the following options in addition to, or instead of, PCRE_STUDY_JIT_COMPILE
when you call <b>pcre_study()</b>:
<pre>
PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
</pre>
The JIT compiler generates different optimized code for each of the three
modes (normal, soft partial, hard partial). When <b>pcre_exec()</b> is called,
the appropriate code is run if it is available. Otherwise, the pattern is
matched using interpretive code.
</P>
<P>
In some circumstances you may need to call additional functions. These are
described in the section entitled
<a href="#stackcontrol">"Controlling the JIT stack"</a>
below.
</P>
<P>
If JIT support is not available, PCRE_STUDY_JIT_COMPILE etc. are ignored, and
no JIT data is created. Otherwise, the compiled pattern is passed to the JIT
compiler, which turns it into machine code that executes much faster than the
normal interpretive code. When <b>pcre_exec()</b> is passed a <b>pcre_extra</b>
block containing a pointer to JIT code of the appropriate mode (normal or
hard/soft partial), it obeys that code instead of running the interpreter. The
result is identical, but the compiled JIT code runs much faster.
</P>
<P>
There are some <b>pcre_exec()</b> options that are not supported for JIT
execution. There are also some pattern items that JIT cannot handle. Details
are given below. In both cases, execution automatically falls back to the
interpretive code. If you want to know whether JIT was actually used for a
particular match, you should arrange for a JIT callback function to be set up
as described in the section entitled
<a href="#stackcontrol">"Controlling the JIT stack"</a>
below, even if you do not need to supply a non-default JIT stack. Such a
callback function is called whenever JIT code is about to be obeyed. If the
execution options are not right for JIT execution, the callback function is not
obeyed.
</P>
<P>
If the JIT compiler finds an unsupported item, no JIT data is generated. You
can find out if JIT execution is available after studying a pattern by calling
<b>pcre_fullinfo()</b> with the PCRE_INFO_JIT option. A result of 1 means that
JIT compilation was successful. A result of 0 means that JIT support is not
available, or the pattern was not studied with PCRE_STUDY_JIT_COMPILE etc., or
the JIT compiler was not able to handle the pattern.
</P>
<P>
Once a pattern has been studied, with or without JIT, it can be used as many
times as you like for matching different subject strings.
</P>
<br><a name="SEC5" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
<P>
The only <b>pcre_exec()</b> options that are supported for JIT execution are
PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK, PCRE_NO_UTF32_CHECK, PCRE_NOTBOL,
PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and
PCRE_PARTIAL_SOFT.
</P>
<P>
The unsupported pattern items are:
<pre>
\C match a single byte; not supported in UTF-8 mode
(?Cn) callouts
(*PRUNE) )
(*SKIP) ) backtracking control verbs
(*THEN) )
</pre>
Support for some of these may be added in future.
</P>
<br><a name="SEC6" href="#TOC1">RETURN VALUES FROM JIT EXECUTION</a><br>
<P>
When a pattern is matched using JIT execution, the return values are the same
as those given by the interpretive <b>pcre_exec()</b> code, with the addition of
one new error code: PCRE_ERROR_JIT_STACKLIMIT. This means that the memory used
for the JIT stack was insufficient. See
<a href="#stackcontrol">"Controlling the JIT stack"</a>
below for a discussion of JIT stack usage. For compatibility with the
interpretive <b>pcre_exec()</b> code, no more than two-thirds of the
<i>ovector</i> argument is used for passing back captured substrings.
</P>
<P>
The error code PCRE_ERROR_MATCHLIMIT is returned by the JIT code if searching a
very large pattern tree goes on for too long, as it is in the same circumstance
when JIT is not used, but the details of exactly what is counted are not the
same. The PCRE_ERROR_RECURSIONLIMIT error code is never returned by JIT
execution.
</P>
<br><a name="SEC7" href="#TOC1">SAVING AND RESTORING COMPILED PATTERNS</a><br>
<P>
The code that is generated by the JIT compiler is architecture-specific, and is
also position dependent. For those reasons it cannot be saved (in a file or
database) and restored later like the bytecode and other data of a compiled
pattern. Saving and restoring compiled patterns is not something many people
do. More detail about this facility is given in the
<a href="pcreprecompile.html"><b>pcreprecompile</b></a>
documentation. It should be possible to run <b>pcre_study()</b> on a saved and
restored pattern, and thereby recreate the JIT data, but because JIT
compilation uses significant resources, it is probably not worth doing this;
you might as well recompile the original pattern.
<a name="stackcontrol"></a></P>
<br><a name="SEC8" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
<P>
When the compiled JIT code runs, it needs a block of memory to use as a stack.
By default, it uses 32K on the machine stack. However, some large or
complicated patterns need more than this. The error PCRE_ERROR_JIT_STACKLIMIT
is given when there is not enough stack. Three functions are provided for
managing blocks of memory for use as JIT stacks. There is further discussion
about the use of JIT stacks in the section entitled
<a href="#stackcontrol">"JIT stack FAQ"</a>
below.
</P>
<P>
The <b>pcre_jit_stack_alloc()</b> function creates a JIT stack. Its arguments
are a starting size and a maximum size, and it returns a pointer to an opaque
structure of type <b>pcre_jit_stack</b>, or NULL if there is an error. The
<b>pcre_jit_stack_free()</b> function can be used to free a stack that is no
longer needed. (For the technically minded: the address space is allocated by
mmap or VirtualAlloc.)
</P>
<P>
JIT uses far less memory for recursion than the interpretive code,
and a maximum stack size of 512K to 1M should be more than enough for any
pattern.
</P>
<P>
The <b>pcre_assign_jit_stack()</b> function specifies which stack JIT code
should use. Its arguments are as follows:
<pre>
pcre_extra *extra
pcre_jit_callback callback
void *data
</pre>
The <i>extra</i> argument must be the result of studying a pattern with
PCRE_STUDY_JIT_COMPILE etc. There are three cases for the values of the other
two options:
<pre>
(1) If <i>callback</i> is NULL and <i>data</i> is NULL, an internal 32K block
on the machine stack is used.
(2) If <i>callback</i> is NULL and <i>data</i> is not NULL, <i>data</i> must be
a valid JIT stack, the result of calling <b>pcre_jit_stack_alloc()</b>.
(3) If <i>callback</i> is not NULL, it must point to a function that is
called with <i>data</i> as an argument at the start of matching, in
order to set up a JIT stack. If the return from the callback
function is NULL, the internal 32K stack is used; otherwise the
return value must be a valid JIT stack, the result of calling
<b>pcre_jit_stack_alloc()</b>.
</pre>
A callback function is obeyed whenever JIT code is about to be run; it is not
obeyed when <b>pcre_exec()</b> is called with options that are incompatible for
JIT execution. A callback function can therefore be used to determine whether a
match operation was executed by JIT or by the interpreter.
</P>
<P>
You may safely use the same JIT stack for more than one pattern (either by
assigning directly or by callback), as long as the patterns are all matched
sequentially in the same thread. In a multithread application, if you do not
specify a JIT stack, or if you assign or pass back NULL from a callback, that
is thread-safe, because each thread has its own machine stack. However, if you
assign or pass back a non-NULL JIT stack, this must be a different stack for
each thread so that the application is thread-safe.
</P>
<P>
Strictly speaking, even more is allowed. You can assign the same non-NULL stack
to any number of patterns as long as they are not used for matching by multiple
threads at the same time. For example, you can assign the same stack to all
compiled patterns, and use a global mutex in the callback to wait until the
stack is available for use. However, this is an inefficient solution, and not
recommended.
</P>
<P>
This is a suggestion for how a multithreaded program that needs to set up
non-default JIT stacks might operate:
<pre>
During thread initalization
thread_local_var = pcre_jit_stack_alloc(...)
During thread exit
pcre_jit_stack_free(thread_local_var)
Use a one-line callback function
return thread_local_var
</pre>
All the functions described in this section do nothing if JIT is not available,
and <b>pcre_assign_jit_stack()</b> does nothing unless the <b>extra</b> argument
is non-NULL and points to a <b>pcre_extra</b> block that is the result of a
successful study with PCRE_STUDY_JIT_COMPILE etc.
<a name="stackfaq"></a></P>
<br><a name="SEC9" href="#TOC1">JIT STACK FAQ</a><br>
<P>
(1) Why do we need JIT stacks?
<br>
<br>
PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack where
the local data of the current node is pushed before checking its child nodes.
Allocating real machine stack on some platforms is difficult. For example, the
stack chain needs to be updated every time if we extend the stack on PowerPC.
Although it is possible, its updating time overhead decreases performance. So
we do the recursion in memory.
</P>
<P>
(2) Why don't we simply allocate blocks of memory with <b>malloc()</b>?
<br>
<br>
Modern operating systems have a nice feature: they can reserve an address space
instead of allocating memory. We can safely allocate memory pages inside this
address space, so the stack could grow without moving memory data (this is
important because of pointers). Thus we can allocate 1M address space, and use
only a single memory page (usually 4K) if that is enough. However, we can still
grow up to 1M anytime if needed.
</P>
<P>
(3) Who "owns" a JIT stack?
<br>
<br>
The owner of the stack is the user program, not the JIT studied pattern or
anything else. The user program must ensure that if a stack is used by
<b>pcre_exec()</b>, (that is, it is assigned to the pattern currently running),
that stack must not be used by any other threads (to avoid overwriting the same
memory area). The best practice for multithreaded programs is to allocate a
stack for each thread, and return this stack through the JIT callback function.
</P>
<P>
(4) When should a JIT stack be freed?
<br>
<br>
You can free a JIT stack at any time, as long as it will not be used by
<b>pcre_exec()</b> again. When you assign the stack to a pattern, only a pointer
is set. There is no reference counting or any other magic. You can free the
patterns and stacks in any order, anytime. Just <i>do not</i> call
<b>pcre_exec()</b> with a pattern pointing to an already freed stack, as that
will cause SEGFAULT. (Also, do not free a stack currently used by
<b>pcre_exec()</b> in another thread). You can also replace the stack for a
pattern at any time. You can even free the previous stack before assigning a
replacement.
</P>
<P>
(5) Should I allocate/free a stack every time before/after calling
<b>pcre_exec()</b>?
<br>
<br>
No, because this is too costly in terms of resources. However, you could
implement some clever idea which release the stack if it is not used in let's
say two minutes. The JIT callback can help to achieve this without keeping a
list of the currently JIT studied patterns.
</P>
<P>
(6) OK, the stack is for long term memory allocation. But what happens if a
pattern causes stack overflow with a stack of 1M? Is that 1M kept until the
stack is freed?
<br>
<br>
Especially on embedded sytems, it might be a good idea to release memory
sometimes without freeing the stack. There is no API for this at the moment.
Probably a function call which returns with the currently allocated memory for
any stack and another which allows releasing memory (shrinking the stack) would
be a good idea if someone needs this.
</P>
<P>
(7) This is too much of a headache. Isn't there any better solution for JIT
stack handling?
<br>
<br>
No, thanks to Windows. If POSIX threads were used everywhere, we could throw
out this complicated API.
</P>
<br><a name="SEC10" href="#TOC1">EXAMPLE CODE</a><br>
<P>
This is a single-threaded example that specifies a JIT stack without using a
callback.
<pre>
int rc;
int ovector[30];
pcre *re;
pcre_extra *extra;
pcre_jit_stack *jit_stack;
re = pcre_compile(pattern, 0, &error, &erroffset, NULL);
/* Check for errors */
extra = pcre_study(re, PCRE_STUDY_JIT_COMPILE, &error);
jit_stack = pcre_jit_stack_alloc(32*1024, 512*1024);
/* Check for error (NULL) */
pcre_assign_jit_stack(extra, NULL, jit_stack);
rc = pcre_exec(re, extra, subject, length, 0, 0, ovector, 30);
/* Check results */
pcre_free(re);
pcre_free_study(extra);
pcre_jit_stack_free(jit_stack);
</PRE>
</P>
<br><a name="SEC11" href="#TOC1">JIT FAST PATH API</a><br>
<P>
Because the API described above falls back to interpreted execution when JIT is
not available, it is convenient for programs that are written for general use
in many environments. However, calling JIT via <b>pcre_exec()</b> does have a
performance impact. Programs that are written for use where JIT is known to be
available, and which need the best possible performance, can instead use a
"fast path" API to call JIT execution directly instead of calling
<b>pcre_exec()</b> (obviously only for patterns that have been successfully
studied by JIT).
</P>
<P>
The fast path function is called <b>pcre_jit_exec()</b>, and it takes exactly
the same arguments as <b>pcre_exec()</b>, plus one additional argument that
must point to a JIT stack. The JIT stack arrangements described above do not
apply. The return values are the same as for <b>pcre_exec()</b>.
</P>
<P>
When you call <b>pcre_exec()</b>, as well as testing for invalid options, a
number of other sanity checks are performed on the arguments. For example, if
the subject pointer is NULL, or its length is negative, an immediate error is
given. Also, unless PCRE_NO_UTF[8|16|32] is set, a UTF subject string is tested
for validity. In the interests of speed, these checks do not happen on the JIT
fast path, and if invalid data is passed, the result is undefined.
</P>
<P>
Bypassing the sanity checks and the <b>pcre_exec()</b> wrapping can give
speedups of more than 10%.
</P>
<br><a name="SEC12" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcreapi</b>(3)
</P>
<br><a name="SEC13" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel (FAQ by Zoltan Herczeg)
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
<P>
Last updated: 31 October 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,86 @@
<html>
<head>
<title>pcrelimits specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcrelimits man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
SIZE AND OTHER LIMITATIONS
</b><br>
<P>
There are some size limitations in PCRE but it is hoped that they will never in
practice be relevant.
</P>
<P>
The maximum length of a compiled pattern is approximately 64K data units (bytes
for the 8-bit library, 32-bit units for the 32-bit library, and 32-bit units for
the 32-bit library) if PCRE is compiled with the default internal linkage size
of 2 bytes. If you want to process regular expressions that are truly enormous,
you can compile PCRE with an internal linkage size of 3 or 4 (when building the
16-bit or 32-bit library, 3 is rounded up to 4). See the <b>README</b> file in
the source distribution and the
<a href="pcrebuild.html"><b>pcrebuild</b></a>
documentation for details. In these cases the limit is substantially larger.
However, the speed of execution is slower.
</P>
<P>
All values in repeating quantifiers must be less than 65536.
</P>
<P>
There is no limit to the number of parenthesized subpatterns, but there can be
no more than 65535 capturing subpatterns.
</P>
<P>
There is a limit to the number of forward references to subsequent subpatterns
of around 200,000. Repeated forward references with fixed upper limits, for
example, (?2){0,100} when subpattern number 2 is to the right, are included in
the count. There is no limit to the number of backward references.
</P>
<P>
The maximum length of name for a named subpattern is 32 characters, and the
maximum number of named subpatterns is 10000.
</P>
<P>
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or (*THEN) verb
is 255 for the 8-bit library and 65535 for the 16-bit and 32-bit library.
</P>
<P>
The maximum length of a subject string is the largest positive number that an
integer variable can hold. However, when using the traditional matching
function, PCRE uses recursion to handle subpatterns and indefinite repetition.
This means that the available stack space may limit the size of a subject
string that can be processed by certain patterns. For a discussion of stack
issues, see the
<a href="pcrestack.html"><b>pcrestack</b></a>
documentation.
</P>
<br><b>
AUTHOR
</b><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><b>
REVISION
</b><br>
<P>
Last updated: 04 May 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,233 @@
<html>
<head>
<title>pcrematching specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcrematching man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE MATCHING ALGORITHMS</a>
<li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a>
<li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a>
<li><a name="TOC4" href="#SEC4">THE ALTERNATIVE MATCHING ALGORITHM</a>
<li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
<li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
<li><a name="TOC7" href="#SEC7">AUTHOR</a>
<li><a name="TOC8" href="#SEC8">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE MATCHING ALGORITHMS</a><br>
<P>
This document describes the two different algorithms that are available in PCRE
for matching a compiled regular expression against a given subject string. The
"standard" algorithm is the one provided by the <b>pcre_exec()</b>,
<b>pcre16_exec()</b> and <b>pcre32_exec()</b> functions. These work in the same
as as Perl's matching function, and provide a Perl-compatible matching operation.
The just-in-time (JIT) optimization that is described in the
<a href="pcrejit.html"><b>pcrejit</b></a>
documentation is compatible with these functions.
</P>
<P>
An alternative algorithm is provided by the <b>pcre_dfa_exec()</b>,
<b>pcre16_dfa_exec()</b> and <b>pcre32_dfa_exec()</b> functions; they operate in
a different way, and are not Perl-compatible. This alternative has advantages
and disadvantages compared with the standard algorithm, and these are described
below.
</P>
<P>
When there is only one possible way in which a given subject string can match a
pattern, the two algorithms give the same answer. A difference arises, however,
when there are multiple possibilities. For example, if the pattern
<pre>
^&#60;.*&#62;
</pre>
is matched against the string
<pre>
&#60;something&#62; &#60;something else&#62; &#60;something further&#62;
</pre>
there are three possible answers. The standard algorithm finds only one of
them, whereas the alternative algorithm finds all three.
</P>
<br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br>
<P>
The set of strings that are matched by a regular expression can be represented
as a tree structure. An unlimited repetition in the pattern makes the tree of
infinite size, but it is still a tree. Matching the pattern to a given subject
string (from a given starting point) can be thought of as a search of the tree.
There are two ways to search a tree: depth-first and breadth-first, and these
correspond to the two matching algorithms provided by PCRE.
</P>
<br><a name="SEC3" href="#TOC1">THE STANDARD MATCHING ALGORITHM</a><br>
<P>
In the terminology of Jeffrey Friedl's book "Mastering Regular
Expressions", the standard algorithm is an "NFA algorithm". It conducts a
depth-first search of the pattern tree. That is, it proceeds along a single
path through the tree, checking that the subject matches what is required. When
there is a mismatch, the algorithm tries any alternatives at the current point,
and if they all fail, it backs up to the previous branch point in the tree, and
tries the next alternative branch at that level. This often involves backing up
(moving to the left) in the subject string as well. The order in which
repetition branches are tried is controlled by the greedy or ungreedy nature of
the quantifier.
</P>
<P>
If a leaf node is reached, a matching string has been found, and at that point
the algorithm stops. Thus, if there is more than one possible match, this
algorithm returns the first one that it finds. Whether this is the shortest,
the longest, or some intermediate length depends on the way the greedy and
ungreedy repetition quantifiers are specified in the pattern.
</P>
<P>
Because it ends up with a single path through the tree, it is relatively
straightforward for this algorithm to keep track of the substrings that are
matched by portions of the pattern in parentheses. This provides support for
capturing parentheses and back references.
</P>
<br><a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a><br>
<P>
This algorithm conducts a breadth-first search of the tree. Starting from the
first matching point in the subject, it scans the subject string from left to
right, once, character by character, and as it does this, it remembers all the
paths through the tree that represent valid matches. In Friedl's terminology,
this is a kind of "DFA algorithm", though it is not implemented as a
traditional finite state machine (it keeps multiple states active
simultaneously).
</P>
<P>
Although the general principle of this matching algorithm is that it scans the
subject string only once, without backtracking, there is one exception: when a
lookaround assertion is encountered, the characters following or preceding the
current point have to be independently inspected.
</P>
<P>
The scan continues until either the end of the subject is reached, or there are
no more unterminated paths. At this point, terminated paths represent the
different matching possibilities (if there are none, the match has failed).
Thus, if there is more than one possible match, this algorithm finds all of
them, and in particular, it finds the longest. The matches are returned in
decreasing order of length. There is an option to stop the algorithm after the
first match (which is necessarily the shortest) is found.
</P>
<P>
Note that all the matches that are found start at the same point in the
subject. If the pattern
<pre>
cat(er(pillar)?)?
</pre>
is matched against the string "the caterpillar catchment", the result will be
the three strings "caterpillar", "cater", and "cat" that start at the fifth
character of the subject. The algorithm does not automatically move on to find
matches that start at later positions.
</P>
<P>
There are a number of features of PCRE regular expressions that are not
supported by the alternative matching algorithm. They are as follows:
</P>
<P>
1. Because the algorithm finds all possible matches, the greedy or ungreedy
nature of repetition quantifiers is not relevant. Greedy and ungreedy
quantifiers are treated in exactly the same way. However, possessive
quantifiers can make a difference when what follows could also match what is
quantified, for example in a pattern like this:
<pre>
^a++\w!
</pre>
This pattern matches "aaab!" but not "aaa!", which would be matched by a
non-possessive quantifier. Similarly, if an atomic group is present, it is
matched as if it were a standalone pattern at the current point, and the
longest match is then "locked in" for the rest of the overall pattern.
</P>
<P>
2. When dealing with multiple paths through the tree simultaneously, it is not
straightforward to keep track of captured substrings for the different matching
possibilities, and PCRE's implementation of this algorithm does not attempt to
do this. This means that no captured substrings are available.
</P>
<P>
3. Because no substrings are captured, back references within the pattern are
not supported, and cause errors if encountered.
</P>
<P>
4. For the same reason, conditional expressions that use a backreference as the
condition or test for a specific group recursion are not supported.
</P>
<P>
5. Because many paths through the tree may be active, the \K escape sequence,
which resets the start of the match when encountered (but may be on some paths
and not on others), is not supported. It causes an error if encountered.
</P>
<P>
6. Callouts are supported, but the value of the <i>capture_top</i> field is
always 1, and the value of the <i>capture_last</i> field is always -1.
</P>
<P>
7. The \C escape sequence, which (in the standard algorithm) always matches a
single data unit, even in UTF-8, UTF-16 or UTF-32 modes, is not supported in
these modes, because the alternative algorithm moves through the subject string
one character (not data unit) at a time, for all active paths through the tree.
</P>
<P>
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
</P>
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
<P>
Using the alternative matching algorithm provides the following advantages:
</P>
<P>
1. All possible matches (at a single point in the subject) are automatically
found, and in particular, the longest match is found. To find more than one
match using the standard algorithm, you have to do kludgy things with
callouts.
</P>
<P>
2. Because the alternative algorithm scans the subject string just once, and
never needs to backtrack (except for lookbehinds), it is possible to pass very
long subject strings to the matching function in several pieces, checking for
partial matching each time. Although it is possible to do multi-segment
matching using the standard algorithm by retaining partially matched
substrings, it is more complicated. The
<a href="pcrepartial.html"><b>pcrepartial</b></a>
documentation gives details of partial matching and discusses multi-segment
matching.
</P>
<br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
<P>
The alternative algorithm suffers from a number of disadvantages:
</P>
<P>
1. It is substantially slower than the standard algorithm. This is partly
because it has to search for all possible matches, but is also because it is
less susceptible to optimization.
</P>
<P>
2. Capturing parentheses and back references are not supported.
</P>
<P>
3. Although atomic groups are supported, their use does not provide the
performance advantage that it does for the standard algorithm.
</P>
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
Last updated: 08 January 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,474 @@
<html>
<head>
<title>pcrepartial specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcrepartial man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE</a>
<li><a name="TOC2" href="#SEC2">PARTIAL MATCHING USING pcre_exec() OR pcre[16|32]_exec()</a>
<li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre_dfa_exec() OR pcre[16|32]_dfa_exec()</a>
<li><a name="TOC4" href="#SEC4">PARTIAL MATCHING AND WORD BOUNDARIES</a>
<li><a name="TOC5" href="#SEC5">FORMERLY RESTRICTED PATTERNS</a>
<li><a name="TOC6" href="#SEC6">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a>
<li><a name="TOC7" href="#SEC7">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre[16|32]_dfa_exec()</a>
<li><a name="TOC8" href="#SEC8">MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre[16|32]_exec()</a>
<li><a name="TOC9" href="#SEC9">ISSUES WITH MULTI-SEGMENT MATCHING</a>
<li><a name="TOC10" href="#SEC10">AUTHOR</a>
<li><a name="TOC11" href="#SEC11">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE</a><br>
<P>
In normal use of PCRE, if the subject string that is passed to a matching
function matches as far as it goes, but is too short to match the entire
pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances where it might
be helpful to distinguish this case from other cases in which there is no
match.
</P>
<P>
Consider, for example, an application where a human is required to type in data
for a field with specific formatting requirements. An example might be a date
in the form <i>ddmmmyy</i>, defined by this pattern:
<pre>
^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
</pre>
If the application sees the user's keystrokes one by one, and can check that
what has been typed so far is potentially valid, it is able to raise an error
as soon as a mistake is made, by beeping and not reflecting the character that
has been typed, for example. This immediate feedback is likely to be a better
user interface than a check that is delayed until the entire string has been
entered. Partial matching can also be useful when the subject string is very
long and is not all available at once.
</P>
<P>
PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
PCRE_PARTIAL_HARD options, which can be set when calling any of the matching
functions. For backwards compatibility, PCRE_PARTIAL is a synonym for
PCRE_PARTIAL_SOFT. The essential difference between the two options is whether
or not a partial match is preferred to an alternative complete match, though
the details differ between the two types of matching function. If both options
are set, PCRE_PARTIAL_HARD takes precedence.
</P>
<P>
If you want to use partial matching with just-in-time optimized code, you must
call <b>pcre_study()</b>, <b>pcre16_study()</b> or <b>pcre32_study()</b> with one
or both of these options:
<pre>
PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
</pre>
PCRE_STUDY_JIT_COMPILE should also be set if you are going to run non-partial
matches on the same pattern. If the appropriate JIT study mode has not been set
for a match, the interpretive matching code is used.
</P>
<P>
Setting a partial matching option disables two of PCRE's standard
optimizations. PCRE remembers the last literal data unit in a pattern, and
abandons matching immediately if it is not present in the subject string. This
optimization cannot be used for a subject string that might match only
partially. If the pattern was studied, PCRE knows the minimum length of a
matching string, and does not bother to run the matching function on shorter
strings. This optimization is also disabled for partial matching.
</P>
<br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec() OR pcre[16|32]_exec()</a><br>
<P>
A partial match occurs during a call to <b>pcre_exec()</b> or
<b>pcre[16|32]_exec()</b> when the end of the subject string is reached successfully,
but matching cannot continue because more characters are needed. However, at
least one character in the subject must have been inspected. This character
need not form part of the final matched string; lookbehind assertions and the
\K escape sequence provide ways of inspecting characters before the start of a
matched substring. The requirement for inspecting at least one character exists
because an empty string can always be matched; without such a restriction there
would always be a partial match of an empty string at the end of the subject.
</P>
<P>
If there are at least two slots in the offsets vector when a partial match is
returned, the first slot is set to the offset of the earliest character that
was inspected. For convenience, the second offset points to the end of the
subject so that a substring can easily be identified.
</P>
<P>
For the majority of patterns, the first offset identifies the start of the
partially matched string. However, for patterns that contain lookbehind
assertions, or \K, or begin with \b or \B, earlier characters have been
inspected while carrying out the match. For example:
<pre>
/(?&#60;=abc)123/
</pre>
This pattern matches "123", but only if it is preceded by "abc". If the subject
string is "xyzabc12", the offsets after a partial match are for the substring
"abc12", because all these characters are needed if another match is tried
with extra characters added to the subject.
</P>
<P>
What happens when a partial match is identified depends on which of the two
partial matching options are set.
</P>
<br><b>
PCRE_PARTIAL_SOFT WITH pcre_exec() OR pcre[16|32]_exec()
</b><br>
<P>
If PCRE_PARTIAL_SOFT is set when <b>pcre_exec()</b> or <b>pcre[16|32]_exec()</b>
identifies a partial match, the partial match is remembered, but matching
continues as normal, and other alternatives in the pattern are tried. If no
complete match can be found, PCRE_ERROR_PARTIAL is returned instead of
PCRE_ERROR_NOMATCH.
</P>
<P>
This option is "soft" because it prefers a complete match over a partial match.
All the various matching items in a pattern behave as if the subject string is
potentially complete. For example, \z, \Z, and $ match at the end of the
subject, as normal, and for \b and \B the end of the subject is treated as a
non-alphanumeric.
</P>
<P>
If there is more than one partial match, the first one that was found provides
the data that is returned. Consider this pattern:
<pre>
/123\w+X|dogY/
</pre>
If this is matched against the subject string "abc123dog", both
alternatives fail to match, but the end of the subject is reached during
matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set to 3 and 9,
identifying "123dog" as the first partial match that was found. (In this
example, there are two partial matches, because "dog" on its own partially
matches the second alternative.)
</P>
<br><b>
PCRE_PARTIAL_HARD WITH pcre_exec() OR pcre[16|32]_exec()
</b><br>
<P>
If PCRE_PARTIAL_HARD is set for <b>pcre_exec()</b> or <b>pcre[16|32]_exec()</b>,
PCRE_ERROR_PARTIAL is returned as soon as a partial match is found, without
continuing to search for possible complete matches. This option is "hard"
because it prefers an earlier partial match over a later complete match. For
this reason, the assumption is made that the end of the supplied subject string
may not be the true end of the available data, and so, if \z, \Z, \b, \B,
or $ are encountered at the end of the subject, the result is
PCRE_ERROR_PARTIAL, provided that at least one character in the subject has
been inspected.
</P>
<P>
Setting PCRE_PARTIAL_HARD also affects the way UTF-8 and UTF-16
subject strings are checked for validity. Normally, an invalid sequence
causes the error PCRE_ERROR_BADUTF8 or PCRE_ERROR_BADUTF16. However, in the
special case of a truncated character at the end of the subject,
PCRE_ERROR_SHORTUTF8 or PCRE_ERROR_SHORTUTF16 is returned when
PCRE_PARTIAL_HARD is set.
</P>
<br><b>
Comparing hard and soft partial matching
</b><br>
<P>
The difference between the two partial matching options can be illustrated by a
pattern such as:
<pre>
/dog(sbody)?/
</pre>
This matches either "dog" or "dogsbody", greedily (that is, it prefers the
longer string if possible). If it is matched against the string "dog" with
PCRE_PARTIAL_SOFT, it yields a complete match for "dog". However, if
PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. On the other hand,
if the pattern is made ungreedy the result is different:
<pre>
/dog(sbody)??/
</pre>
In this case the result is always a complete match because that is found first,
and matching never continues after finding a complete match. It might be easier
to follow this explanation by thinking of the two patterns like this:
<pre>
/dog(sbody)?/ is the same as /dogsbody|dog/
/dog(sbody)??/ is the same as /dog|dogsbody/
</pre>
The second pattern will never match "dogsbody", because it will always find the
shorter match first.
</P>
<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre_dfa_exec() OR pcre[16|32]_dfa_exec()</a><br>
<P>
The DFA functions move along the subject string character by character, without
backtracking, searching for all possible matches simultaneously. If the end of
the subject is reached before the end of the pattern, there is the possibility
of a partial match, again provided that at least one character has been
inspected.
</P>
<P>
When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
have been no complete matches. Otherwise, the complete matches are returned.
However, if PCRE_PARTIAL_HARD is set, a partial match takes precedence over any
complete matches. The portion of the string that was inspected when the longest
partial match was found is set as the first matching string, provided there are
at least two slots in the offsets vector.
</P>
<P>
Because the DFA functions always search for all possible matches, and there is
no difference between greedy and ungreedy repetition, their behaviour is
different from the standard functions when PCRE_PARTIAL_HARD is set. Consider
the string "dog" matched against the ungreedy pattern shown above:
<pre>
/dog(sbody)??/
</pre>
Whereas the standard functions stop as soon as they find the complete match for
"dog", the DFA functions also find the partial match for "dogsbody", and so
return that when PCRE_PARTIAL_HARD is set.
</P>
<br><a name="SEC4" href="#TOC1">PARTIAL MATCHING AND WORD BOUNDARIES</a><br>
<P>
If a pattern ends with one of sequences \b or \B, which test for word
boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-intuitive
results. Consider this pattern:
<pre>
/\bcat\b/
</pre>
This matches "cat", provided there is a word boundary at either end. If the
subject string is "the cat", the comparison of the final "t" with a following
character cannot take place, so a partial match is found. However, normal
matching carries on, and \b matches at the end of the subject when the last
character is a letter, so a complete match is found. The result, therefore, is
<i>not</i> PCRE_ERROR_PARTIAL. Using PCRE_PARTIAL_HARD in this case does yield
PCRE_ERROR_PARTIAL, because then the partial match takes precedence.
</P>
<br><a name="SEC5" href="#TOC1">FORMERLY RESTRICTED PATTERNS</a><br>
<P>
For releases of PCRE prior to 8.00, because of the way certain internal
optimizations were implemented in the <b>pcre_exec()</b> function, the
PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be used with
all patterns. From release 8.00 onwards, the restrictions no longer apply, and
partial matching with can be requested for any pattern.
</P>
<P>
Items that were formerly restricted were repeated single characters and
repeated metasequences. If PCRE_PARTIAL was set for a pattern that did not
conform to the restrictions, <b>pcre_exec()</b> returned the error code
PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
PCRE_INFO_OKPARTIAL call to <b>pcre_fullinfo()</b> to find out if a compiled
pattern can be used for partial matching now always returns 1.
</P>
<br><a name="SEC6" href="#TOC1">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a><br>
<P>
If the escape sequence \P is present in a <b>pcretest</b> data line, the
PCRE_PARTIAL_SOFT option is used for the match. Here is a run of <b>pcretest</b>
that uses the date example quoted above:
<pre>
re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
data&#62; 25jun04\P
0: 25jun04
1: jun
data&#62; 25dec3\P
Partial match: 23dec3
data&#62; 3ju\P
Partial match: 3ju
data&#62; 3juj\P
No match
data&#62; j\P
No match
</pre>
The first data string is matched completely, so <b>pcretest</b> shows the
matched substrings. The remaining four strings do not match the complete
pattern, but the first two are partial matches. Similar output is obtained
if DFA matching is used.
</P>
<P>
If the escape sequence \P is present more than once in a <b>pcretest</b> data
line, the PCRE_PARTIAL_HARD option is set for the match.
</P>
<br><a name="SEC7" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre[16|32]_dfa_exec()</a><br>
<P>
When a partial match has been found using a DFA matching function, it is
possible to continue the match by providing additional subject data and calling
the function again with the same compiled regular expression, this time setting
the PCRE_DFA_RESTART option. You must pass the same working space as before,
because this is where details of the previous partial match are stored. Here is
an example using <b>pcretest</b>, using the \R escape sequence to set the
PCRE_DFA_RESTART option (\D specifies the use of the DFA matching function):
<pre>
re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
data&#62; 23ja\P\D
Partial match: 23ja
data&#62; n05\R\D
0: n05
</pre>
The first call has "23ja" as the subject, and requests partial matching; the
second call has "n05" as the subject for the continued (restarted) match.
Notice that when the match is complete, only the last part is shown; PCRE does
not retain the previously partially-matched string. It is up to the calling
program to do that if it needs to.
</P>
<P>
You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
PCRE_DFA_RESTART to continue partial matching over multiple segments. This
facility can be used to pass very long subject strings to the DFA matching
functions.
</P>
<br><a name="SEC8" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre[16|32]_exec()</a><br>
<P>
From release 8.00, the standard matching functions can also be used to do
multi-segment matching. Unlike the DFA functions, it is not possible to
restart the previous match with a new segment of data. Instead, new data must
be added to the previous subject string, and the entire match re-run, starting
from the point where the partial match occurred. Earlier data can be discarded.
</P>
<P>
It is best to use PCRE_PARTIAL_HARD in this situation, because it does not
treat the end of a segment as the end of the subject when matching \z, \Z,
\b, \B, and $. Consider an unanchored pattern that matches dates:
<pre>
re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
data&#62; The date is 23ja\P\P
Partial match: 23ja
</pre>
At this stage, an application could discard the text preceding "23ja", add on
text from the next segment, and call the matching function again. Unlike the
DFA matching functions, the entire matching string must always be available,
and the complete matching process occurs for each call, so more memory and more
processing time is needed.
</P>
<P>
<b>Note:</b> If the pattern contains lookbehind assertions, or \K, or starts
with \b or \B, the string that is returned for a partial match includes
characters that precede the partially matched string itself, because these must
be retained when adding on more characters for a subsequent matching attempt.
However, in some cases you may need to retain even earlier characters, as
discussed in the next section.
</P>
<br><a name="SEC9" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br>
<P>
Certain types of pattern may give problems with multi-segment matching,
whichever matching function is used.
</P>
<P>
1. If the pattern contains a test for the beginning of a line, you need to pass
the PCRE_NOTBOL option when the subject string for any call does start at the
beginning of a line. There is also a PCRE_NOTEOL option, but in practice when
doing multi-segment matching you should be using PCRE_PARTIAL_HARD, which
includes the effect of PCRE_NOTEOL.
</P>
<P>
2. Lookbehind assertions that have already been obeyed are catered for in the
offsets that are returned for a partial match. However a lookbehind assertion
later in the pattern could require even earlier characters to be inspected. You
can handle this case by using the PCRE_INFO_MAXLOOKBEHIND option of the
<b>pcre_fullinfo()</b> or <b>pcre[16|32]_fullinfo()</b> functions to obtain the length
of the largest lookbehind in the pattern. This length is given in characters,
not bytes. If you always retain at least that many characters before the
partially matched string, all should be well. (Of course, near the start of the
subject, fewer characters may be present; in that case all characters should be
retained.)
</P>
<P>
3. Because a partial match must always contain at least one character, what
might be considered a partial match of an empty string actually gives a "no
match" result. For example:
<pre>
re&#62; /c(?&#60;=abc)x/
data&#62; ab\P
No match
</pre>
If the next segment begins "cx", a match should be found, but this will only
happen if characters from the previous segment are retained. For this reason, a
"no match" result should be interpreted as "partial match of an empty string"
when the pattern contains lookbehinds.
</P>
<P>
4. Matching a subject string that is split into multiple segments may not
always produce exactly the same result as matching over one single long string,
especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
Word Boundaries" above describes an issue that arises if the pattern ends with
\b or \B. Another kind of difference may occur when there are multiple
matching possibilities, because (for PCRE_PARTIAL_SOFT) a partial match result
is given only when there are no completed matches. This means that as soon as
the shortest match has been found, continuation to a new subject segment is no
longer possible. Consider again this <b>pcretest</b> example:
<pre>
re&#62; /dog(sbody)?/
data&#62; dogsb\P
0: dog
data&#62; do\P\D
Partial match: do
data&#62; gsb\R\P\D
0: g
data&#62; dogsbody\D
0: dogsbody
1: dog
</pre>
The first data line passes the string "dogsb" to a standard matching function,
setting the PCRE_PARTIAL_SOFT option. Although the string is a partial match
for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the shorter
string "dog" is a complete match. Similarly, when the subject is presented to
a DFA matching function in several parts ("do" and "gsb" being the first two)
the match stops when "dog" has been found, and it is not possible to continue.
On the other hand, if "dogsbody" is presented as a single string, a DFA
matching function finds both matches.
</P>
<P>
Because of these problems, it is best to use PCRE_PARTIAL_HARD when matching
multi-segment data. The example above then behaves differently:
<pre>
re&#62; /dog(sbody)?/
data&#62; dogsb\P\P
Partial match: dogsb
data&#62; do\P\D
Partial match: do
data&#62; gsb\R\P\P\D
Partial match: gsb
</pre>
5. Patterns that contain alternatives at the top level which do not all start
with the same pattern item may not work as expected when PCRE_DFA_RESTART is
used. For example, consider this pattern:
<pre>
1234|3789
</pre>
If the first part of the subject is "ABC123", a partial match of the first
alternative is found at offset 3. There is no partial match for the second
alternative, because such a match does not start at the same point in the
subject string. Attempting to continue with the string "7890" does not yield a
match because only those alternatives that match at one point in the subject
are remembered. The problem arises because the start of the second alternative
matches within the first alternative. There is no problem with anchored
patterns or patterns such as:
<pre>
1234|ABCD
</pre>
where no string can be a partial match for both alternatives. This is not a
problem if a standard matching function is used, because the entire match has
to be rerun each time:
<pre>
re&#62; /1234|3789/
data&#62; ABC123\P\P
Partial match: 123
data&#62; 1237890
0: 3789
</pre>
Of course, instead of using PCRE_DFA_RESTART, the same technique of re-running
the entire match can also be used with the DFA matching functions. Another
possibility is to work with two buffers. If a partial match at offset <i>n</i>
in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on
the second buffer, you can then try a new match starting at offset <i>n+1</i> in
the first buffer.
</P>
<br><a name="SEC10" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><a name="SEC11" href="#TOC1">REVISION</a><br>
<P>
Last updated: 24 June 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,195 @@
<html>
<head>
<title>pcreperform specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcreperform man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
PCRE PERFORMANCE
</b><br>
<P>
Two aspects of performance are discussed below: memory usage and processing
time. The way you express your pattern as a regular expression can affect both
of them.
</P>
<br><b>
COMPILED PATTERN MEMORY USAGE
</b><br>
<P>
Patterns are compiled by PCRE into a reasonably efficient interpretive code, so
that most simple patterns do not use much memory. However, there is one case
where the memory usage of a compiled pattern can be unexpectedly large. If a
parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
a limited maximum, the whole subpattern is repeated in the compiled code. For
example, the pattern
<pre>
(abc|def){2,4}
</pre>
is compiled as if it were
<pre>
(abc|def)(abc|def)((abc|def)(abc|def)?)?
</pre>
(Technical aside: It is done this way so that backtrack points within each of
the repetitions can be independently maintained.)
</P>
<P>
For regular expressions whose quantifiers use only small numbers, this is not
usually a problem. However, if the numbers are large, and particularly if such
repetitions are nested, the memory usage can become an embarrassment. For
example, the very simple pattern
<pre>
((ab){1,1000}c){1,3}
</pre>
uses 51K bytes when compiled using the 8-bit library. When PCRE is compiled
with its default internal pointer size of two bytes, the size limit on a
compiled pattern is 64K data units, and this is reached with the above pattern
if the outer repetition is increased from 3 to 4. PCRE can be compiled to use
larger internal pointers and thus handle larger compiled patterns, but it is
better to try to rewrite your pattern to use less memory if you can.
</P>
<P>
One way of reducing the memory usage for such patterns is to make use of PCRE's
<a href="pcrepattern.html#subpatternsassubroutines">"subroutine"</a>
facility. Re-writing the above pattern as
<pre>
((ab)(?2){0,999}c)(?1){0,2}
</pre>
reduces the memory requirements to 18K, and indeed it remains under 20K even
with the outer repetition increased to 100. However, this pattern is not
exactly equivalent, because the "subroutine" calls are treated as
<a href="pcrepattern.html#atomicgroup">atomic groups</a>
into which there can be no backtracking if there is a subsequent matching
failure. Therefore, PCRE cannot do this kind of rewriting automatically.
Furthermore, there is a noticeable loss of speed when executing the modified
pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
speed is acceptable, this kind of rewriting will allow you to process patterns
that PCRE cannot otherwise handle.
</P>
<br><b>
STACK USAGE AT RUN TIME
</b><br>
<P>
When <b>pcre_exec()</b> or <b>pcre[16|32]_exec()</b> is used for matching, certain
kinds of pattern can cause it to use large amounts of the process stack. In
some environments the default process stack is quite small, and if it runs out
the result is often SIGSEGV. This issue is probably the most frequently raised
problem with PCRE. Rewriting your pattern can often help. The
<a href="pcrestack.html"><b>pcrestack</b></a>
documentation discusses this issue in detail.
</P>
<br><b>
PROCESSING TIME
</b><br>
<P>
Certain items in regular expression patterns are processed more efficiently
than others. It is more efficient to use a character class like [aeiou] than a
set of single-character alternatives such as (a|e|i|o|u). In general, the
simplest construction that provides the required behaviour is usually the most
efficient. Jeffrey Friedl's book contains a lot of useful general discussion
about optimizing regular expressions for efficient performance. This document
contains a few observations about PCRE.
</P>
<P>
Using Unicode character properties (the \p, \P, and \X escapes) is slow,
because PCRE has to use a multi-stage table lookup whenever it needs a
character's property. If you can find an alternative pattern that does not use
character properties, it will probably be faster.
</P>
<P>
By default, the escape sequences \b, \d, \s, and \w, and the POSIX
character classes such as [:alpha:] do not use Unicode properties, partly for
backwards compatibility, and partly for performance reasons. However, you can
set PCRE_UCP if you want Unicode character properties to be used. This can
double the matching time for items such as \d, when matched with
a traditional matching function; the performance loss is less with
a DFA matching function, and in both cases there is not much difference for
\b.
</P>
<P>
When a pattern begins with .* not in parentheses, or in parentheses that are
not the subject of a backreference, and the PCRE_DOTALL option is set, the
pattern is implicitly anchored by PCRE, since it can match only at the start of
a subject string. However, if PCRE_DOTALL is not set, PCRE cannot make this
optimization, because the . metacharacter does not then match a newline, and if
the subject string contains newlines, the pattern may match from the character
immediately following one of them instead of from the very start. For example,
the pattern
<pre>
.*second
</pre>
matches the subject "first\nand second" (where \n stands for a newline
character), with the match starting at the seventh character. In order to do
this, PCRE has to retry the match starting after every newline in the subject.
</P>
<P>
If you are using such a pattern with subject strings that do not contain
newlines, the best performance is obtained by setting PCRE_DOTALL, or starting
the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE
from having to scan along the subject looking for a newline to restart at.
</P>
<P>
Beware of patterns that contain nested indefinite repeats. These can take a
long time to run when applied to a string that does not match. Consider the
pattern fragment
<pre>
^(a+)*
</pre>
This can match "aaaa" in 16 different ways, and this number increases very
rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4
times, and for each of those cases other than 0 or 4, the + repeats can match
different numbers of times.) When the remainder of the pattern is such that the
entire match is going to fail, PCRE has in principle to try every possible
variation, and this can take an extremely long time, even for relatively short
strings.
</P>
<P>
An optimization catches some of the more simple cases such as
<pre>
(a+)*b
</pre>
where a literal character follows. Before embarking on the standard matching
procedure, PCRE checks that there is a "b" later in the subject string, and if
there is not, it fails the match immediately. However, when there is no
following literal this optimization cannot be used. You can see the difference
by comparing the behaviour of
<pre>
(a+)*\d
</pre>
with the pattern above. The former gives a failure almost instantly when
applied to a whole line of "a" characters, whereas the latter takes an
appreciable time with strings longer than about 20 characters.
</P>
<P>
In many cases, the solution to this kind of performance issue is to use an
atomic group or a possessive quantifier.
</P>
<br><b>
AUTHOR
</b><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><b>
REVISION
</b><br>
<P>
Last updated: 25 August 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,292 @@
<html>
<head>
<title>pcreposix specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcreposix man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">SYNOPSIS OF POSIX API</a>
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
<li><a name="TOC3" href="#SEC3">COMPILING A PATTERN</a>
<li><a name="TOC4" href="#SEC4">MATCHING NEWLINE CHARACTERS</a>
<li><a name="TOC5" href="#SEC5">MATCHING A PATTERN</a>
<li><a name="TOC6" href="#SEC6">ERROR MESSAGES</a>
<li><a name="TOC7" href="#SEC7">MEMORY USAGE</a>
<li><a name="TOC8" href="#SEC8">AUTHOR</a>
<li><a name="TOC9" href="#SEC9">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">SYNOPSIS OF POSIX API</a><br>
<P>
<b>#include &#60;pcreposix.h&#62;</b>
</P>
<P>
<b>int regcomp(regex_t *<i>preg</i>, const char *<i>pattern</i>,</b>
<b>int <i>cflags</i>);</b>
</P>
<P>
<b>int regexec(regex_t *<i>preg</i>, const char *<i>string</i>,</b>
<b>size_t <i>nmatch</i>, regmatch_t <i>pmatch</i>[], int <i>eflags</i>);</b>
</P>
<P>
<b>size_t regerror(int <i>errcode</i>, const regex_t *<i>preg</i>,</b>
<b>char *<i>errbuf</i>, size_t <i>errbuf_size</i>);</b>
</P>
<P>
<b>void regfree(regex_t *<i>preg</i>);</b>
</P>
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
<P>
This set of functions provides a POSIX-style API for the PCRE regular
expression 8-bit library. See the
<a href="pcreapi.html"><b>pcreapi</b></a>
documentation for a description of PCRE's native API, which contains much
additional functionality. There is no POSIX-style wrapper for PCRE's 16-bit
and 32-bit library.
</P>
<P>
The functions described here are just wrapper functions that ultimately call
the PCRE native API. Their prototypes are defined in the <b>pcreposix.h</b>
header file, and on Unix systems the library itself is called
<b>pcreposix.a</b>, so can be accessed by adding <b>-lpcreposix</b> to the
command for linking an application that uses them. Because the POSIX functions
call the native ones, it is also necessary to add <b>-lpcre</b>.
</P>
<P>
I have implemented only those POSIX option bits that can be reasonably mapped
to PCRE native options. In addition, the option REG_EXTENDED is defined with
the value zero. This has no effect, but since programs that are written to the
POSIX interface often use it, this makes it easier to slot in PCRE as a
replacement library. Other POSIX options are not even defined.
</P>
<P>
There are also some other options that are not defined by POSIX. These have
been added at the request of users who want to make use of certain
PCRE-specific features via the POSIX calling interface.
</P>
<P>
When PCRE is called via these functions, it is only the API that is POSIX-like
in style. The syntax and semantics of the regular expressions themselves are
still those of Perl, subject to the setting of various PCRE options, as
described below. "POSIX-like in style" means that the API approximates to the
POSIX definition; it is not fully POSIX-compatible, and in multi-byte encoding
domains it is probably even less compatible.
</P>
<P>
The header for these functions is supplied as <b>pcreposix.h</b> to avoid any
potential clash with other POSIX libraries. It can, of course, be renamed or
aliased as <b>regex.h</b>, which is the "correct" name. It provides two
structure types, <i>regex_t</i> for compiled internal forms, and
<i>regmatch_t</i> for returning captured substrings. It also defines some
constants whose names start with "REG_"; these are used for setting options and
identifying error codes.
</P>
<br><a name="SEC3" href="#TOC1">COMPILING A PATTERN</a><br>
<P>
The function <b>regcomp()</b> is called to compile a pattern into an
internal form. The pattern is a C string terminated by a binary zero, and
is passed in the argument <i>pattern</i>. The <i>preg</i> argument is a pointer
to a <b>regex_t</b> structure that is used as a base for storing information
about the compiled regular expression.
</P>
<P>
The argument <i>cflags</i> is either zero, or contains one or more of the bits
defined by the following macros:
<pre>
REG_DOTALL
</pre>
The PCRE_DOTALL option is set when the regular expression is passed for
compilation to the native function. Note that REG_DOTALL is not part of the
POSIX standard.
<pre>
REG_ICASE
</pre>
The PCRE_CASELESS option is set when the regular expression is passed for
compilation to the native function.
<pre>
REG_NEWLINE
</pre>
The PCRE_MULTILINE option is set when the regular expression is passed for
compilation to the native function. Note that this does <i>not</i> mimic the
defined POSIX behaviour for REG_NEWLINE (see the following section).
<pre>
REG_NOSUB
</pre>
The PCRE_NO_AUTO_CAPTURE option is set when the regular expression is passed
for compilation to the native function. In addition, when a pattern that is
compiled with this flag is passed to <b>regexec()</b> for matching, the
<i>nmatch</i> and <i>pmatch</i> arguments are ignored, and no captured strings
are returned.
<pre>
REG_UCP
</pre>
The PCRE_UCP option is set when the regular expression is passed for
compilation to the native function. This causes PCRE to use Unicode properties
when matchine \d, \w, etc., instead of just recognizing ASCII values. Note
that REG_UTF8 is not part of the POSIX standard.
<pre>
REG_UNGREEDY
</pre>
The PCRE_UNGREEDY option is set when the regular expression is passed for
compilation to the native function. Note that REG_UNGREEDY is not part of the
POSIX standard.
<pre>
REG_UTF8
</pre>
The PCRE_UTF8 option is set when the regular expression is passed for
compilation to the native function. This causes the pattern itself and all data
strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF8
is not part of the POSIX standard.
</P>
<P>
In the absence of these flags, no options are passed to the native function.
This means the the regex is compiled with PCRE default semantics. In
particular, the way it handles newline characters in the subject string is the
Perl way, not the POSIX way. Note that setting PCRE_MULTILINE has only
<i>some</i> of the effects specified for REG_NEWLINE. It does not affect the way
newlines are matched by . (they are not) or by a negative class such as [^a]
(they are).
</P>
<P>
The yield of <b>regcomp()</b> is zero on success, and non-zero otherwise. The
<i>preg</i> structure is filled in on success, and one member of the structure
is public: <i>re_nsub</i> contains the number of capturing subpatterns in
the regular expression. Various error codes are defined in the header file.
</P>
<P>
NOTE: If the yield of <b>regcomp()</b> is non-zero, you must not attempt to
use the contents of the <i>preg</i> structure. If, for example, you pass it to
<b>regexec()</b>, the result is undefined and your program is likely to crash.
</P>
<br><a name="SEC4" href="#TOC1">MATCHING NEWLINE CHARACTERS</a><br>
<P>
This area is not simple, because POSIX and Perl take different views of things.
It is not possible to get PCRE to obey POSIX semantics, but then PCRE was never
intended to be a POSIX engine. The following table lists the different
possibilities for matching newline characters in PCRE:
<pre>
Default Change with
. matches newline no PCRE_DOTALL
newline matches [^a] yes not changeable
$ matches \n at end yes PCRE_DOLLARENDONLY
$ matches \n in middle no PCRE_MULTILINE
^ matches \n in middle no PCRE_MULTILINE
</pre>
This is the equivalent table for POSIX:
<pre>
Default Change with
. matches newline yes REG_NEWLINE
newline matches [^a] yes REG_NEWLINE
$ matches \n at end no REG_NEWLINE
$ matches \n in middle no REG_NEWLINE
^ matches \n in middle no REG_NEWLINE
</pre>
PCRE's behaviour is the same as Perl's, except that there is no equivalent for
PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is no way to stop
newline from matching [^a].
</P>
<P>
The default POSIX newline handling can be obtained by setting PCRE_DOTALL and
PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE behave exactly as for the
REG_NEWLINE action.
</P>
<br><a name="SEC5" href="#TOC1">MATCHING A PATTERN</a><br>
<P>
The function <b>regexec()</b> is called to match a compiled pattern <i>preg</i>
against a given <i>string</i>, which is by default terminated by a zero byte
(but see REG_STARTEND below), subject to the options in <i>eflags</i>. These can
be:
<pre>
REG_NOTBOL
</pre>
The PCRE_NOTBOL option is set when calling the underlying PCRE matching
function.
<pre>
REG_NOTEMPTY
</pre>
The PCRE_NOTEMPTY option is set when calling the underlying PCRE matching
function. Note that REG_NOTEMPTY is not part of the POSIX standard. However,
setting this option can give more POSIX-like behaviour in some situations.
<pre>
REG_NOTEOL
</pre>
The PCRE_NOTEOL option is set when calling the underlying PCRE matching
function.
<pre>
REG_STARTEND
</pre>
The string is considered to start at <i>string</i> + <i>pmatch[0].rm_so</i> and
to have a terminating NUL located at <i>string</i> + <i>pmatch[0].rm_eo</i>
(there need not actually be a NUL at that location), regardless of the value of
<i>nmatch</i>. This is a BSD extension, compatible with but not specified by
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
intended to be portable to other systems. Note that a non-zero <i>rm_so</i> does
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
how it is matched.
</P>
<P>
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
strings is returned. The <i>nmatch</i> and <i>pmatch</i> arguments of
<b>regexec()</b> are ignored.
</P>
<P>
If the value of <i>nmatch</i> is zero, or if the value <i>pmatch</i> is NULL,
no data about any matched strings is returned.
</P>
<P>
Otherwise,the portion of the string that was matched, and also any captured
substrings, are returned via the <i>pmatch</i> argument, which points to an
array of <i>nmatch</i> structures of type <i>regmatch_t</i>, containing the
members <i>rm_so</i> and <i>rm_eo</i>. These contain the offset to the first
character of each substring and the offset to the first character after the end
of each substring, respectively. The 0th element of the vector relates to the
entire portion of <i>string</i> that was matched; subsequent elements relate to
the capturing subpatterns of the regular expression. Unused entries in the
array have both structure members set to -1.
</P>
<P>
A successful match yields a zero return; various error codes are defined in the
header file, of which REG_NOMATCH is the "expected" failure code.
</P>
<br><a name="SEC6" href="#TOC1">ERROR MESSAGES</a><br>
<P>
The <b>regerror()</b> function maps a non-zero errorcode from either
<b>regcomp()</b> or <b>regexec()</b> to a printable message. If <i>preg</i> is not
NULL, the error should have arisen from the use of that structure. A message
terminated by a binary zero is placed in <i>errbuf</i>. The length of the
message, including the zero, is limited to <i>errbuf_size</i>. The yield of the
function is the size of buffer needed to hold the whole message.
</P>
<br><a name="SEC7" href="#TOC1">MEMORY USAGE</a><br>
<P>
Compiling a regular expression causes memory to be allocated and associated
with the <i>preg</i> structure. The function <b>regfree()</b> frees all such
memory, after which <i>preg</i> may no longer be used as a compiled expression.
</P>
<br><a name="SEC8" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
<P>
Last updated: 09 January 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,158 @@
<html>
<head>
<title>pcreprecompile specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcreprecompile man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">SAVING AND RE-USING PRECOMPILED PCRE PATTERNS</a>
<li><a name="TOC2" href="#SEC2">SAVING A COMPILED PATTERN</a>
<li><a name="TOC3" href="#SEC3">RE-USING A PRECOMPILED PATTERN</a>
<li><a name="TOC4" href="#SEC4">COMPATIBILITY WITH DIFFERENT PCRE RELEASES</a>
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
<li><a name="TOC6" href="#SEC6">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">SAVING AND RE-USING PRECOMPILED PCRE PATTERNS</a><br>
<P>
If you are running an application that uses a large number of regular
expression patterns, it may be useful to store them in a precompiled form
instead of having to compile them every time the application is run.
If you are not using any private character tables (see the
<a href="pcre_maketables.html"><b>pcre_maketables()</b></a>
documentation), this is relatively straightforward. If you are using private
tables, it is a little bit more complicated. However, if you are using the
just-in-time optimization feature, it is not possible to save and reload the
JIT data.
</P>
<P>
If you save compiled patterns to a file, you can copy them to a different host
and run them there. If the two hosts have different endianness (byte order),
you should run the <b>pcre[16|32]_pattern_to_host_byte_order()</b> function on the
new host before trying to match the pattern. The matching functions return
PCRE_ERROR_BADENDIANNESS if they detect a pattern with the wrong endianness.
</P>
<P>
Compiling regular expressions with one version of PCRE for use with a different
version is not guaranteed to work and may cause crashes, and saving and
restoring a compiled pattern loses any JIT optimization data.
</P>
<br><a name="SEC2" href="#TOC1">SAVING A COMPILED PATTERN</a><br>
<P>
The value returned by <b>pcre[16|32]_compile()</b> points to a single block of
memory that holds the compiled pattern and associated data. You can find the
length of this block in bytes by calling <b>pcre[16|32]_fullinfo()</b> with an
argument of PCRE_INFO_SIZE. You can then save the data in any appropriate
manner. Here is sample code for the 8-bit library that compiles a pattern and
writes it to a file. It assumes that the variable <i>fd</i> refers to a file
that is open for output:
<pre>
int erroroffset, rc, size;
char *error;
pcre *re;
re = pcre_compile("my pattern", 0, &error, &erroroffset, NULL);
if (re == NULL) { ... handle errors ... }
rc = pcre_fullinfo(re, NULL, PCRE_INFO_SIZE, &size);
if (rc &#60; 0) { ... handle errors ... }
rc = fwrite(re, 1, size, fd);
if (rc != size) { ... handle errors ... }
</pre>
In this example, the bytes that comprise the compiled pattern are copied
exactly. Note that this is binary data that may contain any of the 256 possible
byte values. On systems that make a distinction between binary and non-binary
data, be sure that the file is opened for binary output.
</P>
<P>
If you want to write more than one pattern to a file, you will have to devise a
way of separating them. For binary data, preceding each pattern with its length
is probably the most straightforward approach. Another possibility is to write
out the data in hexadecimal instead of binary, one pattern to a line.
</P>
<P>
Saving compiled patterns in a file is only one possible way of storing them for
later use. They could equally well be saved in a database, or in the memory of
some daemon process that passes them via sockets to the processes that want
them.
</P>
<P>
If the pattern has been studied, it is also possible to save the normal study
data in a similar way to the compiled pattern itself. However, if the
PCRE_STUDY_JIT_COMPILE was used, the just-in-time data that is created cannot
be saved because it is too dependent on the current environment. When studying
generates additional information, <b>pcre[16|32]_study()</b> returns a pointer to a
<b>pcre[16|32]_extra</b> data block. Its format is defined in the
<a href="pcreapi.html#extradata">section on matching a pattern</a>
in the
<a href="pcreapi.html"><b>pcreapi</b></a>
documentation. The <i>study_data</i> field points to the binary study data, and
this is what you must save (not the <b>pcre[16|32]_extra</b> block itself). The
length of the study data can be obtained by calling <b>pcre[16|32]_fullinfo()</b>
with an argument of PCRE_INFO_STUDYSIZE. Remember to check that
<b>pcre[16|32]_study()</b> did return a non-NULL value before trying to save the
study data.
</P>
<br><a name="SEC3" href="#TOC1">RE-USING A PRECOMPILED PATTERN</a><br>
<P>
Re-using a precompiled pattern is straightforward. Having reloaded it into main
memory, called <b>pcre[16|32]_pattern_to_host_byte_order()</b> if necessary,
you pass its pointer to <b>pcre[16|32]_exec()</b> or <b>pcre[16|32]_dfa_exec()</b> in
the usual way.
</P>
<P>
However, if you passed a pointer to custom character tables when the pattern
was compiled (the <i>tableptr</i> argument of <b>pcre[16|32]_compile()</b>), you
must now pass a similar pointer to <b>pcre[16|32]_exec()</b> or
<b>pcre[16|32]_dfa_exec()</b>, because the value saved with the compiled pattern
will obviously be nonsense. A field in a <b>pcre[16|32]_extra()</b> block is used
to pass this data, as described in the
<a href="pcreapi.html#extradata">section on matching a pattern</a>
in the
<a href="pcreapi.html"><b>pcreapi</b></a>
documentation.
</P>
<P>
If you did not provide custom character tables when the pattern was compiled,
the pointer in the compiled pattern is NULL, which causes the matching
functions to use PCRE's internal tables. Thus, you do not need to take any
special action at run time in this case.
</P>
<P>
If you saved study data with the compiled pattern, you need to create your own
<b>pcre[16|32]_extra</b> data block and set the <i>study_data</i> field to point to the
reloaded study data. You must also set the PCRE_EXTRA_STUDY_DATA bit in the
<i>flags</i> field to indicate that study data is present. Then pass the
<b>pcre[16|32]_extra</b> block to the matching function in the usual way. If the
pattern was studied for just-in-time optimization, that data cannot be saved,
and so is lost by a save/restore cycle.
</P>
<br><a name="SEC4" href="#TOC1">COMPATIBILITY WITH DIFFERENT PCRE RELEASES</a><br>
<P>
In general, it is safest to recompile all saved patterns when you update to a
new PCRE release, though not all updates actually require this.
</P>
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
Last updated: 24 June 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,110 @@
<html>
<head>
<title>pcresample specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcresample man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
PCRE SAMPLE PROGRAM
</b><br>
<P>
A simple, complete demonstration program, to get you started with using PCRE,
is supplied in the file <i>pcredemo.c</i> in the PCRE distribution. A listing of
this program is given in the
<a href="pcredemo.html"><b>pcredemo</b></a>
documentation. If you do not have a copy of the PCRE distribution, you can save
this listing to re-create <i>pcredemo.c</i>.
</P>
<P>
The demonstration program, which uses the original PCRE 8-bit library, compiles
the regular expression that is its first argument, and matches it against the
subject string in its second argument. No PCRE options are set, and default
character tables are used. If matching succeeds, the program outputs the
portion of the subject that matched, together with the contents of any captured
substrings.
</P>
<P>
If the -g option is given on the command line, the program then goes on to
check for further matches of the same regular expression in the same subject
string. The logic is a little bit tricky because of the possibility of matching
an empty string. Comments in the code explain what is going on.
</P>
<P>
If PCRE is installed in the standard include and library directories for your
operating system, you should be able to compile the demonstration program using
this command:
<pre>
gcc -o pcredemo pcredemo.c -lpcre
</pre>
If PCRE is installed elsewhere, you may need to add additional options to the
command line. For example, on a Unix-like system that has PCRE installed in
<i>/usr/local</i>, you can compile the demonstration program using a command
like this:
<pre>
gcc -o pcredemo -I/usr/local/include pcredemo.c -L/usr/local/lib -lpcre
</pre>
In a Windows environment, if you want to statically link the program against a
non-dll <b>pcre.a</b> file, you must uncomment the line that defines PCRE_STATIC
before including <b>pcre.h</b>, because otherwise the <b>pcre_malloc()</b> and
<b>pcre_free()</b> exported functions will be declared
<b>__declspec(dllimport)</b>, with unwanted results.
</P>
<P>
Once you have compiled and linked the demonstration program, you can run simple
tests like this:
<pre>
./pcredemo 'cat|dog' 'the cat sat on the mat'
./pcredemo -g 'cat|dog' 'the dog sat on the cat'
</pre>
Note that there is a much more comprehensive test program, called
<a href="pcretest.html"><b>pcretest</b>,</a>
which supports many more facilities for testing regular expressions and both
PCRE libraries. The
<a href="pcredemo.html"><b>pcredemo</b></a>
program is provided as a simple coding example.
</P>
<P>
If you try to run
<a href="pcredemo.html"><b>pcredemo</b></a>
when PCRE is not installed in the standard library directory, you may get an
error like this on some operating systems (e.g. Solaris):
<pre>
ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or directory
</pre>
This is caused by the way shared library support works on those systems. You
need to add
<pre>
-R/usr/local/lib
</pre>
(for example) to the compile command to get round this problem.
</P>
<br><b>
AUTHOR
</b><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><b>
REVISION
</b><br>
<P>
Last updated: 10 January 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,225 @@
<html>
<head>
<title>pcrestack specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcrestack man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
PCRE DISCUSSION OF STACK USAGE
</b><br>
<P>
When you call <b>pcre[16|32]_exec()</b>, it makes use of an internal function
called <b>match()</b>. This calls itself recursively at branch points in the
pattern, in order to remember the state of the match so that it can back up and
try a different alternative if the first one fails. As matching proceeds deeper
and deeper into the tree of possibilities, the recursion depth increases. The
<b>match()</b> function is also called in other circumstances, for example,
whenever a parenthesized sub-pattern is entered, and in certain cases of
repetition.
</P>
<P>
Not all calls of <b>match()</b> increase the recursion depth; for an item such
as a* it may be called several times at the same level, after matching
different numbers of a's. Furthermore, in a number of cases where the result of
the recursive call would immediately be passed back as the result of the
current call (a "tail recursion"), the function is just restarted instead.
</P>
<P>
The above comments apply when <b>pcre[16|32]_exec()</b> is run in its normal
interpretive manner. If the pattern was studied with the
PCRE_STUDY_JIT_COMPILE option, and just-in-time compiling was successful, and
the options passed to <b>pcre[16|32]_exec()</b> were not incompatible, the matching
process uses the JIT-compiled code instead of the <b>match()</b> function. In
this case, the memory requirements are handled entirely differently. See the
<a href="pcrejit.html"><b>pcrejit</b></a>
documentation for details.
</P>
<P>
The <b>pcre[16|32]_dfa_exec()</b> function operates in an entirely different way,
and uses recursion only when there is a regular expression recursion or
subroutine call in the pattern. This includes the processing of assertion and
"once-only" subpatterns, which are handled like subroutine calls. Normally,
these are never very deep, and the limit on the complexity of
<b>pcre[16|32]_dfa_exec()</b> is controlled by the amount of workspace it is given.
However, it is possible to write patterns with runaway infinite recursions;
such patterns will cause <b>pcre[16|32]_dfa_exec()</b> to run out of stack. At
present, there is no protection against this.
</P>
<P>
The comments that follow do NOT apply to <b>pcre[16|32]_dfa_exec()</b>; they are
relevant only for <b>pcre[16|32]_exec()</b> without the JIT optimization.
</P>
<br><b>
Reducing <b>pcre[16|32]_exec()</b>'s stack usage
</b><br>
<P>
Each time that <b>match()</b> is actually called recursively, it uses memory
from the process stack. For certain kinds of pattern and data, very large
amounts of stack may be needed, despite the recognition of "tail recursion".
You can often reduce the amount of recursion, and therefore the amount of stack
used, by modifying the pattern that is being matched. Consider, for example,
this pattern:
<pre>
([^&#60;]|&#60;(?!inet))+
</pre>
It matches from wherever it starts until it encounters "&#60;inet" or the end of
the data, and is the kind of pattern that might be used when processing an XML
file. Each iteration of the outer parentheses matches either one character that
is not "&#60;" or a "&#60;" that is not followed by "inet". However, each time a
parenthesis is processed, a recursion occurs, so this formulation uses a stack
frame for each matched character. For a long string, a lot of stack is
required. Consider now this rewritten pattern, which matches exactly the same
strings:
<pre>
([^&#60;]++|&#60;(?!inet))+
</pre>
This uses very much less stack, because runs of characters that do not contain
"&#60;" are "swallowed" in one item inside the parentheses. Recursion happens only
when a "&#60;" character that is not followed by "inet" is encountered (and we
assume this is relatively rare). A possessive quantifier is used to stop any
backtracking into the runs of non-"&#60;" characters, but that is not related to
stack usage.
</P>
<P>
This example shows that one way of avoiding stack problems when matching long
subject strings is to write repeated parenthesized subpatterns to match more
than one character whenever possible.
</P>
<br><b>
Compiling PCRE to use heap instead of stack for <b>pcre[16|32]_exec()</b>
</b><br>
<P>
In environments where stack memory is constrained, you might want to compile
PCRE to use heap memory instead of stack for remembering back-up points when
<b>pcre[16|32]_exec()</b> is running. This makes it run a lot more slowly, however.
Details of how to do this are given in the
<a href="pcrebuild.html"><b>pcrebuild</b></a>
documentation. When built in this way, instead of using the stack, PCRE obtains
and frees memory by calling the functions that are pointed to by the
<b>pcre[16|32]_stack_malloc</b> and <b>pcre[16|32]_stack_free</b> variables. By
default, these point to <b>malloc()</b> and <b>free()</b>, but you can replace
the pointers to cause PCRE to use your own functions. Since the block sizes are
always the same, and are always freed in reverse order, it may be possible to
implement customized memory handlers that are more efficient than the standard
functions.
</P>
<br><b>
Limiting <b>pcre[16|32]_exec()</b>'s stack usage
</b><br>
<P>
You can set limits on the number of times that <b>match()</b> is called, both in
total and recursively. If a limit is exceeded, <b>pcre[16|32]_exec()</b> returns an
error code. Setting suitable limits should prevent it from running out of
stack. The default values of the limits are very large, and unlikely ever to
operate. They can be changed when PCRE is built, and they can also be set when
<b>pcre[16|32]_exec()</b> is called. For details of these interfaces, see the
<a href="pcrebuild.html"><b>pcrebuild</b></a>
documentation and the
<a href="pcreapi.html#extradata">section on extra data for <b>pcre[16|32]_exec()</b></a>
in the
<a href="pcreapi.html"><b>pcreapi</b></a>
documentation.
</P>
<P>
As a very rough rule of thumb, you should reckon on about 500 bytes per
recursion. Thus, if you want to limit your stack usage to 8Mb, you should set
the limit at 16000 recursions. A 64Mb stack, on the other hand, can support
around 128000 recursions.
</P>
<P>
In Unix-like environments, the <b>pcretest</b> test program has a command line
option (<b>-S</b>) that can be used to increase the size of its stack. As long
as the stack is large enough, another option (<b>-M</b>) can be used to find the
smallest limits that allow a particular pattern to match a given subject
string. This is done by calling <b>pcre[16|32]_exec()</b> repeatedly with different
limits.
</P>
<br><b>
Obtaining an estimate of stack usage
</b><br>
<P>
The actual amount of stack used per recursion can vary quite a lot, depending
on the compiler that was used to build PCRE and the optimization or debugging
options that were set for it. The rule of thumb value of 500 bytes mentioned
above may be larger or smaller than what is actually needed. A better
approximation can be obtained by running this command:
<pre>
pcretest -m -C
</pre>
The <b>-C</b> option causes <b>pcretest</b> to output information about the
options with which PCRE was compiled. When <b>-m</b> is also given (before
<b>-C</b>), information about stack use is given in a line like this:
<pre>
Match recursion uses stack: approximate frame size = 640 bytes
</pre>
The value is approximate because some recursions need a bit more (up to perhaps
16 more bytes).
</P>
<P>
If the above command is given when PCRE is compiled to use the heap instead of
the stack for recursion, the value that is output is the size of each block
that is obtained from the heap.
</P>
<br><b>
Changing stack size in Unix-like systems
</b><br>
<P>
In Unix-like environments, there is not often a problem with the stack unless
very long strings are involved, though the default limit on stack size varies
from system to system. Values from 8Mb to 64Mb are common. You can find your
default limit by running the command:
<pre>
ulimit -s
</pre>
Unfortunately, the effect of running out of stack is often SIGSEGV, though
sometimes a more explicit error message is given. You can normally increase the
limit on stack size by code such as this:
<pre>
struct rlimit rlim;
getrlimit(RLIMIT_STACK, &rlim);
rlim.rlim_cur = 100*1024*1024;
setrlimit(RLIMIT_STACK, &rlim);
</pre>
This reads the current limits (soft and hard) using <b>getrlimit()</b>, then
attempts to increase the soft limit to 100Mb using <b>setrlimit()</b>. You must
do this before calling <b>pcre[16|32]_exec()</b>.
</P>
<br><b>
Changing stack size in Mac OS X
</b><br>
<P>
Using <b>setrlimit()</b>, as described above, should also work on Mac OS X. It
is also possible to set a stack size when linking a program. There is a
discussion about stack sizes in Mac OS X at this web site:
<a href="http://developer.apple.com/qa/qa2005/qa1419.html">http://developer.apple.com/qa/qa2005/qa1419.html.</a>
</P>
<br><b>
AUTHOR
</b><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><b>
REVISION
</b><br>
<P>
Last updated: 24 June 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

View File

@ -0,0 +1,521 @@
<html>
<head>
<title>pcresyntax specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcresyntax man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
<li><a name="TOC2" href="#SEC2">QUOTING</a>
<li><a name="TOC3" href="#SEC3">CHARACTERS</a>
<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
<li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
<li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
<li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
<li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
<li><a name="TOC12" href="#SEC12">ALTERNATION</a>
<li><a name="TOC13" href="#SEC13">CAPTURING</a>
<li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
<li><a name="TOC15" href="#SEC15">COMMENT</a>
<li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
<li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
<li><a name="TOC18" href="#SEC18">BACKREFERENCES</a>
<li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
<li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a>
<li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a>
<li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a>
<li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a>
<li><a name="TOC24" href="#SEC24">CALLOUTS</a>
<li><a name="TOC25" href="#SEC25">SEE ALSO</a>
<li><a name="TOC26" href="#SEC26">AUTHOR</a>
<li><a name="TOC27" href="#SEC27">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
<P>
The full syntax and semantics of the regular expressions that are supported by
PCRE are described in the
<a href="pcrepattern.html"><b>pcrepattern</b></a>
documentation. This document contains a quick-reference summary of the syntax.
</P>
<br><a name="SEC2" href="#TOC1">QUOTING</a><br>
<P>
<pre>
\x where x is non-alphanumeric is a literal x
\Q...\E treat enclosed characters as literal
</PRE>
</P>
<br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
<P>
<pre>
\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any ASCII character
\e escape (hex 1B)
\f form feed (hex 0C)
\n newline (hex 0A)
\r carriage return (hex 0D)
\t tab (hex 09)
\ddd character with octal code ddd, or backreference
\xhh character with hex code hh
\x{hhh..} character with hex code hhh..
</PRE>
</P>
<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
<P>
<pre>
. any character except newline;
in dotall mode, any character whatsoever
\C one data unit, even in UTF mode (best avoided)
\d a decimal digit
\D a character that is not a decimal digit
\h a horizontal white space character
\H a character that is not a horizontal white space character
\N a character that is not a newline
\p{<i>xx</i>} a character with the <i>xx</i> property
\P{<i>xx</i>} a character without the <i>xx</i> property
\R a newline sequence
\s a white space character
\S a character that is not a white space character
\v a vertical white space character
\V a character that is not a vertical white space character
\w a "word" character
\W a "non-word" character
\X a Unicode extended grapheme cluster
</pre>
In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII
characters, even in a UTF mode. However, this can be changed by setting the
PCRE_UCP option.
</P>
<br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
<P>
<pre>
C Other
Cc Control
Cf Format
Cn Unassigned
Co Private use
Cs Surrogate
L Letter
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter
L& Ll, Lu, or Lt
M Mark
Mc Spacing mark
Me Enclosing mark
Mn Non-spacing mark
N Number
Nd Decimal number
Nl Letter number
No Other number
P Punctuation
Pc Connector punctuation
Pd Dash punctuation
Pe Close punctuation
Pf Final punctuation
Pi Initial punctuation
Po Other punctuation
Ps Open punctuation
S Symbol
Sc Currency symbol
Sk Modifier symbol
Sm Mathematical symbol
So Other symbol
Z Separator
Zl Line separator
Zp Paragraph separator
Zs Space separator
</PRE>
</P>
<br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
<P>
<pre>
Xan Alphanumeric: union of properties L and N
Xps POSIX space: property Z or tab, NL, VT, FF, CR
Xsp Perl space: property Z or tab, NL, FF, CR
Xwd Perl word: property Xan or underscore
</PRE>
</P>
<br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
<P>
Arabic,
Armenian,
Avestan,
Balinese,
Bamum,
Batak,
Bengali,
Bopomofo,
Brahmi,
Braille,
Buginese,
Buhid,
Canadian_Aboriginal,
Carian,
Chakma,
Cham,
Cherokee,
Common,
Coptic,
Cuneiform,
Cypriot,
Cyrillic,
Deseret,
Devanagari,
Egyptian_Hieroglyphs,
Ethiopic,
Georgian,
Glagolitic,
Gothic,
Greek,
Gujarati,
Gurmukhi,
Han,
Hangul,
Hanunoo,
Hebrew,
Hiragana,
Imperial_Aramaic,
Inherited,
Inscriptional_Pahlavi,
Inscriptional_Parthian,
Javanese,
Kaithi,
Kannada,
Katakana,
Kayah_Li,
Kharoshthi,
Khmer,
Lao,
Latin,
Lepcha,
Limbu,
Linear_B,
Lisu,
Lycian,
Lydian,
Malayalam,
Mandaic,
Meetei_Mayek,
Meroitic_Cursive,
Meroitic_Hieroglyphs,
Miao,
Mongolian,
Myanmar,
New_Tai_Lue,
Nko,
Ogham,
Old_Italic,
Old_Persian,
Old_South_Arabian,
Old_Turkic,
Ol_Chiki,
Oriya,
Osmanya,
Phags_Pa,
Phoenician,
Rejang,
Runic,
Samaritan,
Saurashtra,
Sharada,
Shavian,
Sinhala,
Sora_Sompeng,
Sundanese,
Syloti_Nagri,
Syriac,
Tagalog,
Tagbanwa,
Tai_Le,
Tai_Tham,
Tai_Viet,
Takri,
Tamil,
Telugu,
Thaana,
Thai,
Tibetan,
Tifinagh,
Ugaritic,
Vai,
Yi.
</P>
<br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
<P>
<pre>
[...] positive character class
[^...] negative character class
[x-y] range (can be used for hex characters)
[[:xxx:]] positive POSIX named set
[[:^xxx:]] negative POSIX named set
alnum alphanumeric
alpha alphabetic
ascii 0-127
blank space or tab
cntrl control character
digit decimal digit
graph printing, excluding space
lower lower case letter
print printing, including space
punct printing, excluding alphanumeric
space white space
upper upper case letter
word same as \w
xdigit hexadecimal digit
</pre>
In PCRE, POSIX character set names recognize only ASCII characters by default,
but some of them use Unicode properties if PCRE_UCP is set. You can use
\Q...\E inside a character class.
</P>
<br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
<P>
<pre>
? 0 or 1, greedy
?+ 0 or 1, possessive
?? 0 or 1, lazy
* 0 or more, greedy
*+ 0 or more, possessive
*? 0 or more, lazy
+ 1 or more, greedy
++ 1 or more, possessive
+? 1 or more, lazy
{n} exactly n
{n,m} at least n, no more than m, greedy
{n,m}+ at least n, no more than m, possessive
{n,m}? at least n, no more than m, lazy
{n,} n or more, greedy
{n,}+ n or more, possessive
{n,}? n or more, lazy
</PRE>
</P>
<br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
<P>
<pre>
\b word boundary
\B not a word boundary
^ start of subject
also after internal newline in multiline mode
\A start of subject
$ end of subject
also before newline at end of subject
also before internal newline in multiline mode
\Z end of subject
also before newline at end of subject
\z end of subject
\G first matching position in subject
</PRE>
</P>
<br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
<P>
<pre>
\K reset start of match
</PRE>
</P>
<br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
<P>
<pre>
expr|expr|expr...
</PRE>
</P>
<br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
<P>
<pre>
(...) capturing group
(?&#60;name&#62;...) named capturing group (Perl)
(?'name'...) named capturing group (Perl)
(?P&#60;name&#62;...) named capturing group (Python)
(?:...) non-capturing group
(?|...) non-capturing group; reset group numbers for
capturing groups in each alternative
</PRE>
</P>
<br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
<P>
<pre>
(?&#62;...) atomic, non-capturing group
</PRE>
</P>
<br><a name="SEC15" href="#TOC1">COMMENT</a><br>
<P>
<pre>
(?#....) comment (not nestable)
</PRE>
</P>
<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
<P>
<pre>
(?i) caseless
(?J) allow duplicate names
(?m) multiline
(?s) single line (dotall)
(?U) default ungreedy (lazy)
(?x) extended (ignore white space)
(?-...) unset option(s)
</pre>
The following are recognized only at the start of a pattern or after one of the
newline-setting options with similar syntax:
<pre>
(*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
(*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
(*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
(*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
(*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE_UCP (use Unicode properties for \d etc)
</PRE>
</P>
<br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
<P>
<pre>
(?=...) positive look ahead
(?!...) negative look ahead
(?&#60;=...) positive look behind
(?&#60;!...) negative look behind
</pre>
Each top-level branch of a look behind must be of a fixed length.
</P>
<br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br>
<P>
<pre>
\n reference by number (can be ambiguous)
\gn reference by number
\g{n} reference by number
\g{-n} relative reference by number
\k&#60;name&#62; reference by name (Perl)
\k'name' reference by name (Perl)
\g{name} reference by name (Perl)
\k{name} reference by name (.NET)
(?P=name) reference by name (Python)
</PRE>
</P>
<br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
<P>
<pre>
(?R) recurse whole pattern
(?n) call subpattern by absolute number
(?+n) call subpattern by relative number
(?-n) call subpattern by relative number
(?&name) call subpattern by name (Perl)
(?P&#62;name) call subpattern by name (Python)
\g&#60;name&#62; call subpattern by name (Oniguruma)
\g'name' call subpattern by name (Oniguruma)
\g&#60;n&#62; call subpattern by absolute number (Oniguruma)
\g'n' call subpattern by absolute number (Oniguruma)
\g&#60;+n&#62; call subpattern by relative number (PCRE extension)
\g'+n' call subpattern by relative number (PCRE extension)
\g&#60;-n&#62; call subpattern by relative number (PCRE extension)
\g'-n' call subpattern by relative number (PCRE extension)
</PRE>
</P>
<br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br>
<P>
<pre>
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
(?(n)... absolute reference condition
(?(+n)... relative reference condition
(?(-n)... relative reference condition
(?(&#60;name&#62;)... named reference condition (Perl)
(?('name')... named reference condition (Perl)
(?(name)... named reference condition (PCRE)
(?(R)... overall recursion condition
(?(Rn)... specific group recursion condition
(?(R&name)... specific recursion condition
(?(DEFINE)... define subpattern for reference
(?(assert)... assertion condition
</PRE>
</P>
<br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
The following act immediately they are reached:
<pre>
(*ACCEPT) force successful match
(*FAIL) force backtrack; synonym (*F)
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
</pre>
The following act only when a subsequent match failure causes a backtrack to
reach them. They all force a match failure, but they differ in what happens
afterwards. Those that advance the start-of-match point do so only if the
pattern is not anchored.
<pre>
(*COMMIT) overall failure, no advance of starting point
(*PRUNE) advance to next starting character
(*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
(*SKIP) advance to current matching position
(*SKIP:NAME) advance to position corresponding to an earlier
(*MARK:NAME); if not found, the (*SKIP) is ignored
(*THEN) local failure, backtrack to next alternation
(*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
</PRE>
</P>
<br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
<P>
These are recognized only at the very start of the pattern or after a
(*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
<pre>
(*CR) carriage return only
(*LF) linefeed only
(*CRLF) carriage return followed by linefeed
(*ANYCRLF) all three of the above
(*ANY) any Unicode newline sequence
</PRE>
</P>
<br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
<P>
These are recognized only at the very start of the pattern or after a
(*...) option that sets the newline convention or a UTF or UCP mode.
<pre>
(*BSR_ANYCRLF) CR, LF, or CRLF
(*BSR_UNICODE) any Unicode newline sequence
</PRE>
</P>
<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
<P>
<pre>
(?C) callout
(?Cn) callout with data n
</PRE>
</P>
<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
<b>pcrematching</b>(3), <b>pcre</b>(3).
</P>
<br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P>
Last updated: 11 November 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,270 @@
<html>
<head>
<title>pcreunicode specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcreunicode man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<br><b>
UTF-8, UTF-16, UTF-32, AND UNICODE PROPERTY SUPPORT
</b><br>
<P>
As well as UTF-8 support, PCRE also supports UTF-16 (from release 8.30) and
UTF-32 (from release 8.32), by means of two additional libraries. They can be
built as well as, or instead of, the 8-bit library.
</P>
<br><b>
UTF-8 SUPPORT
</b><br>
<P>
In order process UTF-8 strings, you must build PCRE's 8-bit library with UTF
support, and, in addition, you must call
<a href="pcre_compile.html"><b>pcre_compile()</b></a>
with the PCRE_UTF8 option flag, or the pattern must start with the sequence
(*UTF8) or (*UTF). When either of these is the case, both the pattern and any
subject strings that are matched against it are treated as UTF-8 strings
instead of strings of individual 1-byte characters.
</P>
<br><b>
UTF-16 AND UTF-32 SUPPORT
</b><br>
<P>
In order process UTF-16 or UTF-32 strings, you must build PCRE's 16-bit or
32-bit library with UTF support, and, in addition, you must call
<a href="pcre16_compile.html"><b>pcre16_compile()</b></a>
or
<a href="pcre32_compile.html"><b>pcre32_compile()</b></a>
with the PCRE_UTF16 or PCRE_UTF32 option flag, as appropriate. Alternatively,
the pattern must start with the sequence (*UTF16), (*UTF32), as appropriate, or
(*UTF), which can be used with either library. When UTF mode is set, both the
pattern and any subject strings that are matched against it are treated as
UTF-16 or UTF-32 strings instead of strings of individual 16-bit or 32-bit
characters.
</P>
<br><b>
UTF SUPPORT OVERHEAD
</b><br>
<P>
If you compile PCRE with UTF support, but do not use it at run time, the
library will be a bit bigger, but the additional run time overhead is limited
to testing the PCRE_UTF[8|16|32] flag occasionally, so should not be very big.
</P>
<br><b>
UNICODE PROPERTY SUPPORT
</b><br>
<P>
If PCRE is built with Unicode character property support (which implies UTF
support), the escape sequences \p{..}, \P{..}, and \X can be used.
The available properties that can be tested are limited to the general
category properties such as Lu for an upper case letter or Nd for a decimal
number, the Unicode script names such as Arabic or Han, and the derived
properties Any and L&. Full lists is given in the
<a href="pcrepattern.html"><b>pcrepattern</b></a>
and
<a href="pcresyntax.html"><b>pcresyntax</b></a>
documentation. Only the short names for properties are supported. For example,
\p{L} matches a letter. Its Perl synonym, \p{Letter}, is not supported.
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
compatibility with Perl 5.6. PCRE does not support this.
<a name="utf8strings"></a></P>
<br><b>
Validity of UTF-8 strings
</b><br>
<P>
When you set the PCRE_UTF8 flag, the byte strings passed as patterns and
subjects are (by default) checked for validity on entry to the relevant
functions. The entire string is checked before any other processing takes
place. From release 7.3 of PCRE, the check is according the rules of RFC 3629,
which are themselves derived from the Unicode specification. Earlier releases
of PCRE followed the rules of RFC 2279, which allows the full range of 31-bit
values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0
to U+10FFFF, excluding the surrogate area and the non-characters.
</P>
<P>
Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16,
where they are used in pairs to encode codepoints with values greater than
0xFFFF. The code points that are encoded by UTF-16 pairs are available
independently in the UTF-8 and UTF-32 encodings. (In other words, the whole
surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and
UTF-32.)
</P>
<P>
Also excluded are the "Non-Character" code points, which are U+FDD0 to U+FDEF
and the last two code points in each plane, U+??FFFE and U+??FFFF.
</P>
<P>
If an invalid UTF-8 string is passed to PCRE, an error return is given. At
compile time, the only additional information is the offset to the first byte
of the failing character. The run-time functions <b>pcre_exec()</b> and
<b>pcre_dfa_exec()</b> also pass back this information, as well as a more
detailed reason code if the caller has provided memory in which to do this.
</P>
<P>
In some situations, you may already know that your strings are valid, and
therefore want to skip these checks in order to improve performance, for
example in the case of a long subject string that is being scanned repeatedly.
If you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time, PCRE
assumes that the pattern or subject it is given (respectively) contains only
valid UTF-8 codes. In this case, it does not diagnose an invalid UTF-8 string.
</P>
<P>
Note that passing PCRE_NO_UTF8_CHECK to <b>pcre_compile()</b> just disables the
check for the pattern; it does not also apply to subject strings. If you want
to disable the check for a subject string you must pass this option to
<b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>.
</P>
<P>
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, the result
is undefined and your program may crash.
<a name="utf16strings"></a></P>
<br><b>
Validity of UTF-16 strings
</b><br>
<P>
When you set the PCRE_UTF16 flag, the strings of 16-bit data units that are
passed as patterns and subjects are (by default) checked for validity on entry
to the relevant functions. Values other than those in the surrogate range
U+D800 to U+DFFF are independent code points. Values in the surrogate range
must be used in pairs in the correct manner.
</P>
<P>
Excluded are the "Non-Character" code points, which are U+FDD0 to U+FDEF
and the last two code points in each plane, U+??FFFE and U+??FFFF.
</P>
<P>
If an invalid UTF-16 string is passed to PCRE, an error return is given. At
compile time, the only additional information is the offset to the first data
unit of the failing character. The run-time functions <b>pcre16_exec()</b> and
<b>pcre16_dfa_exec()</b> also pass back this information, as well as a more
detailed reason code if the caller has provided memory in which to do this.
</P>
<P>
In some situations, you may already know that your strings are valid, and
therefore want to skip these checks in order to improve performance. If you set
the PCRE_NO_UTF16_CHECK flag at compile time or at run time, PCRE assumes that
the pattern or subject it is given (respectively) contains only valid UTF-16
sequences. In this case, it does not diagnose an invalid UTF-16 string.
However, if an invalid string is passed, the result is undefined.
<a name="utf32strings"></a></P>
<br><b>
Validity of UTF-32 strings
</b><br>
<P>
When you set the PCRE_UTF32 flag, the strings of 32-bit data units that are
passed as patterns and subjects are (by default) checked for validity on entry
to the relevant functions. This check allows only values in the range U+0
to U+10FFFF, excluding the surrogate area U+D800 to U+DFFF, and the
"Non-Character" code points, which are U+FDD0 to U+FDEF and the last two
characters in each plane, U+??FFFE and U+??FFFF.
</P>
<P>
If an invalid UTF-32 string is passed to PCRE, an error return is given. At
compile time, the only additional information is the offset to the first data
unit of the failing character. The run-time functions <b>pcre32_exec()</b> and
<b>pcre32_dfa_exec()</b> also pass back this information, as well as a more
detailed reason code if the caller has provided memory in which to do this.
</P>
<P>
In some situations, you may already know that your strings are valid, and
therefore want to skip these checks in order to improve performance. If you set
the PCRE_NO_UTF32_CHECK flag at compile time or at run time, PCRE assumes that
the pattern or subject it is given (respectively) contains only valid UTF-32
sequences. In this case, it does not diagnose an invalid UTF-32 string.
However, if an invalid string is passed, the result is undefined.
</P>
<br><b>
General comments about UTF modes
</b><br>
<P>
1. Codepoints less than 256 can be specified in patterns by either braced or
unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3). Larger
values have to use braced sequences.
</P>
<P>
2. Octal numbers up to \777 are recognized, and in UTF-8 mode they match
two-byte characters for values greater than \177.
</P>
<P>
3. Repeat quantifiers apply to complete UTF characters, not to individual
data units, for example: \x{100}{3}.
</P>
<P>
4. The dot metacharacter matches one UTF character instead of a single data
unit.
</P>
<P>
5. The escape sequence \C can be used to match a single byte in UTF-8 mode, or
a single 16-bit data unit in UTF-16 mode, or a single 32-bit data unit in
UTF-32 mode, but its use can lead to some strange effects because it breaks up
multi-unit characters (see the description of \C in the
<a href="pcrepattern.html"><b>pcrepattern</b></a>
documentation). The use of \C is not supported in the alternative matching
function <b>pcre[16|32]_dfa_exec()</b>, nor is it supported in UTF mode by the
JIT optimization of <b>pcre[16|32]_exec()</b>. If JIT optimization is requested
for a UTF pattern that contains \C, it will not succeed, and so the matching
will be carried out by the normal interpretive function.
</P>
<P>
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
test characters of any code value, but, by default, the characters that PCRE
recognizes as digits, spaces, or word characters remain the same set as in
non-UTF mode, all with values less than 256. This remains true even when PCRE
is built to include Unicode property support, because to do otherwise would
slow down PCRE in many common cases. Note in particular that this applies to
\b and \B, because they are defined in terms of \w and \W. If you really
want to test for a wider sense of, say, "digit", you can use explicit Unicode
property tests such as \p{Nd}. Alternatively, if you set the PCRE_UCP option,
the way that the character escapes work is changed so that Unicode properties
are used to determine which characters match. There are more details in the
section on
<a href="pcrepattern.html#genericchartypes">generic character types</a>
in the
<a href="pcrepattern.html"><b>pcrepattern</b></a>
documentation.
</P>
<P>
7. Similarly, characters that match the POSIX named character classes are all
low-valued characters, unless the PCRE_UCP option is set.
</P>
<P>
8. However, the horizontal and vertical white space matching escapes (\h, \H,
\v, and \V) do match all the appropriate Unicode characters, whether or not
PCRE_UCP is set.
</P>
<P>
9. Case-insensitive matching applies only to characters whose values are less
than 128, unless PCRE is built with Unicode property support. A few Unicode
characters such as Greek sigma have more than two codepoints that are
case-equivalent. Up to and including PCRE release 8.31, only one-to-one case
mappings were supported, but later releases (with Unicode property support) do
treat as case-equivalent all versions of characters such as Greek sigma.
</P>
<br><b>
AUTHOR
</b><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><b>
REVISION
</b><br>
<P>
Last updated: 11 November 2012
<br>
Copyright &copy; 1997-2012 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>