|
|
|
|
/*
|
|
|
|
|
* DFA routines
|
|
|
|
|
* This file is #included by regexec.c.
|
|
|
|
|
*
|
|
|
|
|
* Copyright (c) 1998, 1999 Henry Spencer. All rights reserved.
|
|
|
|
|
*
|
|
|
|
|
* Development of this software was funded, in part, by Cray Research Inc.,
|
|
|
|
|
* UUNET Communications Services Inc., Sun Microsystems Inc., and Scriptics
|
|
|
|
|
* Corporation, none of whom are responsible for the results. The author
|
|
|
|
|
* thanks all of them.
|
|
|
|
|
*
|
|
|
|
|
* Redistribution and use in source and binary forms -- with or without
|
|
|
|
|
* modification -- are permitted for any purpose, provided that
|
|
|
|
|
* redistributions in source form retain this entire copyright notice and
|
|
|
|
|
* indicate the origin and nature of any modifications.
|
|
|
|
|
*
|
|
|
|
|
* I'd appreciate being given credit for this package in the documentation
|
|
|
|
|
* of software which uses it, but that is not a requirement.
|
|
|
|
|
*
|
|
|
|
|
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
|
|
|
|
|
* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
|
|
|
|
|
* AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
|
|
|
|
|
* HENRY SPENCER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
|
|
|
|
|
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
|
|
|
|
|
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
|
|
|
|
|
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
|
|
|
|
|
* WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
|
|
|
|
|
* OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
|
|
|
|
|
* ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
|
*
|
|
|
|
|
* src/backend/regex/rege_dfa.c
|
|
|
|
|
*
|
|
|
|
|
*/
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* longest - longest-preferred matching engine
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
*
|
|
|
|
|
* On success, returns match endpoint address. Returns NULL on no match.
|
|
|
|
|
* Internal errors also return NULL, with v->err set.
|
|
|
|
|
*/
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
static chr *
|
|
|
|
|
longest(struct vars *v,
|
|
|
|
|
struct dfa *d,
|
|
|
|
|
chr *start, /* where the match should start */
|
|
|
|
|
chr *stop, /* match must end at or before here */
|
|
|
|
|
int *hitstopp) /* record whether hit v->stop, if non-NULL */
|
|
|
|
|
{
|
|
|
|
|
chr *cp;
|
|
|
|
|
chr *realstop = (stop == v->stop) ? stop : stop + 1;
|
|
|
|
|
color co;
|
|
|
|
|
struct sset *css;
|
|
|
|
|
struct sset *ss;
|
|
|
|
|
chr *post;
|
|
|
|
|
int i;
|
|
|
|
|
struct colormap *cm = d->cm;
|
|
|
|
|
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
/* prevent "uninitialized variable" warnings */
|
|
|
|
|
if (hitstopp != NULL)
|
|
|
|
|
*hitstopp = 0;
|
|
|
|
|
|
Recognize "match-all" NFAs within the regex engine.
This builds on the previous "rainbow" patch to detect NFAs that will
match any string, though possibly with constraints on the string length.
This definition is chosen to match constructs such as ".*", ".+", and
".{1,100}". Recognizing such an NFA after the optimization pass is
fairly cheap, since we basically just have to verify that all arcs
are RAINBOW arcs and count the number of steps to the end state.
(Well, there's a bit of complication with pseudo-color arcs for string
boundary conditions, but not much.)
Once we have these markings, the regex executor functions longest(),
shortest(), and matchuntil() don't have to expend per-character work
to determine whether a given substring satisfies such an NFA; they
just need to check its length against the bounds. Since some matching
problems require O(N) invocations of these functions, we've reduced
the runtime for an N-character string from O(N^2) to O(N). Of course,
this is no help for non-matchall sub-patterns, but those usually have
constraints that allow us to avoid needing O(N) substring checks in the
first place. It's precisely the unconstrained "match-all" cases that
cause the most headaches.
This is part of a patch series that in total reduces the regex engine's
runtime by about a factor of four on a large corpus of real-world regexes.
Patch by me, reviewed by Joel Jacobson
Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us
5 years ago
|
|
|
/* fast path for matchall NFAs */
|
|
|
|
|
if (d->cnfa->flags & MATCHALL)
|
|
|
|
|
{
|
|
|
|
|
size_t nchr = stop - start;
|
|
|
|
|
size_t maxmatchall = d->cnfa->maxmatchall;
|
|
|
|
|
|
|
|
|
|
if (nchr < d->cnfa->minmatchall)
|
|
|
|
|
return NULL;
|
|
|
|
|
if (maxmatchall == DUPINF)
|
|
|
|
|
{
|
|
|
|
|
if (stop == v->stop && hitstopp != NULL)
|
|
|
|
|
*hitstopp = 1;
|
|
|
|
|
}
|
|
|
|
|
else
|
|
|
|
|
{
|
|
|
|
|
if (stop == v->stop && nchr <= maxmatchall + 1 && hitstopp != NULL)
|
|
|
|
|
*hitstopp = 1;
|
|
|
|
|
if (nchr > maxmatchall)
|
|
|
|
|
return start + maxmatchall;
|
|
|
|
|
}
|
|
|
|
|
return stop;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* initialize */
|
|
|
|
|
css = initialize(v, d, start);
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
if (css == NULL)
|
|
|
|
|
return NULL;
|
|
|
|
|
cp = start;
|
|
|
|
|
|
|
|
|
|
/* startup */
|
|
|
|
|
FDEBUG(("+++ startup +++\n"));
|
|
|
|
|
if (cp == v->start)
|
|
|
|
|
{
|
|
|
|
|
co = d->cnfa->bos[(v->eflags & REG_NOTBOL) ? 0 : 1];
|
|
|
|
|
FDEBUG(("color %ld\n", (long) co));
|
|
|
|
|
}
|
|
|
|
|
else
|
|
|
|
|
{
|
|
|
|
|
co = GETCOLOR(cm, *(cp - 1));
|
|
|
|
|
FDEBUG(("char %c, color %ld\n", (char) *(cp - 1), (long) co));
|
|
|
|
|
}
|
|
|
|
|
css = miss(v, d, css, co, cp, start);
|
|
|
|
|
if (css == NULL)
|
|
|
|
|
return NULL;
|
|
|
|
|
css->lastseen = cp;
|
|
|
|
|
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
/*
|
|
|
|
|
* This is the main text-scanning loop. It seems worth having two copies
|
|
|
|
|
* to avoid the overhead of REG_FTRACE tests here, even in REG_DEBUG
|
|
|
|
|
* builds, when you're not actively tracing.
|
|
|
|
|
*/
|
|
|
|
|
#ifdef REG_DEBUG
|
|
|
|
|
if (v->eflags & REG_FTRACE)
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
{
|
|
|
|
|
while (cp < realstop)
|
|
|
|
|
{
|
|
|
|
|
FDEBUG(("+++ at c%d +++\n", (int) (css - d->ssets)));
|
|
|
|
|
co = GETCOLOR(cm, *cp);
|
|
|
|
|
FDEBUG(("char %c, color %ld\n", (char) *cp, (long) co));
|
|
|
|
|
ss = css->outs[co];
|
|
|
|
|
if (ss == NULL)
|
|
|
|
|
{
|
|
|
|
|
ss = miss(v, d, css, co, cp + 1, start);
|
|
|
|
|
if (ss == NULL)
|
|
|
|
|
break; /* NOTE BREAK OUT */
|
|
|
|
|
}
|
|
|
|
|
cp++;
|
|
|
|
|
ss->lastseen = cp;
|
|
|
|
|
css = ss;
|
|
|
|
|
}
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
}
|
|
|
|
|
else
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
#endif
|
|
|
|
|
{
|
|
|
|
|
while (cp < realstop)
|
|
|
|
|
{
|
|
|
|
|
co = GETCOLOR(cm, *cp);
|
|
|
|
|
ss = css->outs[co];
|
|
|
|
|
if (ss == NULL)
|
|
|
|
|
{
|
|
|
|
|
ss = miss(v, d, css, co, cp + 1, start);
|
|
|
|
|
if (ss == NULL)
|
|
|
|
|
break; /* NOTE BREAK OUT */
|
|
|
|
|
}
|
|
|
|
|
cp++;
|
|
|
|
|
ss->lastseen = cp;
|
|
|
|
|
css = ss;
|
|
|
|
|
}
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (ISERR())
|
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
|
|
/* shutdown */
|
|
|
|
|
FDEBUG(("+++ shutdown at c%d +++\n", (int) (css - d->ssets)));
|
|
|
|
|
if (cp == v->stop && stop == v->stop)
|
|
|
|
|
{
|
|
|
|
|
if (hitstopp != NULL)
|
|
|
|
|
*hitstopp = 1;
|
|
|
|
|
co = d->cnfa->eos[(v->eflags & REG_NOTEOL) ? 0 : 1];
|
|
|
|
|
FDEBUG(("color %ld\n", (long) co));
|
|
|
|
|
ss = miss(v, d, css, co, cp, start);
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
if (ISERR())
|
|
|
|
|
return NULL;
|
|
|
|
|
/* special case: match ended at eol? */
|
|
|
|
|
if (ss != NULL && (ss->flags & POSTSTATE))
|
|
|
|
|
return cp;
|
|
|
|
|
else if (ss != NULL)
|
|
|
|
|
ss->lastseen = cp; /* to be tidy */
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* find last match, if any */
|
|
|
|
|
post = d->lastpost;
|
|
|
|
|
for (ss = d->ssets, i = d->nssused; i > 0; ss++, i--)
|
|
|
|
|
if ((ss->flags & POSTSTATE) && post != ss->lastseen &&
|
|
|
|
|
(post == NULL || post < ss->lastseen))
|
|
|
|
|
post = ss->lastseen;
|
|
|
|
|
if (post != NULL) /* found one */
|
|
|
|
|
return post - 1;
|
|
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* shortest - shortest-preferred matching engine
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
*
|
|
|
|
|
* On success, returns match endpoint address. Returns NULL on no match.
|
|
|
|
|
* Internal errors also return NULL, with v->err set.
|
|
|
|
|
*/
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
static chr *
|
|
|
|
|
shortest(struct vars *v,
|
|
|
|
|
struct dfa *d,
|
|
|
|
|
chr *start, /* where the match should start */
|
|
|
|
|
chr *min, /* match must end at or after here */
|
|
|
|
|
chr *max, /* match must end at or before here */
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
chr **coldp, /* store coldstart pointer here, if non-NULL */
|
|
|
|
|
int *hitstopp) /* record whether hit v->stop, if non-NULL */
|
|
|
|
|
{
|
|
|
|
|
chr *cp;
|
|
|
|
|
chr *realmin = (min == v->stop) ? min : min + 1;
|
|
|
|
|
chr *realmax = (max == v->stop) ? max : max + 1;
|
|
|
|
|
color co;
|
|
|
|
|
struct sset *css;
|
|
|
|
|
struct sset *ss;
|
|
|
|
|
struct colormap *cm = d->cm;
|
|
|
|
|
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
/* prevent "uninitialized variable" warnings */
|
|
|
|
|
if (coldp != NULL)
|
|
|
|
|
*coldp = NULL;
|
|
|
|
|
if (hitstopp != NULL)
|
|
|
|
|
*hitstopp = 0;
|
|
|
|
|
|
Recognize "match-all" NFAs within the regex engine.
This builds on the previous "rainbow" patch to detect NFAs that will
match any string, though possibly with constraints on the string length.
This definition is chosen to match constructs such as ".*", ".+", and
".{1,100}". Recognizing such an NFA after the optimization pass is
fairly cheap, since we basically just have to verify that all arcs
are RAINBOW arcs and count the number of steps to the end state.
(Well, there's a bit of complication with pseudo-color arcs for string
boundary conditions, but not much.)
Once we have these markings, the regex executor functions longest(),
shortest(), and matchuntil() don't have to expend per-character work
to determine whether a given substring satisfies such an NFA; they
just need to check its length against the bounds. Since some matching
problems require O(N) invocations of these functions, we've reduced
the runtime for an N-character string from O(N^2) to O(N). Of course,
this is no help for non-matchall sub-patterns, but those usually have
constraints that allow us to avoid needing O(N) substring checks in the
first place. It's precisely the unconstrained "match-all" cases that
cause the most headaches.
This is part of a patch series that in total reduces the regex engine's
runtime by about a factor of four on a large corpus of real-world regexes.
Patch by me, reviewed by Joel Jacobson
Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us
5 years ago
|
|
|
/* fast path for matchall NFAs */
|
|
|
|
|
if (d->cnfa->flags & MATCHALL)
|
|
|
|
|
{
|
|
|
|
|
size_t nchr = min - start;
|
|
|
|
|
|
|
|
|
|
if (d->cnfa->maxmatchall != DUPINF &&
|
|
|
|
|
nchr > d->cnfa->maxmatchall)
|
|
|
|
|
return NULL;
|
|
|
|
|
if ((max - start) < d->cnfa->minmatchall)
|
|
|
|
|
return NULL;
|
|
|
|
|
if (nchr < d->cnfa->minmatchall)
|
|
|
|
|
min = start + d->cnfa->minmatchall;
|
|
|
|
|
if (coldp != NULL)
|
|
|
|
|
*coldp = start;
|
|
|
|
|
/* there is no case where we should set *hitstopp */
|
|
|
|
|
return min;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* initialize */
|
|
|
|
|
css = initialize(v, d, start);
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
if (css == NULL)
|
|
|
|
|
return NULL;
|
|
|
|
|
cp = start;
|
|
|
|
|
|
|
|
|
|
/* startup */
|
|
|
|
|
FDEBUG(("--- startup ---\n"));
|
|
|
|
|
if (cp == v->start)
|
|
|
|
|
{
|
|
|
|
|
co = d->cnfa->bos[(v->eflags & REG_NOTBOL) ? 0 : 1];
|
|
|
|
|
FDEBUG(("color %ld\n", (long) co));
|
|
|
|
|
}
|
|
|
|
|
else
|
|
|
|
|
{
|
|
|
|
|
co = GETCOLOR(cm, *(cp - 1));
|
|
|
|
|
FDEBUG(("char %c, color %ld\n", (char) *(cp - 1), (long) co));
|
|
|
|
|
}
|
|
|
|
|
css = miss(v, d, css, co, cp, start);
|
|
|
|
|
if (css == NULL)
|
|
|
|
|
return NULL;
|
|
|
|
|
css->lastseen = cp;
|
|
|
|
|
ss = css;
|
|
|
|
|
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
/*
|
|
|
|
|
* This is the main text-scanning loop. It seems worth having two copies
|
|
|
|
|
* to avoid the overhead of REG_FTRACE tests here, even in REG_DEBUG
|
|
|
|
|
* builds, when you're not actively tracing.
|
|
|
|
|
*/
|
|
|
|
|
#ifdef REG_DEBUG
|
|
|
|
|
if (v->eflags & REG_FTRACE)
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
{
|
|
|
|
|
while (cp < realmax)
|
|
|
|
|
{
|
|
|
|
|
FDEBUG(("--- at c%d ---\n", (int) (css - d->ssets)));
|
|
|
|
|
co = GETCOLOR(cm, *cp);
|
|
|
|
|
FDEBUG(("char %c, color %ld\n", (char) *cp, (long) co));
|
|
|
|
|
ss = css->outs[co];
|
|
|
|
|
if (ss == NULL)
|
|
|
|
|
{
|
|
|
|
|
ss = miss(v, d, css, co, cp + 1, start);
|
|
|
|
|
if (ss == NULL)
|
|
|
|
|
break; /* NOTE BREAK OUT */
|
|
|
|
|
}
|
|
|
|
|
cp++;
|
|
|
|
|
ss->lastseen = cp;
|
|
|
|
|
css = ss;
|
|
|
|
|
if ((ss->flags & POSTSTATE) && cp >= realmin)
|
|
|
|
|
break; /* NOTE BREAK OUT */
|
|
|
|
|
}
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
}
|
|
|
|
|
else
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
#endif
|
|
|
|
|
{
|
|
|
|
|
while (cp < realmax)
|
|
|
|
|
{
|
|
|
|
|
co = GETCOLOR(cm, *cp);
|
|
|
|
|
ss = css->outs[co];
|
|
|
|
|
if (ss == NULL)
|
|
|
|
|
{
|
|
|
|
|
ss = miss(v, d, css, co, cp + 1, start);
|
|
|
|
|
if (ss == NULL)
|
|
|
|
|
break; /* NOTE BREAK OUT */
|
|
|
|
|
}
|
|
|
|
|
cp++;
|
|
|
|
|
ss->lastseen = cp;
|
|
|
|
|
css = ss;
|
|
|
|
|
if ((ss->flags & POSTSTATE) && cp >= realmin)
|
|
|
|
|
break; /* NOTE BREAK OUT */
|
|
|
|
|
}
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (ss == NULL)
|
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
|
|
if (coldp != NULL) /* report last no-progress state set, if any */
|
|
|
|
|
*coldp = lastcold(v, d);
|
|
|
|
|
|
|
|
|
|
if ((ss->flags & POSTSTATE) && cp > min)
|
|
|
|
|
{
|
|
|
|
|
assert(cp >= realmin);
|
|
|
|
|
cp--;
|
|
|
|
|
}
|
|
|
|
|
else if (cp == v->stop && max == v->stop)
|
|
|
|
|
{
|
|
|
|
|
co = d->cnfa->eos[(v->eflags & REG_NOTEOL) ? 0 : 1];
|
|
|
|
|
FDEBUG(("color %ld\n", (long) co));
|
|
|
|
|
ss = miss(v, d, css, co, cp, start);
|
|
|
|
|
/* match might have ended at eol */
|
|
|
|
|
if ((ss == NULL || !(ss->flags & POSTSTATE)) && hitstopp != NULL)
|
|
|
|
|
*hitstopp = 1;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (ss == NULL || !(ss->flags & POSTSTATE))
|
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
|
|
return cp;
|
|
|
|
|
}
|
|
|
|
|
|
Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes
no text; but it checks for existence (or nonexistence) of a match *ending*
at the current point in the string, rather than one *starting* at the
current point. This is a long-requested feature since it exists in many
other regex libraries, but Henry Spencer had never got around to
implementing it in the code we use.
Just making it work is actually pretty trivial; but naive copying of the
logic for lookahead constraints leads to code that often spends O(N^2) time
to scan an N-character string, because we have to run the match engine
from string start to the current probe point each time the constraint is
checked. In typical use-cases a lookbehind constraint will be written at
the start of the regex and hence will need to be checked at every character
--- so O(N^2) work overall. To fix that, I introduced a third copy of the
core DFA matching loop, paralleling the existing longest() and shortest()
loops. This version, matchuntil(), can suspend and resume matching given
a couple of pointers' worth of storage space. So we need only run it
across the string once, stopping at each interesting probe point and then
resuming to advance to the next one.
I also put in an optimization that simplifies one-character lookahead and
lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
constraints, which already existed in the engine. This avoids the overhead
of the LACON machinery entirely for these rather common cases.
The net result is that lookbehind constraints run a factor of three or so
slower than Perl's for multi-character constraints, but faster than Perl's
for one-character constraints ... and they work fine for variable-length
constraints, which Perl gives up on entirely. So that's not bad from a
competitive perspective, and there's room for further optimization if
anyone cares. (In reality, raw scan rate across a large input string is
probably not that big a deal for Postgres usage anyway; so I'm happy if
it's linear.)
10 years ago
|
|
|
/*
|
|
|
|
|
* matchuntil - incremental matching engine
|
|
|
|
|
*
|
|
|
|
|
* This is meant for use with a search-style NFA (that is, the pattern is
|
|
|
|
|
* known to act as though it had a leading .*). We determine whether a
|
|
|
|
|
* match exists starting at v->start and ending at probe. Multiple calls
|
|
|
|
|
* require only O(N) time not O(N^2) so long as the probe values are
|
|
|
|
|
* nondecreasing. *lastcss and *lastcp must be initialized to NULL before
|
|
|
|
|
* starting a series of calls.
|
|
|
|
|
*
|
|
|
|
|
* Returns 1 if a match exists, 0 if not.
|
|
|
|
|
* Internal errors also return 0, with v->err set.
|
|
|
|
|
*/
|
|
|
|
|
static int
|
|
|
|
|
matchuntil(struct vars *v,
|
|
|
|
|
struct dfa *d,
|
Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes
no text; but it checks for existence (or nonexistence) of a match *ending*
at the current point in the string, rather than one *starting* at the
current point. This is a long-requested feature since it exists in many
other regex libraries, but Henry Spencer had never got around to
implementing it in the code we use.
Just making it work is actually pretty trivial; but naive copying of the
logic for lookahead constraints leads to code that often spends O(N^2) time
to scan an N-character string, because we have to run the match engine
from string start to the current probe point each time the constraint is
checked. In typical use-cases a lookbehind constraint will be written at
the start of the regex and hence will need to be checked at every character
--- so O(N^2) work overall. To fix that, I introduced a third copy of the
core DFA matching loop, paralleling the existing longest() and shortest()
loops. This version, matchuntil(), can suspend and resume matching given
a couple of pointers' worth of storage space. So we need only run it
across the string once, stopping at each interesting probe point and then
resuming to advance to the next one.
I also put in an optimization that simplifies one-character lookahead and
lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
constraints, which already existed in the engine. This avoids the overhead
of the LACON machinery entirely for these rather common cases.
The net result is that lookbehind constraints run a factor of three or so
slower than Perl's for multi-character constraints, but faster than Perl's
for one-character constraints ... and they work fine for variable-length
constraints, which Perl gives up on entirely. So that's not bad from a
competitive perspective, and there's room for further optimization if
anyone cares. (In reality, raw scan rate across a large input string is
probably not that big a deal for Postgres usage anyway; so I'm happy if
it's linear.)
10 years ago
|
|
|
chr *probe, /* we want to know if a match ends here */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
9 years ago
|
|
|
struct sset **lastcss, /* state storage across calls */
|
Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes
no text; but it checks for existence (or nonexistence) of a match *ending*
at the current point in the string, rather than one *starting* at the
current point. This is a long-requested feature since it exists in many
other regex libraries, but Henry Spencer had never got around to
implementing it in the code we use.
Just making it work is actually pretty trivial; but naive copying of the
logic for lookahead constraints leads to code that often spends O(N^2) time
to scan an N-character string, because we have to run the match engine
from string start to the current probe point each time the constraint is
checked. In typical use-cases a lookbehind constraint will be written at
the start of the regex and hence will need to be checked at every character
--- so O(N^2) work overall. To fix that, I introduced a third copy of the
core DFA matching loop, paralleling the existing longest() and shortest()
loops. This version, matchuntil(), can suspend and resume matching given
a couple of pointers' worth of storage space. So we need only run it
across the string once, stopping at each interesting probe point and then
resuming to advance to the next one.
I also put in an optimization that simplifies one-character lookahead and
lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
constraints, which already existed in the engine. This avoids the overhead
of the LACON machinery entirely for these rather common cases.
The net result is that lookbehind constraints run a factor of three or so
slower than Perl's for multi-character constraints, but faster than Perl's
for one-character constraints ... and they work fine for variable-length
constraints, which Perl gives up on entirely. So that's not bad from a
competitive perspective, and there's room for further optimization if
anyone cares. (In reality, raw scan rate across a large input string is
probably not that big a deal for Postgres usage anyway; so I'm happy if
it's linear.)
10 years ago
|
|
|
chr **lastcp) /* state storage across calls */
|
|
|
|
|
{
|
|
|
|
|
chr *cp = *lastcp;
|
|
|
|
|
color co;
|
|
|
|
|
struct sset *css = *lastcss;
|
|
|
|
|
struct sset *ss;
|
|
|
|
|
struct colormap *cm = d->cm;
|
|
|
|
|
|
Recognize "match-all" NFAs within the regex engine.
This builds on the previous "rainbow" patch to detect NFAs that will
match any string, though possibly with constraints on the string length.
This definition is chosen to match constructs such as ".*", ".+", and
".{1,100}". Recognizing such an NFA after the optimization pass is
fairly cheap, since we basically just have to verify that all arcs
are RAINBOW arcs and count the number of steps to the end state.
(Well, there's a bit of complication with pseudo-color arcs for string
boundary conditions, but not much.)
Once we have these markings, the regex executor functions longest(),
shortest(), and matchuntil() don't have to expend per-character work
to determine whether a given substring satisfies such an NFA; they
just need to check its length against the bounds. Since some matching
problems require O(N) invocations of these functions, we've reduced
the runtime for an N-character string from O(N^2) to O(N). Of course,
this is no help for non-matchall sub-patterns, but those usually have
constraints that allow us to avoid needing O(N) substring checks in the
first place. It's precisely the unconstrained "match-all" cases that
cause the most headaches.
This is part of a patch series that in total reduces the regex engine's
runtime by about a factor of four on a large corpus of real-world regexes.
Patch by me, reviewed by Joel Jacobson
Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us
5 years ago
|
|
|
/* fast path for matchall NFAs */
|
|
|
|
|
if (d->cnfa->flags & MATCHALL)
|
|
|
|
|
{
|
|
|
|
|
size_t nchr = probe - v->start;
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* It might seem that we should check maxmatchall too, but the .* at
|
|
|
|
|
* the front of the pattern absorbs any extra characters (and it was
|
|
|
|
|
* tacked on *after* computing minmatchall/maxmatchall). Thus, we
|
|
|
|
|
* should match if there are at least minmatchall characters.
|
|
|
|
|
*/
|
|
|
|
|
if (nchr < d->cnfa->minmatchall)
|
|
|
|
|
return 0;
|
|
|
|
|
return 1;
|
|
|
|
|
}
|
|
|
|
|
|
Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes
no text; but it checks for existence (or nonexistence) of a match *ending*
at the current point in the string, rather than one *starting* at the
current point. This is a long-requested feature since it exists in many
other regex libraries, but Henry Spencer had never got around to
implementing it in the code we use.
Just making it work is actually pretty trivial; but naive copying of the
logic for lookahead constraints leads to code that often spends O(N^2) time
to scan an N-character string, because we have to run the match engine
from string start to the current probe point each time the constraint is
checked. In typical use-cases a lookbehind constraint will be written at
the start of the regex and hence will need to be checked at every character
--- so O(N^2) work overall. To fix that, I introduced a third copy of the
core DFA matching loop, paralleling the existing longest() and shortest()
loops. This version, matchuntil(), can suspend and resume matching given
a couple of pointers' worth of storage space. So we need only run it
across the string once, stopping at each interesting probe point and then
resuming to advance to the next one.
I also put in an optimization that simplifies one-character lookahead and
lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
constraints, which already existed in the engine. This avoids the overhead
of the LACON machinery entirely for these rather common cases.
The net result is that lookbehind constraints run a factor of three or so
slower than Perl's for multi-character constraints, but faster than Perl's
for one-character constraints ... and they work fine for variable-length
constraints, which Perl gives up on entirely. So that's not bad from a
competitive perspective, and there's room for further optimization if
anyone cares. (In reality, raw scan rate across a large input string is
probably not that big a deal for Postgres usage anyway; so I'm happy if
it's linear.)
10 years ago
|
|
|
/* initialize and startup, or restart, if necessary */
|
|
|
|
|
if (cp == NULL || cp > probe)
|
|
|
|
|
{
|
|
|
|
|
cp = v->start;
|
|
|
|
|
css = initialize(v, d, cp);
|
|
|
|
|
if (css == NULL)
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
|
|
FDEBUG((">>> startup >>>\n"));
|
|
|
|
|
co = d->cnfa->bos[(v->eflags & REG_NOTBOL) ? 0 : 1];
|
|
|
|
|
FDEBUG(("color %ld\n", (long) co));
|
|
|
|
|
|
|
|
|
|
css = miss(v, d, css, co, cp, v->start);
|
|
|
|
|
if (css == NULL)
|
|
|
|
|
return 0;
|
|
|
|
|
css->lastseen = cp;
|
|
|
|
|
}
|
|
|
|
|
else if (css == NULL)
|
|
|
|
|
{
|
|
|
|
|
/* we previously found that no match is possible beyond *lastcp */
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
ss = css;
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* This is the main text-scanning loop. It seems worth having two copies
|
|
|
|
|
* to avoid the overhead of REG_FTRACE tests here, even in REG_DEBUG
|
|
|
|
|
* builds, when you're not actively tracing.
|
|
|
|
|
*/
|
|
|
|
|
#ifdef REG_DEBUG
|
|
|
|
|
if (v->eflags & REG_FTRACE)
|
|
|
|
|
{
|
|
|
|
|
while (cp < probe)
|
|
|
|
|
{
|
|
|
|
|
FDEBUG((">>> at c%d >>>\n", (int) (css - d->ssets)));
|
|
|
|
|
co = GETCOLOR(cm, *cp);
|
|
|
|
|
FDEBUG(("char %c, color %ld\n", (char) *cp, (long) co));
|
|
|
|
|
ss = css->outs[co];
|
|
|
|
|
if (ss == NULL)
|
|
|
|
|
{
|
|
|
|
|
ss = miss(v, d, css, co, cp + 1, v->start);
|
|
|
|
|
if (ss == NULL)
|
|
|
|
|
break; /* NOTE BREAK OUT */
|
|
|
|
|
}
|
|
|
|
|
cp++;
|
|
|
|
|
ss->lastseen = cp;
|
|
|
|
|
css = ss;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
else
|
|
|
|
|
#endif
|
|
|
|
|
{
|
|
|
|
|
while (cp < probe)
|
|
|
|
|
{
|
|
|
|
|
co = GETCOLOR(cm, *cp);
|
|
|
|
|
ss = css->outs[co];
|
|
|
|
|
if (ss == NULL)
|
|
|
|
|
{
|
|
|
|
|
ss = miss(v, d, css, co, cp + 1, v->start);
|
|
|
|
|
if (ss == NULL)
|
|
|
|
|
break; /* NOTE BREAK OUT */
|
|
|
|
|
}
|
|
|
|
|
cp++;
|
|
|
|
|
ss->lastseen = cp;
|
|
|
|
|
css = ss;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
*lastcss = ss;
|
|
|
|
|
*lastcp = cp;
|
|
|
|
|
|
|
|
|
|
if (ss == NULL)
|
|
|
|
|
return 0; /* impossible match, or internal error */
|
|
|
|
|
|
|
|
|
|
/* We need to process one more chr, or the EOS symbol, to check match */
|
|
|
|
|
if (cp < v->stop)
|
|
|
|
|
{
|
|
|
|
|
FDEBUG((">>> at c%d >>>\n", (int) (css - d->ssets)));
|
|
|
|
|
co = GETCOLOR(cm, *cp);
|
|
|
|
|
FDEBUG(("char %c, color %ld\n", (char) *cp, (long) co));
|
|
|
|
|
ss = css->outs[co];
|
|
|
|
|
if (ss == NULL)
|
|
|
|
|
ss = miss(v, d, css, co, cp + 1, v->start);
|
|
|
|
|
}
|
|
|
|
|
else
|
|
|
|
|
{
|
|
|
|
|
assert(cp == v->stop);
|
|
|
|
|
co = d->cnfa->eos[(v->eflags & REG_NOTEOL) ? 0 : 1];
|
|
|
|
|
FDEBUG(("color %ld\n", (long) co));
|
|
|
|
|
ss = miss(v, d, css, co, cp, v->start);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (ss == NULL || !(ss->flags & POSTSTATE))
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* lastcold - determine last point at which no progress had been made
|
|
|
|
|
*/
|
|
|
|
|
static chr * /* endpoint, or NULL */
|
|
|
|
|
lastcold(struct vars *v,
|
|
|
|
|
struct dfa *d)
|
|
|
|
|
{
|
|
|
|
|
struct sset *ss;
|
|
|
|
|
chr *nopr;
|
|
|
|
|
int i;
|
|
|
|
|
|
|
|
|
|
nopr = d->lastnopr;
|
|
|
|
|
if (nopr == NULL)
|
|
|
|
|
nopr = v->start;
|
|
|
|
|
for (ss = d->ssets, i = d->nssused; i > 0; ss++, i--)
|
|
|
|
|
if ((ss->flags & NOPROGRESS) && nopr < ss->lastseen)
|
|
|
|
|
nopr = ss->lastseen;
|
|
|
|
|
return nopr;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* newdfa - set up a fresh DFA
|
|
|
|
|
*/
|
|
|
|
|
static struct dfa *
|
|
|
|
|
newdfa(struct vars *v,
|
|
|
|
|
struct cnfa *cnfa,
|
|
|
|
|
struct colormap *cm,
|
|
|
|
|
struct smalldfa *sml) /* preallocated space, may be NULL */
|
|
|
|
|
{
|
|
|
|
|
struct dfa *d;
|
|
|
|
|
size_t nss = cnfa->nstates * 2;
|
|
|
|
|
int wordsper = (cnfa->nstates + UBITS - 1) / UBITS;
|
|
|
|
|
struct smalldfa *smallwas = sml;
|
|
|
|
|
|
|
|
|
|
assert(cnfa != NULL && cnfa->nstates != 0);
|
|
|
|
|
|
|
|
|
|
if (nss <= FEWSTATES && cnfa->ncolors <= FEWCOLORS)
|
|
|
|
|
{
|
|
|
|
|
assert(wordsper == 1);
|
|
|
|
|
if (sml == NULL)
|
|
|
|
|
{
|
|
|
|
|
sml = (struct smalldfa *) MALLOC(sizeof(struct smalldfa));
|
|
|
|
|
if (sml == NULL)
|
|
|
|
|
{
|
|
|
|
|
ERR(REG_ESPACE);
|
|
|
|
|
return NULL;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
d = &sml->dfa;
|
|
|
|
|
d->ssets = sml->ssets;
|
|
|
|
|
d->statesarea = sml->statesarea;
|
|
|
|
|
d->work = &d->statesarea[nss];
|
|
|
|
|
d->outsarea = sml->outsarea;
|
|
|
|
|
d->incarea = sml->incarea;
|
|
|
|
|
d->cptsmalloced = 0;
|
|
|
|
|
d->mallocarea = (smallwas == NULL) ? (char *) sml : NULL;
|
|
|
|
|
}
|
|
|
|
|
else
|
|
|
|
|
{
|
|
|
|
|
d = (struct dfa *) MALLOC(sizeof(struct dfa));
|
|
|
|
|
if (d == NULL)
|
|
|
|
|
{
|
|
|
|
|
ERR(REG_ESPACE);
|
|
|
|
|
return NULL;
|
|
|
|
|
}
|
|
|
|
|
d->ssets = (struct sset *) MALLOC(nss * sizeof(struct sset));
|
|
|
|
|
d->statesarea = (unsigned *) MALLOC((nss + WORK) * wordsper *
|
|
|
|
|
sizeof(unsigned));
|
|
|
|
|
d->work = &d->statesarea[nss * wordsper];
|
|
|
|
|
d->outsarea = (struct sset **) MALLOC(nss * cnfa->ncolors *
|
|
|
|
|
sizeof(struct sset *));
|
|
|
|
|
d->incarea = (struct arcp *) MALLOC(nss * cnfa->ncolors *
|
|
|
|
|
sizeof(struct arcp));
|
|
|
|
|
d->cptsmalloced = 1;
|
|
|
|
|
d->mallocarea = (char *) d;
|
|
|
|
|
if (d->ssets == NULL || d->statesarea == NULL ||
|
|
|
|
|
d->outsarea == NULL || d->incarea == NULL)
|
|
|
|
|
{
|
|
|
|
|
freedfa(d);
|
|
|
|
|
ERR(REG_ESPACE);
|
|
|
|
|
return NULL;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
d->nssets = (v->eflags & REG_SMALL) ? 7 : nss;
|
|
|
|
|
d->nssused = 0;
|
|
|
|
|
d->nstates = cnfa->nstates;
|
|
|
|
|
d->ncolors = cnfa->ncolors;
|
|
|
|
|
d->wordsper = wordsper;
|
|
|
|
|
d->cnfa = cnfa;
|
|
|
|
|
d->cm = cm;
|
|
|
|
|
d->lastpost = NULL;
|
|
|
|
|
d->lastnopr = NULL;
|
|
|
|
|
d->search = d->ssets;
|
|
|
|
|
|
|
|
|
|
/* initialization of sset fields is done as needed */
|
|
|
|
|
|
|
|
|
|
return d;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* freedfa - free a DFA
|
|
|
|
|
*/
|
|
|
|
|
static void
|
|
|
|
|
freedfa(struct dfa *d)
|
|
|
|
|
{
|
|
|
|
|
if (d->cptsmalloced)
|
|
|
|
|
{
|
|
|
|
|
if (d->ssets != NULL)
|
|
|
|
|
FREE(d->ssets);
|
|
|
|
|
if (d->statesarea != NULL)
|
|
|
|
|
FREE(d->statesarea);
|
|
|
|
|
if (d->outsarea != NULL)
|
|
|
|
|
FREE(d->outsarea);
|
|
|
|
|
if (d->incarea != NULL)
|
|
|
|
|
FREE(d->incarea);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (d->mallocarea != NULL)
|
|
|
|
|
FREE(d->mallocarea);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* hash - construct a hash code for a bitvector
|
|
|
|
|
*
|
|
|
|
|
* There are probably better ways, but they're more expensive.
|
|
|
|
|
*/
|
|
|
|
|
static unsigned
|
|
|
|
|
hash(unsigned *uv,
|
|
|
|
|
int n)
|
|
|
|
|
{
|
|
|
|
|
int i;
|
|
|
|
|
unsigned h;
|
|
|
|
|
|
|
|
|
|
h = 0;
|
|
|
|
|
for (i = 0; i < n; i++)
|
|
|
|
|
h ^= uv[i];
|
|
|
|
|
return h;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* initialize - hand-craft a cache entry for startup, otherwise get ready
|
|
|
|
|
*/
|
|
|
|
|
static struct sset *
|
|
|
|
|
initialize(struct vars *v,
|
|
|
|
|
struct dfa *d,
|
|
|
|
|
chr *start)
|
|
|
|
|
{
|
|
|
|
|
struct sset *ss;
|
|
|
|
|
int i;
|
|
|
|
|
|
|
|
|
|
/* is previous one still there? */
|
|
|
|
|
if (d->nssused > 0 && (d->ssets[0].flags & STARTER))
|
|
|
|
|
ss = &d->ssets[0];
|
|
|
|
|
else
|
|
|
|
|
{ /* no, must (re)build it */
|
|
|
|
|
ss = getvacant(v, d, start, start);
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
if (ss == NULL)
|
|
|
|
|
return NULL;
|
|
|
|
|
for (i = 0; i < d->wordsper; i++)
|
|
|
|
|
ss->states[i] = 0;
|
|
|
|
|
BSET(ss->states, d->cnfa->pre);
|
|
|
|
|
ss->hash = HASH(ss->states, d->wordsper);
|
|
|
|
|
assert(d->cnfa->pre != d->cnfa->post);
|
|
|
|
|
ss->flags = STARTER | LOCKED | NOPROGRESS;
|
|
|
|
|
/* lastseen dealt with below */
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
for (i = 0; i < d->nssused; i++)
|
|
|
|
|
d->ssets[i].lastseen = NULL;
|
|
|
|
|
ss->lastseen = start; /* maybe untrue, but harmless */
|
|
|
|
|
d->lastpost = NULL;
|
|
|
|
|
d->lastnopr = NULL;
|
|
|
|
|
return ss;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
* miss - handle a stateset cache miss
|
|
|
|
|
*
|
|
|
|
|
* css is the current stateset, co is the color of the current input character,
|
|
|
|
|
* cp points to the character after that (which is where we may need to test
|
|
|
|
|
* LACONs). start does not affect matching behavior but is needed for pickss'
|
|
|
|
|
* heuristics about which stateset cache entry to replace.
|
|
|
|
|
*
|
|
|
|
|
* Ordinarily, returns the address of the next stateset (the one that is
|
|
|
|
|
* valid after consuming the input character). Returns NULL if no valid
|
|
|
|
|
* NFA states remain, ie we have a certain match failure.
|
|
|
|
|
* Internal errors also return NULL, with v->err set.
|
|
|
|
|
*/
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
static struct sset *
|
|
|
|
|
miss(struct vars *v,
|
|
|
|
|
struct dfa *d,
|
|
|
|
|
struct sset *css,
|
|
|
|
|
color co,
|
|
|
|
|
chr *cp, /* next chr */
|
|
|
|
|
chr *start) /* where the attempt got started */
|
|
|
|
|
{
|
|
|
|
|
struct cnfa *cnfa = d->cnfa;
|
|
|
|
|
int i;
|
|
|
|
|
unsigned h;
|
|
|
|
|
struct carc *ca;
|
|
|
|
|
struct sset *p;
|
|
|
|
|
int ispseudocolor;
|
|
|
|
|
int ispost;
|
|
|
|
|
int noprogress;
|
|
|
|
|
int gotstate;
|
|
|
|
|
int dolacons;
|
|
|
|
|
int sawlacons;
|
|
|
|
|
|
|
|
|
|
/* for convenience, we can be called even if it might not be a miss */
|
|
|
|
|
if (css->outs[co] != NULL)
|
|
|
|
|
{
|
|
|
|
|
FDEBUG(("hit\n"));
|
|
|
|
|
return css->outs[co];
|
|
|
|
|
}
|
|
|
|
|
FDEBUG(("miss\n"));
|
|
|
|
|
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
/*
|
|
|
|
|
* Checking for operation cancel in the inner text search loop seems
|
|
|
|
|
* unduly expensive. As a compromise, check during cache misses.
|
|
|
|
|
*/
|
|
|
|
|
if (CANCEL_REQUESTED(v->re))
|
|
|
|
|
{
|
|
|
|
|
ERR(REG_CANCEL);
|
|
|
|
|
return NULL;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* What set of states would we end up in after consuming the co character?
|
|
|
|
|
* We first consider PLAIN arcs that consume the character, and then look
|
|
|
|
|
* to see what LACON arcs could be traversed after consuming it.
|
|
|
|
|
*/
|
|
|
|
|
for (i = 0; i < d->wordsper; i++)
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
d->work[i] = 0; /* build new stateset bitmap in d->work */
|
|
|
|
|
ispseudocolor = d->cm->cd[co].flags & PSEUDO;
|
|
|
|
|
ispost = 0;
|
|
|
|
|
noprogress = 1;
|
|
|
|
|
gotstate = 0;
|
|
|
|
|
for (i = 0; i < d->nstates; i++)
|
|
|
|
|
if (ISBSET(css->states, i))
|
|
|
|
|
for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++)
|
|
|
|
|
if (ca->co == co ||
|
|
|
|
|
(ca->co == RAINBOW && !ispseudocolor))
|
|
|
|
|
{
|
|
|
|
|
BSET(d->work, ca->to);
|
|
|
|
|
gotstate = 1;
|
|
|
|
|
if (ca->to == cnfa->post)
|
|
|
|
|
ispost = 1;
|
|
|
|
|
if (!(cnfa->stflags[ca->to] & CNFA_NOPROGRESS))
|
|
|
|
|
noprogress = 0;
|
|
|
|
|
FDEBUG(("%d -> %d\n", i, ca->to));
|
|
|
|
|
}
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
if (!gotstate)
|
|
|
|
|
return NULL; /* character cannot reach any new state */
|
|
|
|
|
dolacons = (cnfa->flags & HASLACONS);
|
|
|
|
|
sawlacons = 0;
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
/* outer loop handles transitive closure of reachable-by-LACON states */
|
|
|
|
|
while (dolacons)
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
{
|
|
|
|
|
dolacons = 0;
|
|
|
|
|
for (i = 0; i < d->nstates; i++)
|
|
|
|
|
if (ISBSET(d->work, i))
|
|
|
|
|
for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++)
|
|
|
|
|
{
|
|
|
|
|
if (ca->co < cnfa->ncolors)
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
9 years ago
|
|
|
continue; /* not a LACON arc */
|
|
|
|
|
if (ISBSET(d->work, ca->to))
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
9 years ago
|
|
|
continue; /* arc would be a no-op anyway */
|
|
|
|
|
sawlacons = 1; /* this LACON affects our result */
|
|
|
|
|
if (!lacon(v, cnfa, cp, ca->co))
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
{
|
|
|
|
|
if (ISERR())
|
|
|
|
|
return NULL;
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
9 years ago
|
|
|
continue; /* LACON arc cannot be traversed */
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
}
|
|
|
|
|
if (ISERR())
|
|
|
|
|
return NULL;
|
|
|
|
|
BSET(d->work, ca->to);
|
|
|
|
|
dolacons = 1;
|
|
|
|
|
if (ca->to == cnfa->post)
|
|
|
|
|
ispost = 1;
|
|
|
|
|
if (!(cnfa->stflags[ca->to] & CNFA_NOPROGRESS))
|
|
|
|
|
noprogress = 0;
|
|
|
|
|
FDEBUG(("%d :> %d\n", i, ca->to));
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
h = HASH(d->work, d->wordsper);
|
|
|
|
|
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
/* Is this stateset already in the cache? */
|
|
|
|
|
for (p = d->ssets, i = d->nssused; i > 0; p++, i--)
|
|
|
|
|
if (HIT(h, d->work, p, d->wordsper))
|
|
|
|
|
{
|
|
|
|
|
FDEBUG(("cached c%d\n", (int) (p - d->ssets)));
|
|
|
|
|
break; /* NOTE BREAK OUT */
|
|
|
|
|
}
|
|
|
|
|
if (i == 0)
|
|
|
|
|
{ /* nope, need a new cache entry */
|
|
|
|
|
p = getvacant(v, d, cp, start);
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
if (p == NULL)
|
|
|
|
|
return NULL;
|
|
|
|
|
assert(p != css);
|
|
|
|
|
for (i = 0; i < d->wordsper; i++)
|
|
|
|
|
p->states[i] = d->work[i];
|
|
|
|
|
p->hash = h;
|
|
|
|
|
p->flags = (ispost) ? POSTSTATE : 0;
|
|
|
|
|
if (noprogress)
|
|
|
|
|
p->flags |= NOPROGRESS;
|
|
|
|
|
/* lastseen to be dealt with by caller */
|
|
|
|
|
}
|
|
|
|
|
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
/*
|
|
|
|
|
* Link new stateset to old, unless a LACON affected the result, in which
|
|
|
|
|
* case we don't create the link. That forces future transitions across
|
|
|
|
|
* this same arc (same prior stateset and character color) to come through
|
|
|
|
|
* miss() again, so that we can recheck the LACON(s), which might or might
|
|
|
|
|
* not pass since context will be different.
|
|
|
|
|
*/
|
|
|
|
|
if (!sawlacons)
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
{
|
|
|
|
|
FDEBUG(("c%d[%d]->c%d\n",
|
|
|
|
|
(int) (css - d->ssets), co, (int) (p - d->ssets)));
|
|
|
|
|
css->outs[co] = p;
|
|
|
|
|
css->inchain[co] = p->ins;
|
|
|
|
|
p->ins.ss = css;
|
|
|
|
|
p->ins.co = co;
|
|
|
|
|
}
|
|
|
|
|
return p;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes
no text; but it checks for existence (or nonexistence) of a match *ending*
at the current point in the string, rather than one *starting* at the
current point. This is a long-requested feature since it exists in many
other regex libraries, but Henry Spencer had never got around to
implementing it in the code we use.
Just making it work is actually pretty trivial; but naive copying of the
logic for lookahead constraints leads to code that often spends O(N^2) time
to scan an N-character string, because we have to run the match engine
from string start to the current probe point each time the constraint is
checked. In typical use-cases a lookbehind constraint will be written at
the start of the regex and hence will need to be checked at every character
--- so O(N^2) work overall. To fix that, I introduced a third copy of the
core DFA matching loop, paralleling the existing longest() and shortest()
loops. This version, matchuntil(), can suspend and resume matching given
a couple of pointers' worth of storage space. So we need only run it
across the string once, stopping at each interesting probe point and then
resuming to advance to the next one.
I also put in an optimization that simplifies one-character lookahead and
lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
constraints, which already existed in the engine. This avoids the overhead
of the LACON machinery entirely for these rather common cases.
The net result is that lookbehind constraints run a factor of three or so
slower than Perl's for multi-character constraints, but faster than Perl's
for one-character constraints ... and they work fine for variable-length
constraints, which Perl gives up on entirely. So that's not bad from a
competitive perspective, and there's room for further optimization if
anyone cares. (In reality, raw scan rate across a large input string is
probably not that big a deal for Postgres usage anyway; so I'm happy if
it's linear.)
10 years ago
|
|
|
* lacon - lookaround-constraint checker for miss()
|
|
|
|
|
*/
|
|
|
|
|
static int /* predicate: constraint satisfied? */
|
|
|
|
|
lacon(struct vars *v,
|
|
|
|
|
struct cnfa *pcnfa, /* parent cnfa */
|
|
|
|
|
chr *cp,
|
|
|
|
|
color co) /* "color" of the lookaround constraint */
|
|
|
|
|
{
|
|
|
|
|
int n;
|
|
|
|
|
struct subre *sub;
|
|
|
|
|
struct dfa *d;
|
|
|
|
|
chr *end;
|
Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes
no text; but it checks for existence (or nonexistence) of a match *ending*
at the current point in the string, rather than one *starting* at the
current point. This is a long-requested feature since it exists in many
other regex libraries, but Henry Spencer had never got around to
implementing it in the code we use.
Just making it work is actually pretty trivial; but naive copying of the
logic for lookahead constraints leads to code that often spends O(N^2) time
to scan an N-character string, because we have to run the match engine
from string start to the current probe point each time the constraint is
checked. In typical use-cases a lookbehind constraint will be written at
the start of the regex and hence will need to be checked at every character
--- so O(N^2) work overall. To fix that, I introduced a third copy of the
core DFA matching loop, paralleling the existing longest() and shortest()
loops. This version, matchuntil(), can suspend and resume matching given
a couple of pointers' worth of storage space. So we need only run it
across the string once, stopping at each interesting probe point and then
resuming to advance to the next one.
I also put in an optimization that simplifies one-character lookahead and
lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
constraints, which already existed in the engine. This avoids the overhead
of the LACON machinery entirely for these rather common cases.
The net result is that lookbehind constraints run a factor of three or so
slower than Perl's for multi-character constraints, but faster than Perl's
for one-character constraints ... and they work fine for variable-length
constraints, which Perl gives up on entirely. So that's not bad from a
competitive perspective, and there's room for further optimization if
anyone cares. (In reality, raw scan rate across a large input string is
probably not that big a deal for Postgres usage anyway; so I'm happy if
it's linear.)
10 years ago
|
|
|
int satisfied;
|
|
|
|
|
|
|
|
|
|
/* Since this is recursive, it could be driven to stack overflow */
|
|
|
|
|
if (STACK_TOO_DEEP(v->re))
|
|
|
|
|
{
|
|
|
|
|
ERR(REG_ETOOBIG);
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
n = co - pcnfa->ncolors;
|
Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes
no text; but it checks for existence (or nonexistence) of a match *ending*
at the current point in the string, rather than one *starting* at the
current point. This is a long-requested feature since it exists in many
other regex libraries, but Henry Spencer had never got around to
implementing it in the code we use.
Just making it work is actually pretty trivial; but naive copying of the
logic for lookahead constraints leads to code that often spends O(N^2) time
to scan an N-character string, because we have to run the match engine
from string start to the current probe point each time the constraint is
checked. In typical use-cases a lookbehind constraint will be written at
the start of the regex and hence will need to be checked at every character
--- so O(N^2) work overall. To fix that, I introduced a third copy of the
core DFA matching loop, paralleling the existing longest() and shortest()
loops. This version, matchuntil(), can suspend and resume matching given
a couple of pointers' worth of storage space. So we need only run it
across the string once, stopping at each interesting probe point and then
resuming to advance to the next one.
I also put in an optimization that simplifies one-character lookahead and
lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
constraints, which already existed in the engine. This avoids the overhead
of the LACON machinery entirely for these rather common cases.
The net result is that lookbehind constraints run a factor of three or so
slower than Perl's for multi-character constraints, but faster than Perl's
for one-character constraints ... and they work fine for variable-length
constraints, which Perl gives up on entirely. So that's not bad from a
competitive perspective, and there's room for further optimization if
anyone cares. (In reality, raw scan rate across a large input string is
probably not that big a deal for Postgres usage anyway; so I'm happy if
it's linear.)
10 years ago
|
|
|
assert(n > 0 && n < v->g->nlacons && v->g->lacons != NULL);
|
|
|
|
|
FDEBUG(("=== testing lacon %d\n", n));
|
|
|
|
|
sub = &v->g->lacons[n];
|
Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes
no text; but it checks for existence (or nonexistence) of a match *ending*
at the current point in the string, rather than one *starting* at the
current point. This is a long-requested feature since it exists in many
other regex libraries, but Henry Spencer had never got around to
implementing it in the code we use.
Just making it work is actually pretty trivial; but naive copying of the
logic for lookahead constraints leads to code that often spends O(N^2) time
to scan an N-character string, because we have to run the match engine
from string start to the current probe point each time the constraint is
checked. In typical use-cases a lookbehind constraint will be written at
the start of the regex and hence will need to be checked at every character
--- so O(N^2) work overall. To fix that, I introduced a third copy of the
core DFA matching loop, paralleling the existing longest() and shortest()
loops. This version, matchuntil(), can suspend and resume matching given
a couple of pointers' worth of storage space. So we need only run it
across the string once, stopping at each interesting probe point and then
resuming to advance to the next one.
I also put in an optimization that simplifies one-character lookahead and
lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
constraints, which already existed in the engine. This avoids the overhead
of the LACON machinery entirely for these rather common cases.
The net result is that lookbehind constraints run a factor of three or so
slower than Perl's for multi-character constraints, but faster than Perl's
for one-character constraints ... and they work fine for variable-length
constraints, which Perl gives up on entirely. So that's not bad from a
competitive perspective, and there's room for further optimization if
anyone cares. (In reality, raw scan rate across a large input string is
probably not that big a deal for Postgres usage anyway; so I'm happy if
it's linear.)
10 years ago
|
|
|
d = getladfa(v, n);
|
|
|
|
|
if (d == NULL)
|
|
|
|
|
return 0;
|
Avoid generating extra subre tree nodes for capturing parentheses.
Previously, each pair of capturing parentheses gave rise to a separate
subre tree node, whose only function was to identify that we ought to
capture the match details for this particular sub-expression. In
most cases we don't really need that, since we can perfectly well
put a "capture this" annotation on the child node that does the real
matching work. As with the two preceding commits, the main value
of this is to avoid generating and optimizing an NFA for a tree node
that's not really pulling its weight.
The chosen data representation only allows one capture annotation
per subre node. In the legal-per-spec, but seemingly not very useful,
case where there are multiple capturing parens around the exact same
bit of the regex (i.e. "((xyz))"), wrap the child node in N-1 capture
nodes that act the same as before. We could work harder at that but
I'll refrain, pending some evidence that such cases are worth troubling
over.
In passing, improve the comments in regex.h to say what all the
different re_info bits mean. Some of them were pretty obvious
but others not so much, so reverse-engineer some documentation.
This is part of a patch series that in total reduces the regex engine's
runtime by about a factor of four on a large corpus of real-world regexes.
Patch by me, reviewed by Joel Jacobson
Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us
5 years ago
|
|
|
if (LATYPE_IS_AHEAD(sub->latype))
|
Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes
no text; but it checks for existence (or nonexistence) of a match *ending*
at the current point in the string, rather than one *starting* at the
current point. This is a long-requested feature since it exists in many
other regex libraries, but Henry Spencer had never got around to
implementing it in the code we use.
Just making it work is actually pretty trivial; but naive copying of the
logic for lookahead constraints leads to code that often spends O(N^2) time
to scan an N-character string, because we have to run the match engine
from string start to the current probe point each time the constraint is
checked. In typical use-cases a lookbehind constraint will be written at
the start of the regex and hence will need to be checked at every character
--- so O(N^2) work overall. To fix that, I introduced a third copy of the
core DFA matching loop, paralleling the existing longest() and shortest()
loops. This version, matchuntil(), can suspend and resume matching given
a couple of pointers' worth of storage space. So we need only run it
across the string once, stopping at each interesting probe point and then
resuming to advance to the next one.
I also put in an optimization that simplifies one-character lookahead and
lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
constraints, which already existed in the engine. This avoids the overhead
of the LACON machinery entirely for these rather common cases.
The net result is that lookbehind constraints run a factor of three or so
slower than Perl's for multi-character constraints, but faster than Perl's
for one-character constraints ... and they work fine for variable-length
constraints, which Perl gives up on entirely. So that's not bad from a
competitive perspective, and there's room for further optimization if
anyone cares. (In reality, raw scan rate across a large input string is
probably not that big a deal for Postgres usage anyway; so I'm happy if
it's linear.)
10 years ago
|
|
|
{
|
|
|
|
|
/* used to use longest() here, but shortest() could be much cheaper */
|
|
|
|
|
end = shortest(v, d, cp, cp, v->stop,
|
|
|
|
|
(chr **) NULL, (int *) NULL);
|
Avoid generating extra subre tree nodes for capturing parentheses.
Previously, each pair of capturing parentheses gave rise to a separate
subre tree node, whose only function was to identify that we ought to
capture the match details for this particular sub-expression. In
most cases we don't really need that, since we can perfectly well
put a "capture this" annotation on the child node that does the real
matching work. As with the two preceding commits, the main value
of this is to avoid generating and optimizing an NFA for a tree node
that's not really pulling its weight.
The chosen data representation only allows one capture annotation
per subre node. In the legal-per-spec, but seemingly not very useful,
case where there are multiple capturing parens around the exact same
bit of the regex (i.e. "((xyz))"), wrap the child node in N-1 capture
nodes that act the same as before. We could work harder at that but
I'll refrain, pending some evidence that such cases are worth troubling
over.
In passing, improve the comments in regex.h to say what all the
different re_info bits mean. Some of them were pretty obvious
but others not so much, so reverse-engineer some documentation.
This is part of a patch series that in total reduces the regex engine's
runtime by about a factor of four on a large corpus of real-world regexes.
Patch by me, reviewed by Joel Jacobson
Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us
5 years ago
|
|
|
satisfied = LATYPE_IS_POS(sub->latype) ? (end != NULL) : (end == NULL);
|
Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes
no text; but it checks for existence (or nonexistence) of a match *ending*
at the current point in the string, rather than one *starting* at the
current point. This is a long-requested feature since it exists in many
other regex libraries, but Henry Spencer had never got around to
implementing it in the code we use.
Just making it work is actually pretty trivial; but naive copying of the
logic for lookahead constraints leads to code that often spends O(N^2) time
to scan an N-character string, because we have to run the match engine
from string start to the current probe point each time the constraint is
checked. In typical use-cases a lookbehind constraint will be written at
the start of the regex and hence will need to be checked at every character
--- so O(N^2) work overall. To fix that, I introduced a third copy of the
core DFA matching loop, paralleling the existing longest() and shortest()
loops. This version, matchuntil(), can suspend and resume matching given
a couple of pointers' worth of storage space. So we need only run it
across the string once, stopping at each interesting probe point and then
resuming to advance to the next one.
I also put in an optimization that simplifies one-character lookahead and
lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
constraints, which already existed in the engine. This avoids the overhead
of the LACON machinery entirely for these rather common cases.
The net result is that lookbehind constraints run a factor of three or so
slower than Perl's for multi-character constraints, but faster than Perl's
for one-character constraints ... and they work fine for variable-length
constraints, which Perl gives up on entirely. So that's not bad from a
competitive perspective, and there's room for further optimization if
anyone cares. (In reality, raw scan rate across a large input string is
probably not that big a deal for Postgres usage anyway; so I'm happy if
it's linear.)
10 years ago
|
|
|
}
|
|
|
|
|
else
|
|
|
|
|
{
|
|
|
|
|
/*
|
|
|
|
|
* To avoid doing O(N^2) work when repeatedly testing a lookbehind
|
|
|
|
|
* constraint in an N-character string, we use matchuntil() which can
|
|
|
|
|
* cache the DFA state across calls. We only need to restart if the
|
|
|
|
|
* probe point decreases, which is not common. The NFA we're using is
|
|
|
|
|
* a search NFA, so it doesn't mind scanning over stuff before the
|
|
|
|
|
* nominal match.
|
|
|
|
|
*/
|
|
|
|
|
satisfied = matchuntil(v, d, cp, &v->lblastcss[n], &v->lblastcp[n]);
|
Avoid generating extra subre tree nodes for capturing parentheses.
Previously, each pair of capturing parentheses gave rise to a separate
subre tree node, whose only function was to identify that we ought to
capture the match details for this particular sub-expression. In
most cases we don't really need that, since we can perfectly well
put a "capture this" annotation on the child node that does the real
matching work. As with the two preceding commits, the main value
of this is to avoid generating and optimizing an NFA for a tree node
that's not really pulling its weight.
The chosen data representation only allows one capture annotation
per subre node. In the legal-per-spec, but seemingly not very useful,
case where there are multiple capturing parens around the exact same
bit of the regex (i.e. "((xyz))"), wrap the child node in N-1 capture
nodes that act the same as before. We could work harder at that but
I'll refrain, pending some evidence that such cases are worth troubling
over.
In passing, improve the comments in regex.h to say what all the
different re_info bits mean. Some of them were pretty obvious
but others not so much, so reverse-engineer some documentation.
This is part of a patch series that in total reduces the regex engine's
runtime by about a factor of four on a large corpus of real-world regexes.
Patch by me, reviewed by Joel Jacobson
Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us
5 years ago
|
|
|
if (!LATYPE_IS_POS(sub->latype))
|
Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes
no text; but it checks for existence (or nonexistence) of a match *ending*
at the current point in the string, rather than one *starting* at the
current point. This is a long-requested feature since it exists in many
other regex libraries, but Henry Spencer had never got around to
implementing it in the code we use.
Just making it work is actually pretty trivial; but naive copying of the
logic for lookahead constraints leads to code that often spends O(N^2) time
to scan an N-character string, because we have to run the match engine
from string start to the current probe point each time the constraint is
checked. In typical use-cases a lookbehind constraint will be written at
the start of the regex and hence will need to be checked at every character
--- so O(N^2) work overall. To fix that, I introduced a third copy of the
core DFA matching loop, paralleling the existing longest() and shortest()
loops. This version, matchuntil(), can suspend and resume matching given
a couple of pointers' worth of storage space. So we need only run it
across the string once, stopping at each interesting probe point and then
resuming to advance to the next one.
I also put in an optimization that simplifies one-character lookahead and
lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
constraints, which already existed in the engine. This avoids the overhead
of the LACON machinery entirely for these rather common cases.
The net result is that lookbehind constraints run a factor of three or so
slower than Perl's for multi-character constraints, but faster than Perl's
for one-character constraints ... and they work fine for variable-length
constraints, which Perl gives up on entirely. So that's not bad from a
competitive perspective, and there's room for further optimization if
anyone cares. (In reality, raw scan rate across a large input string is
probably not that big a deal for Postgres usage anyway; so I'm happy if
it's linear.)
10 years ago
|
|
|
satisfied = !satisfied;
|
|
|
|
|
}
|
Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes
no text; but it checks for existence (or nonexistence) of a match *ending*
at the current point in the string, rather than one *starting* at the
current point. This is a long-requested feature since it exists in many
other regex libraries, but Henry Spencer had never got around to
implementing it in the code we use.
Just making it work is actually pretty trivial; but naive copying of the
logic for lookahead constraints leads to code that often spends O(N^2) time
to scan an N-character string, because we have to run the match engine
from string start to the current probe point each time the constraint is
checked. In typical use-cases a lookbehind constraint will be written at
the start of the regex and hence will need to be checked at every character
--- so O(N^2) work overall. To fix that, I introduced a third copy of the
core DFA matching loop, paralleling the existing longest() and shortest()
loops. This version, matchuntil(), can suspend and resume matching given
a couple of pointers' worth of storage space. So we need only run it
across the string once, stopping at each interesting probe point and then
resuming to advance to the next one.
I also put in an optimization that simplifies one-character lookahead and
lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
constraints, which already existed in the engine. This avoids the overhead
of the LACON machinery entirely for these rather common cases.
The net result is that lookbehind constraints run a factor of three or so
slower than Perl's for multi-character constraints, but faster than Perl's
for one-character constraints ... and they work fine for variable-length
constraints, which Perl gives up on entirely. So that's not bad from a
competitive perspective, and there's room for further optimization if
anyone cares. (In reality, raw scan rate across a large input string is
probably not that big a deal for Postgres usage anyway; so I'm happy if
it's linear.)
10 years ago
|
|
|
FDEBUG(("=== lacon %d satisfied %d\n", n, satisfied));
|
|
|
|
|
return satisfied;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* getvacant - get a vacant state set
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
*
|
|
|
|
|
* This routine clears out the inarcs and outarcs, but does not otherwise
|
|
|
|
|
* clear the innards of the state set -- that's up to the caller.
|
|
|
|
|
*/
|
|
|
|
|
static struct sset *
|
|
|
|
|
getvacant(struct vars *v,
|
|
|
|
|
struct dfa *d,
|
|
|
|
|
chr *cp,
|
|
|
|
|
chr *start)
|
|
|
|
|
{
|
|
|
|
|
int i;
|
|
|
|
|
struct sset *ss;
|
|
|
|
|
struct sset *p;
|
|
|
|
|
struct arcp ap;
|
|
|
|
|
color co;
|
|
|
|
|
|
|
|
|
|
ss = pickss(v, d, cp, start);
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
if (ss == NULL)
|
|
|
|
|
return NULL;
|
|
|
|
|
assert(!(ss->flags & LOCKED));
|
|
|
|
|
|
|
|
|
|
/* clear out its inarcs, including self-referential ones */
|
|
|
|
|
ap = ss->ins;
|
|
|
|
|
while ((p = ap.ss) != NULL)
|
|
|
|
|
{
|
|
|
|
|
co = ap.co;
|
|
|
|
|
FDEBUG(("zapping c%d's %ld outarc\n", (int) (p - d->ssets), (long) co));
|
|
|
|
|
p->outs[co] = NULL;
|
|
|
|
|
ap = p->inchain[co];
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
9 years ago
|
|
|
p->inchain[co].ss = NULL; /* paranoia */
|
|
|
|
|
}
|
|
|
|
|
ss->ins.ss = NULL;
|
|
|
|
|
|
|
|
|
|
/* take it off the inarc chains of the ssets reached by its outarcs */
|
|
|
|
|
for (i = 0; i < d->ncolors; i++)
|
|
|
|
|
{
|
|
|
|
|
p = ss->outs[i];
|
|
|
|
|
assert(p != ss); /* not self-referential */
|
|
|
|
|
if (p == NULL)
|
|
|
|
|
continue; /* NOTE CONTINUE */
|
|
|
|
|
FDEBUG(("del outarc %d from c%d's in chn\n", i, (int) (p - d->ssets)));
|
|
|
|
|
if (p->ins.ss == ss && p->ins.co == i)
|
|
|
|
|
p->ins = ss->inchain[i];
|
|
|
|
|
else
|
|
|
|
|
{
|
|
|
|
|
struct arcp lastap = {NULL, 0};
|
|
|
|
|
|
|
|
|
|
assert(p->ins.ss != NULL);
|
|
|
|
|
for (ap = p->ins; ap.ss != NULL &&
|
|
|
|
|
!(ap.ss == ss && ap.co == i);
|
|
|
|
|
ap = ap.ss->inchain[ap.co])
|
|
|
|
|
lastap = ap;
|
|
|
|
|
assert(ap.ss != NULL);
|
|
|
|
|
lastap.ss->inchain[lastap.co] = ss->inchain[i];
|
|
|
|
|
}
|
|
|
|
|
ss->outs[i] = NULL;
|
|
|
|
|
ss->inchain[i].ss = NULL;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* if ss was a success state, may need to remember location */
|
|
|
|
|
if ((ss->flags & POSTSTATE) && ss->lastseen != d->lastpost &&
|
|
|
|
|
(d->lastpost == NULL || d->lastpost < ss->lastseen))
|
|
|
|
|
d->lastpost = ss->lastseen;
|
|
|
|
|
|
|
|
|
|
/* likewise for a no-progress state */
|
|
|
|
|
if ((ss->flags & NOPROGRESS) && ss->lastseen != d->lastnopr &&
|
|
|
|
|
(d->lastnopr == NULL || d->lastnopr < ss->lastseen))
|
|
|
|
|
d->lastnopr = ss->lastseen;
|
|
|
|
|
|
|
|
|
|
return ss;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* pickss - pick the next stateset to be used
|
|
|
|
|
*/
|
|
|
|
|
static struct sset *
|
|
|
|
|
pickss(struct vars *v,
|
|
|
|
|
struct dfa *d,
|
|
|
|
|
chr *cp,
|
|
|
|
|
chr *start)
|
|
|
|
|
{
|
|
|
|
|
int i;
|
|
|
|
|
struct sset *ss;
|
|
|
|
|
struct sset *end;
|
|
|
|
|
chr *ancient;
|
|
|
|
|
|
|
|
|
|
/* shortcut for cases where cache isn't full */
|
|
|
|
|
if (d->nssused < d->nssets)
|
|
|
|
|
{
|
|
|
|
|
i = d->nssused;
|
|
|
|
|
d->nssused++;
|
|
|
|
|
ss = &d->ssets[i];
|
|
|
|
|
FDEBUG(("new c%d\n", i));
|
|
|
|
|
/* set up innards */
|
|
|
|
|
ss->states = &d->statesarea[i * d->wordsper];
|
|
|
|
|
ss->flags = 0;
|
|
|
|
|
ss->ins.ss = NULL;
|
|
|
|
|
ss->ins.co = WHITE; /* give it some value */
|
|
|
|
|
ss->outs = &d->outsarea[i * d->ncolors];
|
|
|
|
|
ss->inchain = &d->incarea[i * d->ncolors];
|
|
|
|
|
for (i = 0; i < d->ncolors; i++)
|
|
|
|
|
{
|
|
|
|
|
ss->outs[i] = NULL;
|
|
|
|
|
ss->inchain[i].ss = NULL;
|
|
|
|
|
}
|
|
|
|
|
return ss;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* look for oldest, or old enough anyway */
|
|
|
|
|
if (cp - start > d->nssets * 2 / 3) /* oldest 33% are expendable */
|
|
|
|
|
ancient = cp - d->nssets * 2 / 3;
|
|
|
|
|
else
|
|
|
|
|
ancient = start;
|
|
|
|
|
for (ss = d->search, end = &d->ssets[d->nssets]; ss < end; ss++)
|
|
|
|
|
if ((ss->lastseen == NULL || ss->lastseen < ancient) &&
|
|
|
|
|
!(ss->flags & LOCKED))
|
|
|
|
|
{
|
|
|
|
|
d->search = ss + 1;
|
|
|
|
|
FDEBUG(("replacing c%d\n", (int) (ss - d->ssets)));
|
|
|
|
|
return ss;
|
|
|
|
|
}
|
|
|
|
|
for (ss = d->ssets, end = d->search; ss < end; ss++)
|
|
|
|
|
if ((ss->lastseen == NULL || ss->lastseen < ancient) &&
|
|
|
|
|
!(ss->flags & LOCKED))
|
|
|
|
|
{
|
|
|
|
|
d->search = ss + 1;
|
|
|
|
|
FDEBUG(("replacing c%d\n", (int) (ss - d->ssets)));
|
|
|
|
|
return ss;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* nobody's old enough?!? -- something's really wrong */
|
Wording cleanup for error messages. Also change can't -> cannot.
Standard English uses "may", "can", and "might" in different ways:
may - permission, "You may borrow my rake."
can - ability, "I can lift that log."
might - possibility, "It might rain today."
Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice. Similarly, "It may crash" is better stated, "It might crash".
19 years ago
|
|
|
FDEBUG(("cannot find victim to replace!\n"));
|
|
|
|
|
ERR(REG_ASSERT);
|
Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
10 years ago
|
|
|
return NULL;
|
|
|
|
|
}
|