You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
postgres/src/backend/regex/rege_dfa.c

990 lines
24 KiB

/*
* DFA routines
* This file is #included by regexec.c.
*
* Copyright (c) 1998, 1999 Henry Spencer. All rights reserved.
23 years ago
*
* Development of this software was funded, in part, by Cray Research Inc.,
* UUNET Communications Services Inc., Sun Microsystems Inc., and Scriptics
* Corporation, none of whom are responsible for the results. The author
23 years ago
* thanks all of them.
*
* Redistribution and use in source and binary forms -- with or without
* modification -- are permitted for any purpose, provided that
* redistributions in source form retain this entire copyright notice and
* indicate the origin and nature of any modifications.
23 years ago
*
* I'd appreciate being given credit for this package in the documentation
* of software which uses it, but that is not a requirement.
23 years ago
*
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
* AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
* HENRY SPENCER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
* WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
* OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
* ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
* src/backend/regex/rege_dfa.c
*
*/
/*
* longest - longest-preferred matching engine
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
*
* On success, returns match endpoint address. Returns NULL on no match.
* Internal errors also return NULL, with v->err set.
*/
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
static chr *
9 years ago
longest(struct vars *v,
struct dfa *d,
chr *start, /* where the match should start */
chr *stop, /* match must end at or before here */
int *hitstopp) /* record whether hit v->stop, if non-NULL */
{
23 years ago
chr *cp;
chr *realstop = (stop == v->stop) ? stop : stop + 1;
color co;
struct sset *css;
struct sset *ss;
23 years ago
chr *post;
int i;
struct colormap *cm = d->cm;
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
/* prevent "uninitialized variable" warnings */
if (hitstopp != NULL)
*hitstopp = 0;
/* fast path for matchall NFAs */
if (d->cnfa->flags & MATCHALL)
{
size_t nchr = stop - start;
size_t maxmatchall = d->cnfa->maxmatchall;
if (nchr < d->cnfa->minmatchall)
return NULL;
if (maxmatchall == DUPINF)
{
if (stop == v->stop && hitstopp != NULL)
*hitstopp = 1;
}
else
{
if (stop == v->stop && nchr <= maxmatchall + 1 && hitstopp != NULL)
*hitstopp = 1;
if (nchr > maxmatchall)
return start + maxmatchall;
}
return stop;
}
/* initialize */
css = initialize(v, d, start);
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
if (css == NULL)
return NULL;
cp = start;
/* startup */
FDEBUG(("+++ startup +++\n"));
23 years ago
if (cp == v->start)
{
co = d->cnfa->bos[(v->eflags & REG_NOTBOL) ? 0 : 1];
FDEBUG(("color %ld\n", (long) co));
}
else
{
co = GETCOLOR(cm, *(cp - 1));
23 years ago
FDEBUG(("char %c, color %ld\n", (char) *(cp - 1), (long) co));
}
css = miss(v, d, css, co, cp, start);
if (css == NULL)
return NULL;
css->lastseen = cp;
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
/*
* This is the main text-scanning loop. It seems worth having two copies
* to avoid the overhead of REG_FTRACE tests here, even in REG_DEBUG
* builds, when you're not actively tracing.
*/
#ifdef REG_DEBUG
23 years ago
if (v->eflags & REG_FTRACE)
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
{
23 years ago
while (cp < realstop)
{
FDEBUG(("+++ at c%d +++\n", (int) (css - d->ssets)));
co = GETCOLOR(cm, *cp);
23 years ago
FDEBUG(("char %c, color %ld\n", (char) *cp, (long) co));
ss = css->outs[co];
23 years ago
if (ss == NULL)
{
ss = miss(v, d, css, co, cp + 1, start);
if (ss == NULL)
23 years ago
break; /* NOTE BREAK OUT */
}
cp++;
ss->lastseen = cp;
css = ss;
}
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
}
else
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
#endif
{
23 years ago
while (cp < realstop)
{
co = GETCOLOR(cm, *cp);
ss = css->outs[co];
23 years ago
if (ss == NULL)
{
ss = miss(v, d, css, co, cp + 1, start);
if (ss == NULL)
23 years ago
break; /* NOTE BREAK OUT */
}
cp++;
ss->lastseen = cp;
css = ss;
}
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
}
if (ISERR())
return NULL;
/* shutdown */
FDEBUG(("+++ shutdown at c%d +++\n", (int) (css - d->ssets)));
23 years ago
if (cp == v->stop && stop == v->stop)
{
if (hitstopp != NULL)
*hitstopp = 1;
23 years ago
co = d->cnfa->eos[(v->eflags & REG_NOTEOL) ? 0 : 1];
FDEBUG(("color %ld\n", (long) co));
ss = miss(v, d, css, co, cp, start);
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
if (ISERR())
return NULL;
/* special case: match ended at eol? */
23 years ago
if (ss != NULL && (ss->flags & POSTSTATE))
return cp;
else if (ss != NULL)
ss->lastseen = cp; /* to be tidy */
}
/* find last match, if any */
post = d->lastpost;
for (ss = d->ssets, i = d->nssused; i > 0; ss++, i--)
23 years ago
if ((ss->flags & POSTSTATE) && post != ss->lastseen &&
(post == NULL || post < ss->lastseen))
post = ss->lastseen;
23 years ago
if (post != NULL) /* found one */
return post - 1;
return NULL;
}
/*
* shortest - shortest-preferred matching engine
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
*
* On success, returns match endpoint address. Returns NULL on no match.
* Internal errors also return NULL, with v->err set.
*/
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
static chr *
9 years ago
shortest(struct vars *v,
struct dfa *d,
chr *start, /* where the match should start */
chr *min, /* match must end at or after here */
chr *max, /* match must end at or before here */
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
chr **coldp, /* store coldstart pointer here, if non-NULL */
int *hitstopp) /* record whether hit v->stop, if non-NULL */
{
23 years ago
chr *cp;
chr *realmin = (min == v->stop) ? min : min + 1;
chr *realmax = (max == v->stop) ? max : max + 1;
color co;
struct sset *css;
struct sset *ss;
struct colormap *cm = d->cm;
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
/* prevent "uninitialized variable" warnings */
if (coldp != NULL)
*coldp = NULL;
if (hitstopp != NULL)
*hitstopp = 0;
/* fast path for matchall NFAs */
if (d->cnfa->flags & MATCHALL)
{
size_t nchr = min - start;
if (d->cnfa->maxmatchall != DUPINF &&
nchr > d->cnfa->maxmatchall)
return NULL;
if ((max - start) < d->cnfa->minmatchall)
return NULL;
if (nchr < d->cnfa->minmatchall)
min = start + d->cnfa->minmatchall;
if (coldp != NULL)
*coldp = start;
/* there is no case where we should set *hitstopp */
return min;
}
/* initialize */
css = initialize(v, d, start);
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
if (css == NULL)
return NULL;
cp = start;
/* startup */
FDEBUG(("--- startup ---\n"));
23 years ago
if (cp == v->start)
{
co = d->cnfa->bos[(v->eflags & REG_NOTBOL) ? 0 : 1];
FDEBUG(("color %ld\n", (long) co));
}
else
{
co = GETCOLOR(cm, *(cp - 1));
23 years ago
FDEBUG(("char %c, color %ld\n", (char) *(cp - 1), (long) co));
}
css = miss(v, d, css, co, cp, start);
if (css == NULL)
return NULL;
css->lastseen = cp;
ss = css;
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
/*
* This is the main text-scanning loop. It seems worth having two copies
* to avoid the overhead of REG_FTRACE tests here, even in REG_DEBUG
* builds, when you're not actively tracing.
*/
#ifdef REG_DEBUG
23 years ago
if (v->eflags & REG_FTRACE)
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
{
23 years ago
while (cp < realmax)
{
FDEBUG(("--- at c%d ---\n", (int) (css - d->ssets)));
co = GETCOLOR(cm, *cp);
23 years ago
FDEBUG(("char %c, color %ld\n", (char) *cp, (long) co));
ss = css->outs[co];
23 years ago
if (ss == NULL)
{
ss = miss(v, d, css, co, cp + 1, start);
if (ss == NULL)
23 years ago
break; /* NOTE BREAK OUT */
}
cp++;
ss->lastseen = cp;
css = ss;
23 years ago
if ((ss->flags & POSTSTATE) && cp >= realmin)
break; /* NOTE BREAK OUT */
}
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
}
else
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
#endif
{
23 years ago
while (cp < realmax)
{
co = GETCOLOR(cm, *cp);
ss = css->outs[co];
23 years ago
if (ss == NULL)
{
ss = miss(v, d, css, co, cp + 1, start);
if (ss == NULL)
23 years ago
break; /* NOTE BREAK OUT */
}
cp++;
ss->lastseen = cp;
css = ss;
23 years ago
if ((ss->flags & POSTSTATE) && cp >= realmin)
break; /* NOTE BREAK OUT */
}
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
}
if (ss == NULL)
return NULL;
if (coldp != NULL) /* report last no-progress state set, if any */
*coldp = lastcold(v, d);
23 years ago
if ((ss->flags & POSTSTATE) && cp > min)
{
assert(cp >= realmin);
cp--;
23 years ago
}
else if (cp == v->stop && max == v->stop)
{
co = d->cnfa->eos[(v->eflags & REG_NOTEOL) ? 0 : 1];
FDEBUG(("color %ld\n", (long) co));
ss = miss(v, d, css, co, cp, start);
/* match might have ended at eol */
23 years ago
if ((ss == NULL || !(ss->flags & POSTSTATE)) && hitstopp != NULL)
*hitstopp = 1;
}
23 years ago
if (ss == NULL || !(ss->flags & POSTSTATE))
return NULL;
return cp;
}
Implement lookbehind constraints in our regular-expression engine. A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
10 years ago
/*
* matchuntil - incremental matching engine
*
* This is meant for use with a search-style NFA (that is, the pattern is
* known to act as though it had a leading .*). We determine whether a
* match exists starting at v->start and ending at probe. Multiple calls
* require only O(N) time not O(N^2) so long as the probe values are
* nondecreasing. *lastcss and *lastcp must be initialized to NULL before
* starting a series of calls.
*
* Returns 1 if a match exists, 0 if not.
* Internal errors also return 0, with v->err set.
*/
static int
9 years ago
matchuntil(struct vars *v,
struct dfa *d,
Implement lookbehind constraints in our regular-expression engine. A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
10 years ago
chr *probe, /* we want to know if a match ends here */
Phase 2 of pgindent updates. Change pg_bsd_indent to follow upstream rules for placement of comments to the right of code, and remove pgindent hack that caused comments following #endif to not obey the general rule. Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using the published version of pg_bsd_indent, but a hacked-up version that tried to minimize the amount of movement of comments to the right of code. The situation of interest is where such a comment has to be moved to the right of its default placement at column 33 because there's code there. BSD indent has always moved right in units of tab stops in such cases --- but in the previous incarnation, indent was working in 8-space tab stops, while now it knows we use 4-space tabs. So the net result is that in about half the cases, such comments are placed one tab stop left of before. This is better all around: it leaves more room on the line for comment text, and it means that in such cases the comment uniformly starts at the next 4-space tab stop after the code, rather than sometimes one and sometimes two tabs after. Also, ensure that comments following #endif are indented the same as comments following other preprocessor commands such as #else. That inconsistency turns out to have been self-inflicted damage from a poorly-thought-through post-indent "fixup" in pgindent. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
9 years ago
struct sset **lastcss, /* state storage across calls */
Implement lookbehind constraints in our regular-expression engine. A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
10 years ago
chr **lastcp) /* state storage across calls */
{
chr *cp = *lastcp;
color co;
struct sset *css = *lastcss;
struct sset *ss;
struct colormap *cm = d->cm;
/* fast path for matchall NFAs */
if (d->cnfa->flags & MATCHALL)
{
size_t nchr = probe - v->start;
/*
* It might seem that we should check maxmatchall too, but the .* at
* the front of the pattern absorbs any extra characters (and it was
* tacked on *after* computing minmatchall/maxmatchall). Thus, we
* should match if there are at least minmatchall characters.
*/
if (nchr < d->cnfa->minmatchall)
return 0;
return 1;
}
Implement lookbehind constraints in our regular-expression engine. A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
10 years ago
/* initialize and startup, or restart, if necessary */
if (cp == NULL || cp > probe)
{
cp = v->start;
css = initialize(v, d, cp);
if (css == NULL)
return 0;
FDEBUG((">>> startup >>>\n"));
co = d->cnfa->bos[(v->eflags & REG_NOTBOL) ? 0 : 1];
FDEBUG(("color %ld\n", (long) co));
css = miss(v, d, css, co, cp, v->start);
if (css == NULL)
return 0;
css->lastseen = cp;
}
else if (css == NULL)
{
/* we previously found that no match is possible beyond *lastcp */
return 0;
}
ss = css;
/*
* This is the main text-scanning loop. It seems worth having two copies
* to avoid the overhead of REG_FTRACE tests here, even in REG_DEBUG
* builds, when you're not actively tracing.
*/
#ifdef REG_DEBUG
if (v->eflags & REG_FTRACE)
{
while (cp < probe)
{
FDEBUG((">>> at c%d >>>\n", (int) (css - d->ssets)));
co = GETCOLOR(cm, *cp);
FDEBUG(("char %c, color %ld\n", (char) *cp, (long) co));
ss = css->outs[co];
if (ss == NULL)
{
ss = miss(v, d, css, co, cp + 1, v->start);
if (ss == NULL)
break; /* NOTE BREAK OUT */
}
cp++;
ss->lastseen = cp;
css = ss;
}
}
else
#endif
{
while (cp < probe)
{
co = GETCOLOR(cm, *cp);
ss = css->outs[co];
if (ss == NULL)
{
ss = miss(v, d, css, co, cp + 1, v->start);
if (ss == NULL)
break; /* NOTE BREAK OUT */
}
cp++;
ss->lastseen = cp;
css = ss;
}
}
*lastcss = ss;
*lastcp = cp;
if (ss == NULL)
return 0; /* impossible match, or internal error */
/* We need to process one more chr, or the EOS symbol, to check match */
if (cp < v->stop)
{
FDEBUG((">>> at c%d >>>\n", (int) (css - d->ssets)));
co = GETCOLOR(cm, *cp);
FDEBUG(("char %c, color %ld\n", (char) *cp, (long) co));
ss = css->outs[co];
if (ss == NULL)
ss = miss(v, d, css, co, cp + 1, v->start);
}
else
{
assert(cp == v->stop);
co = d->cnfa->eos[(v->eflags & REG_NOTEOL) ? 0 : 1];
FDEBUG(("color %ld\n", (long) co));
ss = miss(v, d, css, co, cp, v->start);
}
if (ss == NULL || !(ss->flags & POSTSTATE))
return 0;
return 1;
}
/*
* lastcold - determine last point at which no progress had been made
*/
23 years ago
static chr * /* endpoint, or NULL */
9 years ago
lastcold(struct vars *v,
struct dfa *d)
{
struct sset *ss;
23 years ago
chr *nopr;
int i;
nopr = d->lastnopr;
if (nopr == NULL)
nopr = v->start;
for (ss = d->ssets, i = d->nssused; i > 0; ss++, i--)
23 years ago
if ((ss->flags & NOPROGRESS) && nopr < ss->lastseen)
nopr = ss->lastseen;
return nopr;
}
/*
* newdfa - set up a fresh DFA
*/
static struct dfa *
9 years ago
newdfa(struct vars *v,
struct cnfa *cnfa,
struct colormap *cm,
struct smalldfa *sml) /* preallocated space, may be NULL */
{
struct dfa *d;
23 years ago
size_t nss = cnfa->nstates * 2;
int wordsper = (cnfa->nstates + UBITS - 1) / UBITS;
struct smalldfa *smallwas = sml;
assert(cnfa != NULL && cnfa->nstates != 0);
23 years ago
if (nss <= FEWSTATES && cnfa->ncolors <= FEWCOLORS)
{
assert(wordsper == 1);
if (sml == NULL)
23 years ago
{
sml = (struct smalldfa *) MALLOC(sizeof(struct smalldfa));
if (sml == NULL)
23 years ago
{
ERR(REG_ESPACE);
return NULL;
}
}
d = &sml->dfa;
d->ssets = sml->ssets;
d->statesarea = sml->statesarea;
d->work = &d->statesarea[nss];
d->outsarea = sml->outsarea;
d->incarea = sml->incarea;
d->cptsmalloced = 0;
d->mallocarea = (smallwas == NULL) ? (char *) sml : NULL;
23 years ago
}
else
{
d = (struct dfa *) MALLOC(sizeof(struct dfa));
if (d == NULL)
{
ERR(REG_ESPACE);
return NULL;
}
23 years ago
d->ssets = (struct sset *) MALLOC(nss * sizeof(struct sset));
d->statesarea = (unsigned *) MALLOC((nss + WORK) * wordsper *
sizeof(unsigned));
d->work = &d->statesarea[nss * wordsper];
23 years ago
d->outsarea = (struct sset **) MALLOC(nss * cnfa->ncolors *
sizeof(struct sset *));
d->incarea = (struct arcp *) MALLOC(nss * cnfa->ncolors *
sizeof(struct arcp));
d->cptsmalloced = 1;
23 years ago
d->mallocarea = (char *) d;
if (d->ssets == NULL || d->statesarea == NULL ||
23 years ago
d->outsarea == NULL || d->incarea == NULL)
{
freedfa(d);
ERR(REG_ESPACE);
return NULL;
}
}
23 years ago
d->nssets = (v->eflags & REG_SMALL) ? 7 : nss;
d->nssused = 0;
d->nstates = cnfa->nstates;
d->ncolors = cnfa->ncolors;
d->wordsper = wordsper;
d->cnfa = cnfa;
d->cm = cm;
d->lastpost = NULL;
d->lastnopr = NULL;
d->search = d->ssets;
/* initialization of sset fields is done as needed */
return d;
}
/*
* freedfa - free a DFA
*/
static void
9 years ago
freedfa(struct dfa *d)
{
23 years ago
if (d->cptsmalloced)
{
if (d->ssets != NULL)
FREE(d->ssets);
if (d->statesarea != NULL)
FREE(d->statesarea);
if (d->outsarea != NULL)
FREE(d->outsarea);
if (d->incarea != NULL)
FREE(d->incarea);
}
if (d->mallocarea != NULL)
FREE(d->mallocarea);
}
/*
* hash - construct a hash code for a bitvector
*
* There are probably better ways, but they're more expensive.
*/
static unsigned
hash(unsigned *uv,
int n)
{
23 years ago
int i;
unsigned h;
h = 0;
for (i = 0; i < n; i++)
h ^= uv[i];
return h;
}
/*
* initialize - hand-craft a cache entry for startup, otherwise get ready
*/
static struct sset *
9 years ago
initialize(struct vars *v,
struct dfa *d,
chr *start)
{
struct sset *ss;
23 years ago
int i;
/* is previous one still there? */
23 years ago
if (d->nssused > 0 && (d->ssets[0].flags & STARTER))
ss = &d->ssets[0];
23 years ago
else
{ /* no, must (re)build it */
ss = getvacant(v, d, start, start);
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
if (ss == NULL)
return NULL;
for (i = 0; i < d->wordsper; i++)
ss->states[i] = 0;
BSET(ss->states, d->cnfa->pre);
ss->hash = HASH(ss->states, d->wordsper);
assert(d->cnfa->pre != d->cnfa->post);
23 years ago
ss->flags = STARTER | LOCKED | NOPROGRESS;
/* lastseen dealt with below */
}
for (i = 0; i < d->nssused; i++)
d->ssets[i].lastseen = NULL;
ss->lastseen = start; /* maybe untrue, but harmless */
d->lastpost = NULL;
d->lastnopr = NULL;
return ss;
}
/*
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
* miss - handle a stateset cache miss
*
* css is the current stateset, co is the color of the current input character,
* cp points to the character after that (which is where we may need to test
* LACONs). start does not affect matching behavior but is needed for pickss'
* heuristics about which stateset cache entry to replace.
*
* Ordinarily, returns the address of the next stateset (the one that is
* valid after consuming the input character). Returns NULL if no valid
* NFA states remain, ie we have a certain match failure.
* Internal errors also return NULL, with v->err set.
*/
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
static struct sset *
9 years ago
miss(struct vars *v,
struct dfa *d,
struct sset *css,
color co,
chr *cp, /* next chr */
chr *start) /* where the attempt got started */
{
struct cnfa *cnfa = d->cnfa;
23 years ago
int i;
unsigned h;
struct carc *ca;
struct sset *p;
int ispseudocolor;
23 years ago
int ispost;
int noprogress;
int gotstate;
int dolacons;
int sawlacons;
/* for convenience, we can be called even if it might not be a miss */
23 years ago
if (css->outs[co] != NULL)
{
FDEBUG(("hit\n"));
return css->outs[co];
}
FDEBUG(("miss\n"));
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
/*
* Checking for operation cancel in the inner text search loop seems
* unduly expensive. As a compromise, check during cache misses.
*/
if (CANCEL_REQUESTED(v->re))
{
ERR(REG_CANCEL);
return NULL;
}
/*
* What set of states would we end up in after consuming the co character?
* We first consider PLAIN arcs that consume the character, and then look
* to see what LACON arcs could be traversed after consuming it.
*/
for (i = 0; i < d->wordsper; i++)
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
d->work[i] = 0; /* build new stateset bitmap in d->work */
ispseudocolor = d->cm->cd[co].flags & PSEUDO;
ispost = 0;
noprogress = 1;
gotstate = 0;
for (i = 0; i < d->nstates; i++)
if (ISBSET(css->states, i))
for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++)
if (ca->co == co ||
(ca->co == RAINBOW && !ispseudocolor))
23 years ago
{
BSET(d->work, ca->to);
gotstate = 1;
if (ca->to == cnfa->post)
ispost = 1;
if (!(cnfa->stflags[ca->to] & CNFA_NOPROGRESS))
noprogress = 0;
FDEBUG(("%d -> %d\n", i, ca->to));
}
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
if (!gotstate)
return NULL; /* character cannot reach any new state */
dolacons = (cnfa->flags & HASLACONS);
sawlacons = 0;
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
/* outer loop handles transitive closure of reachable-by-LACON states */
23 years ago
while (dolacons)
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
{
dolacons = 0;
for (i = 0; i < d->nstates; i++)
if (ISBSET(d->work, i))
for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++)
23 years ago
{
if (ca->co < cnfa->ncolors)
Phase 2 of pgindent updates. Change pg_bsd_indent to follow upstream rules for placement of comments to the right of code, and remove pgindent hack that caused comments following #endif to not obey the general rule. Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using the published version of pg_bsd_indent, but a hacked-up version that tried to minimize the amount of movement of comments to the right of code. The situation of interest is where such a comment has to be moved to the right of its default placement at column 33 because there's code there. BSD indent has always moved right in units of tab stops in such cases --- but in the previous incarnation, indent was working in 8-space tab stops, while now it knows we use 4-space tabs. So the net result is that in about half the cases, such comments are placed one tab stop left of before. This is better all around: it leaves more room on the line for comment text, and it means that in such cases the comment uniformly starts at the next 4-space tab stop after the code, rather than sometimes one and sometimes two tabs after. Also, ensure that comments following #endif are indented the same as comments following other preprocessor commands such as #else. That inconsistency turns out to have been self-inflicted damage from a poorly-thought-through post-indent "fixup" in pgindent. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
9 years ago
continue; /* not a LACON arc */
if (ISBSET(d->work, ca->to))
Phase 2 of pgindent updates. Change pg_bsd_indent to follow upstream rules for placement of comments to the right of code, and remove pgindent hack that caused comments following #endif to not obey the general rule. Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using the published version of pg_bsd_indent, but a hacked-up version that tried to minimize the amount of movement of comments to the right of code. The situation of interest is where such a comment has to be moved to the right of its default placement at column 33 because there's code there. BSD indent has always moved right in units of tab stops in such cases --- but in the previous incarnation, indent was working in 8-space tab stops, while now it knows we use 4-space tabs. So the net result is that in about half the cases, such comments are placed one tab stop left of before. This is better all around: it leaves more room on the line for comment text, and it means that in such cases the comment uniformly starts at the next 4-space tab stop after the code, rather than sometimes one and sometimes two tabs after. Also, ensure that comments following #endif are indented the same as comments following other preprocessor commands such as #else. That inconsistency turns out to have been self-inflicted damage from a poorly-thought-through post-indent "fixup" in pgindent. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
9 years ago
continue; /* arc would be a no-op anyway */
sawlacons = 1; /* this LACON affects our result */
if (!lacon(v, cnfa, cp, ca->co))
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
{
if (ISERR())
return NULL;
Phase 2 of pgindent updates. Change pg_bsd_indent to follow upstream rules for placement of comments to the right of code, and remove pgindent hack that caused comments following #endif to not obey the general rule. Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using the published version of pg_bsd_indent, but a hacked-up version that tried to minimize the amount of movement of comments to the right of code. The situation of interest is where such a comment has to be moved to the right of its default placement at column 33 because there's code there. BSD indent has always moved right in units of tab stops in such cases --- but in the previous incarnation, indent was working in 8-space tab stops, while now it knows we use 4-space tabs. So the net result is that in about half the cases, such comments are placed one tab stop left of before. This is better all around: it leaves more room on the line for comment text, and it means that in such cases the comment uniformly starts at the next 4-space tab stop after the code, rather than sometimes one and sometimes two tabs after. Also, ensure that comments following #endif are indented the same as comments following other preprocessor commands such as #else. That inconsistency turns out to have been self-inflicted damage from a poorly-thought-through post-indent "fixup" in pgindent. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
9 years ago
continue; /* LACON arc cannot be traversed */
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
}
if (ISERR())
return NULL;
BSET(d->work, ca->to);
dolacons = 1;
if (ca->to == cnfa->post)
ispost = 1;
if (!(cnfa->stflags[ca->to] & CNFA_NOPROGRESS))
noprogress = 0;
FDEBUG(("%d :> %d\n", i, ca->to));
}
}
h = HASH(d->work, d->wordsper);
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
/* Is this stateset already in the cache? */
for (p = d->ssets, i = d->nssused; i > 0; p++, i--)
23 years ago
if (HIT(h, d->work, p, d->wordsper))
{
FDEBUG(("cached c%d\n", (int) (p - d->ssets)));
23 years ago
break; /* NOTE BREAK OUT */
}
23 years ago
if (i == 0)
{ /* nope, need a new cache entry */
p = getvacant(v, d, cp, start);
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
if (p == NULL)
return NULL;
assert(p != css);
for (i = 0; i < d->wordsper; i++)
p->states[i] = d->work[i];
p->hash = h;
p->flags = (ispost) ? POSTSTATE : 0;
if (noprogress)
p->flags |= NOPROGRESS;
/* lastseen to be dealt with by caller */
}
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
/*
* Link new stateset to old, unless a LACON affected the result, in which
* case we don't create the link. That forces future transitions across
* this same arc (same prior stateset and character color) to come through
* miss() again, so that we can recheck the LACON(s), which might or might
* not pass since context will be different.
*/
23 years ago
if (!sawlacons)
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
{
FDEBUG(("c%d[%d]->c%d\n",
(int) (css - d->ssets), co, (int) (p - d->ssets)));
css->outs[co] = p;
css->inchain[co] = p->ins;
p->ins.ss = css;
p->ins.co = co;
}
return p;
}
/*
Implement lookbehind constraints in our regular-expression engine. A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
10 years ago
* lacon - lookaround-constraint checker for miss()
*/
23 years ago
static int /* predicate: constraint satisfied? */
9 years ago
lacon(struct vars *v,
struct cnfa *pcnfa, /* parent cnfa */
chr *cp,
color co) /* "color" of the lookaround constraint */
{
23 years ago
int n;
struct subre *sub;
struct dfa *d;
23 years ago
chr *end;
Implement lookbehind constraints in our regular-expression engine. A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
10 years ago
int satisfied;
/* Since this is recursive, it could be driven to stack overflow */
if (STACK_TOO_DEEP(v->re))
{
ERR(REG_ETOOBIG);
return 0;
}
n = co - pcnfa->ncolors;
Implement lookbehind constraints in our regular-expression engine. A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
10 years ago
assert(n > 0 && n < v->g->nlacons && v->g->lacons != NULL);
FDEBUG(("=== testing lacon %d\n", n));
sub = &v->g->lacons[n];
Implement lookbehind constraints in our regular-expression engine. A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
10 years ago
d = getladfa(v, n);
23 years ago
if (d == NULL)
return 0;
if (LATYPE_IS_AHEAD(sub->latype))
Implement lookbehind constraints in our regular-expression engine. A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
10 years ago
{
/* used to use longest() here, but shortest() could be much cheaper */
end = shortest(v, d, cp, cp, v->stop,
(chr **) NULL, (int *) NULL);
satisfied = LATYPE_IS_POS(sub->latype) ? (end != NULL) : (end == NULL);
Implement lookbehind constraints in our regular-expression engine. A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
10 years ago
}
else
{
/*
* To avoid doing O(N^2) work when repeatedly testing a lookbehind
* constraint in an N-character string, we use matchuntil() which can
* cache the DFA state across calls. We only need to restart if the
* probe point decreases, which is not common. The NFA we're using is
* a search NFA, so it doesn't mind scanning over stuff before the
* nominal match.
*/
satisfied = matchuntil(v, d, cp, &v->lblastcss[n], &v->lblastcp[n]);
if (!LATYPE_IS_POS(sub->latype))
Implement lookbehind constraints in our regular-expression engine. A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
10 years ago
satisfied = !satisfied;
}
Implement lookbehind constraints in our regular-expression engine. A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
10 years ago
FDEBUG(("=== lacon %d satisfied %d\n", n, satisfied));
return satisfied;
}
/*
* getvacant - get a vacant state set
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
*
* This routine clears out the inarcs and outarcs, but does not otherwise
* clear the innards of the state set -- that's up to the caller.
*/
static struct sset *
9 years ago
getvacant(struct vars *v,
struct dfa *d,
chr *cp,
chr *start)
{
23 years ago
int i;
struct sset *ss;
struct sset *p;
struct arcp ap;
23 years ago
color co;
ss = pickss(v, d, cp, start);
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
if (ss == NULL)
return NULL;
23 years ago
assert(!(ss->flags & LOCKED));
/* clear out its inarcs, including self-referential ones */
ap = ss->ins;
23 years ago
while ((p = ap.ss) != NULL)
{
co = ap.co;
FDEBUG(("zapping c%d's %ld outarc\n", (int) (p - d->ssets), (long) co));
p->outs[co] = NULL;
ap = p->inchain[co];
Phase 2 of pgindent updates. Change pg_bsd_indent to follow upstream rules for placement of comments to the right of code, and remove pgindent hack that caused comments following #endif to not obey the general rule. Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using the published version of pg_bsd_indent, but a hacked-up version that tried to minimize the amount of movement of comments to the right of code. The situation of interest is where such a comment has to be moved to the right of its default placement at column 33 because there's code there. BSD indent has always moved right in units of tab stops in such cases --- but in the previous incarnation, indent was working in 8-space tab stops, while now it knows we use 4-space tabs. So the net result is that in about half the cases, such comments are placed one tab stop left of before. This is better all around: it leaves more room on the line for comment text, and it means that in such cases the comment uniformly starts at the next 4-space tab stop after the code, rather than sometimes one and sometimes two tabs after. Also, ensure that comments following #endif are indented the same as comments following other preprocessor commands such as #else. That inconsistency turns out to have been self-inflicted damage from a poorly-thought-through post-indent "fixup" in pgindent. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
9 years ago
p->inchain[co].ss = NULL; /* paranoia */
}
ss->ins.ss = NULL;
/* take it off the inarc chains of the ssets reached by its outarcs */
23 years ago
for (i = 0; i < d->ncolors; i++)
{
p = ss->outs[i];
assert(p != ss); /* not self-referential */
if (p == NULL)
23 years ago
continue; /* NOTE CONTINUE */
FDEBUG(("del outarc %d from c%d's in chn\n", i, (int) (p - d->ssets)));
if (p->ins.ss == ss && p->ins.co == i)
p->ins = ss->inchain[i];
23 years ago
else
{
struct arcp lastap = {NULL, 0};
assert(p->ins.ss != NULL);
for (ap = p->ins; ap.ss != NULL &&
23 years ago
!(ap.ss == ss && ap.co == i);
ap = ap.ss->inchain[ap.co])
lastap = ap;
assert(ap.ss != NULL);
lastap.ss->inchain[lastap.co] = ss->inchain[i];
}
ss->outs[i] = NULL;
ss->inchain[i].ss = NULL;
}
/* if ss was a success state, may need to remember location */
23 years ago
if ((ss->flags & POSTSTATE) && ss->lastseen != d->lastpost &&
(d->lastpost == NULL || d->lastpost < ss->lastseen))
d->lastpost = ss->lastseen;
/* likewise for a no-progress state */
23 years ago
if ((ss->flags & NOPROGRESS) && ss->lastseen != d->lastnopr &&
(d->lastnopr == NULL || d->lastnopr < ss->lastseen))
d->lastnopr = ss->lastseen;
return ss;
}
/*
* pickss - pick the next stateset to be used
*/
static struct sset *
9 years ago
pickss(struct vars *v,
struct dfa *d,
chr *cp,
chr *start)
{
23 years ago
int i;
struct sset *ss;
struct sset *end;
23 years ago
chr *ancient;
/* shortcut for cases where cache isn't full */
23 years ago
if (d->nssused < d->nssets)
{
i = d->nssused;
d->nssused++;
ss = &d->ssets[i];
FDEBUG(("new c%d\n", i));
/* set up innards */
ss->states = &d->statesarea[i * d->wordsper];
ss->flags = 0;
ss->ins.ss = NULL;
ss->ins.co = WHITE; /* give it some value */
ss->outs = &d->outsarea[i * d->ncolors];
ss->inchain = &d->incarea[i * d->ncolors];
23 years ago
for (i = 0; i < d->ncolors; i++)
{
ss->outs[i] = NULL;
ss->inchain[i].ss = NULL;
}
return ss;
}
/* look for oldest, or old enough anyway */
23 years ago
if (cp - start > d->nssets * 2 / 3) /* oldest 33% are expendable */
ancient = cp - d->nssets * 2 / 3;
else
ancient = start;
for (ss = d->search, end = &d->ssets[d->nssets]; ss < end; ss++)
if ((ss->lastseen == NULL || ss->lastseen < ancient) &&
23 years ago
!(ss->flags & LOCKED))
{
d->search = ss + 1;
FDEBUG(("replacing c%d\n", (int) (ss - d->ssets)));
return ss;
}
for (ss = d->ssets, end = d->search; ss < end; ss++)
if ((ss->lastseen == NULL || ss->lastseen < ancient) &&
23 years ago
!(ss->flags & LOCKED))
{
d->search = ss + 1;
FDEBUG(("replacing c%d\n", (int) (ss - d->ssets)));
return ss;
}
/* nobody's old enough?!? -- something's really wrong */
FDEBUG(("cannot find victim to replace!\n"));
ERR(REG_ASSERT);
Add some more query-cancel checks to regular expression matching. Commit 9662143f0c35d64d7042fbeaf879df8f0b54be32 added infrastructure to allow regular-expression operations to be terminated early in the event of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there are still cases where regex compilation could run for a long time without noticing a cancel request. Specifically, the fixempties() phase never adds new states, only new arcs, so it doesn't hit the cancel check I'd put in newstate(). Add one to newarc() as well to cover that. Some experimentation of my own found that regex execution could also run for a long time despite a pending cancel. We'd put a high-level cancel check into cdissect(), but there was none inside the core text-matching routines longest() and shortest(). Ordinarily those inner loops are very very fast ... but in the presence of lookahead constraints, not so much. As a compromise, stick a cancel check into the stateset cache-miss function, which is enough to guarantee a cancel check at least once per lookahead constraint test. Making this work required more attention to error handling throughout the regex executor. Henry Spencer had apparently originally intended longest() and shortest() to be incapable of incurring errors while running, so neither they nor their subroutines had well-defined error reporting behaviors. However, that was already broken by the lookahead constraint feature, since lacon() can surely suffer an out-of-memory failure --- which, in the code as it stood, might never be reported to the user at all, but just silently be treated as a non-match of the lookahead constraint. Normalize all that by inserting explicit error tests as needed. I took the opportunity to add some more comments to the code, too. Back-patch to all supported branches, like the previous patch.
10 years ago
return NULL;
}