public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* [Bug runtime/11334] regular expression string matching
       [not found] <bug-11334-6586@http.sourceware.org/bugzilla/>
@ 2012-10-10 16:22 ` smakarov at redhat dot com
  2012-10-11 19:58 ` smakarov at redhat dot com
  2012-10-17 19:26 ` fche at redhat dot com
  2 siblings, 0 replies; 3+ messages in thread
From: smakarov at redhat dot com @ 2012-10-10 16:22 UTC (permalink / raw)
  To: systemtap


http://sourceware.org/bugzilla/show_bug.cgi?id=11334

Serguei Makarov <smakarov at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |smakarov at redhat dot com

--- Comment #1 from Serguei Makarov <smakarov at redhat dot com> 2012-10-10 16:21:50 UTC ---
Evaluating use of re2c as the basis for our regex translator, according to a
mailing list recommendation. (see http://re2c.org) Some rough thoughts as to
its suitability follow:

Difficulty integrating into systemtap:

There are some fiddly points here. First of all, re2c has a flex-like interface
where it simply preprocesses a C file searching for re2c invocations. Each
invocation consists of a set of alternate regular expressions with associated
actions at the end. The actions don't access numbered subexpressions; to get
the actual matched content we need to do a bunch of hackery where we save the
cursor position at the start of the match, and use that plus the position at
the end to get the entire matched string. The lessons/ and examples/ folders of
the re2c distribution give a good idea of how this works in practice.

Thus we need to hack the engine to instead generate code that can support a
scheme where we run a match EXPR =~ REGEXP and perhaps then extract numbered
subexpressions as substrings from the original string. This is possible to code
into re2c if we figure out how we want to keep track of subexpression indices
during matching, and define a new action type for the DFA engine, to be
associated with ( ) brackets, whose generated code saves the indices at the
appropriate point.

Robustness as kernel code:

Fairly optimistic about this part. As it stands now, re2c simply emits the DFA
in the form of a series of case statements. (This relies on the magic of gcc to
optimize the resulting spaghetti down into a compact form, which probably
stands up well next to any hand-coded optimizations we might feel inclined to
implement on a DFA table for a project of this scope.)

Actual manipulation / reading of the string is done by calling out to
user-provided macros:

YYCTYPE -- character type, e.g. char or unsigned char or whatnot
YYCURSOR -- gives pointer to current character in buffer 
YYLIMIT -- gives the position of the end of the buffer
YYMARKER -- stores (limited) backtracking info for the special case where the
regex has an optional postfix that may or may not be matched.
YYFILL() -- used when re2c runs out of buffer data, to request the buffer to be
refilled/updated. If we are matching the string in-place, this is a no-op, and
(unlike other re2c use cases) we do not have to worry about YYMARKER remaining
consistent with the state of the buffer.

The only thing re2c assumes is that there is a buffer of consecutive characters
used for the matching. Hence, by carefully coding the hooks above, the
resulting "runtime" should be robust enough for in-kernel use.

{unresolved questions -- while robust, can the matching go on for too long?
given that strings are limited by MAXSTRINGLEN, for instance?}

Conformity to a standard such as POSIX ERE:

Almost, but not quite there. As far as I can tell, re2c does not support named
character classes, and does something weird and non-standard with single and
double quotes. (Double quotes quote stuff, but single quotes quote stuff and
make it case insensitive?)

This is not at all difficult to remedy as long as the rest of re2c is compliant
(though we'd need someone else's testsuite to be sure of that -- see "testing
testing testing" below).

License/maintenance issues:

The project is public domain, so I don't think there will be a problem
incorporating it into SystemTap. I am more uncertain about how to proceed with
contributing or separately maintaining the inevitable changes and extensions
needed for SystemTap use.

The project is fairly stagnant but not completely abandoned. (The home page
mentions that it had been abandoned once and re-adopted.) While there are very
occasional mailing list inquiries, the last commit to the repositories was in
2011. I notice that several people using the project seem to have made
disconnected GitHub forks to hack on it rather than going to the maintainers,
which is also not an encouraging sign.

Difficulty of rewriting almost-completely-from-scratch if we decide that doing
so would better fit our needs:

Not too high, as far as I can tell. sloccount gives ~10,000 lines for the whole
project, a decided plurality of which is either fluff we don't need (e.g. their
C preprocessor), or we must replace with our own stuff anyways (e.g. the
libre2c component is completely irrelevant to our needs), or stuff that in the
event of a rewrite would be redundant infrastructure that SystemTap already
provides.

Either way, testing testing testing:

Probably the most valuable component of any regex engine to us, at least the
one I'd have most difficulty recreating on my own, would be the testsuite :)
Note that re2c's testsuite is not ideal, since its method of invocation
diverges from the EXPR=~REGEXP mechanism, and it doesn't implement POSIX ERE.

We'd probably need to look into cribbing a testsuite from GNU for full
satisfaction and coding re2c (or our own roughly-equivalent engine) to
successfully test against it.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug runtime/11334] regular expression string matching
       [not found] <bug-11334-6586@http.sourceware.org/bugzilla/>
  2012-10-10 16:22 ` [Bug runtime/11334] regular expression string matching smakarov at redhat dot com
@ 2012-10-11 19:58 ` smakarov at redhat dot com
  2012-10-17 19:26 ` fche at redhat dot com
  2 siblings, 0 replies; 3+ messages in thread
From: smakarov at redhat dot com @ 2012-10-11 19:58 UTC (permalink / raw)
  To: systemtap


http://sourceware.org/bugzilla/show_bug.cgi?id=11334

--- Comment #2 from Serguei Makarov <smakarov at redhat dot com> 2012-10-11 19:58:37 UTC ---
Okay, irc discussions seem to favour an "adapt and assimilate" approach for the
existing re2c code. That leaves the SystemTap side of things to figure out.

Here's a rough sketch of the proposed in-language regex interface.

str =~ pat -- regex matching operator
str !~ pat -- regex non-matching operator

matched:str(n:long) -- n'th subgroup of most recent match

matched_start:long(n:long)
matched_end:long(n:long)   -- start and end indices of subgroup

The functions can be actual functions in string.stp, implemented in
embedded-C, which access match information stored in the probe
context.

* * *

We want to avoid generating several copies of the same code for the
same regex used in more than one place. One reasonable solution might
be to allow definition of globally available regexes, as in:

global ident_re = /[_a-zA-Z][_a-zA-Z0-9]*/

All of the uses of indent_re then refer to the same matching code.

Another strategy (both might be worth using) is to keep a table of
declared regex patterns to detect when the same regex appears more
than once.

Another (more involved) idea for an optimization is to check whether we need to
save subgroup locations for a given match (or just compute a binary yes/no
answer without the need to do so). This might be handled by something as simple
as a pragma marking the matched_* tapset functions to signal that we need
subgroup saving, somewhat similar to how the existing /* pragma:unwind */
works.

* * *

In the long term, if we want other features such as substitution, we
might either use the Perl str =~ s/pat/sub/ syntax conventions, or we
might hack up the translator to detect some function-call-like construct
such as str_replace(prnt_str,/pat/,/rplc_pat/) and substitute the appropriate
generated code for that.

* * *

Short and incomplete list of files that will be affected:

* regcomp.cxx -- NEW file interfacing with (and eventually assimilating) the
re2c DFA compiler
  - defines a table of regular expressions used by the program (stored in the
session object)
  - action: register a new regular expression (save it in the table and compile
to internal DFA representation)
  - action: emit C code for given DFA (called at the appropriate point from
translate.cxx)

* staptree.cxx
* parse.cxx
  - support for parsing and internally representing the =~ operator
* elaborate.cxx
  - after appropriate semantic checks & code elision are done, register all of
the surviving regexes using the regcomp.cxx interface
* translate.cxx 
  - call regcomp.cxx code to emit regex matching subroutines
  - replace =~ with appropriate regex invocations

* string.stp
  - access the regex match data after the fact

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug runtime/11334] regular expression string matching
       [not found] <bug-11334-6586@http.sourceware.org/bugzilla/>
  2012-10-10 16:22 ` [Bug runtime/11334] regular expression string matching smakarov at redhat dot com
  2012-10-11 19:58 ` smakarov at redhat dot com
@ 2012-10-17 19:26 ` fche at redhat dot com
  2 siblings, 0 replies; 3+ messages in thread
From: fche at redhat dot com @ 2012-10-17 19:26 UTC (permalink / raw)
  To: systemtap


http://sourceware.org/bugzilla/show_bug.cgi?id=11334

Frank Ch. Eigler <fche at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|systemtap at sourceware dot |smakarov at redhat dot com
                   |org                         |

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-10-17 19:26 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-11334-6586@http.sourceware.org/bugzilla/>
2012-10-10 16:22 ` [Bug runtime/11334] regular expression string matching smakarov at redhat dot com
2012-10-11 19:58 ` smakarov at redhat dot com
2012-10-17 19:26 ` fche at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).