public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed
From: "steve98 at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs-regex@sourceware.org
Subject: [Bug regex/25934] New: re_token_t.mb_partial used before initialization
Date: Thu, 07 May 2020 00:49:10 +0000	[thread overview]
Message-ID: <bug-25934-132@http.sourceware.org/bugzilla/> (raw)

https://sourceware.org/bugzilla/show_bug.cgi?id=25934

            Bug ID: 25934
           Summary: re_token_t.mb_partial used before initialization
           Product: glibc
           Version: 2.27
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: regex
          Assignee: unassigned at sourceware dot org
          Reporter: steve98 at gmail dot com
                CC: drepper.fsp at gmail dot com
  Target Milestone: ---

I was debugging my own program the the Valgrind/Memcheck tool, and discovered a
case of regex using a variable without first initializing it.

The offending code is in regcomp.c near line 328:

*p++ = dfa->nodes[node].opr.c;
              while (++node < dfa->nodes_len
                     && dfa->nodes[node].type == CHARACTER
                     && dfa->nodes[node].mb_partial)
                *p++ = dfa->nodes[node].opr.c;

The Valgrind/Memcheck reported the problem as:

==31536== Conditional jump or move depends on uninitialised value(s)
==31536==    at 0x56F213D: re_compile_fastmap_iter.isra.26 (regcomp.c:328)
==31536==    by 0x57023F0: __re_compile_fastmap (regcomp.c:282)
==31536==    by 0x57023F0: regcomp (regcomp.c:509)
==31536==    by 0x126EEF: regex_match (xxx_xxx.c:290)
(The rest, plus the line above, is from our own code)

The tool also reported where/how such a variable was allocated (from the heap,
not the stack)

==31536==  Uninitialised value was created by a heap allocation
==31536==    at 0x4C2FB0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31536==    by 0x56F2D6A: create_token_tree.isra.14.constprop.39
(regcomp.c:3749)
==31536==    by 0x56FB133: parse_bracket_exp (regcomp.c:3299)
==31536==    by 0x56FB133: parse_expression (regcomp.c:2262)
==31536==    by 0x56FB519: parse_branch (regcomp.c:2190)
==31536==    by 0x56FB68B: parse_reg_exp (regcomp.c:2138)
==31536==    by 0x56FBD7C: parse (regcomp.c:2107)
==31536==    by 0x56FBD7C: re_compile_internal (regcomp.c:788)
==31536==    by 0x5702331: regcomp (regcomp.c:498)
==31536==    by 0x126EEF: regex_match (xxx_xxx.c:290)

I reviewed the related code, and saw that the "mb_partial" field is part of the
following structure (in regex_internal.h):

typedef struct
{
  union
  {
    unsigned char c;                /* for CHARACTER */
    re_bitset_ptr_t sbcset;        /* for SIMPLE_BRACKET */
#ifdef RE_ENABLE_I18N
    re_charset_t *mbcset;        /* for COMPLEX_BRACKET */
#endif /* RE_ENABLE_I18N */
    Idx idx;                        /* for BACK_REF */
    re_context_type ctx_type;        /* for ANCHOR */
  } opr;
#if __GNUC__ >= 2 && !defined __STRICT_ANSI__
  re_token_type_t type : 8;
#else
  re_token_type_t type;
#endif
  unsigned int constraint : 10;        /* context constraint */
  unsigned int duplicated : 1;
  unsigned int opt_subexp : 1;
#ifdef RE_ENABLE_I18N
  unsigned int accept_mb : 1;
  /* These 2 bits can be moved into the union if needed (e.g. if running out
     of bits; move opr.c to opr.c.c and move the flags to opr.c.flags).  */
  unsigned int mb_partial : 1;
#endif
  unsigned int word_char : 1;
} re_token_t;

I then compared the write operation of "mb_partial" against others, such as the
near-by "accept_mb" member, and saw that indeed there are times that the
"accept_mb" field is set, but the "mb_partial" field is not. For example (in
regex_internal.c, near line 1450):

  dfa->nodes[dfa->nodes_len].constraint = 0;
#ifdef RE_ENABLE_I18N
  dfa->nodes[dfa->nodes_len].accept_mb =
    ((token.type == OP_PERIOD && dfa->mb_cur_max > 1)
     || token.type == COMPLEX_BRACKET);
#endif

The code above seems to be performing initialization for the newly allocated
"node", but clearly missed the mb_partial field.

My problem occurred with 2.27, but I also read through the latest code, and it
seems that mb_partial is not being initialized in additional places. So the
problem should still exist in the latest version.

If you need me to validate the problem against the latest glibc version, or
provide a program to demonstrate this problem with Valgrind/Memcheck, I'll be
happy to do so.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

             reply	other threads:[~2020-05-07  0:49 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-07  0:49 steve98 at gmail dot com [this message]
2020-05-07  6:07 ` [Bug regex/25934] " steve98 at gmail dot com
2020-05-07  6:37 ` sangshuduo at gmail dot com

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-25934-132@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs-regex@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).