public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed
* [Bug regex/25934] New: re_token_t.mb_partial used before initialization
@ 2020-05-07  0:49 steve98 at gmail dot com
  2020-05-07  6:07 ` [Bug regex/25934] " steve98 at gmail dot com
  2020-05-07  6:37 ` sangshuduo at gmail dot com
  0 siblings, 2 replies; 3+ messages in thread
From: steve98 at gmail dot com @ 2020-05-07  0:49 UTC (permalink / raw)
  To: glibc-bugs-regex

https://sourceware.org/bugzilla/show_bug.cgi?id=25934

            Bug ID: 25934
           Summary: re_token_t.mb_partial used before initialization
           Product: glibc
           Version: 2.27
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: regex
          Assignee: unassigned at sourceware dot org
          Reporter: steve98 at gmail dot com
                CC: drepper.fsp at gmail dot com
  Target Milestone: ---

I was debugging my own program the the Valgrind/Memcheck tool, and discovered a
case of regex using a variable without first initializing it.

The offending code is in regcomp.c near line 328:

*p++ = dfa->nodes[node].opr.c;
              while (++node < dfa->nodes_len
                     && dfa->nodes[node].type == CHARACTER
                     && dfa->nodes[node].mb_partial)
                *p++ = dfa->nodes[node].opr.c;

The Valgrind/Memcheck reported the problem as:

==31536== Conditional jump or move depends on uninitialised value(s)
==31536==    at 0x56F213D: re_compile_fastmap_iter.isra.26 (regcomp.c:328)
==31536==    by 0x57023F0: __re_compile_fastmap (regcomp.c:282)
==31536==    by 0x57023F0: regcomp (regcomp.c:509)
==31536==    by 0x126EEF: regex_match (xxx_xxx.c:290)
(The rest, plus the line above, is from our own code)

The tool also reported where/how such a variable was allocated (from the heap,
not the stack)

==31536==  Uninitialised value was created by a heap allocation
==31536==    at 0x4C2FB0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31536==    by 0x56F2D6A: create_token_tree.isra.14.constprop.39
(regcomp.c:3749)
==31536==    by 0x56FB133: parse_bracket_exp (regcomp.c:3299)
==31536==    by 0x56FB133: parse_expression (regcomp.c:2262)
==31536==    by 0x56FB519: parse_branch (regcomp.c:2190)
==31536==    by 0x56FB68B: parse_reg_exp (regcomp.c:2138)
==31536==    by 0x56FBD7C: parse (regcomp.c:2107)
==31536==    by 0x56FBD7C: re_compile_internal (regcomp.c:788)
==31536==    by 0x5702331: regcomp (regcomp.c:498)
==31536==    by 0x126EEF: regex_match (xxx_xxx.c:290)

I reviewed the related code, and saw that the "mb_partial" field is part of the
following structure (in regex_internal.h):

typedef struct
{
  union
  {
    unsigned char c;                /* for CHARACTER */
    re_bitset_ptr_t sbcset;        /* for SIMPLE_BRACKET */
#ifdef RE_ENABLE_I18N
    re_charset_t *mbcset;        /* for COMPLEX_BRACKET */
#endif /* RE_ENABLE_I18N */
    Idx idx;                        /* for BACK_REF */
    re_context_type ctx_type;        /* for ANCHOR */
  } opr;
#if __GNUC__ >= 2 && !defined __STRICT_ANSI__
  re_token_type_t type : 8;
#else
  re_token_type_t type;
#endif
  unsigned int constraint : 10;        /* context constraint */
  unsigned int duplicated : 1;
  unsigned int opt_subexp : 1;
#ifdef RE_ENABLE_I18N
  unsigned int accept_mb : 1;
  /* These 2 bits can be moved into the union if needed (e.g. if running out
     of bits; move opr.c to opr.c.c and move the flags to opr.c.flags).  */
  unsigned int mb_partial : 1;
#endif
  unsigned int word_char : 1;
} re_token_t;

I then compared the write operation of "mb_partial" against others, such as the
near-by "accept_mb" member, and saw that indeed there are times that the
"accept_mb" field is set, but the "mb_partial" field is not. For example (in
regex_internal.c, near line 1450):

  dfa->nodes[dfa->nodes_len].constraint = 0;
#ifdef RE_ENABLE_I18N
  dfa->nodes[dfa->nodes_len].accept_mb =
    ((token.type == OP_PERIOD && dfa->mb_cur_max > 1)
     || token.type == COMPLEX_BRACKET);
#endif

The code above seems to be performing initialization for the newly allocated
"node", but clearly missed the mb_partial field.

My problem occurred with 2.27, but I also read through the latest code, and it
seems that mb_partial is not being initialized in additional places. So the
problem should still exist in the latest version.

If you need me to validate the problem against the latest glibc version, or
provide a program to demonstrate this problem with Valgrind/Memcheck, I'll be
happy to do so.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug regex/25934] re_token_t.mb_partial used before initialization
  2020-05-07  0:49 [Bug regex/25934] New: re_token_t.mb_partial used before initialization steve98 at gmail dot com
@ 2020-05-07  6:07 ` steve98 at gmail dot com
  2020-05-07  6:37 ` sangshuduo at gmail dot com
  1 sibling, 0 replies; 3+ messages in thread
From: steve98 at gmail dot com @ 2020-05-07  6:07 UTC (permalink / raw)
  To: glibc-bugs-regex

https://sourceware.org/bugzilla/show_bug.cgi?id=25934

--- Comment #1 from Steven Li <steve98 at gmail dot com> ---
OK, I managed to create a simple problem to recreate this problem (on Ubuntu
18.04, using 2.27). The code is super simple:

$ cat a.c
#include <stdio.h>
#include <regex.h>
#include <locale.h>

int main() {
  char * pattern = "^[ab]*(c)$"; // any simpler, the problem goes away
  int flags = REG_ICASE; // has to be there for problem to appear

  setlocale(LC_CTYPE, ""); // without this, there is no problem

  regex_t regex;
  regcomp(&regex, pattern, flags);
}

The interesting thing is with the 1st 3 lines of code, each of them is a
necessary condition for the problem. Compiling the code is easy enough:

$ rm a.out; gcc a.c

Running the code under Valgrind yields really interesting/disturbing result
(with my home directory name masked out in the messages):

$ valgrind --track-origins=yes --leak-check=yes ./a.out
==12322== Memcheck, a memory error detector
==12322== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==12322== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==12322== Command: ./a.out
==12322==
==12322== Conditional jump or move depends on uninitialised value(s)
==12322==    at 0x4F2D13D: re_compile_fastmap_iter.isra.26 (regcomp.c:328)
==12322==    by 0x4F3D3D0: __re_compile_fastmap (regcomp.c:280)
==12322==    by 0x4F3D3D0: regcomp (regcomp.c:509)
==12322==    by 0x108749: main (in [...]/a.out)
==12322==  Uninitialised value was created by a heap allocation
==12322==    at 0x4C2FB0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12322==    by 0x4F2DD6A: create_token_tree.isra.14.constprop.39
(regcomp.c:3749)
==12322==    by 0x4F35885: parse_expression (regcomp.c:2356)
==12322==    by 0x4F364CB: parse_branch (regcomp.c:2183)
==12322==    by 0x4F3668B: parse_reg_exp (regcomp.c:2138)
==12322==    by 0x4F36D7C: parse (regcomp.c:2107)
==12322==    by 0x4F36D7C: re_compile_internal (regcomp.c:788)
==12322==    by 0x4F3D331: regcomp (regcomp.c:498)
==12322==    by 0x108749: main (in [...]/a.out)
==12322==
==12322== Conditional jump or move depends on uninitialised value(s)
==12322==    at 0x4F2D13D: re_compile_fastmap_iter.isra.26 (regcomp.c:328)
==12322==    by 0x4F3D3F0: __re_compile_fastmap (regcomp.c:282)
==12322==    by 0x4F3D3F0: regcomp (regcomp.c:509)
==12322==    by 0x108749: main (in /home/stevel/workspace/TDengine/a.out)
==12322==  Uninitialised value was created by a heap allocation
==12322==    at 0x4C2FB0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12322==    by 0x4F2DD6A: create_token_tree.isra.14.constprop.39
(regcomp.c:3749)
==12322==    by 0x4F35885: parse_expression (regcomp.c:2356)
==12322==    by 0x4F364CB: parse_branch (regcomp.c:2183)
==12322==    by 0x4F3668B: parse_reg_exp (regcomp.c:2138)
==12322==    by 0x4F36D7C: parse (regcomp.c:2107)
==12322==    by 0x4F36D7C: re_compile_internal (regcomp.c:788)
==12322==    by 0x4F3D331: regcomp (regcomp.c:498)
==12322==    by 0x108749: main (in [...]/a.out)
==12322==
==12322==
==12322== HEAP SUMMARY:
==12322==     in use at exit: 2,680 bytes in 48 blocks
==12322==   total heap usage: 82 allocs, 34 frees, 9,003 bytes allocated
==12322==
==12322== 256 bytes in 1 blocks are definitely lost in loss record 35 of 39
==12322==    at 0x4C2FB0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12322==    by 0x4F3D2C9: regcomp (regcomp.c:479)
==12322==    by 0x108749: main (in [...]/a.out)
==12322==
==12322== 2,424 (224 direct, 2,200 indirect) bytes in 1 blocks are definitely
lost in loss record 39 of 39
==12322==    at 0x4C2FA3F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12322==    by 0x4C31D84: realloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12322==    by 0x4F37CDB: re_compile_internal (regcomp.c:749)
==12322==    by 0x4F3D331: regcomp (regcomp.c:498)
==12322==    by 0x108749: main (in [...]/a.out)
==12322==
==12322== LEAK SUMMARY:
==12322==    definitely lost: 480 bytes in 2 blocks
==12322==    indirectly lost: 2,200 bytes in 46 blocks
==12322==      possibly lost: 0 bytes in 0 blocks
==12322==    still reachable: 0 bytes in 0 blocks
==12322==         suppressed: 0 bytes in 0 blocks
==12322==
==12322== For counts of detected and suppressed errors, rerun with: -v
==12322== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug regex/25934] re_token_t.mb_partial used before initialization
  2020-05-07  0:49 [Bug regex/25934] New: re_token_t.mb_partial used before initialization steve98 at gmail dot com
  2020-05-07  6:07 ` [Bug regex/25934] " steve98 at gmail dot com
@ 2020-05-07  6:37 ` sangshuduo at gmail dot com
  1 sibling, 0 replies; 3+ messages in thread
From: sangshuduo at gmail dot com @ 2020-05-07  6:37 UTC (permalink / raw)
  To: glibc-bugs-regex

https://sourceware.org/bugzilla/show_bug.cgi?id=25934

Shuduo Sang <sangshuduo at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sangshuduo at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-05-07  6:37 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-07  0:49 [Bug regex/25934] New: re_token_t.mb_partial used before initialization steve98 at gmail dot com
2020-05-07  6:07 ` [Bug regex/25934] " steve98 at gmail dot com
2020-05-07  6:37 ` sangshuduo at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).