From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 32418 invoked by alias); 4 Jul 2002 15:25:52 -0000 Mailing-List: contact libc-hacker-help@sources.redhat.com; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-hacker-owner@sources.redhat.com Received: (qmail 32398 invoked from network); 4 Jul 2002 15:25:50 -0000 Received: from unknown (HELO sunsite.mff.cuni.cz) (195.113.19.66) by sources.redhat.com with SMTP; 4 Jul 2002 15:25:50 -0000 Received: (from jakub@localhost) by sunsite.mff.cuni.cz (8.11.6/8.11.6) id g64FPhi00687; Thu, 4 Jul 2002 17:25:43 +0200 Date: Thu, 04 Jul 2002 08:25:00 -0000 From: Jakub Jelinek To: Ulrich Drepper , Isamu Hasegawa Cc: Glibc hackers Subject: [PATCH] Decrease regex memory usage Message-ID: <20020704172542.U20867@sunsite.ms.mff.cuni.cz> Reply-To: Jakub Jelinek Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i X-SW-Source: 2002-07/txt/msg00009.txt.bz2 Hi! Just reshuffling some structures can save memory especially on 64-bit arches. E.g. re_token_t which would take 12 resp. 24 bytes now occupies 8 resp. 16 bytes, and as current code allocates sizeof(re_token_t)*length_of_regexp, it might be a lot for long regular expressions (consider e.g. ksymoops which uses several kilobytes long patterns). BTW: I wonder whether it wouldn't be good to have a STRING node to replace a lot of consecutive CHARACTER nodes, especially in MB_CUR_MAX == 1 mode. Right now searching for regular expression Veryveryveryloooooooooooooooooooooooooooooooooooooooooongstring[ ]*endofit will occupy a lot of memory and will be slower than if the string could be compared with strcmp when the first character is found. 2002-07-04 Jakub Jelinek * posix/regex_internal.h (re_token_t): Shrink the structure to 8 resp. 16 bytes on 32-bit resp. 64-bit platforms. (re_charset_t, re_string_t): Reorder structure members for 64-bit arches. --- libc/posix/regex_internal.h.jj Wed Jun 5 10:27:39 2002 +++ libc/posix/regex_internal.h Thu Jul 4 16:13:16 2002 @@ -142,24 +142,18 @@ typedef enum #ifdef RE_ENABLE_I18N typedef struct { - /* If this character set is the non-matching list. */ - unsigned int non_match : 1; - /* Multibyte characters. */ wchar_t *mbchars; - int nmbchars; /* Collating symbols. */ # ifdef _LIBC int32_t *coll_syms; # endif - int ncoll_syms; /* Equivalence classes. */ # ifdef _LIBC int32_t *equiv_classes; # endif - int nequiv_classes; /* Range expressions. */ # ifdef _LIBC @@ -169,17 +163,32 @@ typedef struct wchar_t *range_starts; wchar_t *range_ends; # endif /* not _LIBC */ - int nranges; /* Character classes. */ wctype_t *char_classes; + + /* If this character set is the non-matching list. */ + unsigned int non_match : 1; + + /* # of multibyte characters. */ + int nmbchars; + + /* # of collating symbols. */ + int ncoll_syms; + + /* # of equivalence classes. */ + int nequiv_classes; + + /* # of range expressions. */ + int nranges; + + /* # of character classes. */ int nchar_classes; } re_charset_t; #endif /* RE_ENABLE_I18N */ typedef struct { - re_token_type_t type; union { unsigned char c; /* for CHARACTER */ @@ -195,6 +204,11 @@ typedef struct re_node_set *bkref_eclosure; } *ctx_info; } opr; +#if __GNUC__ >= 2 + re_token_type_t type : 8; +#else + re_token_type_t type; +#endif unsigned int constraint : 10; /* context constraint */ unsigned int duplicated : 1; #ifdef RE_ENABLE_I18N @@ -214,9 +228,6 @@ struct re_string_t /* Indicate the raw buffer which is the original string passed as an argument of regexec(), re_search(), etc.. */ const unsigned char *raw_mbs; - /* Index in RAW_MBS. Each character mbs[i] corresponds to - raw_mbs[raw_mbs_idx + i]. */ - int raw_mbs_idx; /* Store the multibyte string. In case of "case insensitive mode" like REG_ICASE, upper cases of the string are stored, otherwise MBS points the same address that RAW_MBS points. */ @@ -230,6 +241,9 @@ struct re_string_t wint_t *wcs; mbstate_t cur_state; #endif + /* Index in RAW_MBS. Each character mbs[i] corresponds to + raw_mbs[raw_mbs_idx + i]. */ + int raw_mbs_idx; /* The length of the valid characters in the buffers. */ int valid_len; /* The length of the buffers MBS, MBS_CASE, and WCS. */ Jakub