From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from nikam.ms.mff.cuni.cz (nikam.ms.mff.cuni.cz [195.113.20.16]) by sourceware.org (Postfix) with ESMTPS id 5B2873858D3C for ; Mon, 19 Jun 2023 16:14:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5B2873858D3C Authentication-Results: sourceware.org; dmarc=fail (p=none dis=none) header.from=ucw.cz Authentication-Results: sourceware.org; spf=none smtp.mailfrom=kam.mff.cuni.cz Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 9B58228AEC4; Mon, 19 Jun 2023 18:14:19 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ucw.cz; s=gen1; t=1687191259; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=xsIbqv29ZVxAqt5MhukuajEJ+BViXOQ5wqNqFo/NJ/c=; b=qwFDu+oIqbruNGOq+MP3qgZIvi7XxDKQQ+Op5OVZb6eCBPwT5g49mFFx37gu99xMIqsE7h ZVaRsvNcD4dS/3mIdyQy+7TKMRBpwybk3UuZSvQu43ovzarrF8adQKSrSJPtWsf95HoBtF na1hwi2+T6tgZL7UvBn0wRsnp2GwHp8= Date: Mon, 19 Jun 2023 18:14:19 +0200 From: Jan Hubicka To: Jonathan Wakely Cc: Jakub Jelinek , gcc-patches@gcc.gnu.org Subject: Re: [libstdc++] Improve M_check_len Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-5.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: > On Mon, 19 Jun 2023 at 12:20, Jakub Jelinek wrote: > > > On Mon, Jun 19, 2023 at 01:05:36PM +0200, Jan Hubicka via Gcc-patches > > wrote: > > > - if (max_size() - size() < __n) > > > - __throw_length_error(__N(__s)); > > > + const size_type __max_size = max_size(); > > > + // On 64bit systems vectors can not reach overflow by growing > > > + // by small sizes; before this happens, we will run out of memory. > > > + if (__builtin_constant_p(__n) > > > + && __builtin_constant_p(__max_size) > > > + && sizeof(ptrdiff_t) >= 8 > > > + && __max_size * sizeof(_Tp) >= ((ptrdiff_t)1 << 60) > > > > Isn't there a risk of overlow in the __max_size * sizeof(_Tp) computation? > > > > For std::allocator, no, because max_size() is size_t(-1) / sizeof(_Tp). But > for a user-defined allocator that has a silly max_size(), yes, that's > possible. > > I still don't really understand why any change is needed here. The PR says > that the current _M_check_len brings in the EH code, but how/why does that > happen? The __throw_length_error function is not inline, it's defined in > libstdc++.so, so why isn't it just an extern call? Is the problem that it It is really quite interesting peformance problem which does affect real code. Extra extern call counts (especially since it is seen as 3 calls by inliner). This is _M_check_len after early optimizations (so as seen by inline heuristics): [local count: 1073741824]: _15 = this_7(D)->D.26656._M_impl.D.25963._M_finish; _14 = this_7(D)->D.26656._M_impl.D.25963._M_start; _13 = _15 - _14; _10 = _13 /[ex] 8; _8 = (long unsigned int) _10; _1 = 1152921504606846975 - _8; __n.3_2 = __n; if (_1 < __n.3_2) goto ; [0.04%] else goto ; [99.96%] [local count: 429496]: std::__throw_length_error (__s_16(D)); [local count: 1073312329]: D.27696 = _8; if (__n.3_2 > _8) goto ; [34.00%] else goto ; [66.00%] [local count: 364926196]: [local count: 1073312330]: # _18 = PHI <&D.27696(4), &__n(5)> _3 = *_18; __len_11 = _3 + _8; D.27696 ={v} {CLOBBER(eol)}; if (_8 > __len_11) goto ; [35.00%] else goto ; [65.00%] [local count: 697653013]: _5 = MIN_EXPR <__len_11, 1152921504606846975>; [local count: 1073312330]: # iftmp.4_4 = PHI <1152921504606846975(6), _5(7)> return iftmp.4_4; So a lot of code that is essnetially semantically equivalent to: return __size + MAX_EXPR (__n, __size) at least with the default allocator. Early inliner decides that it is not good idea to early inline. At this stage we inline mostly calls where we expect code to get smaller after inlining and since the function contains another uninlinable call, this does not seem likely. With -O3 we will inline it later at IPA stage, but only when the code is considered hot. With -O2 we decide to keep it offline if the unit contians multiple calls to the function otherwise we inline it since it wins in the code size estimation model. The problem is that _M_check_len is used by _M_realloc_insert that later feeds result to the allocator. There is extra redundancy since allocator can call std::__throw_bad_array_new_length and std::__throw_bad_alloc for bad sizes, but _M_check_len will not produce them which is something we won't work out before inlning it. As a result _M_realloc_insert is seen as very large function by inliner heuristics (71 instructions). Functions that are not declared inline are inlined if smaller than 15 instructions with -O2 and 30 instructions with -O3. So we don't inline. This hurts common lops that use vector as a stack and calls push_back in internal loop. Not inlining prevents SRA and we end up saving and loading the end of vector pointer on every iteration of the loop. The following testcase: typedef unsigned int uint32_t; std::pair pair; void test() { std::vector> stack; stack.push_back (pair); while (!stack.empty()) { std::pair cur = stack.back(); stack.pop_back(); if (!cur.first) { cur.second++; stack.push_back (cur); } if (cur.second > 10000) break; } } int main() { for (int i = 0; i < 10000; i++) test(); } Runs for me 0.5s with _M_realoc_insert not inlined and 0.048s with _M_realloc_insert inlined. Clang inlines it even at -O2 and does 0.063s. I believe it is the reason why jpegxl library is slower when built with GCC and since such loops are quite common in say DFS walk, I think it is frequent problem. > makes _M_check_len potentially-throwing? Because that's basically the > entire point of _M_check_len: to throw the exception that is required by > the C++ standard. We need to be very careful about removing that required > throw! And after we call _M_check_len we call allocate unconditionally, so > _M_realloc_insert can always throw (we only call _M_realloc_insert in the > case where we've already decided a reallocation is definitely needed). > > Would this version of _M_check_len help? > > size_type > _M_check_len(size_type __n, const char* __s) const > { > const size_type __size = size(); > const size_type __max_size = max_size(); > > if (__is_same(allocator_type, allocator<_Tp>) > && __size > __max_size / 2) > __builtin_unreachable(); // Assume std::allocator can't fill > memory. > else if (__size > __max_size) > __builtin_unreachable(); > > if (__max_size - __size < __n) > __throw_length_error(__N(__s)); > > const size_type __len = __size + (std::max)(__size, __n); > return (__len < __size || __len > __max_size) ? __max_size : __len; > } > > This only applies to std::allocator, not user-defined allocators (because > we don't know their semantics). It also seems like less of a big hack! This helps my testcase :) _M_check_len is now: size_type std::vector >::_M_check_len (const struct vector * const this, size_type __n, const char * __s) { const size_type __len; long unsigned int _1; long unsigned int __n.5_2; long int _5; long unsigned int _6; struct pair * _7; struct pair * _8; long int _12; long unsigned int _13; long unsigned int _14; long unsigned int _15; [local count: 1073741824]: _8 = this_4(D)->D.26651._M_impl.D.25958._M_finish; _7 = this_4(D)->D.26651._M_impl.D.25958._M_start; _5 = _8 - _7; _12 = _5 /[ex] 8; _13 = (long unsigned int) _12; _15 = (long unsigned int) _5; if (_15 > 4611686018427387896) goto ; [0.00%] else goto ; [100.00%] [count: 0]: __builtin_unreachable (); [local count: 1073741824]: _1 = 1152921504606846975 - _13; __n.5_2 = __n; if (_1 < __n.5_2) goto ; [0.04%] else goto ; [99.96%] [local count: 429496]: std::__throw_length_error (__s_10(D)); [local count: 1073312329]: _14 = MAX_EXPR <__n.5_2, _13>; __len_9 = _13 + _14; _6 = MIN_EXPR <__len_9, 1152921504606846975>; return _6; } Originally it counted as 32 instructions in inliner metrics (load/calls counts as 2), while now it is 28 and 13 of them are expected to survive optimizations after inlining. With the patch we inline _M_realloc_insert at -O3. With -O2 we still do not inline it (and I am not convinced we should), but reduce its code size from 347 bytes to 227 bytes in the final binary that also counts. Estimated size of offlined _M_realloc_insert is still 67 instructions instead of formed 71 (still long way to 15). How common are nonstandard allocators? Honza