From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=EuDN=CH=kam.mff.cuni.cz=hubicka@sourceware.org>
Received: from nikam.ms.mff.cuni.cz (nikam.ms.mff.cuni.cz [195.113.20.16])
	by sourceware.org (Postfix) with ESMTPS id 5B2873858D3C
	for <gcc-patches@gcc.gnu.org>; Mon, 19 Jun 2023 16:14:21 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5B2873858D3C
Authentication-Results: sourceware.org; dmarc=fail (p=none dis=none) header.from=ucw.cz
Authentication-Results: sourceware.org; spf=none smtp.mailfrom=kam.mff.cuni.cz
Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202)
	id 9B58228AEC4; Mon, 19 Jun 2023 18:14:19 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ucw.cz; s=gen1;
	t=1687191259;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=xsIbqv29ZVxAqt5MhukuajEJ+BViXOQ5wqNqFo/NJ/c=;
	b=qwFDu+oIqbruNGOq+MP3qgZIvi7XxDKQQ+Op5OVZb6eCBPwT5g49mFFx37gu99xMIqsE7h
	ZVaRsvNcD4dS/3mIdyQy+7TKMRBpwybk3UuZSvQu43ovzarrF8adQKSrSJPtWsf95HoBtF
	na1hwi2+T6tgZL7UvBn0wRsnp2GwHp8=
Date: Mon, 19 Jun 2023 18:14:19 +0200
From: Jan Hubicka <hubicka@ucw.cz>
To: Jonathan Wakely <jwakely@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>, gcc-patches@gcc.gnu.org
Subject: Re: [libstdc++] Improve M_check_len
Message-ID: <ZJB+26dR4DWSXraS@kam.mff.cuni.cz>
References: <ZI9MmdQ+OMehcdeg@kam.mff.cuni.cz>
 <CACb0b4kgHajswPwumWtjLeXOMt75tt0jJhmb1-JZ4wzrfXvB4A@mail.gmail.com>
 <ZJA2gJzMkigiwqkZ@kam.mff.cuni.cz>
 <ZJA5/fuVUNpkc3ms@tucnak>
 <CACb0b4nt=mNKYS7+Fwnrf0Kq4umLu0M7Z-SotzCbx7=oEs021g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CACb0b4nt=mNKYS7+Fwnrf0Kq4umLu0M7Z-SotzCbx7=oEs021g@mail.gmail.com>
X-Spam-Status: No, score=-5.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

> On Mon, 19 Jun 2023 at 12:20, Jakub Jelinek wrote:
> 
> > On Mon, Jun 19, 2023 at 01:05:36PM +0200, Jan Hubicka via Gcc-patches
> > wrote:
> > > -     if (max_size() - size() < __n)
> > > -       __throw_length_error(__N(__s));
> > > +     const size_type __max_size = max_size();
> > > +     // On 64bit systems vectors can not reach overflow by growing
> > > +     // by small sizes; before this happens, we will run out of memory.
> > > +     if (__builtin_constant_p(__n)
> > > +         && __builtin_constant_p(__max_size)
> > > +         && sizeof(ptrdiff_t) >= 8
> > > +         && __max_size * sizeof(_Tp) >= ((ptrdiff_t)1 << 60)
> >
> > Isn't there a risk of overlow in the __max_size * sizeof(_Tp) computation?
> >
> 
> For std::allocator, no, because max_size() is size_t(-1) / sizeof(_Tp). But
> for a user-defined allocator that has a silly max_size(), yes, that's
> possible.
> 
> I still don't really understand why any change is needed here. The PR says
> that the current _M_check_len brings in the EH code, but how/why does that
> happen? The __throw_length_error function is not inline, it's defined in
> libstdc++.so, so why isn't it just an extern call? Is the problem that it

It is really quite interesting peformance problem which does affect real
code. Extra extern call counts (especially since it is seen as
3 calls by inliner).  

This is _M_check_len after early optimizations (so as seen by inline
heuristics):

  <bb 2> [local count: 1073741824]:
  _15 = this_7(D)->D.26656._M_impl.D.25963._M_finish;
  _14 = this_7(D)->D.26656._M_impl.D.25963._M_start;
  _13 = _15 - _14;
  _10 = _13 /[ex] 8;
  _8 = (long unsigned int) _10;
  _1 = 1152921504606846975 - _8;
  __n.3_2 = __n;
  if (_1 < __n.3_2)
    goto <bb 3>; [0.04%]
  else
    goto <bb 4>; [99.96%]

  <bb 3> [local count: 429496]:
  std::__throw_length_error (__s_16(D));

  <bb 4> [local count: 1073312329]:
  D.27696 = _8;
  if (__n.3_2 > _8)
    goto <bb 5>; [34.00%]
  else
    goto <bb 6>; [66.00%]

  <bb 5> [local count: 364926196]:

  <bb 6> [local count: 1073312330]:
  # _18 = PHI <&D.27696(4), &__n(5)>
  _3 = *_18;
  __len_11 = _3 + _8;
  D.27696 ={v} {CLOBBER(eol)};
  if (_8 > __len_11)
    goto <bb 8>; [35.00%]
  else
    goto <bb 7>; [65.00%]

  <bb 7> [local count: 697653013]:
  _5 = MIN_EXPR <__len_11, 1152921504606846975>;

  <bb 8> [local count: 1073312330]:
  # iftmp.4_4 = PHI <1152921504606846975(6), _5(7)>
  return iftmp.4_4;

So a lot of code that is essnetially semantically equivalent to:

   return __size + MAX_EXPR (__n, __size)

at least with the default allocator.

Early inliner decides that it is not good idea to early inline. 
At this stage we inline mostly calls where we expect code to get
smaller after inlining and since the function contains another
uninlinable call, this does not seem likely.

With -O3 we will inline it later at IPA stage, but only when the code is
considered hot. 
With -O2 we decide to keep it offline if the unit contians multiple
calls to the function otherwise we inline it since it wins in the code
size estimation model.

The problem is that _M_check_len is used by _M_realloc_insert that later
feeds result to the allocator.  There is extra redundancy since
allocator can call std::__throw_bad_array_new_length and 
std::__throw_bad_alloc for bad sizes, but _M_check_len will not produce
them which is something we won't work out before inlning it.

As a result _M_realloc_insert is seen as very large function by
inliner heuristics (71 instructions).  Functions that are not
declared inline are inlined if smaller than 15 instructions with -O2
and 30 instructions with -O3. So we don't inline.

This hurts common lops that use vector as a stack and calls push_back
in internal loop. Not inlining prevents SRA and we end up saving and
loading the end of vector pointer on every iteration of the loop.

The following testcase:

typedef unsigned int uint32_t;
std::pair<uint32_t, uint32_t> pair;
void
test()
{
        std::vector<std::pair<uint32_t, uint32_t>> stack;
        stack.push_back (pair);
        while (!stack.empty()) {
                std::pair<uint32_t, uint32_t> cur = stack.back();
                stack.pop_back();
                if (!cur.first)
                {
                        cur.second++;
                        stack.push_back (cur);
                }
                if (cur.second > 10000)
                        break;
        }
}
int
main()
{
        for (int i = 0; i < 10000; i++)
          test();
}

Runs for me 0.5s with _M_realoc_insert not inlined and 0.048s with
_M_realloc_insert inlined.  Clang inlines it even at -O2 and does
0.063s.  I believe it is the reason why jpegxl library is slower
when built with GCC and since such loops are quite common in say
DFS walk, I think it is frequent problem.
> makes _M_check_len potentially-throwing? Because that's basically the
> entire point of _M_check_len: to throw the exception that is required by
> the C++ standard. We need to be very careful about removing that required
> throw! And after we call _M_check_len we call allocate unconditionally, so
> _M_realloc_insert can always throw (we only call _M_realloc_insert in the
> case where we've already decided a reallocation is definitely needed).
> 
> Would this version of _M_check_len help?
> 
>       size_type
>       _M_check_len(size_type __n, const char* __s) const
>       {
>         const size_type __size = size();
>         const size_type __max_size = max_size();
> 
>         if (__is_same(allocator_type, allocator<_Tp>)
>               && __size > __max_size / 2)
>           __builtin_unreachable(); // Assume std::allocator can't fill
> memory.
>         else if (__size > __max_size)
>           __builtin_unreachable();
> 
>         if (__max_size - __size < __n)
>           __throw_length_error(__N(__s));
> 
>         const size_type __len = __size + (std::max)(__size, __n);
>         return (__len < __size || __len > __max_size) ? __max_size : __len;
>       }
> 
> This only applies to std::allocator, not user-defined allocators (because
> we don't know their semantics). It also seems like less of a big hack!

This helps my testcase :)
_M_check_len is now:

size_type std::vector<std::pair<unsigned int, unsigned int> >::_M_check_len (const struct vector * const this, size_type __n, const char * __s)
{
  const size_type __len;
  long unsigned int _1;
  long unsigned int __n.5_2;
  long int _5;
  long unsigned int _6;
  struct pair * _7;
  struct pair * _8;
  long int _12;
  long unsigned int _13;
  long unsigned int _14;
  long unsigned int _15;

  <bb 2> [local count: 1073741824]:
  _8 = this_4(D)->D.26651._M_impl.D.25958._M_finish;
  _7 = this_4(D)->D.26651._M_impl.D.25958._M_start;
  _5 = _8 - _7;
  _12 = _5 /[ex] 8;
  _13 = (long unsigned int) _12;
  _15 = (long unsigned int) _5;
  if (_15 > 4611686018427387896)
    goto <bb 3>; [0.00%]
  else
    goto <bb 4>; [100.00%]

  <bb 3> [count: 0]:
  __builtin_unreachable ();

  <bb 4> [local count: 1073741824]:
  _1 = 1152921504606846975 - _13;
  __n.5_2 = __n;
  if (_1 < __n.5_2)
    goto <bb 5>; [0.04%]
  else
    goto <bb 6>; [99.96%]

  <bb 5> [local count: 429496]:
  std::__throw_length_error (__s_10(D));

  <bb 6> [local count: 1073312329]:
  _14 = MAX_EXPR <__n.5_2, _13>;
  __len_9 = _13 + _14;
  _6 = MIN_EXPR <__len_9, 1152921504606846975>;
  return _6;
}

Originally it counted as 32 instructions in inliner metrics (load/calls
counts as 2), while now it is 28 and 13 of them are expected to survive
optimizations after inlining.

With the patch we inline _M_realloc_insert at -O3.  With -O2 we still do
not inline it (and I am not convinced we should), but reduce its code
size from 347 bytes to 227 bytes in the final binary that also counts.

Estimated size of offlined _M_realloc_insert is still 67 instructions
instead of formed 71 (still long way to 15).

How common are nonstandard allocators?

Honza