After some very painful analysis, I was able to reduce the degradation
we are experiencing in VRP to a handful of lines in the new
implementation of prange.

What happens is that any series of small changes to a new prange class
causes changes in the inlining of wide_int_storage elsewhere.  With
the attached patch, one difference lies in irange::singleton_p(tree
*).  Note that this is in irange, which is completely unrelated to the
new (unused) code.

Using trunk as the stage1 compiler, we can see the assembly for
irange::singleton_p(tree *) in value-range.cc is different with and
without my patch.

The number of calls into wide_int within irange::singleton_p(tree *) changes:

awk '/^_ZNK6irange11singleton_pEPP9tree_node/,/endproc/' value-range.s
| grep call.*wide_int

With mainline sources:

        call    _ZN16wide_int_storageC2ERKS_
        call
_Z16wide_int_to_treeP9tree_nodeRK8poly_intILj1E16generic_wide_intI20wide_int_ref_storageILb0ELb1EEEE

With the attached patch:

        call    _ZN16wide_int_storageC2ERKS_
        call    _ZN16wide_int_storageC2ERKS_
        call
_Z16wide_int_to_treeP9tree_nodeRK8poly_intILj1E16generic_wide_intI20wide_int_ref_storageILb0ELb1EEEE
        call    _ZN16wide_int_storageC2ERKS_

The additional calls correspond to the wide_int_storage constructor:

        $ c++filt _ZN16wide_int_storageC2ERKS_
        wide_int_storage::wide_int_storage(wide_int_storage const&)

Using -fno-semantic-interposition makes no difference.

Here are the relevant bits in the difference from -Winline with and
without my patch:

>     inlined from ‘virtual bool irange::singleton_p(tree_node**) const’ at /home/aldyh/src/gcc/gcc/value-range.cc:1254:40:
> /home/aldyh/src/gcc/gcc/wide-int.h:1196:8: warning: inlining failed in call to ‘wide_int_storage::wide_int_storage(const wide_int_storage&)’: --param inline-unit-growth limit reached [-Winline]
>  1196 | inline wide_int_storage::wide_int_storage (const wide_int_storage &x)
>       |        ^~~~~~~~~~~~~~~~
> /home/aldyh/src/gcc/gcc/wide-int.h:775:7: note: called from here
>   775 | class GTY(()) generic_wide_int : public storage
>       |       ^~~~~~~~~~~~~~~~
> /home/aldyh/src/gcc/gcc/wide-int.h:1196:8: warning: inlining failed in call to ‘wide_int_storage::wide_int_storage(const wide_int_storage&)’: --param inline-unit-growth limit reached [-Winline]
>  1196 | inline wide_int_storage::wide_int_storage (const wide_int_storage &x)
>       |        ^~~~~~~~~~~~~~~~
> /home/aldyh/src/gcc/gcc/wide-int.h:775:7: note: called from here
>   775 | class GTY(()) generic_wide_int : public storage
>       |       ^~~~~~~~~~~~~~~~
> In copy constructor ‘generic_wide_int<wide_int_storage>::generic_wide_int(const generic_wide_int<wide_int_storage>&)’,
>     inlined from ‘wide_int irange::lower_bound(unsigned int) const’ at /home/aldyh/src/gcc/gcc/value-range.h:1122:25,

Note that this is just one example.  There are also inlining
differences to irange::get_bitmask(), irange::union_bitmask(),
irange::operator=, among others.  Most of the inlining failures seem
to be related to wide_int_storage.  I am attaching the difference in
-Winline for the curious.

Tracking this down is tricky because the slightest change in the patch
causes different inlining in irange.  Even using a slightly different
stage1 compiler produces different changes.  For example, using GCC 13
as the stage1 compiler, VRP exhibits a slowdown of 2% with the full
prange class.  Although this is virtually identical to the slowdown
for using trunk as the stage1 compiler, the inlining failures are a
tad different.

I am tempted to commit the attached to mainline, which slows down VRP
by 0.3%, but is measurable enough to analyze, just so we have a base
commit-point from where to do the analysis.  My wife is about to give
birth any day now, so I'm afraid if I drop off for a few months, we'll
lose the analysis and the point in time from where to do it.

One final thing.  The full prange class patch, even when disabled,
slows VRP by 2%.  I tried to implement the class in small increments,
and every small change caused a further slowdown.  I don't know if
this 2% is final, or if further tweaks in this space will slow us down
more.

On a positive note, with the entirety of prange implemented (not just
the base class but range-ops implemented and prange enabled, there is
no overall change to VRP, and IPA-cp speeds up by 7%.  This is because
holding pointers in prange is a net win that overcomes the 2% handicap
the inliner is hitting us with.

I would love to hear thoughts, and if y'all agree that committing a
small skeleton now can help us track this down in the future.

Aldy

On Tue, Apr 30, 2024 at 11:37 PM Jason Merrill <jason@redhat.com> wrote:
>
> On 4/30/24 12:22, Jakub Jelinek wrote:
> > On Tue, Apr 30, 2024 at 03:09:51PM -0400, Jason Merrill via Gcc wrote:
> >> On Fri, Apr 26, 2024 at 5:44 AM Aldy Hernandez via Gcc <gcc@gcc.gnu.org> wrote:
> >>>
> >>> In implementing prange (pointer ranges), I have found a 1.74% slowdown
> >>> in VRP, even without any code path actually using the code.  I have
> >>> tracked this down to irange::get_bitmask() being compiled differently
> >>> with and without the bare bones patch.  With the patch,
> >>> irange::get_bitmask() has a lot of code inlined into it, particularly
> >>> get_bitmask_from_range() and consequently the wide_int_storage code.
> >> ...
> >>> +static irange_bitmask
> >>> +get_bitmask_from_range (tree type,
> >>> +                     const wide_int &min, const wide_int &max)
> >> ...
> >>> -irange_bitmask
> >>> -irange::get_bitmask_from_range () const
> >>
> >> My guess is that this is the relevant change: the old function has
> >> external linkage, and is therefore interposable, which inhibits
> >> inlining.  The new function has internal linkage, which allows
> >> inlining.
> >
> > Even when a function is exported, when not compiled with -fpic/-fPIC
> > if we know the function is defined in current TU, it can't be interposed,
>
> Ah, I was misremembering the effect of the change.  Rather, it's that if
> we see that a function with internal linkage has only a single caller,
> we try harder to inline it.
>
> Jason
>