After some very painful analysis, I was able to reduce the degradation we are experiencing in VRP to a handful of lines in the new implementation of prange. What happens is that any series of small changes to a new prange class causes changes in the inlining of wide_int_storage elsewhere. With the attached patch, one difference lies in irange::singleton_p(tree *). Note that this is in irange, which is completely unrelated to the new (unused) code. Using trunk as the stage1 compiler, we can see the assembly for irange::singleton_p(tree *) in value-range.cc is different with and without my patch. The number of calls into wide_int within irange::singleton_p(tree *) changes: awk '/^_ZNK6irange11singleton_pEPP9tree_node/,/endproc/' value-range.s | grep call.*wide_int With mainline sources: call _ZN16wide_int_storageC2ERKS_ call _Z16wide_int_to_treeP9tree_nodeRK8poly_intILj1E16generic_wide_intI20wide_int_ref_storageILb0ELb1EEEE With the attached patch: call _ZN16wide_int_storageC2ERKS_ call _ZN16wide_int_storageC2ERKS_ call _Z16wide_int_to_treeP9tree_nodeRK8poly_intILj1E16generic_wide_intI20wide_int_ref_storageILb0ELb1EEEE call _ZN16wide_int_storageC2ERKS_ The additional calls correspond to the wide_int_storage constructor: $ c++filt _ZN16wide_int_storageC2ERKS_ wide_int_storage::wide_int_storage(wide_int_storage const&) Using -fno-semantic-interposition makes no difference. Here are the relevant bits in the difference from -Winline with and without my patch: > inlined from ‘virtual bool irange::singleton_p(tree_node**) const’ at /home/aldyh/src/gcc/gcc/value-range.cc:1254:40: > /home/aldyh/src/gcc/gcc/wide-int.h:1196:8: warning: inlining failed in call to ‘wide_int_storage::wide_int_storage(const wide_int_storage&)’: --param inline-unit-growth limit reached [-Winline] > 1196 | inline wide_int_storage::wide_int_storage (const wide_int_storage &x) > | ^~~~~~~~~~~~~~~~ > /home/aldyh/src/gcc/gcc/wide-int.h:775:7: note: called from here > 775 | class GTY(()) generic_wide_int : public storage > | ^~~~~~~~~~~~~~~~ > /home/aldyh/src/gcc/gcc/wide-int.h:1196:8: warning: inlining failed in call to ‘wide_int_storage::wide_int_storage(const wide_int_storage&)’: --param inline-unit-growth limit reached [-Winline] > 1196 | inline wide_int_storage::wide_int_storage (const wide_int_storage &x) > | ^~~~~~~~~~~~~~~~ > /home/aldyh/src/gcc/gcc/wide-int.h:775:7: note: called from here > 775 | class GTY(()) generic_wide_int : public storage > | ^~~~~~~~~~~~~~~~ > In copy constructor ‘generic_wide_int::generic_wide_int(const generic_wide_int&)’, > inlined from ‘wide_int irange::lower_bound(unsigned int) const’ at /home/aldyh/src/gcc/gcc/value-range.h:1122:25, Note that this is just one example. There are also inlining differences to irange::get_bitmask(), irange::union_bitmask(), irange::operator=, among others. Most of the inlining failures seem to be related to wide_int_storage. I am attaching the difference in -Winline for the curious. Tracking this down is tricky because the slightest change in the patch causes different inlining in irange. Even using a slightly different stage1 compiler produces different changes. For example, using GCC 13 as the stage1 compiler, VRP exhibits a slowdown of 2% with the full prange class. Although this is virtually identical to the slowdown for using trunk as the stage1 compiler, the inlining failures are a tad different. I am tempted to commit the attached to mainline, which slows down VRP by 0.3%, but is measurable enough to analyze, just so we have a base commit-point from where to do the analysis. My wife is about to give birth any day now, so I'm afraid if I drop off for a few months, we'll lose the analysis and the point in time from where to do it. One final thing. The full prange class patch, even when disabled, slows VRP by 2%. I tried to implement the class in small increments, and every small change caused a further slowdown. I don't know if this 2% is final, or if further tweaks in this space will slow us down more. On a positive note, with the entirety of prange implemented (not just the base class but range-ops implemented and prange enabled, there is no overall change to VRP, and IPA-cp speeds up by 7%. This is because holding pointers in prange is a net win that overcomes the 2% handicap the inliner is hitting us with. I would love to hear thoughts, and if y'all agree that committing a small skeleton now can help us track this down in the future. Aldy On Tue, Apr 30, 2024 at 11:37 PM Jason Merrill wrote: > > On 4/30/24 12:22, Jakub Jelinek wrote: > > On Tue, Apr 30, 2024 at 03:09:51PM -0400, Jason Merrill via Gcc wrote: > >> On Fri, Apr 26, 2024 at 5:44 AM Aldy Hernandez via Gcc wrote: > >>> > >>> In implementing prange (pointer ranges), I have found a 1.74% slowdown > >>> in VRP, even without any code path actually using the code. I have > >>> tracked this down to irange::get_bitmask() being compiled differently > >>> with and without the bare bones patch. With the patch, > >>> irange::get_bitmask() has a lot of code inlined into it, particularly > >>> get_bitmask_from_range() and consequently the wide_int_storage code. > >> ... > >>> +static irange_bitmask > >>> +get_bitmask_from_range (tree type, > >>> + const wide_int &min, const wide_int &max) > >> ... > >>> -irange_bitmask > >>> -irange::get_bitmask_from_range () const > >> > >> My guess is that this is the relevant change: the old function has > >> external linkage, and is therefore interposable, which inhibits > >> inlining. The new function has internal linkage, which allows > >> inlining. > > > > Even when a function is exported, when not compiled with -fpic/-fPIC > > if we know the function is defined in current TU, it can't be interposed, > > Ah, I was misremembering the effect of the change. Rather, it's that if > we see that a function with internal linkage has only a single caller, > we try harder to inline it. > > Jason >