From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=yQ2q=C4=kam.mff.cuni.cz=hubicka@sourceware.org>
Received: from nikam.ms.mff.cuni.cz (nikam.ms.mff.cuni.cz [195.113.20.16])
	by sourceware.org (Postfix) with ESMTPS id 280EC3858D32
	for <gcc-patches@gcc.gnu.org>; Mon, 10 Jul 2023 08:33:39 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 280EC3858D32
Authentication-Results: sourceware.org; dmarc=fail (p=none dis=none) header.from=ucw.cz
Authentication-Results: sourceware.org; spf=none smtp.mailfrom=kam.mff.cuni.cz
Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202)
	id AE97D281D2F; Mon, 10 Jul 2023 10:33:37 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ucw.cz; s=gen1;
	t=1688978017;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=XZnxLHyU5NpL5kQFfVxT27vQAffYYHUtX48jfg9oJks=;
	b=pepxc20F1CC6cYDVWTNSChZE9cQo4uBDlTFkTQC8ANSA8x5su5y60q9xWheJPr6CuwCSua
	jO23G788czYlQKI4cl9PtB34RLwu4pXEz0AF3xAeKzqBbOITZpM3TJsuAdozbP5plVWKDF
	7xjl5s22V8r4yc+NBOqUMaQ8kVhRVa0=
Date: Mon, 10 Jul 2023 10:33:37 +0200
From: Jan Hubicka <hubicka@ucw.cz>
To: Richard Biener <rguenther@suse.de>
Cc: Tamar Christina <tamar.christina@arm.com>, gcc-patches@gcc.gnu.org,
	nd@arm.com, jlaw@ventanamicro.com
Subject: Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on
 multiple-exits
Message-ID: <ZKvCYcEcNNahQnB+@kam.mff.cuni.cz>
References: <ZJw5BcegL6zob2rC@arm.com>
 <nycvar.YFH.7.77.849.2307041137250.4723@jbgna.fhfr.qr>
 <ZKQzSJcEUL5SH0KG@kam.mff.cuni.cz>
 <ZKbQ7DE2cxU1hA96@kam.mff.cuni.cz>
 <nycvar.YFH.7.77.849.2307070557080.4723@jbgna.fhfr.qr>
 <ZKgC/9WHxvuD3qdA@kam.mff.cuni.cz>
 <nycvar.YFH.7.77.849.2307100654500.4723@jbgna.fhfr.qr>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <nycvar.YFH.7.77.849.2307100654500.4723@jbgna.fhfr.qr>
X-Spam-Status: No, score=-5.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hi,
over weekend I found that vectorizer is missing scale_loop_profile for
epilogues.  It already adjusts loop_info to set max iteraitons, so
adding it was easy. However now predicts the first loop to iterate at
most once (which is too much, I suppose it forgets to divide by epilogue
unrolling factor) and second never.
> 
> The -O2 cost model doesn't want to do epilogues:
> 
>   /* If using the "very cheap" model. reject cases in which we'd keep
>      a copy of the scalar code (even if we might be able to vectorize it).  
> */
>   if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
>       && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
>           || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>           || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
>     {
>       if (dump_enabled_p ())
>         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                          "some scalar iterations would need to be 
> peeled\n");
>       return 0;
>     }
> 
> it's because of the code size increase.

I know, however -O2 is not -Os and here the tradeoffs of
performance/code size seems a lot better than other code expanding
things we do at -O2 (such as the unrolling 3 times).
I think we set the very cheap cost model very conservatively in order to
get -ftree-vectorize enabled with -O2 and there is some room for finding
right balance.

I get:

jan@localhost:~> cat t.c
int a[99];
__attribute((noipa, weak))
void
test()
{
        for (int i = 0 ; i < 99; i++)
                a[i]++;
}
void
main()
{
        for (int j = 0; j < 10000000; j++)
                test();
}
jan@localhost:~> gcc -O2 t.c -fno-unroll-loops ; time ./a.out

real    0m0.529s
user    0m0.528s
sys     0m0.000s

jan@localhost:~> gcc -O2 t.c ; time ./a.out

real    0m0.427s
user    0m0.426s
sys     0m0.000s
jan@localhost:~> gcc -O3 t.c ; time ./a.out

real    0m0.136s
user    0m0.135s
sys     0m0.000s
jan@localhost:~> clang -O2 t.c ; time ./a.out
<warnings>

real    0m0.116s
user    0m0.116s
sys     0m0.000s

Code size (of function test):
 gcc -O2 -fno-unroll-loops 17  bytes
 gcc -O2                   29  bytes
 gcc -O3                   50  bytes
 clang -O2                 510 bytes

So unroling 70% code size growth for 23% speedup.
Vectorizing is 294% code size growth for 388% speedup
Clang does 3000% codde size growth for 456% speedup
> 
> That's clearly much larger code.  On x86 we're also fighting with
> large instruction encodings here, in particular EVEX for AVX512 is
> "bad" here.  We hardly get more than two instructions decoded per
> cycle due to their size.

Agreed, I found it surprising clang does that much of complette unrolling
at -O2. However vectorizing and not unrolling here seems like it may be
a better default for -O2 than what we do currently...

Honza
> 
> Richard.