From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=cKl7=C4=suse.de=rguenther@sourceware.org>
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29])
	by sourceware.org (Postfix) with ESMTPS id E2F5738582B0
	for <gcc-patches@gcc.gnu.org>; Mon, 10 Jul 2023 09:24:24 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E2F5738582B0
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
	by smtp-out2.suse.de (Postfix) with ESMTP id 160BB1F747;
	Mon, 10 Jul 2023 09:24:24 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1688981064; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=bMTXurntDQf8LRLVdZQmg3I50dXyWcPOwxwE2i5sT2E=;
	b=kY7HTcNeZEiMlm8fMa0jCDgncV7IiZzmOYJmIFBFAeZTM9EEdH4mxARpjKqzC3Tr4VWS67
	MScevMy/y6cEjCfkZ0aPue6/juaQFZs36F4Ar/kdQnwDCB6YcF+ewx8F5/PJGRpHP7f/ZN
	RbPqELi96L2HjdINqAPcOM/atb/nAoQ=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1688981064;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=bMTXurntDQf8LRLVdZQmg3I50dXyWcPOwxwE2i5sT2E=;
	b=7asniz9EDbKIGtMOLyVOqTxOpuNJt6too2rGkD2cbODPuFdgLqdj1pEXUwJitK3E1WiyEc
	RSU/CRGcXrqmViDA==
Received: from wotan.suse.de (wotan.suse.de [10.160.0.1])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by relay2.suse.de (Postfix) with ESMTPS id D600C2C142;
	Mon, 10 Jul 2023 09:24:23 +0000 (UTC)
Date: Mon, 10 Jul 2023 09:24:23 +0000 (UTC)
From: Richard Biener <rguenther@suse.de>
To: Jan Hubicka <hubicka@ucw.cz>
cc: Tamar Christina <tamar.christina@arm.com>, gcc-patches@gcc.gnu.org, 
    nd@arm.com, jlaw@ventanamicro.com
Subject: Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on
 multiple-exits
In-Reply-To: <ZKvCYcEcNNahQnB+@kam.mff.cuni.cz>
Message-ID: <nycvar.YFH.7.77.849.2307100922010.4723@jbgna.fhfr.qr>
References: <ZJw5BcegL6zob2rC@arm.com> <nycvar.YFH.7.77.849.2307041137250.4723@jbgna.fhfr.qr> <ZKQzSJcEUL5SH0KG@kam.mff.cuni.cz> <ZKbQ7DE2cxU1hA96@kam.mff.cuni.cz> <nycvar.YFH.7.77.849.2307070557080.4723@jbgna.fhfr.qr> <ZKgC/9WHxvuD3qdA@kam.mff.cuni.cz>
 <nycvar.YFH.7.77.849.2307100654500.4723@jbgna.fhfr.qr> <ZKvCYcEcNNahQnB+@kam.mff.cuni.cz>
User-Agent: Alpine 2.22 (LSU 394 2020-01-19)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Mon, 10 Jul 2023, Jan Hubicka wrote:

> Hi,
> over weekend I found that vectorizer is missing scale_loop_profile for
> epilogues.  It already adjusts loop_info to set max iteraitons, so
> adding it was easy. However now predicts the first loop to iterate at
> most once (which is too much, I suppose it forgets to divide by epilogue
> unrolling factor) and second never.
> > 
> > The -O2 cost model doesn't want to do epilogues:
> > 
> >   /* If using the "very cheap" model. reject cases in which we'd keep
> >      a copy of the scalar code (even if we might be able to vectorize it).  
> > */
> >   if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
> >       && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> >           || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> >           || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> >     {
> >       if (dump_enabled_p ())
> >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >                          "some scalar iterations would need to be 
> > peeled\n");
> >       return 0;
> >     }
> > 
> > it's because of the code size increase.
> 
> I know, however -O2 is not -Os and here the tradeoffs of
> performance/code size seems a lot better than other code expanding
> things we do at -O2 (such as the unrolling 3 times).
> I think we set the very cheap cost model very conservatively in order to
> get -ftree-vectorize enabled with -O2 and there is some room for finding
> right balance.
> 
> I get:
> 
> jan@localhost:~> cat t.c
> int a[99];
> __attribute((noipa, weak))
> void
> test()
> {
>         for (int i = 0 ; i < 99; i++)
>                 a[i]++;
> }
> void
> main()
> {
>         for (int j = 0; j < 10000000; j++)
>                 test();
> }
> jan@localhost:~> gcc -O2 t.c -fno-unroll-loops ; time ./a.out
> 
> real    0m0.529s
> user    0m0.528s
> sys     0m0.000s
> 
> jan@localhost:~> gcc -O2 t.c ; time ./a.out
> 
> real    0m0.427s
> user    0m0.426s
> sys     0m0.000s
> jan@localhost:~> gcc -O3 t.c ; time ./a.out
> 
> real    0m0.136s
> user    0m0.135s
> sys     0m0.000s
> jan@localhost:~> clang -O2 t.c ; time ./a.out
> <warnings>
> 
> real    0m0.116s
> user    0m0.116s
> sys     0m0.000s
> 
> Code size (of function test):
>  gcc -O2 -fno-unroll-loops 17  bytes
>  gcc -O2                   29  bytes
>  gcc -O3                   50  bytes
>  clang -O2                 510 bytes
> 
> So unroling 70% code size growth for 23% speedup.
> Vectorizing is 294% code size growth for 388% speedup
> Clang does 3000% codde size growth for 456% speedup
> > 
> > That's clearly much larger code.  On x86 we're also fighting with
> > large instruction encodings here, in particular EVEX for AVX512 is
> > "bad" here.  We hardly get more than two instructions decoded per
> > cycle due to their size.
> 
> Agreed, I found it surprising clang does that much of complette unrolling
> at -O2. However vectorizing and not unrolling here seems like it may be
> a better default for -O2 than what we do currently...

I was also playing with AVX512 fully masked loops here which avoids
the epilogue but due to the instruction encoding size that doesn't
usually win.  I agree that size isn't everything at least for -O2.

Richard.