From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from nikam.ms.mff.cuni.cz (nikam.ms.mff.cuni.cz [195.113.20.16]) by sourceware.org (Postfix) with ESMTPS id 280EC3858D32 for ; Mon, 10 Jul 2023 08:33:39 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 280EC3858D32 Authentication-Results: sourceware.org; dmarc=fail (p=none dis=none) header.from=ucw.cz Authentication-Results: sourceware.org; spf=none smtp.mailfrom=kam.mff.cuni.cz Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id AE97D281D2F; Mon, 10 Jul 2023 10:33:37 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ucw.cz; s=gen1; t=1688978017; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=XZnxLHyU5NpL5kQFfVxT27vQAffYYHUtX48jfg9oJks=; b=pepxc20F1CC6cYDVWTNSChZE9cQo4uBDlTFkTQC8ANSA8x5su5y60q9xWheJPr6CuwCSua jO23G788czYlQKI4cl9PtB34RLwu4pXEz0AF3xAeKzqBbOITZpM3TJsuAdozbP5plVWKDF 7xjl5s22V8r4yc+NBOqUMaQ8kVhRVa0= Date: Mon, 10 Jul 2023 10:33:37 +0200 From: Jan Hubicka To: Richard Biener Cc: Tamar Christina , gcc-patches@gcc.gnu.org, nd@arm.com, jlaw@ventanamicro.com Subject: Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on multiple-exits Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-5.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi, over weekend I found that vectorizer is missing scale_loop_profile for epilogues. It already adjusts loop_info to set max iteraitons, so adding it was easy. However now predicts the first loop to iterate at most once (which is too much, I suppose it forgets to divide by epilogue unrolling factor) and second never. > > The -O2 cost model doesn't want to do epilogues: > > /* If using the "very cheap" model. reject cases in which we'd keep > a copy of the scalar code (even if we might be able to vectorize it). > */ > if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP > && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "some scalar iterations would need to be > peeled\n"); > return 0; > } > > it's because of the code size increase. I know, however -O2 is not -Os and here the tradeoffs of performance/code size seems a lot better than other code expanding things we do at -O2 (such as the unrolling 3 times). I think we set the very cheap cost model very conservatively in order to get -ftree-vectorize enabled with -O2 and there is some room for finding right balance. I get: jan@localhost:~> cat t.c int a[99]; __attribute((noipa, weak)) void test() { for (int i = 0 ; i < 99; i++) a[i]++; } void main() { for (int j = 0; j < 10000000; j++) test(); } jan@localhost:~> gcc -O2 t.c -fno-unroll-loops ; time ./a.out real 0m0.529s user 0m0.528s sys 0m0.000s jan@localhost:~> gcc -O2 t.c ; time ./a.out real 0m0.427s user 0m0.426s sys 0m0.000s jan@localhost:~> gcc -O3 t.c ; time ./a.out real 0m0.136s user 0m0.135s sys 0m0.000s jan@localhost:~> clang -O2 t.c ; time ./a.out real 0m0.116s user 0m0.116s sys 0m0.000s Code size (of function test): gcc -O2 -fno-unroll-loops 17 bytes gcc -O2 29 bytes gcc -O3 50 bytes clang -O2 510 bytes So unroling 70% code size growth for 23% speedup. Vectorizing is 294% code size growth for 388% speedup Clang does 3000% codde size growth for 456% speedup > > That's clearly much larger code. On x86 we're also fighting with > large instruction encodings here, in particular EVEX for AVX512 is > "bad" here. We hardly get more than two instructions decoded per > cycle due to their size. Agreed, I found it surprising clang does that much of complette unrolling at -O2. However vectorizing and not unrolling here seems like it may be a better default for -O2 than what we do currently... Honza > > Richard.