From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by sourceware.org (Postfix) with ESMTPS id E2F5738582B0 for ; Mon, 10 Jul 2023 09:24:24 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E2F5738582B0 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id 160BB1F747; Mon, 10 Jul 2023 09:24:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1688981064; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=bMTXurntDQf8LRLVdZQmg3I50dXyWcPOwxwE2i5sT2E=; b=kY7HTcNeZEiMlm8fMa0jCDgncV7IiZzmOYJmIFBFAeZTM9EEdH4mxARpjKqzC3Tr4VWS67 MScevMy/y6cEjCfkZ0aPue6/juaQFZs36F4Ar/kdQnwDCB6YcF+ewx8F5/PJGRpHP7f/ZN RbPqELi96L2HjdINqAPcOM/atb/nAoQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1688981064; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=bMTXurntDQf8LRLVdZQmg3I50dXyWcPOwxwE2i5sT2E=; b=7asniz9EDbKIGtMOLyVOqTxOpuNJt6too2rGkD2cbODPuFdgLqdj1pEXUwJitK3E1WiyEc RSU/CRGcXrqmViDA== Received: from wotan.suse.de (wotan.suse.de [10.160.0.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id D600C2C142; Mon, 10 Jul 2023 09:24:23 +0000 (UTC) Date: Mon, 10 Jul 2023 09:24:23 +0000 (UTC) From: Richard Biener To: Jan Hubicka cc: Tamar Christina , gcc-patches@gcc.gnu.org, nd@arm.com, jlaw@ventanamicro.com Subject: Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on multiple-exits In-Reply-To: Message-ID: References: User-Agent: Alpine 2.22 (LSU 394 2020-01-19) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Mon, 10 Jul 2023, Jan Hubicka wrote: > Hi, > over weekend I found that vectorizer is missing scale_loop_profile for > epilogues. It already adjusts loop_info to set max iteraitons, so > adding it was easy. However now predicts the first loop to iterate at > most once (which is too much, I suppose it forgets to divide by epilogue > unrolling factor) and second never. > > > > The -O2 cost model doesn't want to do epilogues: > > > > /* If using the "very cheap" model. reject cases in which we'd keep > > a copy of the scalar code (even if we might be able to vectorize it). > > */ > > if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP > > && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > > || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > > || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))) > > { > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > "some scalar iterations would need to be > > peeled\n"); > > return 0; > > } > > > > it's because of the code size increase. > > I know, however -O2 is not -Os and here the tradeoffs of > performance/code size seems a lot better than other code expanding > things we do at -O2 (such as the unrolling 3 times). > I think we set the very cheap cost model very conservatively in order to > get -ftree-vectorize enabled with -O2 and there is some room for finding > right balance. > > I get: > > jan@localhost:~> cat t.c > int a[99]; > __attribute((noipa, weak)) > void > test() > { > for (int i = 0 ; i < 99; i++) > a[i]++; > } > void > main() > { > for (int j = 0; j < 10000000; j++) > test(); > } > jan@localhost:~> gcc -O2 t.c -fno-unroll-loops ; time ./a.out > > real 0m0.529s > user 0m0.528s > sys 0m0.000s > > jan@localhost:~> gcc -O2 t.c ; time ./a.out > > real 0m0.427s > user 0m0.426s > sys 0m0.000s > jan@localhost:~> gcc -O3 t.c ; time ./a.out > > real 0m0.136s > user 0m0.135s > sys 0m0.000s > jan@localhost:~> clang -O2 t.c ; time ./a.out > > > real 0m0.116s > user 0m0.116s > sys 0m0.000s > > Code size (of function test): > gcc -O2 -fno-unroll-loops 17 bytes > gcc -O2 29 bytes > gcc -O3 50 bytes > clang -O2 510 bytes > > So unroling 70% code size growth for 23% speedup. > Vectorizing is 294% code size growth for 388% speedup > Clang does 3000% codde size growth for 456% speedup > > > > That's clearly much larger code. On x86 we're also fighting with > > large instruction encodings here, in particular EVEX for AVX512 is > > "bad" here. We hardly get more than two instructions decoded per > > cycle due to their size. > > Agreed, I found it surprising clang does that much of complette unrolling > at -O2. However vectorizing and not unrolling here seems like it may be > a better default for -O2 than what we do currently... I was also playing with AVX512 fully masked loops here which avoids the epilogue but due to the instruction encoding size that doesn't usually win. I agree that size isn't everything at least for -O2. Richard.