From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from nikam.ms.mff.cuni.cz (nikam.ms.mff.cuni.cz [195.113.20.16]) by sourceware.org (Postfix) with ESMTPS id 7245C392209C for ; Fri, 7 Jul 2023 12:20:17 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7245C392209C Authentication-Results: sourceware.org; dmarc=fail (p=none dis=none) header.from=ucw.cz Authentication-Results: sourceware.org; spf=none smtp.mailfrom=kam.mff.cuni.cz Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 00624281052; Fri, 7 Jul 2023 14:20:15 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ucw.cz; s=gen1; t=1688732416; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=MSJfMrwkGSWg3fbaK38TrWMQU2qxEaPNRq8rBhlu1I4=; b=hug30HFO87aMPnwOKWU+00GxwTQE24MwTfN3ZvM2FXlbe5ovK4vmivUCBO89fBGLSTNvfK BNXqpanwh8Y4EDTDLWETvOi1OvKUstsqGe+4YYyCC2jeYPFovB0dSmiXlsCI4BiRcWLHq4 vh0ER8+oAkEZcrLbEO4/bguk1DCNFn0= Date: Fri, 7 Jul 2023 14:20:15 +0200 From: Jan Hubicka To: Richard Biener Cc: Tamar Christina , gcc-patches@gcc.gnu.org, nd@arm.com, jlaw@ventanamicro.com Subject: Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on multiple-exits Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-5.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: > > Looks good, but I wonder what we can do to at least make the > multiple exit case behave reasonably? The vectorizer keeps track > of a "canonical" exit, would it be possible to pass in the main > exit edge and use that instead of single_exit (), would other > exits then behave somewhat reasonable or would we totally screw > things up here? That is, the "canonical" exit would be the > counting exit while the other exits are on data driven conditions > and thus wouldn't change probability when we reduce the number > of iterations(?) I can add canonical_exit parameter and make the function to direct flow to it if possible. However overall I think fixup depends on what transformation led to the change. Assuming that vectorizer did no prologues and apilogues and we vectorized with factor N, then I think the update could be done more specifically as follows. We know that header block count dropped by 4. So we can start from that and each time we reach basic block with exit edge, we know the original count of the edge. This count is unchanged, so one can rescale probabilities out of that BB accordingly. If loop has no inner loops, we can just walk the body in RPO and propagate scales downwards and we sould arrive to right result I originally added the bound parameter to handle prologues/epilogues which gets new artificial bound. In prologue I think you are right that the flow will be probably directed to the conditional counting iterations. In epilogue we add no artificial iteration cap, so maybe it is more realistic to simply scale up probability of all exits? To see what is going on I tried following testcase: int a[99]; test() { for (int i = 0; i < 99; i++) a[i]++; } What surprises me is that vectorizer at -O2 does nothing and we end up unrolling the loop: L2: addl $1, (%rax) addl $1, 4(%rax) addl $1, 8(%rax) addq $12, %rax cmpq $a+396, %rax Which seems sily thing to do. Vectorized loop with epilogue doing 2 and 1 addition would be better. With -O3 we vectorize it: .L2: movdqa (%rax), %xmm0 addq $16, %rax paddd %xmm1, %xmm0 movaps %xmm0, -16(%rax) cmpq %rax, %rdx jne .L2 movq a+384(%rip), %xmm0 addl $1, a+392(%rip) movq .LC1(%rip), %xmm1 paddd %xmm1, %xmm0 movq %xmm0, a+384(%rip) and correctly drop vectorized loop body to 24 iterations. However the epilogue has loop for vector size 2 predicted to iterate once (it won't) ;; basic block 7, loop depth 0, count 10737416 (estimated locally), maybe hot ;; prev block 5, next block 8, flags: (NEW, VISITED) ;; pred: 3 [4.0% (adjusted)] count:10737416 (estimated locally) (FALSE_VALUE,EXECUTABLE) ;; succ: 8 [always] count:10737416 (estimated locally) (FALLTHRU,EXECUTABLE) ;; basic block 8, loop depth 1, count 21474835 (estimated locally), maybe hot ;; prev block 7, next block 9, flags: (NEW, REACHABLE, VISITED) ;; pred: 9 [always] count:10737417 (estimated locally) (FALLTHRU,DFS_BACK,EXECUTABLE) ;; 7 [always] count:10737416 (estimated locally) (FALLTHRU,EXECUTABLE) # i_9 = PHI # ivtmp_13 = PHI # vectp_a.14_40 = PHI [(void *)&a + 384B](7)> # vectp_a.18_46 = PHI [(void *)&a + 384B](7)> # ivtmp_49 = PHI vect__14.16_42 = MEM [(int *)vectp_a.14_40]; _14 = a[i_9]; vect__15.17_44 = vect__14.16_42 + { 1, 1 }; _15 = _14 + 1; MEM [(int *)vectp_a.18_46] = vect__15.17_44; i_17 = i_9 + 1; ivtmp_18 = ivtmp_13 - 1; vectp_a.14_41 = vectp_a.14_40 + 8; vectp_a.18_47 = vectp_a.18_46 + 8; ivtmp_50 = ivtmp_49 + 1; if (ivtmp_50 < 1) goto ; [50.00%] else goto ; [50.00%] and finally the scalar copy ;; basic block 12, loop depth 0, count 10737416 (estimated locally), maybe hot ;; prev block 9, next block 13, flags: (NEW, VISITED) ;; pred: 8 [50.0% (adjusted)] count:10737418 (estimated locally) (FALSE_VALUE,EXECUTABLE) ;; succ: 13 [always] count:10737416 (estimated locally) (FALLTHRU) ;; basic block 13, loop depth 1, count 1063004409 (estimated locally), maybe hot ;; prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED) ;; pred: 14 [always] count:1052266996 (estimated locally) (FALLTHRU,DFS_BACK,EXECUTABLE) ;; 12 [always] count:10737416 (estimated locally) (FALLTHRU) # i_30 = PHI # ivtmp_32 = PHI _33 = a[i_30]; _34 = _33 + 1; a[i_30] = _34; i_36 = i_30 + 1; ivtmp_37 = ivtmp_32 - 1; if (ivtmp_37 != 0) goto ; [98.99%] else goto ; [1.01%] With also small but non-zero iteration probability. This is papered over by my yesterday patch. But it seems to me that it would be a lot better if vectorizer understood that the epilogue will be loopless and accounted it to the cost model that would probably make it easy to enable it at cheap costs too. Clang 16 at -O2 is much more aggressive by both vectorizing and unroling: test: # @test .cfi_startproc # %bb.0: movdqa a(%rip), %xmm1 pcmpeqd %xmm0, %xmm0 psubd %xmm0, %xmm1 movdqa %xmm1, a(%rip) movdqa a+16(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+16(%rip) movdqa a+32(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+32(%rip) movdqa a+48(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+48(%rip) movdqa a+64(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+64(%rip) movdqa a+80(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+80(%rip) movdqa a+96(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+96(%rip) movdqa a+112(%rip), %xmm1 psubd %xmm0, %xmm1 .... movdqa %xmm1, a+240(%rip) movdqa a+256(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+256(%rip) movdqa a+272(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+272(%rip) movdqa a+288(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+288(%rip) movdqa a+304(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+304(%rip) movdqa a+320(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+320(%rip) movdqa a+336(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+336(%rip) movdqa a+352(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+352(%rip) movdqa a+368(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+368(%rip) addl $1, a+384(%rip) addl $1, a+388(%rip) addl $1, a+392(%rip) retq Honza