From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=cKl7=C4=suse.de=rguenther@sourceware.org>
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29])
	by sourceware.org (Postfix) with ESMTPS id 423183858C2C
	for <gcc-patches@gcc.gnu.org>; Mon, 10 Jul 2023 07:07:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 423183858C2C
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
	by smtp-out2.suse.de (Postfix) with ESMTP id E6CD11F747;
	Mon, 10 Jul 2023 07:07:03 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1688972823; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=8+tnVF7J6KjFldhapa4ZXV0kzODifvcPvaVS0FXFzKA=;
	b=XpqT3QYEoum8dt9cEhFzyg+8uG3AmzFhaN/wku9gNgJB66WiF4r5SK79iXQTbcERViCAn7
	hPEQf4TCcDqQfBqADakga2Zr1/52GC8XN+bPLlrvXZz4XjG4PaLJ6wv6fn3fFpzT9kBo3V
	lzObA8ICmWbokhb5RjNbLN+naGh07ZI=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1688972823;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=8+tnVF7J6KjFldhapa4ZXV0kzODifvcPvaVS0FXFzKA=;
	b=sP+otb714H5MD0BNkUtdyowhm5lec+swfHK4HGXfq85mLTThO5/15drsD7ERSpiDuS9fi2
	N08ujKL0x3yXc/Bw==
Received: from wotan.suse.de (wotan.suse.de [10.160.0.1])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by relay2.suse.de (Postfix) with ESMTPS id C0A452C142;
	Mon, 10 Jul 2023 07:07:03 +0000 (UTC)
Date: Mon, 10 Jul 2023 07:07:03 +0000 (UTC)
From: Richard Biener <rguenther@suse.de>
To: Jan Hubicka <hubicka@ucw.cz>
cc: Tamar Christina <tamar.christina@arm.com>, gcc-patches@gcc.gnu.org, 
    nd@arm.com, jlaw@ventanamicro.com
Subject: Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on
 multiple-exits
In-Reply-To: <ZKgC/9WHxvuD3qdA@kam.mff.cuni.cz>
Message-ID: <nycvar.YFH.7.77.849.2307100654500.4723@jbgna.fhfr.qr>
References: <ZJw5BcegL6zob2rC@arm.com> <nycvar.YFH.7.77.849.2307041137250.4723@jbgna.fhfr.qr> <ZKQzSJcEUL5SH0KG@kam.mff.cuni.cz> <ZKbQ7DE2cxU1hA96@kam.mff.cuni.cz> <nycvar.YFH.7.77.849.2307070557080.4723@jbgna.fhfr.qr> <ZKgC/9WHxvuD3qdA@kam.mff.cuni.cz>
User-Agent: Alpine 2.22 (LSU 394 2020-01-19)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Fri, 7 Jul 2023, Jan Hubicka wrote:

> > 
> > Looks good, but I wonder what we can do to at least make the
> > multiple exit case behave reasonably?  The vectorizer keeps track
> 
> > of a "canonical" exit, would it be possible to pass in the main
> > exit edge and use that instead of single_exit (), would other
> > exits then behave somewhat reasonable or would we totally screw
> > things up here?  That is, the "canonical" exit would be the
> > counting exit while the other exits are on data driven conditions
> > and thus wouldn't change probability when we reduce the number
> > of iterations(?)
> 
> I can add canonical_exit parameter and make the function to direct flow
> to it if possible.  However overall I think fixup depends on what
> transformation led to the change.

I think the vectorizer knows there's a single counting IV and all
other exits are dependent on data processed, so the scaling the
vectorizer just changes the counting IV.  So I think it makes
sense to pass that exit to the function in all cases.

> Assuming that vectorizer did no prologues and apilogues and we
> vectorized with factor N, then I think the update could be done more
> specifically as follows.
> 
> We know that header block count dropped by 4. So we can start from that
> and each time we reach basic block with exit edge, we know the original
> count of the edge.  This count is unchanged, so one can rescale
> probabilities out of that BB accordingly.  If loop has no inner loops,
> we can just walk the body in RPO and propagate scales downwards and we
> sould arrive to right result

That should work for alternate exits as well, no?

> I originally added the bound parameter to handle prologues/epilogues
> which gets new artificial bound.  In prologue I think you are right that
> the flow will be probably directed to the conditional counting
> iterations.

I suppose we'd need to scale both main and epilogue together since
the epilogue "steals" from the main loop counts.  Likewise if there's
a skip edge around the vector loop.  I think currently we simply
set the edge probability of those skip conds rather than basing
this off the niter values they work on.  Aka if (niter < VF) goto
epilogue; do {} while (niter / VF); epilogue: do {} while (niter);

There's also the cost model which might require niter > VF to enter
the main loop body.

> In epilogue we add no artificial iteration cap, so maybe it is more
> realistic to simply scale up probability of all exits?

Probably.

> To see what is going on I tried following testcase:
> 
> int a[99];
> test()
> {
>   for (int i = 0; i < 99; i++)
>       a[i]++;
> }
> 
> What surprises me is that vectorizer at -O2 does nothing and we end up
> unrolling the loop:
> 
> L2:
>         addl    $1, (%rax)
>         addl    $1, 4(%rax)
>         addl    $1, 8(%rax)
>         addq    $12, %rax
>         cmpq    $a+396, %rax
> 
> Which seems sily thing to do. Vectorized loop with epilogue doing 2 and
> 1 addition would be better.
> 
> With -O3 we vectorize it:
> 
> 
> .L2:
>         movdqa  (%rax), %xmm0
>         addq    $16, %rax
>         paddd   %xmm1, %xmm0
>         movaps  %xmm0, -16(%rax)
>         cmpq    %rax, %rdx
>         jne     .L2
>         movq    a+384(%rip), %xmm0
>         addl    $1, a+392(%rip)
>         movq    .LC1(%rip), %xmm1
>         paddd   %xmm1, %xmm0
>         movq    %xmm0, a+384(%rip)

The -O2 cost model doesn't want to do epilogues:

  /* If using the "very cheap" model. reject cases in which we'd keep
     a copy of the scalar code (even if we might be able to vectorize it).  
*/
  if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
      && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
          || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
          || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
    {
      if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                         "some scalar iterations would need to be 
peeled\n");
      return 0;
    }

it's because of the code size increase.

> and correctly drop vectorized loop body to 24 iterations. However the
> epilogue has loop for vector size 2 predicted to iterate once (it won't)
> 
> ;;   basic block 7, loop depth 0, count 10737416 (estimated locally), maybe hot 
> ;;    prev block 5, next block 8, flags: (NEW, VISITED)                         
> ;;    pred:       3 [4.0% (adjusted)]  count:10737416 (estimated locally) (FALSE_VALUE,EXECUTABLE)
> ;;    succ:       8 [always]  count:10737416 (estimated locally) (FALLTHRU,EXECUTABLE)
>                                                                                 
> ;;   basic block 8, loop depth 1, count 21474835 (estimated locally), maybe hot 
> ;;    prev block 7, next block 9, flags: (NEW, REACHABLE, VISITED)              
> ;;    pred:       9 [always]  count:10737417 (estimated locally) (FALLTHRU,DFS_BACK,EXECUTABLE)
> ;;                7 [always]  count:10737416 (estimated locally) (FALLTHRU,EXECUTABLE)
>   # i_9 = PHI <i_17(9), 96(7)>                                                  
>   # ivtmp_13 = PHI <ivtmp_18(9), 3(7)>                                          
>   # vectp_a.14_40 = PHI <vectp_a.14_41(9), &MEM <int[99]> [(void *)&a + 384B](7)>
>   # vectp_a.18_46 = PHI <vectp_a.18_47(9), &MEM <int[99]> [(void *)&a + 384B](7)>
>   # ivtmp_49 = PHI <ivtmp_50(9), 0(7)>                                          
>   vect__14.16_42 = MEM <vector(2) int> [(int *)vectp_a.14_40];                  
>   _14 = a[i_9];                                                                 
>   vect__15.17_44 = vect__14.16_42 + { 1, 1 };                                   
>   _15 = _14 + 1;                                                                
>   MEM <vector(2) int> [(int *)vectp_a.18_46] = vect__15.17_44;                  
>   i_17 = i_9 + 1;                                                               
>   ivtmp_18 = ivtmp_13 - 1;                                                      
>   vectp_a.14_41 = vectp_a.14_40 + 8;                                            
>   vectp_a.18_47 = vectp_a.18_46 + 8;                                            
>   ivtmp_50 = ivtmp_49 + 1;                                                      
>   if (ivtmp_50 < 1)                                                             
>     goto <bb 9>; [50.00%]                                                       
>   else                                                                          
>     goto <bb 12>; [50.00%]                                                      
> 
> and finally the scalar copy
> 
> ;;   basic block 12, loop depth 0, count 10737416 (estimated locally), maybe hot
> ;;    prev block 9, next block 13, flags: (NEW, VISITED)                        
> ;;    pred:       8 [50.0% (adjusted)]  count:10737418 (estimated locally) (FALSE_VALUE,EXECUTABLE)
> ;;    succ:       13 [always]  count:10737416 (estimated locally) (FALLTHRU)    
>                                                                                 
> ;;   basic block 13, loop depth 1, count 1063004409 (estimated locally), maybe hot
> ;;    prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED)            
> ;;    pred:       14 [always]  count:1052266996 (estimated locally) (FALLTHRU,DFS_BACK,EXECUTABLE)
> ;;                12 [always]  count:10737416 (estimated locally) (FALLTHRU)    
>   # i_30 = PHI <i_36(14), 98(12)>                                               
>   # ivtmp_32 = PHI <ivtmp_37(14), 1(12)>                                        
>   _33 = a[i_30];                                                                
>   _34 = _33 + 1;                                                                
>   a[i_30] = _34;                                                                
>   i_36 = i_30 + 1;                                                              
>   ivtmp_37 = ivtmp_32 - 1;                                                      
>   if (ivtmp_37 != 0)                                                            
>     goto <bb 14>; [98.99%]                                                      
>   else                                                                          
>     goto <bb 4>; [1.01%]                                                        
> 
> With also small but non-zero iteration probability.   This is papered
> over by my yesterday patch. But it seems to me that it would be a lot
> better if vectorizer understood that the epilogue will be loopless and
> accounted it to the cost model that would probably make it easy to
> enable it at cheap costs too.

The epilogue will be "unrolled" later I think because we can correctly
compute it won't iterate.

> Clang 16 at -O2 is much more aggressive by both vectorizing and unroling:
> 
> test:                                   # @test
>         .cfi_startproc
> # %bb.0:
>         movdqa  a(%rip), %xmm1
>         pcmpeqd %xmm0, %xmm0
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a(%rip)
>         movdqa  a+16(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+16(%rip)
>         movdqa  a+32(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+32(%rip)
>         movdqa  a+48(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+48(%rip)
>         movdqa  a+64(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+64(%rip)
>         movdqa  a+80(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+80(%rip)
>         movdqa  a+96(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+96(%rip)
>         movdqa  a+112(%rip), %xmm1
>         psubd   %xmm0, %xmm1
> ....
>         movdqa  %xmm1, a+240(%rip)
>         movdqa  a+256(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+256(%rip)
>         movdqa  a+272(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+272(%rip)
>         movdqa  a+288(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+288(%rip)
>         movdqa  a+304(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+304(%rip)
>         movdqa  a+320(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+320(%rip)
>         movdqa  a+336(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+336(%rip)
>         movdqa  a+352(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+352(%rip)
>         movdqa  a+368(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+368(%rip)
>         addl    $1, a+384(%rip)
>         addl    $1, a+388(%rip)
>         addl    $1, a+392(%rip)
>         retq

That's clearly much larger code.  On x86 we're also fighting with
large instruction encodings here, in particular EVEX for AVX512 is
"bad" here.  We hardly get more than two instructions decoded per
cycle due to their size.

Richard.