I suggest specify -fno-schedule-insns to force tests assembler never change for any scheduling model.



juzhe.zhong@rivai.ai
 
From: Palmer Dabbelt
Date: 2024-02-28 08:55
To: jeffreyalaw
CC: juzhe.zhong; Robin Dapp; ewlu; gcc-patches; gnu-toolchain; pan2.li
Subject: Re: [PATCH] RISC-V: Update test expectancies with recent scheduler change
On Tue, 27 Feb 2024 15:53:19 PST (-0800), jeffreyalaw@gmail.com wrote:
>
>
> On 2/27/24 15:56, 钟居哲 wrote:
>>  >> I don't think it's that simple.  On some uarchs vsetvls are nearly free
>>>>while on others they can be fairly expensive.  It's not clear (to me)
>>>>yet if one approach or the other is going to be the more common.
>> 
>> That's uarch dependent which is not the stuff I am talking about.
>> What's I want to say is that this patch breaks those testcases I added 
>> for VSETVL PASS testing.
>> And those testcases are uarch independent.
> No, uarch impacts things like latency, which in turn impacts scheduling, 
> which in turn impacts vsetvl generation/optimization.
 
Ya, and I think that's just what's expected for this sort of approach.  
Edwin and I were working through that possibility in the office earlier, 
but we didn't have the code up.  So I figured I'd just go through one in 
more detail to see if what we were talking about was sane.  Grabbing 
some arbitrary function in the changed set:
 
    void
    test_vbool1_then_vbool64(int8_t * restrict in, int8_t * restrict out) {
        vbool1_t v1 = *(vbool1_t*)in;
        vbool64_t v2 = *(vbool64_t*)in;
    
        *(vbool1_t*)(out + 100) = v1;
        *(vbool64_t*)(out + 200) = v2;
    }
 
we currently get (from generic-ooo)
 
    test_vbool1_then_vbool64:
            vsetvli a4,zero,e8,m8,ta,ma
            vlm.v   v2,0(a0)
            vsetvli a5,zero,e8,mf8,ta,ma
            vlm.v   v1,0(a0)
            addi    a3,a1,100
            vsetvli a4,zero,e8,m8,ta,ma
            addi    a1,a1,200
            vsm.v   v2,0(a3)
            vsetvli a5,zero,e8,mf8,ta,ma
            vsm.v   v1,0(a1)
            ret
 
but we could generate correct code with 2, 3, or 4 vsetvli instructions 
depending on how things are scheduled.  For example, with 
-fno-schedule-insns I happen to get 3
 
    test_vbool1_then_vbool64:
            vsetvli a5,zero,e8,mf8,ta,ma
            vlm.v   v1,0(a0)
            vsetvli a4,zero,e8,m8,ta,ma
            vlm.v   v2,0(a0)
            addi    a3,a1,100
            addi    a1,a1,200
            vsm.v   v2,0(a3)
            vsetvli a5,zero,e8,mf8,ta,ma
            vsm.v   v1,0(a1)
            ret
 
because the load/store with the same vcfg end up scheduled back-to-back.  
I don't see any reason why something along the lines of
 
    test_vbool1_then_vbool64:
            vsetvli a4,zero,e8,m8,ta,ma
            vlm.v   v2,0(a0)
            addi    a3,a1,100
            vsm.v   v2,0(a3)
            vsetvli a5,zero,e8,mf8,ta,ma
            vlm.v   v1,0(a0)
            addi    a1,a1,200
            vsm.v   v1,0(a1)
            ret
 
wouldn't be correct (though I just reordered the loads/stores and then 
removed the redundant vsetvlis, so I might have some address calculation 
wrong in there).  The validity of removing a vsetvli depends on how the 
dependant instructions get scheduled, which is very much under the 
control of the pipeline model -- it's entirely possible the code with 
more vsetvlis is faster, if vsetvli is cheap and scheduling ends up 
hiding latency better.
 
So IMO it's completely reasonable to have vsetvli count ranges for a 
test like this.  I haven't looked at the others in any detail, but I 
remember seeing similar things elsewhere last time I was poking around 
these tests.  We should probably double-check all these and write some 
comments, just to make sure we're not missing any bugs, but I'd bet 
there's a bunch of valid testsuite changes.
 
Like we talked about in the call this morning we should probably make 
the tests more precise, but that's a larger effort.  After working 
through this I'm thinking it's a bit higher priority, though, as in this 
case the bounds are so wide we're not even really testing the pass any 
more.
 
>
> jeff