Hi Hans,

On 10/13/22 13:54, Hans Boehm wrote:
> The generated code here is correct in both cases. In the RISC--V case, 
> I believe it is conservative, at a minimum, in that atomics should not 
> imply IO ordering. We had an earlier discussion, which seemed to have 
> consensus in favor of that opinion. I believe clang does not enforce 
> IO ordering.
>
> You can think of a "sequentially consistent" load roughly as enforcing 
> two properties:
>
> 1) It behaves as an "acquire" load. Later (in program order) memory 
> operations do not advance past it. This is implicit for x86. It 
> requires the trailing fence on RISC-V, which could probably be 
> weakened to r,rw.

Acq implies later things won't leak out, but prior things could still 
leak-in, meaning prior write could happen after load which contradicts 
what user is asking by load(seq_cst) on x86 ?

>
> 2) It ensures that seq_cst operations are fully ordered. This means 
> that, in addition to (1), and the corresponding fence for stores, 
> every seq_cst store must be separated from a seq_cst load by at least 
> a w,r fence, so a seq_cst store followed by a seq_cst load is not 
> reordered.

This makes sense when both store -> load are seq_cst.
But the question is what happens when that store is non atomic. IOW if 
we had a store(relaxed) -> load(seq_cst) would the generated code still 
ensure that load had a full barrier to prevent


> w,r fences are discouraged on RISC-V, and probably no better than 
> rw,rw, so that's how the leading fence got there. (Again the io 
> ordering should disappear. It's the responsibility of IO code to 
> insert that explicitly, rather than paying for it everywhere.)

Thanks for explaining the RV semantics.

>
> x86 does (2) by associating that fence with stores instead of loads, 
> either by using explicit fences after stores, or by turning stores 
> into xchg.

That makes sense as x86 has ld->ld and ld -> st architecturally ordered, 
so any fences ought to be associated with st.

Thx,
-Vineet

> RISC-V could do the same. And I believe that if the current A 
> extension were the final word on the architecture, it should. But that 
> convention is not compatible with the later introduction of an 
> "acquire load", which I think is essential for performance, at least 
> on larger cores. So I think the two fence mapping for loads should be 
> maintained for now, as I suggested in the document I posted to the list.
>
> Hans
>
> On Thu, Oct 13, 2022 at 12:31 PM Vineet Gupta <vineetg@rivosinc.com> 
> wrote:
>
>     Hi,
>
>     I have a testcase (from real workloads) involving C++ atomics and
>     trying
>     to understand the codegen (gcc 12) for RVWMO and x86.
>     It does mix atomics with non-atomics so not obvious what the
>     behavior is
>     intended to be hence some explicit CC of subject matter experts
>     (apologies for that in advance).
>
>     Test has a non-atomic store followed by an atomic_load(SEQ_CST). I
>     assume that unadorned direct access defaults to
>     safest/conservative seq_cst.
>
>         extern int g;
>         std::atomic<int> a;
>
>         int bar_noaccessor(int n, int *n2)
>         {
>              *n2 = g;
>              return n + a;
>         }
>
>         int bar_seqcst(int n, int *n2)
>         {
>              *n2 = g;
>              return n + a.load(std::memory_order_seq_cst);
>         }
>
>     On RV (rvwmo), with current gcc 12 we get 2 full fences around the
>     load
>     as prescribed by Privileged Spec, Chpater A, Table A.6 (Mappings from
>     C/C++ to RISC-V primitives).
>
>         _Z10bar_seqcstiPi:
>         .LFB382:
>              .cfi_startproc
>              lui    a5,%hi(g)
>              lw    a5,%lo(g)(a5)
>              sw    a5,0(a1)
>         *fence    iorw,iorw*
>              lui    a5,%hi(a)
>              lw    a5,%lo(a)(a5)
>         *fence    iorw,iorw*
>              addw    a0,a5,a0
>              ret
>
>
>     OTOH, for x86 (same default toggles) there's no barriers at all.
>
>         _Z10bar_seqcstiPi:
>              endbr64
>              movl    g(%rip), %eax
>              movl    %eax, (%rsi)
>              movl    a(%rip), %eax
>              addl    %edi, %eax
>              ret
>
>
>     My naive intuition was x86 TSO would require a fence before
>     load(seq_cst) for a prior store, even if that store was non
>     atomic, so
>     ensure load didn't bubble up ahead of store.
>
>     Perhaps this begs the general question of intermixing non atomic
>     accesses with atomics and if that is undefined behavior or some
>     such. I
>     skimmed through C++14 specification chapter Atomic Operations library
>     but nothing's jumping out on the topic.
>
>     Or is it much deeper, related to As-if rule or something.
>
>     Thx,
>     -Vineet
>