Failure to optimize?

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* Failure to optimize?
@ 2021-01-12 13:32 ☂Josh Chia (謝任中)
  2021-01-12 13:36 ` Florian Weimer
  0 siblings, 1 reply; 9+ messages in thread
From: ☂Josh Chia (謝任中) @ 2021-01-12 13:32 UTC (permalink / raw)
  To: gcc-help

I have a code snippet that I'm wondering why GCC didn't optimize the way I
think it should:
https://godbolt.org/z/1qKvax

bar2() is a variant of bar1() that has been manually tweaked to avoid
branches. I haven't done any benchmarks but, I would expect the branchless
bar2() to perform better than bar1() but GCC does not automatically
optimize bar1() to be like bar2(); the generated code for bar1() and bar2()
are different and the generated code for bar1() contains a branch.

I'm generally trying to get an idea of how smart GCC optimization is and
how much hand-holding I should provide, so could someone help me understand
why GCC didn't generate the same branchless code for bar1() and bar2()? Or,
perhaps avoiding branches here doesn't actually help performance?

Josh

*SOURCE*
char const* foo();

int cursor = 0;

char const* bar1() {
    char const* result = foo();
    if (result)
        ++cursor;
    return result;
}

char const* bar2() {
    char const* result = foo();
    cursor += !!result;
    return result;
}

*GENERATED CODE*
bar1():
        sub     rsp, 8
        call    foo()
        test    rax, rax
        je      .L1
        add     DWORD PTR cursor[rip], 1
.L1:
        add     rsp, 8
        ret
bar2():
        sub     rsp, 8
        call    foo()
        cmp     rax, 1
        sbb     DWORD PTR cursor[rip], -1
        add     rsp, 8
        ret
cursor:
        .zero   4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failure to optimize?
  2021-01-12 13:32 Failure to optimize? ☂Josh Chia (謝任中)
@ 2021-01-12 13:36 ` Florian Weimer
  2021-01-12 14:05   ` ☂Josh Chia (謝任中)
  2021-01-12 14:20   ` Jonathan Wakely
  0 siblings, 2 replies; 9+ messages in thread
From: Florian Weimer @ 2021-01-12 13:36 UTC (permalink / raw)
  To: ☂Josh Chia (謝任中) via Gcc-help

* ☂Josh Chia (謝任中) via Gcc-help:

> I have a code snippet that I'm wondering why GCC didn't optimize the way I
> think it should:
> https://godbolt.org/z/1qKvax
>
> bar2() is a variant of bar1() that has been manually tweaked to avoid
> branches. I haven't done any benchmarks but, I would expect the branchless
> bar2() to perform better than bar1() but GCC does not automatically
> optimize bar1() to be like bar2(); the generated code for bar1() and bar2()
> are different and the generated code for bar1() contains a branch.

The optimization is probably valid for C99, but not for C11, where the
memory model prevents the compiler from introducing spurious writes:
Another thread may modify the variable concurrently, and if this happens
only if foo returns NULL, the original bar1 function does not contain a
data race, but the branchless version would.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failure to optimize?
  2021-01-12 13:36 ` Florian Weimer
@ 2021-01-12 14:05   ` ☂Josh Chia (謝任中)
  2021-01-12 14:15     ` Alexander Monakov
  2021-01-12 14:20     ` Florian Weimer
  2021-01-12 14:20   ` Jonathan Wakely
  1 sibling, 2 replies; 9+ messages in thread
From: ☂Josh Chia (謝任中) @ 2021-01-12 14:05 UTC (permalink / raw)
  To: Florian Weimer; +Cc: ☂Josh Chia (謝任中) via Gcc-help

I didn't mention it earlier, but I'd like to clarify that this is for "g++
-O3 -std=c++17" on GCC 10.2.

If I use "static thread_local int cursor = 0;" then cursor is thread-local
and such a race should be impossible, but the generated code for bar1()
still has a branch.

The branch persists also if I do this instead for bar1():
char const* bar1() {
    char const* result = foo();
    if (result)
        cursor += 1;
    else
        cursor += 0;
    return result;
}

bar1():
        sub     rsp, 8
        call    foo()
        test    rax, rax
        je      .L1
        add     DWORD PTR fs:cursor@tpoff, 1
.L1:
        add     rsp, 8
        ret

On Tue, Jan 12, 2021 at 9:36 PM Florian Weimer <fweimer@redhat.com> wrote:

> * ☂Josh Chia (謝任中) via Gcc-help:
>
> > I have a code snippet that I'm wondering why GCC didn't optimize the way
> I
> > think it should:
> > https://godbolt.org/z/1qKvax
> >
> > bar2() is a variant of bar1() that has been manually tweaked to avoid
> > branches. I haven't done any benchmarks but, I would expect the
> branchless
> > bar2() to perform better than bar1() but GCC does not automatically
> > optimize bar1() to be like bar2(); the generated code for bar1() and
> bar2()
> > are different and the generated code for bar1() contains a branch.
>
> The optimization is probably valid for C99, but not for C11, where the
> memory model prevents the compiler from introducing spurious writes:
> Another thread may modify the variable concurrently, and if this happens
> only if foo returns NULL, the original bar1 function does not contain a
> data race, but the branchless version would.
>
> Thanks,
> Florian
> --
> Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
> Commercial register: Amtsgericht Muenchen, HRB 153243,
> Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael
> O'Neill
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failure to optimize?
  2021-01-12 14:05   ` ☂Josh Chia (謝任中)
@ 2021-01-12 14:15     ` Alexander Monakov
  2021-01-12 14:20     ` Florian Weimer
  1 sibling, 0 replies; 9+ messages in thread
From: Alexander Monakov @ 2021-01-12 14:15 UTC (permalink / raw)
  To: ☂Josh Chia (謝任中)
  Cc: Florian Weimer, ☂Josh Chia (謝任中) via Gcc-help

On Tue, 12 Jan 2021, ☂Josh Chia (謝任中) via Gcc-help wrote:

> I didn't mention it earlier, but I'd like to clarify that this is for "g++
> -O3 -std=c++17" on GCC 10.2.
> 
> If I use "static thread_local int cursor = 0;" then cursor is thread-local
> and such a race should be impossible, but the generated code for bar1()
> still has a branch.

Whether 'cursor' is thread-local or not does not matter, it's possible to take
the address of a thread-local variable and pass it to another thread.

> The branch persists also if I do this instead for bar1():
> char const* bar1() {
>     char const* result = foo();
>     if (result)
>         cursor += 1;
>     else
>         cursor += 0;
>     return result;
> }

This is hard to optimize as 'cursor += 0' is optimized out quite early, and
it's hard to keep track that 'cursor' is unconditionally written in the
abstract machine.

Alexander

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failure to optimize?
  2021-01-12 14:05   ` ☂Josh Chia (謝任中)
  2021-01-12 14:15     ` Alexander Monakov
@ 2021-01-12 14:20     ` Florian Weimer
  1 sibling, 0 replies; 9+ messages in thread
From: Florian Weimer @ 2021-01-12 14:20 UTC (permalink / raw)
  To: ☂Josh Chia (謝任中)
  Cc: ☂Josh Chia (謝任中) via Gcc-help

* ☂Josh Chia (謝任中):

> If I use "static thread_local int cursor = 0;" then cursor is
> thread-local and such a race should be impossible, but the generated
> code for bar1() still has a branch.

The GNU toolchain supports access to thread-local storage from other
threads as an extension (subject to some constraints, e.g., TLS storage
of other threads is gone after fork).  So the optimization still
wouldn't be valid.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failure to optimize?
  2021-01-12 13:36 ` Florian Weimer
  2021-01-12 14:05   ` ☂Josh Chia (謝任中)
@ 2021-01-12 14:20   ` Jonathan Wakely
  2021-01-12 14:22     ` Florian Weimer
  2021-01-12 16:53     ` Liu Hao
  1 sibling, 2 replies; 9+ messages in thread
From: Jonathan Wakely @ 2021-01-12 14:20 UTC (permalink / raw)
  To: Florian Weimer; +Cc: ☂Josh Chia (謝任中) via Gcc-help

On Tue, 12 Jan 2021 at 13:37, Florian Weimer via Gcc-help
<gcc-help@gcc.gnu.org> wrote:
>
> * ☂Josh Chia (謝任中) via Gcc-help:
>
> > I have a code snippet that I'm wondering why GCC didn't optimize the way I
> > think it should:
> > https://godbolt.org/z/1qKvax
> >
> > bar2() is a variant of bar1() that has been manually tweaked to avoid
> > branches. I haven't done any benchmarks but, I would expect the branchless
> > bar2() to perform better than bar1() but GCC does not automatically
> > optimize bar1() to be like bar2(); the generated code for bar1() and bar2()
> > are different and the generated code for bar1() contains a branch.
>
> The optimization is probably valid for C99, but not for C11, where the
> memory model prevents the compiler from introducing spurious writes:
> Another thread may modify the variable concurrently, and if this happens
> only if foo returns NULL, the original bar1 function does not contain a
> data race, but the branchless version would.

I'm not sure about the rules for C, but in C++ the compiler can assume
there is no race, because the increment is not atomic. If there were
another access to the variable then a non-atomic store would be a race
even in the bar1 version.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failure to optimize?
  2021-01-12 14:20   ` Jonathan Wakely
@ 2021-01-12 14:22     ` Florian Weimer
  2021-01-12 16:53     ` Liu Hao
  1 sibling, 0 replies; 9+ messages in thread
From: Florian Weimer @ 2021-01-12 14:22 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: ☂Josh Chia (謝任中) via Gcc-help

* Jonathan Wakely:

> On Tue, 12 Jan 2021 at 13:37, Florian Weimer via Gcc-help
> <gcc-help@gcc.gnu.org> wrote:
>>
>> * ☂Josh Chia (謝任中) via Gcc-help:
>>
>> > I have a code snippet that I'm wondering why GCC didn't optimize the way I
>> > think it should:
>> > https://godbolt.org/z/1qKvax
>> >
>> > bar2() is a variant of bar1() that has been manually tweaked to avoid
>> > branches. I haven't done any benchmarks but, I would expect the branchless
>> > bar2() to perform better than bar1() but GCC does not automatically
>> > optimize bar1() to be like bar2(); the generated code for bar1() and bar2()
>> > are different and the generated code for bar1() contains a branch.
>>
>> The optimization is probably valid for C99, but not for C11, where the
>> memory model prevents the compiler from introducing spurious writes:
>> Another thread may modify the variable concurrently, and if this happens
>> only if foo returns NULL, the original bar1 function does not contain a
>> data race, but the branchless version would.
>
> I'm not sure about the rules for C, but in C++ the compiler can assume
> there is no race, because the increment is not atomic.

The problem is that the store is conditional as written.  The compiler
cannot introduce an unconditional store due to that.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failure to optimize?
  2021-01-12 14:20   ` Jonathan Wakely
  2021-01-12 14:22     ` Florian Weimer
@ 2021-01-12 16:53     ` Liu Hao
  2021-01-12 18:16       ` Jonathan Wakely
  1 sibling, 1 reply; 9+ messages in thread
From: Liu Hao @ 2021-01-12 16:53 UTC (permalink / raw)
  To: Jonathan Wakely, Florian Weimer
  Cc: ☂Josh Chia (謝任中) via Gcc-help


[-- Attachment #1.1: Type: text/plain, Size: 1612 bytes --]

在 2021-01-12 22:20, Jonathan Wakely via Gcc-help 写道:
> 
> I'm not sure about the rules for C, but in C++ the compiler can assume
> there is no race, because the increment is not atomic. If there were
> another access to the variable then a non-atomic store would be a race
> even in the bar1 version.
> 

What about this code:

     // -- beginning of copy-n-pasted code

     char const* foo();

     int cursor = 0;

     char const* bar1() {
         char const* result = foo();
         if (result)
             ++cursor;
         return result;
     }

     char const* bar2() {
         char const* result = foo();
         cursor += !!result;
         return result;
     }

     // -- end of copy-n-pasted code


     #include <atomic>
     #include <thread>
     #include <cstdio>

     char const* foo() {
         static ::std::atomic<char const*> str("meow");
         return str.exchange(nullptr);
     }

     int main() {
       ::std::thread thrs[10];
       for(auto& r : thrs)
         r = ::std::thread(bar1);

       for(auto& r : thrs)
         r.join();

       ::std::printf("cursor = %d\n", cursor);
     }


`foo()` will return non-null for exactly one thread. Increment of `cursor` by that thread is 
sequenced before its termination, which synchronizes with exactly one `join()`, which is sequenced 
before the final read of `cursor`. There is no race in this program, but there would be if `bar2` 
was called in place of `bar1`, where all threads could modify `cursor` concurrently.



-- 
Best regards,
LH_Mouse


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failure to optimize?
  2021-01-12 16:53     ` Liu Hao
@ 2021-01-12 18:16       ` Jonathan Wakely
  0 siblings, 0 replies; 9+ messages in thread
From: Jonathan Wakely @ 2021-01-12 18:16 UTC (permalink / raw)
  To: Liu Hao
  Cc: Florian Weimer, ☂Josh Chia (謝任中) via Gcc-help

On Tue, 12 Jan 2021, 16:55 Liu Hao, <lh_mouse@126.com> wrote:

> 在 2021-01-12 22:20, Jonathan Wakely via Gcc-help 写道:
> >
> > I'm not sure about the rules for C, but in C++ the compiler can assume
> > there is no race, because the increment is not atomic. If there were
> > another access to the variable then a non-atomic store would be a race
> > even in the bar1 version.
> >
>
> What about this code:
>
>      // -- beginning of copy-n-pasted code
>
>      char const* foo();
>
>      int cursor = 0;
>
>      char const* bar1() {
>          char const* result = foo();
>          if (result)
>              ++cursor;
>          return result;
>      }
>
>      char const* bar2() {
>          char const* result = foo();
>          cursor += !!result;
>          return result;
>      }
>
>      // -- end of copy-n-pasted code
>
>
>      #include <atomic>
>      #include <thread>
>      #include <cstdio>
>
>      char const* foo() {
>          static ::std::atomic<char const*> str("meow");
>          return str.exchange(nullptr);
>      }
>
>      int main() {
>        ::std::thread thrs[10];
>        for(auto& r : thrs)
>          r = ::std::thread(bar1);
>
>        for(auto& r : thrs)
>          r.join();
>
>        ::std::printf("cursor = %d\n", cursor);
>      }
>
>
> `foo()` will return non-null for exactly one thread. Increment of `cursor`
> by that thread is
> sequenced before its termination, which synchronizes with exactly one
> `join()`, which is sequenced
> before the final read of `cursor`. There is no race in this program, but
> there would be if `bar2`
> was called in place of `bar1`, where all threads could modify `cursor`
> concurrently.
>


Yes, that's Florian's point. Introducing a write in the other threads would
introduce a data race, which has undefined behaviour.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-01-12 18:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-12 13:32 Failure to optimize? ☂Josh Chia (謝任中)
2021-01-12 13:36 ` Florian Weimer
2021-01-12 14:05   ` ☂Josh Chia (謝任中)
2021-01-12 14:15     ` Alexander Monakov
2021-01-12 14:20     ` Florian Weimer
2021-01-12 14:20   ` Jonathan Wakely
2021-01-12 14:22     ` Florian Weimer
2021-01-12 16:53     ` Liu Hao
2021-01-12 18:16       ` Jonathan Wakely

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).