[Bug target/109896] New: Missed optimisation: overflow detection in multiplication instructions for operator new

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/109896] New: Missed optimisation: overflow detection in multiplication instructions for operator new
@ 2023-05-17 18:34 thiago at kde dot org
  2023-05-17 18:45 ` [Bug target/109896] " pinskia at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: thiago at kde dot org @ 2023-05-17 18:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109896

            Bug ID: 109896
           Summary: Missed optimisation: overflow detection in
                    multiplication instructions for operator new
           Product: gcc
           Version: 13.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: thiago at kde dot org
  Target Milestone: ---

In the following code:
struct S
{
    char buf[47];       // weird size
};

void *f(unsigned long paramCount)
{
    return new S[paramCount];
}

GCC generates (see https://gcc.godbolt.org/z/o5eocj5n9):
        movabsq $196241958230952676, %rax
        cmpq    %rdi, %rax
        jb      .L2
        imulq   $47, %rdi, %rdi
        jmp     operator new[](unsigned long)
f(unsigned long) [clone .cold]:
.L2:
        pushq   %rax
        call    __cxa_throw_bad_array_new_length

That's a slight pessimisation of the typical, non-exceptional case because of
the presence of the compare instructions. On modern x86, that's 3 retire slots
and 2 uops, in addition to the multiplication's 3 cycles (which may be
speculated and start early). But the presence of a 10-byte instruction and the
fact that the jump is further than 8-bit displacement range mean those three
instructions occupy 18 bytes, meaning the front-end is sub-utilised, requiring
2 cycles to decode the 5 instructions (pre-GLC [I think] CPUs decode 4
instructions in 16 bytes per cycle).

Instead, GCC should emit the multiplication and check if the overflow flag was
set. I believe the optimal code for GCC would be:

        imulq   $47, %rdi, %rdi
        jo      .L2
        jmp     operator new[](unsigned long)

That's 15 bytes, so 1 cycle for the decoder to decode all 3 instructions.
That's 3+1 cycles and 2 retire slots before the JMP.

In the Godbolt link above, Clang and MSVC emitted a CMOV:

        mulq    %rcx
        movq    $-1, %rdi
        cmovnoq %rax, %rdi
        jmp     operator new[](unsigned long)@PLT

This is slightly worse (19 bytes, 4 instructions, though also 3+1 cycles). For
GCC's -fno-exceptions case, I recommend keeping the IMUL+JO case and only load
-1 in the .text.unlikely section. But see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109895

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/109896] Missed optimisation: overflow detection in multiplication instructions for operator new
  2023-05-17 18:34 [Bug target/109896] New: Missed optimisation: overflow detection in multiplication instructions for operator new thiago at kde dot org
@ 2023-05-17 18:45 ` pinskia at gcc dot gnu.org
  2023-05-17 20:39 ` hjl.tools at gmail dot com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-05-17 18:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109896

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I suspect the overflow code was added before __builtin_*_overflow were added
which is why the generated code is this way.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/109896] Missed optimisation: overflow detection in multiplication instructions for operator new
  2023-05-17 18:34 [Bug target/109896] New: Missed optimisation: overflow detection in multiplication instructions for operator new thiago at kde dot org
  2023-05-17 18:45 ` [Bug target/109896] " pinskia at gcc dot gnu.org
@ 2023-05-17 20:39 ` hjl.tools at gmail dot com
  2023-05-17 23:42 ` thiago at kde dot org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: hjl.tools at gmail dot com @ 2023-05-17 20:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109896

--- Comment #2 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Andrew Pinski from comment #1)
> I suspect the overflow code was added before __builtin_*_overflow were added
> which is why the generated code is this way.

Should the C++ front-end use __builtin_mul_overflow?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/109896] Missed optimisation: overflow detection in multiplication instructions for operator new
  2023-05-17 18:34 [Bug target/109896] New: Missed optimisation: overflow detection in multiplication instructions for operator new thiago at kde dot org
  2023-05-17 18:45 ` [Bug target/109896] " pinskia at gcc dot gnu.org
  2023-05-17 20:39 ` hjl.tools at gmail dot com
@ 2023-05-17 23:42 ` thiago at kde dot org
  2023-05-17 23:46 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: thiago at kde dot org @ 2023-05-17 23:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109896

--- Comment #3 from Thiago Macieira <thiago at kde dot org> ---
(In reply to H.J. Lu from comment #2)
> (In reply to Andrew Pinski from comment #1)
> > I suspect the overflow code was added before __builtin_*_overflow were added
> > which is why the generated code is this way.
> 
> Should the C++ front-end use __builtin_mul_overflow?

That's what that code is doing, yes.

But mind you that not all examples are doing actual multiplications. That's why
I had the weird size of 47.

A size that is a power of 2 is just doing bit checks. For example, 16:
        movq    %rdi, %rax
        shrq    $59, %rax
        jne     .L2

Other sizes do the compare, but there's no multiplication involved. For 24:
        movabsq $384307168202282325, %rax
        cmpq    %rdi, %rax
        jb      .L2
        leaq    (%rdi,%rdi,2), %rdi
        salq    $3, %rdi
5 instructions, 4 cycles (not including front-end decode), so roughly the same
as the imulq example above (4 cycles), but with far more ports to dispatch to.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/109896] Missed optimisation: overflow detection in multiplication instructions for operator new
  2023-05-17 18:34 [Bug target/109896] New: Missed optimisation: overflow detection in multiplication instructions for operator new thiago at kde dot org
                   ` (2 preceding siblings ...)
  2023-05-17 23:42 ` thiago at kde dot org
@ 2023-05-17 23:46 ` pinskia at gcc dot gnu.org
  2023-05-18  0:13 ` thiago at kde dot org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-05-17 23:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109896

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Thiago Macieira from comment #3)
> 5 instructions, 4 cycles (not including front-end decode), so roughly the
> same as the imulq example above (4 cycles), but with far more ports to
> dispatch to.

If you are that picky for cycles, these cycles are not going to be a problem
compared to the dynamic allocation that is just about to happen ......

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/109896] Missed optimisation: overflow detection in multiplication instructions for operator new
  2023-05-17 18:34 [Bug target/109896] New: Missed optimisation: overflow detection in multiplication instructions for operator new thiago at kde dot org
                   ` (3 preceding siblings ...)
  2023-05-17 23:46 ` pinskia at gcc dot gnu.org
@ 2023-05-18  0:13 ` thiago at kde dot org
  2023-05-18  8:33 ` redi at gcc dot gnu.org
  2023-05-18 14:08 ` thiago at kde dot org
  6 siblings, 0 replies; 8+ messages in thread
From: thiago at kde dot org @ 2023-05-18  0:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109896

--- Comment #5 from Thiago Macieira <thiago at kde dot org> ---
(In reply to Andrew Pinski from comment #4)
> If you are that picky for cycles, these cycles are not going to be a problem
> compared to the dynamic allocation that is just about to happen ......

Yeah, I realised that after I posted the reply. If the calculation is
successful, we're going to allocate memory and that's neither fast nor
determinstic. If it overflows, we're going to unwind the stack, which is even
worse. I had only looked at the multiplication and failed to consider what
comes after it.

So, yeah, do this if it's a low-hanging fruit.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/109896] Missed optimisation: overflow detection in multiplication instructions for operator new
  2023-05-17 18:34 [Bug target/109896] New: Missed optimisation: overflow detection in multiplication instructions for operator new thiago at kde dot org
                   ` (4 preceding siblings ...)
  2023-05-18  0:13 ` thiago at kde dot org
@ 2023-05-18  8:33 ` redi at gcc dot gnu.org
  2023-05-18 14:08 ` thiago at kde dot org
  6 siblings, 0 replies; 8+ messages in thread
From: redi at gcc dot gnu.org @ 2023-05-18  8:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109896

--- Comment #6 from Jonathan Wakely <redi at gcc dot gnu.org> ---
With placement-new there's no allocation:
https://gcc.godbolt.org/z/68e4PaeYz

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/109896] Missed optimisation: overflow detection in multiplication instructions for operator new
  2023-05-17 18:34 [Bug target/109896] New: Missed optimisation: overflow detection in multiplication instructions for operator new thiago at kde dot org
                   ` (5 preceding siblings ...)
  2023-05-18  8:33 ` redi at gcc dot gnu.org
@ 2023-05-18 14:08 ` thiago at kde dot org
  6 siblings, 0 replies; 8+ messages in thread
From: thiago at kde dot org @ 2023-05-18 14:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109896

--- Comment #7 from Thiago Macieira <thiago at kde dot org> ---
(In reply to Jonathan Wakely from comment #6)
> With placement-new there's no allocation:
> https://gcc.godbolt.org/z/68e4PaeYz

Is the exception expected there, though?

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-05-18 14:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-17 18:34 [Bug target/109896] New: Missed optimisation: overflow detection in multiplication instructions for operator new thiago at kde dot org
2023-05-17 18:45 ` [Bug target/109896] " pinskia at gcc dot gnu.org
2023-05-17 20:39 ` hjl.tools at gmail dot com
2023-05-17 23:42 ` thiago at kde dot org
2023-05-17 23:46 ` pinskia at gcc dot gnu.org
2023-05-18  0:13 ` thiago at kde dot org
2023-05-18  8:33 ` redi at gcc dot gnu.org
2023-05-18 14:08 ` thiago at kde dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).