public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c++/62080] New: Suboptimal code generation with eigen library
@ 2014-08-10 10:20 beschindler at gmail dot com
  2014-08-10 10:21 ` [Bug c++/62080] " beschindler at gmail dot com
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: beschindler at gmail dot com @ 2014-08-10 10:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62080

            Bug ID: 62080
           Summary: Suboptimal code generation with eigen library
           Product: gcc
           Version: 4.8.3
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: beschindler at gmail dot com

Created attachment 33281
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33281&action=edit
Source code used to get the provided assembly

I'm currently optimizing some code using the eigen library and I'm stumbling
over an interesting problem. 
I have a function, which I wrote in two different ways (the attributes are
there to provide some optimization barriers, dimEigen is a member variable of
the containing class): 


void eigenClamp(Eigen::Vector4i& vec) __attribute__((noinline, noclone))
{
    vec = vec.array().min(dimEigen.array()).max(Eigen::Array4i::Zero());
}

void eigenClamp2(Eigen::Vector4i& vec) __attribute__((noinline, noclone))
{
    vec = vec.array().min(dimEigen.array());
    vec = vec.array().max(Eigen::Array4i::Zero());
}

I'm compiling this on a core i7 920 using -O2 -fno-exceptions -fno-rtti
-std=c++11 -march=native

The first function generates this assembly, which looks great: 

movdqu    (%rsi), %xmm1
movdqu    (%rdi), %xmm0
pminsd    %xmm1, %xmm0
pxor    %xmm1, %xmm1
pmaxsd    %xmm1, %xmm0
movdqa    %xmm0, (%rsi)

The second version does this: 

movdqa    (%rsi), %xmm0
pminsd    (%rdi), %xmm0
movdqa    %xmm0, (%rsi) <-- 
pxor    %xmm0, %xmm0
movdqu    (%rsi), %xmm1 <-- 
pmaxsd    %xmm1, %xmm0
movdqa    %xmm0, (%rsi)

It seems, because there are two lines in the original source code, the result
of the first expression is written to memory and then two instructions later,
read back from memory. This makes this function almost 50% slower in what I can
measure. As I find the latter code much easier to read as the former, it would
be great if the same assembly would be generated. 

Also, I note that in the second version, the pminsd is executed directly from
the memory source, while in the first version, it is read to a register and
then pminsd is called. Thus, I'd love to see this code: 

movdqu    (%rsi), %xmm1
pminsd    (%rdi), %xmm1
pxor    %xmm1, %xmm1
pmaxsd    %xmm1, %xmm0
movdqa    %xmm0, (%rsi)

As a reference, I'm attaching the complete source code and the generated
assembly


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c++/62080] Suboptimal code generation with eigen library
  2014-08-10 10:20 [Bug c++/62080] New: Suboptimal code generation with eigen library beschindler at gmail dot com
@ 2014-08-10 10:21 ` beschindler at gmail dot com
  2014-08-10 11:30 ` glisse at gcc dot gnu.org
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: beschindler at gmail dot com @ 2014-08-10 10:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62080

--- Comment #1 from Benjamin Schindler <beschindler at gmail dot com> ---
Created attachment 33282
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33282&action=edit
Generated assembly in full


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c++/62080] Suboptimal code generation with eigen library
  2014-08-10 10:20 [Bug c++/62080] New: Suboptimal code generation with eigen library beschindler at gmail dot com
  2014-08-10 10:21 ` [Bug c++/62080] " beschindler at gmail dot com
@ 2014-08-10 11:30 ` glisse at gcc dot gnu.org
  2014-08-10 12:05 ` beschindler at gmail dot com
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: glisse at gcc dot gnu.org @ 2014-08-10 11:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62080

--- Comment #2 from Marc Glisse <glisse at gcc dot gnu.org> ---
(note that a minimal, self-contained testcase would be much better and
shouldn't be hard to produce)

We write to memory with:

(insn 10 8 11 2 (set (mem:V2DI (reg/v/f:DI 97 [ vec ]) [0 MEM[(__m128i *
{ref-all})vec_4(D)]+0 S16 A128])
        (subreg:V2DI (reg:V4SI 98) 0))
/usr/lib/gcc-snapshot/lib/gcc/x86_64-linux-gnu/4.10.0/include/emmintrin.h:706
1147 {*movv2di_internal}
     (expr_list:REG_DEAD (reg:V4SI 98)
        (nil)))

and then read back with:

(insn 15 12 17 2 (set (reg:V2DF 100)
        (vec_concat:V2DF (mem:DF (reg/v/f:DI 97 [ vec ]) [5 MEM[(const double
*)vec_4(D)]+0 S8 A64])
            (mem:DF (plus:DI (reg/v/f:DI 97 [ vec ])
                    (const_int 8 [0x8])) [0  S8 A8])))
/usr/lib/gcc-snapshot/lib/gcc/x86_64-linux-gnu/4.10.0/include/emmintrin.h:925
2016 {*vec_concatv2df}
     (nil))

The vec_concat of the 2 adjacent memory locations is not merged into a single
memory read, although from the previous insn it looks like it is suitably
aligned.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c++/62080] Suboptimal code generation with eigen library
  2014-08-10 10:20 [Bug c++/62080] New: Suboptimal code generation with eigen library beschindler at gmail dot com
  2014-08-10 10:21 ` [Bug c++/62080] " beschindler at gmail dot com
  2014-08-10 11:30 ` glisse at gcc dot gnu.org
@ 2014-08-10 12:05 ` beschindler at gmail dot com
  2014-08-11 19:16 ` beschindler at gmail dot com
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: beschindler at gmail dot com @ 2014-08-10 12:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62080

--- Comment #3 from Benjamin Schindler <beschindler at gmail dot com> ---
I just looked at what gcc-4.9.1 does and it does vary:

movdqu    (%rsi), %xmm1
movdqu    (%rdi), %xmm0 <-- 
pminsd    %xmm1, %xmm0 <-- 
pxor    %xmm1, %xmm1
pmaxsd    %xmm1, %xmm0
movaps    %xmm0, (%rsi)

So, the first version still has a needless movdqu (for which I don't know how
much it hurts). Second version

movdqa    (%rsi), %xmm0
pminsd    (%rdi), %xmm0 <-- good
pxor    %xmm1, %xmm1
movdqu    %xmm0, %xmm0 <-- bad?
pmaxsd    %xmm1, %xmm0
movaps    %xmm0, (%rsi)

So, gcc-4.9 fares better such that it does not go to memory, but it emits an
odd mov instruction. May be this is a separate issue?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c++/62080] Suboptimal code generation with eigen library
  2014-08-10 10:20 [Bug c++/62080] New: Suboptimal code generation with eigen library beschindler at gmail dot com
                   ` (2 preceding siblings ...)
  2014-08-10 12:05 ` beschindler at gmail dot com
@ 2014-08-11 19:16 ` beschindler at gmail dot com
  2014-08-11 20:40 ` glisse at gcc dot gnu.org
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: beschindler at gmail dot com @ 2014-08-11 19:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62080

--- Comment #4 from Benjamin Schindler <beschindler at gmail dot com> ---
(In reply to Marc Glisse from comment #2)
> (note that a minimal, self-contained testcase would be much better and
> shouldn't be hard to produce)


I don't mind doing so, but I don't quite know what is required to trigger this
isssue. 

After chatting with a friend, I realized yet another issue with the generated
assembly: it makes a lot of use of unaligned reads (movdqu) as opposed to
movdqa. Eigen types are by design aligned and thus, it should be possible to
use the (from what I've been told) faster aligned reads

Cheers


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c++/62080] Suboptimal code generation with eigen library
  2014-08-10 10:20 [Bug c++/62080] New: Suboptimal code generation with eigen library beschindler at gmail dot com
                   ` (3 preceding siblings ...)
  2014-08-11 19:16 ` beschindler at gmail dot com
@ 2014-08-11 20:40 ` glisse at gcc dot gnu.org
  2014-08-27  9:58 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: glisse at gcc dot gnu.org @ 2014-08-11 20:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62080

--- Comment #5 from Marc Glisse <glisse at gcc dot gnu.org> ---
With the intrinsics patch, I notice that we don't simplify in gimple either:

  _40 = VIEW_CONVERT_EXPR<__m128i>(_39);
  MEM[(__m128i * {ref-all})vec_4(D)] = _40;
  _60 = MEM[(const double *)vec_4(D)];
  _61 = MEM[(const double *)vec_4(D) + 8B];
  _62 = {_60, _61};
  _63 = VIEW_CONVERT_EXPR<__v4si>(_62);

(_39 and _63 have the same type)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c++/62080] Suboptimal code generation with eigen library
  2014-08-10 10:20 [Bug c++/62080] New: Suboptimal code generation with eigen library beschindler at gmail dot com
                   ` (4 preceding siblings ...)
  2014-08-11 20:40 ` glisse at gcc dot gnu.org
@ 2014-08-27  9:58 ` rguenth at gcc dot gnu.org
  2020-04-06  9:43 ` [Bug middle-end/62080] " pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-08-27  9:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62080

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Marc Glisse from comment #5)
> With the intrinsics patch, I notice that we don't simplify in gimple either:
> 
>   _40 = VIEW_CONVERT_EXPR<__m128i>(_39);
>   MEM[(__m128i * {ref-all})vec_4(D)] = _40;
>   _60 = MEM[(const double *)vec_4(D)];
>   _61 = MEM[(const double *)vec_4(D) + 8B];
>   _62 = {_60, _61};
>   _63 = VIEW_CONVERT_EXPR<__v4si>(_62);
> 
> (_39 and _63 have the same type)

value-numbering has difficulties in seeing through so much stmts (read:
not implemented) and it doesn't have a way of expressing "partial"
values.  That is, it knows that at MEM[(__m128i * {ref-all})vec_4(D)] we
stored _39 but when value-numbering the partial reads it can't assign
the value _39 to them (as said, "partial" values are not supported).

So one way to optimize this is to special-case the composition
operations and try looking up a proper memory operation.

Another possibility is to value-number compound operations also as
piecewise operations, introducing fake value-numbers (that is,
"lower" everything to component-wise operations internally).

I suppose pattern-matching

>   _60 = MEM[(const double *)vec_4(D)];
>   _61 = MEM[(const double *)vec_4(D) + 8B];
>   _62 = {_60, _61};

and generating a single read (with eventually a permute?) would be
more profitable and easier to implement.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/62080] Suboptimal code generation with eigen library
  2014-08-10 10:20 [Bug c++/62080] New: Suboptimal code generation with eigen library beschindler at gmail dot com
                   ` (5 preceding siblings ...)
  2014-08-27  9:58 ` rguenth at gcc dot gnu.org
@ 2020-04-06  9:43 ` pinskia at gcc dot gnu.org
  2020-04-06 10:34 ` glisse at gcc dot gnu.org
  2021-07-18 21:18 ` pinskia at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2020-04-06  9:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62080

--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I suspect this has been improved already.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/62080] Suboptimal code generation with eigen library
  2014-08-10 10:20 [Bug c++/62080] New: Suboptimal code generation with eigen library beschindler at gmail dot com
                   ` (6 preceding siblings ...)
  2020-04-06  9:43 ` [Bug middle-end/62080] " pinskia at gcc dot gnu.org
@ 2020-04-06 10:34 ` glisse at gcc dot gnu.org
  2021-07-18 21:18 ` pinskia at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: glisse at gcc dot gnu.org @ 2020-04-06 10:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62080

Marc Glisse <glisse at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |WAITING
   Last reconfirmed|                            |2020-04-06

--- Comment #8 from Marc Glisse <glisse at gcc dot gnu.org> ---
Even with gcc-4.8.4 (the oldest I have), I cannot reproduce the original
report. Maybe Eigen changed since then. That's why we ask for self-contained
testcases (possibly just the preprocessed source code).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/62080] Suboptimal code generation with eigen library
  2014-08-10 10:20 [Bug c++/62080] New: Suboptimal code generation with eigen library beschindler at gmail dot com
                   ` (7 preceding siblings ...)
  2020-04-06 10:34 ` glisse at gcc dot gnu.org
@ 2021-07-18 21:18 ` pinskia at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-07-18 21:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62080

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |INVALID
             Status|WAITING                     |RESOLVED

--- Comment #9 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
No feedback in over a year and no self-contained preprocessed source so there
is no way to reproduce this any more so closing as invalid.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-07-18 21:18 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-10 10:20 [Bug c++/62080] New: Suboptimal code generation with eigen library beschindler at gmail dot com
2014-08-10 10:21 ` [Bug c++/62080] " beschindler at gmail dot com
2014-08-10 11:30 ` glisse at gcc dot gnu.org
2014-08-10 12:05 ` beschindler at gmail dot com
2014-08-11 19:16 ` beschindler at gmail dot com
2014-08-11 20:40 ` glisse at gcc dot gnu.org
2014-08-27  9:58 ` rguenth at gcc dot gnu.org
2020-04-06  9:43 ` [Bug middle-end/62080] " pinskia at gcc dot gnu.org
2020-04-06 10:34 ` glisse at gcc dot gnu.org
2021-07-18 21:18 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).