public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: Initializing a vector to zero leads to less efficient assemblies than manually assigning a vector to zero?
       [not found] <OFA6ABE30F.F93D2C18-ON00258523.00744B2A-00258523.007540B7@notes.na.collabserv.com>
@ 2020-03-09  5:54 ` Hongtao Liu
  2020-03-09 18:01 ` Hong X
  1 sibling, 0 replies; 2+ messages in thread
From: Hongtao Liu @ 2020-03-09  5:54 UTC (permalink / raw)
  To: Hong X; +Cc: gcc-help

On Sat, Mar 7, 2020 at 5:20 AM Hong X <hongx@ibm.com> wrote:
>
> Hi all,
>
> I tried to compile the following two code snippets with "--std=c++14 -mavx2 -O3" options:
>
>     double tmp_values[4] = {0};
>
> and
>
>     double tmp_values[4];
>
>     for (auto i = 0; i < 4; ++i) {
>         tmp_values[i] = 0.0;
>     }
>
> The first code snippet leads to
>
>     vmovaps XMMWORD PTR [rsp], xmm0
>     vmovaps XMMWORD PTR [rsp+16], xmm0
>
> But the second leads to only
>
>     vmovapd YMMWORD PTR [rsp], ymm0
>
> which is less efficient than the previous one. Am I missing something?
>
Assume you're working on Skylake. the latency and throuoput of
vmovaps/vmovpad is
                                        | lat | throughput | uops |  port |
VMOVAPS (XMM, M128)| [≤4;≤7] | 0.50 / 0.50 | 1 | 1*p23 |
VMOVAPS (YMM, M256)| [≤5;≤8]|   0.50 / 0.50| 1 | 1*p23 |
Refer to https://uops.info/table.html
So the later seems better.
> For the full code, see this godbolt link: https://godbolt.org/z/jonf72 , and I paste the full input and output below:
>
> Input code
>
> #include <cstring>
>
> double loadu1(const void* ptr, int count) {
>
>     double tmp_values[4] = {0};
>
>     std::memcpy(
>         tmp_values,
>         ptr,
>         count * sizeof(double));
>     return tmp_values[0] + tmp_values[1] + tmp_values[2] + tmp_values[3];
> }
>
>
> double loadu2(const void* ptr, int count) {
>
>     double tmp_values[4];
>
>     for (auto i = 0; i < 4; ++i) {
>         tmp_values[i] = 0.0;
>     }
>
>     std::memcpy(
>         tmp_values,
>         ptr,
>         count * sizeof(double));
>     return tmp_values[0] + tmp_values[1] + tmp_values[2] + tmp_values[3];
> }
>
>
> Output assemblies:
>
> loadu1(void const*, int):
>         sub     rsp, 40
>         movsx   rdx, esi
>         vpxor   xmm0, xmm0, xmm0
>         mov     rsi, rdi
>         sal     rdx, 3
>         mov     rdi, rsp
>         vmovaps XMMWORD PTR [rsp], xmm0
>         vmovaps XMMWORD PTR [rsp+16], xmm0
>         call    memcpy
>         vmovsd  xmm0, QWORD PTR [rsp]
>         vaddsd  xmm0, xmm0, QWORD PTR [rsp+8]
>         vaddsd  xmm0, xmm0, QWORD PTR [rsp+16]
>         vaddsd  xmm0, xmm0, QWORD PTR [rsp+24]
>         add     rsp, 40
>         ret
> loadu2(void const*, int):
>         push    rbp
>         movsx   rdx, esi
>         vxorpd  xmm0, xmm0, xmm0
>         mov     rsi, rdi
>         sal     rdx, 3
>         mov     rbp, rsp
>         and     rsp, -32
>         sub     rsp, 32
>         mov     rdi, rsp
>         vmovapd YMMWORD PTR [rsp], ymm0
>         vzeroupper
>         call    memcpy
>         vmovsd  xmm0, QWORD PTR [rsp]
>         vaddsd  xmm0, xmm0, QWORD PTR [rsp+8]
>         vaddsd  xmm0, xmm0, QWORD PTR [rsp+16]
>         vaddsd  xmm0, xmm0, QWORD PTR [rsp+24]
>         leave
>         ret
>
> Thanks!
> Hong
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 2+ messages in thread

* RE: Initializing a vector to zero leads to less efficient assemblies than manually assigning a vector to zero?
       [not found] <OFA6ABE30F.F93D2C18-ON00258523.00744B2A-00258523.007540B7@notes.na.collabserv.com>
  2020-03-09  5:54 ` Initializing a vector to zero leads to less efficient assemblies than manually assigning a vector to zero? Hongtao Liu
@ 2020-03-09 18:01 ` Hong X
  1 sibling, 0 replies; 2+ messages in thread
From: Hong X @ 2020-03-09 18:01 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: gcc-help



-----Hongtao Liu <crazylht@gmail.com> wrote: -----

>To: Hong X <hongx@ibm.com>
>From: Hongtao Liu <crazylht@gmail.com>
>Date: 03/08/2020 22:54
>Cc: gcc-help@gcc.gnu.org
>Subject: [EXTERNAL] Re: Initializing a vector to zero leads to less
>efficient assemblies than manually assigning a vector to zero?
>
>On Sat, Mar 7, 2020 at 5:20 AM Hong X <hongx@ibm.com> wrote:
>>
>> Hi all,
>>
>> I tried to compile the following two code snippets with
>"--std=c++14 -mavx2 -O3" options:
>>
>>     double tmp_values[4] = {0};
>>
>> and
>>
>>     double tmp_values[4];
>>
>>     for (auto i = 0; i < 4; ++i) {
>>         tmp_values[i] = 0.0;
>>     }
>>
>> The first code snippet leads to
>>
>>     vmovaps XMMWORD PTR [rsp], xmm0
>>     vmovaps XMMWORD PTR [rsp+16], xmm0
>>
>> But the second leads to only
>>
>>     vmovapd YMMWORD PTR [rsp], ymm0
>>
>> which is less efficient than the previous one. Am I missing
>something?
>>
>Assume you're working on Skylake. the latency and throuoput of
>vmovaps/vmovpad is
>                                        | lat | throughput | uops |
>port |
>VMOVAPS (XMM, M128)| [≤4;≤7] | 0.50 / 0.50 | 1 | 1*p23 |
>VMOVAPS (YMM, M256)| [≤5;≤8]|   0.50 / 0.50| 1 | 1*p23 |
>Refer to
>https://urldefense.proofpoint.com/v2/url?u=https-3A__uops.info_table.
>html&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=MiihJD2XQNB_CwZVDvjHBg&m=nEB
>RkuwiQXUL6Tu6accQsNS-jUQ9wCEw6jqJXNEBOes&s=zEMMNHR8du8hu3NLiODEXoXBYX
>fjaraeuP8ueYllxTM&e= 
>So the later seems better.

Oops, I said in the other way around. I meant the second is *more* (not *less* in my original post) efficient than the first despite they are functionally equivalent, but the first is likely more preferred by an average C++ programmer. This looks odd to me.

Thanks,
Hong


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-03-09 18:01 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <OFA6ABE30F.F93D2C18-ON00258523.00744B2A-00258523.007540B7@notes.na.collabserv.com>
2020-03-09  5:54 ` Initializing a vector to zero leads to less efficient assemblies than manually assigning a vector to zero? Hongtao Liu
2020-03-09 18:01 ` Hong X

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).