* Re: Initializing a vector to zero leads to less efficient assemblies than manually assigning a vector to zero?
[not found] <OFA6ABE30F.F93D2C18-ON00258523.00744B2A-00258523.007540B7@notes.na.collabserv.com>
@ 2020-03-09 5:54 ` Hongtao Liu
2020-03-09 18:01 ` Hong X
1 sibling, 0 replies; 2+ messages in thread
From: Hongtao Liu @ 2020-03-09 5:54 UTC (permalink / raw)
To: Hong X; +Cc: gcc-help
On Sat, Mar 7, 2020 at 5:20 AM Hong X <hongx@ibm.com> wrote:
>
> Hi all,
>
> I tried to compile the following two code snippets with "--std=c++14 -mavx2 -O3" options:
>
> double tmp_values[4] = {0};
>
> and
>
> double tmp_values[4];
>
> for (auto i = 0; i < 4; ++i) {
> tmp_values[i] = 0.0;
> }
>
> The first code snippet leads to
>
> vmovaps XMMWORD PTR [rsp], xmm0
> vmovaps XMMWORD PTR [rsp+16], xmm0
>
> But the second leads to only
>
> vmovapd YMMWORD PTR [rsp], ymm0
>
> which is less efficient than the previous one. Am I missing something?
>
Assume you're working on Skylake. the latency and throuoput of
vmovaps/vmovpad is
| lat | throughput | uops | port |
VMOVAPS (XMM, M128)| [≤4;≤7] | 0.50 / 0.50 | 1 | 1*p23 |
VMOVAPS (YMM, M256)| [≤5;≤8]| 0.50 / 0.50| 1 | 1*p23 |
Refer to https://uops.info/table.html
So the later seems better.
> For the full code, see this godbolt link: https://godbolt.org/z/jonf72 , and I paste the full input and output below:
>
> Input code
>
> #include <cstring>
>
> double loadu1(const void* ptr, int count) {
>
> double tmp_values[4] = {0};
>
> std::memcpy(
> tmp_values,
> ptr,
> count * sizeof(double));
> return tmp_values[0] + tmp_values[1] + tmp_values[2] + tmp_values[3];
> }
>
>
> double loadu2(const void* ptr, int count) {
>
> double tmp_values[4];
>
> for (auto i = 0; i < 4; ++i) {
> tmp_values[i] = 0.0;
> }
>
> std::memcpy(
> tmp_values,
> ptr,
> count * sizeof(double));
> return tmp_values[0] + tmp_values[1] + tmp_values[2] + tmp_values[3];
> }
>
>
> Output assemblies:
>
> loadu1(void const*, int):
> sub rsp, 40
> movsx rdx, esi
> vpxor xmm0, xmm0, xmm0
> mov rsi, rdi
> sal rdx, 3
> mov rdi, rsp
> vmovaps XMMWORD PTR [rsp], xmm0
> vmovaps XMMWORD PTR [rsp+16], xmm0
> call memcpy
> vmovsd xmm0, QWORD PTR [rsp]
> vaddsd xmm0, xmm0, QWORD PTR [rsp+8]
> vaddsd xmm0, xmm0, QWORD PTR [rsp+16]
> vaddsd xmm0, xmm0, QWORD PTR [rsp+24]
> add rsp, 40
> ret
> loadu2(void const*, int):
> push rbp
> movsx rdx, esi
> vxorpd xmm0, xmm0, xmm0
> mov rsi, rdi
> sal rdx, 3
> mov rbp, rsp
> and rsp, -32
> sub rsp, 32
> mov rdi, rsp
> vmovapd YMMWORD PTR [rsp], ymm0
> vzeroupper
> call memcpy
> vmovsd xmm0, QWORD PTR [rsp]
> vaddsd xmm0, xmm0, QWORD PTR [rsp+8]
> vaddsd xmm0, xmm0, QWORD PTR [rsp+16]
> vaddsd xmm0, xmm0, QWORD PTR [rsp+24]
> leave
> ret
>
> Thanks!
> Hong
>
--
BR,
Hongtao
^ permalink raw reply [flat|nested] 2+ messages in thread
* RE: Initializing a vector to zero leads to less efficient assemblies than manually assigning a vector to zero?
[not found] <OFA6ABE30F.F93D2C18-ON00258523.00744B2A-00258523.007540B7@notes.na.collabserv.com>
2020-03-09 5:54 ` Initializing a vector to zero leads to less efficient assemblies than manually assigning a vector to zero? Hongtao Liu
@ 2020-03-09 18:01 ` Hong X
1 sibling, 0 replies; 2+ messages in thread
From: Hong X @ 2020-03-09 18:01 UTC (permalink / raw)
To: Hongtao Liu; +Cc: gcc-help
-----Hongtao Liu <crazylht@gmail.com> wrote: -----
>To: Hong X <hongx@ibm.com>
>From: Hongtao Liu <crazylht@gmail.com>
>Date: 03/08/2020 22:54
>Cc: gcc-help@gcc.gnu.org
>Subject: [EXTERNAL] Re: Initializing a vector to zero leads to less
>efficient assemblies than manually assigning a vector to zero?
>
>On Sat, Mar 7, 2020 at 5:20 AM Hong X <hongx@ibm.com> wrote:
>>
>> Hi all,
>>
>> I tried to compile the following two code snippets with
>"--std=c++14 -mavx2 -O3" options:
>>
>> double tmp_values[4] = {0};
>>
>> and
>>
>> double tmp_values[4];
>>
>> for (auto i = 0; i < 4; ++i) {
>> tmp_values[i] = 0.0;
>> }
>>
>> The first code snippet leads to
>>
>> vmovaps XMMWORD PTR [rsp], xmm0
>> vmovaps XMMWORD PTR [rsp+16], xmm0
>>
>> But the second leads to only
>>
>> vmovapd YMMWORD PTR [rsp], ymm0
>>
>> which is less efficient than the previous one. Am I missing
>something?
>>
>Assume you're working on Skylake. the latency and throuoput of
>vmovaps/vmovpad is
> | lat | throughput | uops |
>port |
>VMOVAPS (XMM, M128)| [≤4;≤7] | 0.50 / 0.50 | 1 | 1*p23 |
>VMOVAPS (YMM, M256)| [≤5;≤8]| 0.50 / 0.50| 1 | 1*p23 |
>Refer to
>https://urldefense.proofpoint.com/v2/url?u=https-3A__uops.info_table.
>html&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=MiihJD2XQNB_CwZVDvjHBg&m=nEB
>RkuwiQXUL6Tu6accQsNS-jUQ9wCEw6jqJXNEBOes&s=zEMMNHR8du8hu3NLiODEXoXBYX
>fjaraeuP8ueYllxTM&e=
>So the later seems better.
Oops, I said in the other way around. I meant the second is *more* (not *less* in my original post) efficient than the first despite they are functionally equivalent, but the first is likely more preferred by an average C++ programmer. This looks odd to me.
Thanks,
Hong
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2020-03-09 18:01 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <OFA6ABE30F.F93D2C18-ON00258523.00744B2A-00258523.007540B7@notes.na.collabserv.com>
2020-03-09 5:54 ` Initializing a vector to zero leads to less efficient assemblies than manually assigning a vector to zero? Hongtao Liu
2020-03-09 18:01 ` Hong X
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).