* Question about vectorization optimization during RTL-PASS
@ 2023-11-13 7:55 Hanke Zhang
0 siblings, 0 replies; only message in thread
From: Hanke Zhang @ 2023-11-13 7:55 UTC (permalink / raw)
To: gcc
Hi, I've been working on vectorization-related optimization lately.
GCC seems to have some optimization vulnerabilities. I would like to
ask if it can be solved.
For example, for the following program using AVX2:
#include <immtrin.h>
// reg->node2[i].state is an unsigned long long variable
// reg->size is an integer variable that represents the iterations
for (int i = 0; i < reg->size; i+=4) {
/* original code:
unsigned long long state = reg->node2[i].state;
if (state & (1LLU << j + 1 | 1LLU << width + j))
state ^= (1LLU << j);
state ^= (1LLU << width + j);
*/
__m256i state = _mm256_loadu_si256((__m256i *)((char*)(reg->node2) +
i * sizeof(unsigned long long)));
__m256i mask1 = _mm256_set1_epi64x(1LLU << j + 1 | 1LLU << width + j);
// cmp
__m256i tmp1 = _mm256_and_si256(state, mask1);
__m256i cmp1 = _mm256_cmpeq_epi64(tmp1, mask1);
// xor
__m256i xor_param = _mm256_set1_epi64x(1LLU << j);
__m256i tmp2 = _mm256_and_si256(xor_param, cmp1);
__m256i xor_result = _mm256_xor_si256(state, tmp2);
// xor
__m256i xor_param2 = _mm256_set1_epi64x(1LLU << width + j);
__m256i xor_res2 = _mm256_xor_si256(xor_result, xor_param2);
_mm256_storeu_si256((__m256i *)((char*)(reg->node2) + i *
sizeof(unsigned long long)), xor_res2);
}
My expectation is to generate assembly code like this:
vpxor ymm6, ymm2, ymmword ptr [r9+r15*8]
vpand ymm4, ymm1, ymm6
vpcmpeqq ymm5, ymm4, ymm1
vpand ymm7, ymm3, ymm5
vpxor ymm8, ymm6, ymm7
vmovdqu ymmword ptr [r9+r15*8], ymm8
But the actual generated assembly code looks like this:
vpand ymm0, ymm2, ymmword ptr [rsi+rax*8]
vpxor ymm1, ymm4, ymmword ptr [rsi+rax*8]
vpcmpeqq ymm0, ymm0, ymm2
vpand ymm0, ymm0, ymm5
vpxor ymm0, ymm0, ymm1
vmovdqu ymmword ptr [rsi+rax*8], ymm0
That is, GCC has advanced the second XOR operation, and at the same
time has an additional address fetch operation (ymmword ptr
[rsi+rax*8]), which I think may lead to a decrease in efficiency, and
I also found that this instruction accounts for a large proportion
when I use perf.
At the same time, I found that these operations are performed on
RTL-PASS through dump-related files, and they don't seem to be easy to
change. Is there a good way to get it to generate the assembly code I
want? Is it possible to modify my own source files or GCC source code
to get that?
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2023-11-13 7:56 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-13 7:55 Question about vectorization optimization during RTL-PASS Hanke Zhang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).