From: Xionghu Luo <luoxhu@linux.ibm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>,
Segher Boessenkool <segher@kernel.crashing.org>,
Bill Schmidt <wschmidt@linux.ibm.com>,
linkw@gcc.gnu.org, David Edelsohn <dje.gcc@gmail.com>
Subject: Re: [PATCH] New hook adjust_iv_update_pos
Date: Fri, 25 Jun 2021 17:41:30 +0800 [thread overview]
Message-ID: <ef640ef9-8ec7-77b8-785e-13052a63e0ae@linux.ibm.com> (raw)
In-Reply-To: <CAFiYyc3izF3re7bG_otVyhX9cRc_Dghv=GJHwxu_iDtXvQxn2Q@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 3756 bytes --]
On 2021/6/25 16:54, Richard Biener wrote:
> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
>>
>> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance
>> on Power. For example, it generates mismatched address offset after
>> adjust iv update statement position:
>>
>> <bb 32> [local count: 70988443]:
>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
>> ivtmp.30_415 = ivtmp.30_414 + 1;
>> _34 = ref_180 + 18446744073709551615;
>> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
>> if (_84 == _86)
>> goto <bb 56>; [94.50%]
>> else
>> goto <bb 87>; [5.50%]
>>
>> Disable it will produce:
>>
>> <bb 32> [local count: 70988443]:
>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
>> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1];
>> ivtmp.30_415 = ivtmp.30_414 + 1;
>> if (_84 == _86)
>> goto <bb 56>; [94.50%]
>> else
>> goto <bb 87>; [5.50%]
>>
>> Then later pass loop unroll could benefit from same address offset
>> with different base address and reduces register dependency.
>> This patch could improve performance by 10% for typical case on Power,
>> no performance change observed for X86 or Aarch64 due to small loops
>> not unrolled on these platforms. Any comments?
>
> The case you quote is special in that if we hoisted the IV update before
> the other MEM _also_ used in the condition it would be fine again.
Thanks. I tried to hoist the IV update statement before the first MEM (Fix 2), it
shows even worse performance due to not unroll(two more "base-1" is generated in gimple,
then loop->ninsns is 11 so small loops is not unrolled), change the threshold from
10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, the
performance is SAME to the one that IV update statement in the *MIDDLE* (trunk).
From the ASM, we can see the index register %r4 is used in two iterations which
maybe a bottle neck for hiding instruction latency?
Then it seems reasonable the performance would be better if keep the IV update
statement at *LAST* (Fix 1).
(Fix 2):
<bb 32> [local count: 70988443]:
ivtmp.30_415 = ivtmp.30_414 + 1;
_34 = ip_229 + 18446744073709551615;
_84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
_33 = ref_180 + 18446744073709551615;
_86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1];
if (_84 == _86)
goto <bb 56>; [94.50%]
else
goto <bb 87>; [5.50%]
.L67:
lbzx %r12,%r24,%r4
lbzx %r25,%r7,%r4
cmpw %cr0,%r12,%r25
bne %cr0,.L11
mr %r26,%r4
addi %r4,%r4,1
lbzx %r12,%r24,%r4
lbzx %r25,%r7,%r4
mr %r6,%r26
cmpw %cr0,%r12,%r25
bne %cr0,.L11
mr %r26,%r4
.L12:
cmpdi %cr0,%r10,1
addi %r4,%r26,1
mr %r6,%r26
addi %r10,%r10,-1
bne %cr0,.L67
>
> Now, adjust_iv_update_pos doesn't seem to check that the
> condition actually uses the IV use stmt def, so it likely applies to
> too many cases.
>
> Unfortunately the introducing rev didn't come with a testcase,
> but still I think fixing up adjust_iv_update_pos is better than
> introducing a way to short-cut it per target decision.
>
> One "fix" might be to add a check that either the condition
> lhs or rhs is the def of the IV use and the other operand
> is invariant. Or if it's of similar structure hoist across the
> other iv-use as well. Not that I understand the argument
> about the overlapping life-range.
>
> You also don't provide a complete testcase ...
>
Attached the test code, will also add it it patch in future version.
The issue comes from a very small hot loop:
do {
len++;
} while(len < maxlen && ip[len] == ref[len]);
--
Thanks,
Xionghu
[-- Attachment #2: test_i2_4_6.c --]
[-- Type: text/plain, Size: 4754 bytes --]
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
# define HLOG 16
#define MAX_LIT (1 << 5)
typedef const uint8_t *LZF_HSLOT;
typedef LZF_HSLOT LZF_STATE[1 << (HLOG)];
int compute_on_bytes(uint8_t *, int, uint8_t *, int );
int main(int argc, char **argv)
{
//Declarations
FILE *fptr;
int len_limit;
uint8_t *inputbuf,*outputbuf;
uint8_t *ip,*op;
int len, i;
long int sum=0;
//randomness
if (argv[1] != NULL )
len=atoi(argv[1]);
//read
fptr = fopen(argv[2],"rb");
if(fptr == NULL)
{
printf("Error!");
exit(1);
}
fseek( fptr , 0L , SEEK_END);
len_limit = ftell( fptr );
rewind( fptr );
/* allocate memory for entire input content */
inputbuf = calloc( 1, len_limit+1 );
if( !inputbuf ) fclose(fptr),fputs("memory alloc fails",stderr),exit(1);
/* allocate memory for entire output */
outputbuf = calloc( 1, len_limit+1 );
if( !inputbuf ) fclose(fptr),fputs("memory alloc fails",stderr),exit(1);
/* copy the file into the buffer */
if( 1!=fread( inputbuf , len_limit, 1 , fptr) )
fclose(fptr),free(inputbuf),fputs("entire read fails",stderr),exit(1);
//compare
ip=inputbuf;
op=outputbuf;
for (i=0; i < len_limit; i=i+(len/4))
sum+=compute_on_bytes(ip+i,len,op,len-4);
fclose(fptr);
free(inputbuf);
return sum;
}
//__attribute__((noinline)) int compute_on_bytes(uint8_t *ip, uint8_t *ref, int len_limit, int temp2 )
int compute_on_bytes(uint8_t *in_data, int in_len, uint8_t *out_data, int out_len)
{
LZF_STATE htab;
uint8_t *ip = in_data;
uint8_t *op = out_data;
uint8_t *in_end = ip + in_len;
uint8_t *out_end = op + out_len;
uint8_t *ref;
unsigned long off;
unsigned int hval;
int lit;
if (!in_len || !out_len)
return 0;
lit = 0; op++;
hval = (((ip[0]) << 8) | ip[1]);
while( ip < in_end - 2 )
{
uint8_t *hslot;
hval = (((hval) << 8) | ip[2]);
hslot = htab + ((( hval >> (3*8 - 16)) - hval*5) & ((1 << (16)) - 1));
ref = *hslot + in_data;
*hslot = ip - in_data;
if (1 && (off = ip - ref - 1) < (1 << 13) && ref > in_data && ref[2] == ip[2] && ((ref[1] << 8) | ref[0]) == ((ip[1] << 8) | ip[0]) )
{
unsigned int len = 2;
unsigned int maxlen = in_end - ip - len;
maxlen = maxlen > ((1 << 8) + (1 << 3)) ? ((1 << 8) + (1 << 3)) : maxlen;
if ((op + 3 + 1 >= out_end) != 0)
if (op - !lit + 3 + 1 >= out_end)
return 0;
op [- lit - 1] = lit - 1;
op -= !lit;
for (;;)
{
if ( maxlen > 16 )
{
len++; if (ref [len] != ip [len]) break;
len++; if (ref [len] != ip [len]) break;
len++; if (ref [len] != ip [len]) break;
len++; if (ref [len] != ip [len]) break;
len++; if (ref [len] != ip [len]) break;
len++; if (ref [len] != ip [len]) break;
len++; if (ref [len] != ip [len]) break;
len++; if (ref [len] != ip [len]) break;
len++; if (ref [len] != ip [len]) break;
len++; if (ref [len] != ip [len]) break;
len++; if (ref [len] != ip [len]) break;
len++; if (ref [len] != ip [len]) break;
len++; if (ref [len] != ip [len]) break;
len++; if (ref [len] != ip [len]) break;
len++; if (ref [len] != ip [len]) break;
len++; if (ref [len] != ip [len]) break;
}
do {
len++;
}while(len < maxlen && ip[len] == ref[len]);
break;
}
len -= 2;
ip++;
if (len < 7)
{
*op++ = (off >> 8) + (len << 5);
}
else
{
*op++ = (off >> 8) + ( 7 << 5);
*op++ = len - 7;
}
*op++ = off;
lit = 0; op++;
ip += len + 1;
if (ip >= in_end - 2)
break;
--ip;
--ip;
hval = (((ip[0]) << 8) | ip[1]);
hval = (((hval) << 8) | ip[2]);
htab[((( hval >> (3*8 - 16)) - hval*5) & ((1 << (16)) - 1))] = ip - in_data;
ip++;
hval = (((hval) << 8) | ip[2]);
htab[((( hval >> (3*8 - 16)) - hval*5) & ((1 << (16)) - 1))] = ip - in_data;
ip++;
}
else {
if(op >= out_end)
return 0;
lit++; *op++ = *ip++;
if (lit == (1 << 5))
{
op [- lit - 1] = lit - 1;
lit = 0; op++;
}
}
}
if (op + 3 > out_end) /* at most 3 bytes can be missing here */
return 0;
while (ip < in_end)
{
lit++; *op++ = *ip++;
if (lit == MAX_LIT)
{
op [- lit - 1] = lit - 1; /* stop run */
lit = 0; op++; /* start run */
}
}
op [- lit - 1] = lit - 1; /* end run */
op -= !lit; /* undo run if length is zero */
return op - out_data;
}
next prev parent reply other threads:[~2021-06-25 9:41 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-25 8:31 Xionghu Luo
2021-06-25 8:54 ` Richard Biener
2021-06-25 9:41 ` Xionghu Luo [this message]
2021-06-25 10:02 ` Richard Biener
2021-06-25 12:58 ` Xionghu Luo
2021-06-25 13:12 ` Xionghu Luo
2021-06-28 8:07 ` Xionghu Luo
2021-06-28 8:25 ` Richard Biener
2021-06-29 9:19 ` Xionghu Luo
2021-07-07 13:20 ` Richard Biener
2021-07-12 8:13 ` Xionghu Luo
2021-07-12 9:05 ` Hongtao Liu
2021-07-12 9:49 ` Richard Biener
2021-07-07 5:51 ` Bin.Cheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ef640ef9-8ec7-77b8-785e-13052a63e0ae@linux.ibm.com \
--to=luoxhu@linux.ibm.com \
--cc=dje.gcc@gmail.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=linkw@gcc.gnu.org \
--cc=richard.guenther@gmail.com \
--cc=segher@kernel.crashing.org \
--cc=wschmidt@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).