From: Carl Love <cel@us.ibm.com>
To: "Kewen.Lin" <linkw@linux.ibm.com>, cel@us.ibm.com
Cc: Peter Bergner <bergner@vnet.ibm.com>,
Segher Boessenkool <segher@kernel.crashing.org>,
gcc-patches@gcc.gnu.org, David Edelsohn <dje.gcc@gmail.com>
Subject: Re: [PATCH] rs6000: Update the vsx-vector-6.* tests.
Date: Fri, 30 Jun 2023 15:20:45 -0700 [thread overview]
Message-ID: <d176fbc44ba6a4c43092c0a53718acaba3b04d31.camel@us.ibm.com> (raw)
In-Reply-To: <c4614546-8b2c-895e-472a-0cf6818079ec@linux.ibm.com>
Kewen:
On Fri, 2023-06-30 at 11:37 +0800, Kewen.Lin wrote:
> Hi Carl,
>
> on 2023/6/30 05:36, Carl Love wrote:
> > Kewen:
> >
> > On Wed, 2023-06-28 at 16:35 +0800, Kewen.Lin wrote:
> > > > Yea, I was going with a runnable test and didn't include the
> > > > instruction counts. Added back in. Rather then doing by
> > > > processor
> > > > version (P8, P9, P10) I was able to do it by BE/LE. The
> > > > instruction
> > > > counts were the same for LE accross processor versions but
> > > > there
> > > > are a
> > > > few instruction counts that vary with BE and LE.
> > >
> > > But the original test case only checks for cpu-types (processor
> > > version)
> > > but not for endianness, it means for the bif usages, there should
> > > not
> > > be
> > > different for endianness. Why does this changes with your new
> > > test
> > > case?
> > > Could you have a further look and make it consistent with some
> > > adjustment
> > > if possible? As we know, checking insn counts sometimes are
> > > fragile,
> > > so
> > > I think we should try our best to make it as robust as possible
> > > in
> > > the
> > > first place.
> > >
> > > Besides, the original case also have some differences between
> > > p7/p8
> > > and
> > > p9.
> > >
> >
> > There are differences on P8 LE versus BE. I did a diff between the
> > P8
> > and P9 tests:
> >
> > diff vsx-vector-6.p8.c vsx-vector-6.p9.c
> > 3,4c3,4
> > < /* { dg-require-effective-target powerpc_p8vector_ok } */
> > < /* { dg-options "-O2 -mdejagnu-cpu=power8" } */
> > ---
> > > /* { dg-require-effective-target powerpc_p9vector_ok } */
> > > /* { dg-options "-O2 -mdejagnu-cpu=power9" } */
> > 12c12
> > < /* { dg-final { scan-assembler-times {\mvperm\M} 1 } } */
> > ---
> > > /* { dg-final { scan-assembler-times {\m(?:v|xx)permr?\M} 1 } }
> > > */
> > 23d22
> > < /* { dg-final { scan-assembler-times {\mxvmsub[am]dp\M} 1 } } */
> > 37c36
> > < /* { dg-final { scan-assembler-times {\mxvsubdp\M} 1 } } */
> > ---
> > > /* { dg-final { scan-assembler-times {\mxvmsub[am]dp\M} 1 } } */
> >
> > So we can see the vperm, vpermr, xxpermr, xvmsubadp, xvmsubmdp,
> > xvsubdp, xvmsubadp, xvmsubmdp instruction count checks are
> > different
> > between the two architectures. I then wrote a script to compile
> > the
> > CPU specific test on Power 8, Power 9 and Power 10 architectures
> > and
> > then grep for the above list of instructions. If I run the scrip
> > on P8
> > BE and LE I get>
> >
> > Power 8 BE Power 8 LE Power 9 LE Power 9
> > BE Power 10 LE*
> > (makalu-
> > lp1) (genoa) (marlin) (nilram) (ltcd97-lp3)
> > instruction count count count count
> > count
> > vperm 1 1 0 0
> > 0
> > vpermr 0 0 0 0
> > 0
> > xxpermr 0 0 1 0
> > 1
> > xvmsubadp 1 0 1 1
> > 1
> > xvmsubmdp 0 1 0 0
> > 0
> > xvsubdp 1 1 1 1
> > 1
> >
>
> Thanks for looking into this and making this statistics.
>
> Is there a typo for column nilram? Otherwise, the below insn check
>
> /* { dg-final { scan-assembler-times {\m(?:v|xx)permr?\M} 1 } } */
>
> would fail there.
Yes, there is a typo in the nilram column. The test generates a vperm
instruction.
#if defined (__BIG_ENDIAN__) || defined (_ARCH_PWR9)
dst[8].d = vec_perm (src0[8].d, src1[8].d, src2[8].uc);
f74: e9 3f 00 78 ld r9,120(r31)
f78: 39 29 07 00 addi r9,r9,1792
f7c: f5 89 00 01 lxv vs12,0(r9)
f80: e9 3f 00 80 ld r9,128(r31)
f84: 39 29 07 00 addi r9,r9,1792
f88: f4 09 00 01 lxv vs0,0(r9)
f8c: e9 3f 00 88 ld r9,136(r31)
f90: 39 29 07 00 addi r9,r9,1792
f94: f4 09 00 89 lxv vs32,128(r9)
f98: e9 3f 00 70 ld r9,112(r31)
f9c: 39 29 07 00 addi r9,r9,1792
fa0: f0 2c 64 91 xxmr vs33,vs12
fa4: f1 a0 04 91 xxmr vs45,vs0
fa8: 10 01 68 2b vperm v0,v1,v13,v0
...
> <snip>
> >
> > I had played with putting -Wno-inline on the command line but that
> > didn't seem to make any difference. However, you suggestion of
> > __attribute__ ((noipa)) does prevent the inlining and we don't get
> > the
> > second copy of the instructions showing up. The inlining eliminated
> > the
> > LE/BE differences for xvmaxsp, xvminsp and xvmaxdp.
>
> -Winline is a option for warning: "Warn if a function that is
> declared
> as inline cannot be inlined.", I think what you wanted is -fno-
> inline,
> and it's good to know noipa helps here.
Yea, my bad. Didn't read the manual very carefully.
>
> > The instruction count test for xxlor in vsx-vector-6-func-2lop.c
> > differs on LE and BE vsx-vector-6-func-2op.c. I believe the
> > instruction is used with loads to reorder the data. I don't see
> > anyway
> > to get around the extra xxlor instructions and verify the vec_or
> > builtin test generates the instruction.
> >
>
> OK, I'm still curious how the loads cause the difference.
Yea, looks like there is something screwy going on. So, I started by
running the test:
make -j 1 && make check-gcc RUNTESTFLAGS="-v -v powerpc.exp=vsx-
vector-6-func-2lop.c " > out
on Makalu, P8 BE and verified the test gives 7 passes and no failures.
on genoa, P8 LE, I also verified the test gives 7 passes and no
failures.
Then I went in an intentionally changed the expected counts down by one
for each platform. The idea was to verify that the dg-final { scan-
assembler-times {\mxxlor\M} was being called.
on Makalu, I now get an error, as expected:
heck_cached_effective_target be: returning 1 for unix
is-effective-target: be 1 <<<< NOTE BE
gcc.target/powerpc/vsx-vector-6-func-2lop.c: \\mxxlor\\M found 32 times
FAIL: gcc.target/powerpc/vsx-vector-6-func-2lop.c scan-assembler-times
\\mxxlor\\M 31
on Genoa, I now get the error, as expected:
check_cached_effective_target le: returning 1 for
unix
is-effective-target: le
1
gcc.target/powerpc/vsx-vector-6-func-2lop.c: \\mxxlor\\M found 22
times
FAIL: gcc.target/powerpc/vsx-vector-6-func-2lop.c scan-assembler-times
\\mxxlor\\M 21
So, running the tests, gcc definitely thinks there should be 22 xxlor
on LE and 32 on BE.
So, went to look at the assembly to verify my comment on the difference
being related to the loads. I decided to actually count the
instructions just to verify the number in the assembly files. Before,
I just looked at the assembly briefly but didn't dig in very deep.
If I compile the tests and dump the assembly with:
gcc -g -mcpu=power8 -o vsx-vector-6-func-2lop vsx-vector-6-func-
2lop.c
objdump -S -d vsx-vector-6-func-2lop > vsx-vector-6-func-2lop.dump
grep xxlor vsx-vector-6-func-2lop.dump | wc
4 28 192
So we see 4 xxlor instructions not 32 as expeced for BE or 22 as
expected for LE as the test claims. I get the same count of 4 on both
makalu and on genoa. I like this approach because you can easily see
the relationship of the source and assembly. So, there seems to be
something screwy here as that is not even close to what the make script
/scan-assemblerthinks the counts should be.
Segher never liked the above way of looking at the assembly. He
prefers:
gcc -S -g -mcpu=power8 -o vsx-vector-6-func-2lop.s vsx-vector-6-func-
2lop.c
grep xxlor vsx-vector-6-func-2lop.s | wc
34 68 516
So, again, I get the same count of 34 on both makalu and genoa. But
again, that doesn't agree with what make script/scan-assembler thinks
the counts should be.
When I looked at the vsx-vector-6-func-2lop.s I see on BE:
....
lxvd2x 0,10,9
xxlor 0,12,0
xxlnor 0,0,0
...
I was guessing that it was adjusting the data layout from the load.
But looking again more carefully versus LE:
....
lxvd2x 0,31,9
xxpermdi 0,0,0,2
xxlor 0,12,0
xxlnor 0,0,0
xxpermdi 0,0,0,2
....
the xxpermdi is probably what is really doing the data layout change.
So, we have the issue that looking at the assembly gives different
instruction counts then what
dg-final { scan-assembler-times {\mxxlor\M} }
comes up with??? Now I am really confused. I don't know how the scan-
assembler-times works but I will go see if I can find it and see if I
can figure out what the issue is. I would expect that the scan-
assembler is working off the --save-temp files, which get deleted as
part of the run. I would guess that scan-assembler does a grep to find
the instructions and then maybe uses wc to count them??? I will go see
if I can figure out how scan-assembler-times works.
Carl
next prev parent reply other threads:[~2023-06-30 22:20 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-30 20:46 Carl Love
2023-06-19 7:17 ` Kewen.Lin
2023-06-21 22:42 ` Carl Love
2023-06-28 8:35 ` Kewen.Lin
2023-06-29 21:36 ` Carl Love
2023-06-30 3:37 ` Kewen.Lin
2023-06-30 22:20 ` Carl Love [this message]
2023-06-30 23:50 ` Carl Love
2023-07-01 0:03 ` Peter Bergner
2023-06-30 23:59 ` Peter Bergner
2023-07-03 15:57 ` Carl Love
2023-07-04 2:08 ` Kewen.Lin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d176fbc44ba6a4c43092c0a53718acaba3b04d31.camel@us.ibm.com \
--to=cel@us.ibm.com \
--cc=bergner@vnet.ibm.com \
--cc=dje.gcc@gmail.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=linkw@linux.ibm.com \
--cc=segher@kernel.crashing.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).