public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction
@ 2011-06-01 20:17 oleg.endo@t-online.de
  2011-06-01 20:42 ` [Bug target/49263] " oleg.endo@t-online.de
                   ` (44 more replies)
  0 siblings, 45 replies; 46+ messages in thread
From: oleg.endo@t-online.de @ 2011-06-01 20:17 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

           Summary: SH Target: underutilized "TST #imm, R0" instruction
           Product: gcc
           Version: 4.6.1
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: target
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: oleg.endo@t-online.de


Created attachment 24411
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24411
test for various integer types and constant values 0...255

The "TST #imm, R0" instruction is a bit underutilized on SH targets. For some
bit patterns of the immediate constant it tries to extract the bits in question
by various means and test the result against zero/non-zero and misses the
straight forward instruction.
In particular: 
* one single bit 
* n contiguous bits starting at bit 0

When testing a byte against 0x80 it uses "CMP/PZ", which is as good as a "TST"
instruction (in terms of costs), so this is OK. 

I've spotted this in version around 4.4. It still happens with 4.6 and the
following config:

Using built-in specs.
COLLECT_GCC=sh-elf-gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/sh-elf/4.6.1/lto-wrapper
Target: sh-elf
Configured with: ../gcc-4.6-20110527/configure --target=sh-elf
--prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp
--without-headers --disable-nls --disable-werror --enable-lto --with-newlib
--with-gnu-as --with-gnu-ld --with-system-zlib
Thread model: single
gcc version 4.6.1 20110527 (prerelease) (GCC)


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
@ 2011-06-01 20:42 ` oleg.endo@t-online.de
  2011-06-12 23:12 ` kkojima at gcc dot gnu.org
                   ` (43 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: oleg.endo@t-online.de @ 2011-06-01 20:42 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #1 from Oleg Endo <oleg.endo@t-online.de> 2011-06-01 20:42:00 UTC ---
Created attachment 24412
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24412
Proposed Patch

Although the patch gets the job done, programmer's sense tells me it is fishy,
or at least pretty much brute forced cure of the symptoms, not the cause. It's
my first GCC patch, so any feedback is highly appreciated.

What I did was looking at the RTL, in particular the combine pass, identifying
patterns it failed to find a "shortcut" (tst insn) for and adding them to the
insn descriptions. 

I also had to expand the costs calculation of the AND instruction to cover AND,
OR and XOR (on SH they are the same anyways), or else the cost of a matched
replacement insn would result in a rejection in the combine pass.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
  2011-06-01 20:42 ` [Bug target/49263] " oleg.endo@t-online.de
@ 2011-06-12 23:12 ` kkojima at gcc dot gnu.org
  2011-06-19 16:42 ` oleg.endo@t-online.de
                   ` (42 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: kkojima at gcc dot gnu.org @ 2011-06-12 23:12 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

Kazumoto Kojima <kkojima at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |sh*-*-*

--- Comment #2 from Kazumoto Kojima <kkojima at gcc dot gnu.org> 2011-06-12 23:11:19 UTC ---
It looks that playing with the internal behavior of the combine
pass is a bit fragile.  Perhaps an alternative way is to define
a peephole for tst #imm8,r0, something like:

;; A peephole for the TST immediate instruction.

(define_peephole2
  [(set (match_operand:SI 2 "arith_reg_operand" "r")
    (and:SI (match_operand:SI 0 "arith_reg_operand" "z")
        (match_operand:SI 1 "const_int_operand" "i")))
   (set (reg:SI T_REG) (eq:SI (match_dup 2) (const_int 0)))]
  "TARGET_SH1
   && peep2_reg_dead_p (2, operands[2])
   && (satisfies_constraint_I08 (operands[1])
       || satisfies_constraint_K08 (operands[1]))"
  [(set (reg:SI T_REG)
    (eq:SI (and:SI (match_dup 0) (match_dup 1)) (const_int 0)))]
  "
{
  operands[1] = GEN_INT (INTVAL (operands[1]) & 0xff);
}")

which will work at -O2 or -fpeephole2, though there are pros
and cons of peephole approach.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
  2011-06-01 20:42 ` [Bug target/49263] " oleg.endo@t-online.de
  2011-06-12 23:12 ` kkojima at gcc dot gnu.org
@ 2011-06-19 16:42 ` oleg.endo@t-online.de
  2011-06-22 22:34 ` kkojima at gcc dot gnu.org
                   ` (41 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: oleg.endo@t-online.de @ 2011-06-19 16:42 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #3 from Oleg Endo <oleg.endo@t-online.de> 2011-06-19 16:42:01 UTC ---
Thanks for having a look at it Kaz.
Yes, the fiddling with the combine pass is fragile. I've tried out your
peephole idea. It works in most cases but not all the time. E.g. it doesn't
catch the following example:

int test_imm (int i)
{
  return (i & 3) ? 20 : 40;
}


    mov    #3,r1
    and    r1,r4
    tst    r4,r4
    bt/s    .L5
    mov    #40,r0
    mov    #20,r0
.L5:
    rts    
    nop


Also the following... 

int test_imm (int* i)
{
   return (*i & 255) ? 20 : 40;
}

becomes on little endian (OK):
    mov.b    @r4,r1
    tst    r1,r1
    bt/s    .L10
    mov    #40,r0
    mov    #20,r0
.L10:
    rts    
    nop

but on big endian (NG):
    mov.l    @r4,r1
    extu.b    r1,r1
    tst    r1,r1
    bt/s    .L10
    mov    #40,r0
    mov    #20,r0
.L10:
    rts    
    nop


I'll give the thing another try.

Regarding the change to the andcosts function, the following should be better:

--- sh.c    (revision 175188)
+++ sh.c    (working copy)
@@ -243,7 +243,7 @@
 static int flow_dependent_p (rtx, rtx);
 static void flow_dependent_p_1 (rtx, const_rtx, void *);
 static int shiftcosts (rtx);
-static int andcosts (rtx);
+static int and_xor_ior_costs (rtx, int code);
 static int addsubcosts (rtx);
 static int multcosts (rtx);
 static bool unspec_caller_rtx_p (rtx);
@@ -2841,14 +2841,14 @@
     return shift_insns[value];
 }

-/* Return the cost of an AND operation.  */
+/* Return the cost of an AND/XOR/IOR operation.  */

 static inline int
-andcosts (rtx x)
+and_xor_ior_costs (rtx x, int code)
 {
   int i;

-  /* Anding with a register is a single cycle and instruction.  */
+  /* register x register is a single cycle instruction.  */
   if (!CONST_INT_P (XEXP (x, 1)))
     return 1;

@@ -2864,17 +2864,17 @@
     }

   /* These constants are single cycle extu.[bw] instructions.  */
-  if (i == 0xff || i == 0xffff)
+  if ((i == 0xff || i == 0xffff) && code == AND)
     return 1;
-  /* Constants that can be used in an and immediate instruction in a single
+  /* Constants that can be used in an instruction with an immediate are a
single
      cycle, but this requires r0, so make it a little more expensive.  */
   if (CONST_OK_FOR_K08 (i))
     return 2;
-  /* Constants that can be loaded with a mov immediate and an and.
+  /* Constants that can be loaded with a mov immediate need one more cycle.
      This case is probably unnecessary.  */
   if (CONST_OK_FOR_I08 (i))
     return 2;
-  /* Any other constants requires a 2 cycle pc-relative load plus an and.
+  /* Any other constant requires an additional 2 cycle pc-relative load.
      This case is probably unnecessary.  */
   return 3;
 }
@@ -3043,7 +3043,9 @@
       return true;

     case AND:
-      *total = COSTS_N_INSNS (andcosts (x));
+    case XOR:
+    case IOR:
+      *total = COSTS_N_INSNS (and_xor_ior_costs (x, code));
       return true;

     case MULT:


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (2 preceding siblings ...)
  2011-06-19 16:42 ` oleg.endo@t-online.de
@ 2011-06-22 22:34 ` kkojima at gcc dot gnu.org
  2011-06-26 22:31 ` oleg.endo@t-online.de
                   ` (40 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: kkojima at gcc dot gnu.org @ 2011-06-22 22:34 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #4 from Kazumoto Kojima <kkojima at gcc dot gnu.org> 2011-06-22 22:34:04 UTC ---
Yes, that peephole doesn't catch all the patterns which could
make tst #imm8,r0 use.  Perhaps it would be a good idea to get
numbers for the test like CSiBE test with the vanilla and new
insns/peepholes patched compilers.  Something covers 80% of
the possible cases in the usual working set, it would be enough
successful for such a micro-optimization, I guess.

Cost patch looks fine to me.  Could you propose it as a separate
patch on gcc-patches list with an appropriate ChangeLog entry?
When proposing it, please refer how you've tested it.  Also
the numbers got with the patch are highly welcome.

BTW, do you have FSF copyright assignment for your GCC work?
Although the cost patch itself is essentially several lines which
doesn't require copyright assignment, the other changes you've
proposed clearly require the paper work, I think.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (3 preceding siblings ...)
  2011-06-22 22:34 ` kkojima at gcc dot gnu.org
@ 2011-06-26 22:31 ` oleg.endo@t-online.de
  2011-06-27  5:15 ` kkojima at gcc dot gnu.org
                   ` (39 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: oleg.endo@t-online.de @ 2011-06-26 22:31 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #5 from Oleg Endo <oleg.endo@t-online.de> 2011-06-26 22:30:05 UTC ---
> Yes, that peephole doesn't catch all the patterns which could
> make tst #imm8,r0 use.  Perhaps it would be a good idea to get
> numbers for the test like CSiBE test with the vanilla and new
> insns/peepholes patched compilers.  Something covers 80% of
> the possible cases in the usual working set, it would be enough
> successful for such a micro-optimization, I guess.

I'd also like to see some numbers on those micro-improvements. I'll have a look
at CSiBE.
Anyway, why not just add all the currently known-to-work cases? What are your
concerns regarding that? I can imagine that it is a maintenance burden to keep
all those definitions and special cases in the MD up-to-date (bit rot etc). Do
you have anything other than that in mind? 

It seems that others have been trying to solve the same problem in a very
similar way: http://patchwork.ozlabs.org/patch/58832/ ;)

I've figured that the following pattern works quite well for this particular
case. It seems to catch all the bit patterns. (sh_tst_const_ok and
sh_zero_extract_val are added functions in sh.c)

(define_insn_and_split "*tstsi3"
  [(set (reg:SI T_REG)
    (zero_extract:SI (match_operand:SI 0 "arith_reg_operand" "")
            (match_operand:SI 1 "const_int_operand")
            (match_operand:SI 2 "const_int_operand")))]
  "TARGET_SH1 && sh_tst_const_ok (sh_zero_extract_val (operands[1],
operands[2]))"
  "#"
  "&& 1"
  [(const_int 0)]
  "{
    int tstval = sh_zero_extract_val (operands[1], operands[2]);
    emit_insn (gen_tstsi (operands[0], GEN_INT (tstval)));
    DONE;
  }"
)

...however, the problem with that one is that the T bit is inverted afterwards,
and thus the following branch logic (or whatever) gets inverted as well. One
option could be to emit some T bit inversion insn after gen_tstsi and then
optimize it away later on with e.g. a peephole (?), but I haven't tried that
yet.

The actual "problem" is that the combine pass first sees the andsi, eq, ...
insns and correctly matches them to those in the MD. But instead of continuing
to match patterns from the MD it expands the and insn into something like
zero_extract, which in turn will never be combined back into the tst insn.
Isn't there a way to tell the combine pass not to do so, but instead first look
deeper at what is in the MD?


Regarding the peephole:

> (satisfies_constraint_I08 (operands[1])
>       || satisfies_constraint_K08 (operands[1])

I guess this might generate wrong code for e.g. "if (x & -2)". When x has any
bits[31:1] set this must return true. The code after the peephole optimization
will look only at the lower 8 bits and would possibly return false for x =
0xFF000000, which is wrong. So it should be satisfies_constraint_K08 only,
shouldn't it?


> 
> Cost patch looks fine to me.  Could you propose it as a separate
> patch on gcc-patches list with an appropriate ChangeLog entry?
> When proposing it, please refer how you've tested it.  Also
> the numbers got with the patch are highly welcome.
> 
> BTW, do you have FSF copyright assignment for your GCC work?
> Although the cost patch itself is essentially several lines which
> doesn't require copyright assignment, the other changes you've
> proposed clearly require the paper work, I think.

I'll contact you directly regarding that.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (4 preceding siblings ...)
  2011-06-26 22:31 ` oleg.endo@t-online.de
@ 2011-06-27  5:15 ` kkojima at gcc dot gnu.org
  2011-10-09 23:35 ` oleg.endo@t-online.de
                   ` (38 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: kkojima at gcc dot gnu.org @ 2011-06-27  5:15 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #6 from Kazumoto Kojima <kkojima at gcc dot gnu.org> 2011-06-27 05:14:36 UTC ---
(In reply to comment #5)
> Anyway, why not just add all the currently known-to-work cases? What are your
> concerns regarding that? I can imagine that it is a maintenance burden to keep
> all those definitions and special cases in the MD up-to-date (bit rot etc). Do
> you have anything other than that in mind? 

Yep, maintenance burden but I don't mean ack/nak for anything.
If it's enough fruitful, we should take that route.  When it
gives 5% improvement in the usual working set like as CSiBE,
hundreds lines would be OK, but if it's ~0.5% or less, it doesn't
look worth to add many patterns for that.

> Isn't there a way to tell the combine pass not to do so, but instead first look
> deeper at what is in the MD?

I don't know how to do it cleanly.

> I guess this might generate wrong code for e.g. "if (x & -2)". When x has any
> bits[31:1] set this must return true. The code after the peephole optimization
> will look only at the lower 8 bits and would possibly return false for x =
> 0xFF000000, which is wrong. So it should be satisfies_constraint_K08 only,
> shouldn't it?

You are right.  That peephole was simply 'something like this'.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (5 preceding siblings ...)
  2011-06-27  5:15 ` kkojima at gcc dot gnu.org
@ 2011-10-09 23:35 ` oleg.endo@t-online.de
  2011-10-10  1:32 ` kkojima at gcc dot gnu.org
                   ` (37 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: oleg.endo@t-online.de @ 2011-10-09 23:35 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #7 from Oleg Endo <oleg.endo@t-online.de> 2011-10-09 23:34:45 UTC ---
(In reply to comment #6)

> Yep, maintenance burden but I don't mean ack/nak for anything.
> If it's enough fruitful, we should take that route.  When it
> gives 5% improvement in the usual working set like as CSiBE,
> hundreds lines would be OK, but if it's ~0.5% or less, it doesn't
> look worth to add many patterns for that.
> 
> > Isn't there a way to tell the combine pass not to do so, but instead first look
> > deeper at what is in the MD?
> 
> I don't know how to do it cleanly.
> 

I've tried out a couple of things and got some CSiBE numbers based on 
trunk rev 179430. Unfortunately only code size comparisons, no run time 
performance numbers. The tests were compiled with 
-ml -m4-single -Os -mnomacsave -mpretend-cmove -mfused-madd -freg-struct-return

Option 1)
  Use many (~10) patterns in the MD and some cost calculation tuning.
  The last patch required some adaptation, because the combine pass 
  started trying to match things slightly differently. I've also 
  noticed that it requires a special case for one pattern on SH4A...

  size of all modules: 2916390 -> 2909026    -7364 / -0.252504 %
  avg difference over all modules: -409.111111 / -0.273887 %
  max: compiler       22808 -> 22812           +4 / +0.017538 %
  min: libpng-1.2.5   99120 -> 97804        -1316 / -1.327684 %

Option 2)
  I've added another combine pass which has the make_compound_operation
  function turned off. The make_compound_operation function is used to
  produce zero_extract patterns. If the resulting "simplified" pattern does
  not match anything in the MD, combine reverts the transformation and 
  proceeds with the next insn. That way, it never tries to match the 
  tst #imm pattern in the MD. 
  With this option only ~5 patterns seem to be required and a small
  extension of the costs calculation.

  size of all modules:  2916390 -> 2909170    -7220 / -0.247566 %
  avg difference over all modules: -401.111111 / -0.254423 %
  max: compiler       22808 -> 22812           +4 / +0.017538 %
  min: libpng-1.2.5   99120 -> 97836        -1284 / -1.295400 %

Not so spectacular on average. It highly depends on the type of SW being
compiled, but it hits quite a lot of files in CSiBE.

Option 2 seems more robust even if it seems less effective, what do you think?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (6 preceding siblings ...)
  2011-10-09 23:35 ` oleg.endo@t-online.de
@ 2011-10-10  1:32 ` kkojima at gcc dot gnu.org
  2011-10-10 23:48 ` oleg.endo@t-online.de
                   ` (36 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: kkojima at gcc dot gnu.org @ 2011-10-10  1:32 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #8 from Kazumoto Kojima <kkojima at gcc dot gnu.org> 2011-10-10 01:31:42 UTC ---
(In reply to comment #7)
> Option 2 seems more robust even if it seems less effective, what do you think?

Another combine pass to reduce size less than 0.3% on one target
would be not acceptable, I guess.  ~10 new patterns would be
overkill for that result, though I'm still expecting that a few
patterns of them were dominant.  Could you get numbers which pattern
was used in the former option?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (7 preceding siblings ...)
  2011-10-10  1:32 ` kkojima at gcc dot gnu.org
@ 2011-10-10 23:48 ` oleg.endo@t-online.de
  2011-10-11  1:47 ` kkojima at gcc dot gnu.org
                   ` (35 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: oleg.endo@t-online.de @ 2011-10-10 23:48 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #9 from Oleg Endo <oleg.endo@t-online.de> 2011-10-10 23:48:17 UTC ---
Created attachment 25461
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25461
CSiBE comparisons

(In reply to comment #8)
> 
> Another combine pass to reduce size less than 0.3% on one target
> would be not acceptable, I guess.  

I'm sorry, I forgot to mention that it was just a proof of concept hack
of mine, just to see whether it has any chance to work at all.
I think it would be better to change/fix the behavior of the combine pass
in this regard, so that it tries matching combined patterns without
sophisticated transformations. I will try asking on the gcc list about that.

> ~10 new patterns would be
> overkill for that result, though I'm still expecting that a few
> patterns of them were dominant. 

Yep, even if it turned out to be actually only 8 patterns in total, but
still.. I haven't looked at the issue with SH4A but most likely it would add
another one or two patterns... so basically ~10 :)

> Could you get numbers which pattern
> was used in the former option?

I think it would be a bit too much checking out each individual pattern.
Instead I grouped them into what they are effectively doing.
While I was at it, I also added your peephole idea, and a top 10 listing of
the individual files.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (8 preceding siblings ...)
  2011-10-10 23:48 ` oleg.endo@t-online.de
@ 2011-10-11  1:47 ` kkojima at gcc dot gnu.org
  2011-10-13 22:55 ` oleg.endo@t-online.de
                   ` (34 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: kkojima at gcc dot gnu.org @ 2011-10-11  1:47 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #10 from Kazumoto Kojima <kkojima at gcc dot gnu.org> 2011-10-11 01:47:03 UTC ---
(In reply to comment #9)
> 3) only zero_extract special cases

looks to be dominant.

> I'm sorry, I forgot to mention that it was just a proof of concept hack
> of mine, just to see whether it has any chance to work at all.
> I think it would be better to change/fix the behavior of the combine pass
> in this regard, so that it tries matching combined patterns without
> sophisticated transformations. I will try asking on the gcc list about that.

I see.  I also expect that the experts have some idea for
this issue.

> I think it would be a bit too much checking out each individual pattern.

I don't think that it's too much.  Those numbers can be easily
collected for CSiBE.  If your patterns are named, you could
simply add "-dap -save-temps" to the compiler option which is
specified when ruining CSiBE's create-config and then get
the occurrences of testsi_6, for example, with something like
  grep "testsi_6" `find . -name "*.s" -print` | wc -l
after running the CSiBE size test.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (9 preceding siblings ...)
  2011-10-11  1:47 ` kkojima at gcc dot gnu.org
@ 2011-10-13 22:55 ` oleg.endo@t-online.de
  2011-10-14 23:06 ` kkojima at gcc dot gnu.org
                   ` (33 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: oleg.endo@t-online.de @ 2011-10-13 22:55 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

Oleg Endo <oleg.endo@t-online.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #24412|0                           |1
        is obsolete|                            |

--- Comment #11 from Oleg Endo <oleg.endo@t-online.de> 2011-10-13 22:54:54 UTC ---
Created attachment 25491
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25491
Proposed patch including test case


> > 3) only zero_extract special cases
> 
> looks to be dominant.

Yes.  I've briefly looked through the test sources.  A popular use case
for bit test instructions seem to be single bit tests, which the patch
is basically adding. 

> I see.  I also expect that the experts have some idea for
> this issue.

Hm .. http://gcc.gnu.org/ml/gcc/2011-10/msg00189.html
Eric pointed me to the i386 back-end.  Unfortunately, what I found there is
where I originally started...

;; Combine likes to form bit extractions for some tests.  Humor it.

I.e. it is also coded against the behavior of the combine pass with a bunch
of pattern variations.  I guess that's the way it's supposed to be done then :T

> I don't think that it's too much.  Those numbers can be easily
> collected for CSiBE.  If your patterns are named, you could
> simply add "-dap -save-temps" to the compiler option which is
> specified when ruining CSiBE's create-config and then get
> the occurrences of testsi_6, for example, with something like
>   grep "testsi_6" `find . -name "*.s" -print` | wc -l
> after running the CSiBE size test.

Ah, right!  With the attached latest patch applied to trunk rev 179778
the numbers for 
  "-ml -m4-single -Os -mnomacsave -mpretend-cmove -mfused-madd
-freg-struct-return"
look something like that:

tstsi_t: 1391
tsthi_t: 4
tstqi_t: 23
tstqi_t_zero: 667
tstsi_t_and_not: 598
tstsi_t_zero_extract_eq: 70
tstsi_t_zero_extract_xor: 923

Notice that the split contributes to the tstsi_t number.
Also, the 3 patterns 
  tstsi_t_zero_extract_xor
  tstsi_t_zero_extract_subreg_xor_little
  tstsi_t_zero_extract_subreg_xor_big

are basically one and the same. On SH4A the subreg variants are required,
because tstsi_t_zero_extract_xor will never match.

I've also added a special case to sh_rtx_costs to detect at least the tstsi_t
pattern. However, the other patterns are not really covered by that and the 
combine pass calculates the cost as a sum of all the operations of the pattern.
I guess the selection of the test instruction could be stimulated a bit more 
by a more accurate costs calculation, but my feeling is that it won't do a lot.


Cheers,
Oleg


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (10 preceding siblings ...)
  2011-10-13 22:55 ` oleg.endo@t-online.de
@ 2011-10-14 23:06 ` kkojima at gcc dot gnu.org
  2011-10-15  2:33 ` kkojima at gcc dot gnu.org
                   ` (32 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: kkojima at gcc dot gnu.org @ 2011-10-14 23:06 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #12 from Kazumoto Kojima <kkojima at gcc dot gnu.org> 2011-10-14 23:06:06 UTC ---
(In reply to comment #11)
> Created attachment 25491 [details]
> Proposed patch including test case

Looks fine.  A very minor style nits:

> +      if (GET_CODE (XEXP (x, 0)) == AND		/* tst instruction.  */

This comment looks a bit bogus.  A full sentence comment would
be better.

> +
> +

There are some extra empty lines.  GNU/GCC coding style says
that only one empty line is needed.  I know that there are
extra empty lines already, but we should not add new ones :-)


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (11 preceding siblings ...)
  2011-10-14 23:06 ` kkojima at gcc dot gnu.org
@ 2011-10-15  2:33 ` kkojima at gcc dot gnu.org
  2011-11-20 14:20 ` oleg.endo@t-online.de
                   ` (31 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: kkojima at gcc dot gnu.org @ 2011-10-15  2:33 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #13 from Kazumoto Kojima <kkojima at gcc dot gnu.org> 2011-10-15 02:32:56 UTC ---
Author: kkojima
Date: Sat Oct 15 02:32:53 2011
New Revision: 180020

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=180020
Log:
    PR target/49263
    * config/sh/sh.h (ZERO_EXTRACT_ANDMASK): New macro.
    * config/sh/sh.c (sh_rtx_costs): Add test instruction case.
    * config/sh/sh.md (tstsi_t): Name existing insn.  Make inner
    and instruction commutative.
    (tsthi_t, tstqi_t, tstqi_t_zero, tstsi_t_and_not,
    tstsi_t_zero_extract_eq, tstsi_t_zero_extract_xor,
    tstsi_t_zero_extract_subreg_xor_little,
    tstsi_t_zero_extract_subreg_xor_big): New insns.
    (*movsicc_t_false, *movsicc_t_true): Replace space with tab in
    asm output.
    (*andsi_compact): Reorder alternatives so that K08 is considered
    first.
    * gcc.target/sh/pr49263.c: New.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/sh/sh.c
    trunk/gcc/config/sh/sh.h
    trunk/gcc/config/sh/sh.md
    trunk/gcc/testsuite/ChangeLog


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (12 preceding siblings ...)
  2011-10-15  2:33 ` kkojima at gcc dot gnu.org
@ 2011-11-20 14:20 ` oleg.endo@t-online.de
  2011-12-29  1:09 ` oleg.endo@t-online.de
                   ` (30 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: oleg.endo@t-online.de @ 2011-11-20 14:20 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #14 from Oleg Endo <oleg.endo@t-online.de> 2011-11-20 13:04:39 UTC ---
With trunk rev 181517 I have observed the following problem, which happens when
compiling for -m2*, -m3*, -m4* and -Os:

The function compiled with -m2 -Os

int test_int64_lowword (long long* x)
{
  return *x & 0xFFFFFFFF ? -20 : -40;
}

results in the following code:

    mov    #0,r2    ! 42    movsi_i/3    [length = 2]
    tst    r2,r2    ! 44    cmpeqsi_t/1    [length = 2]
    bf.s    .L4    ! 45    branch_false    [length = 2]
    mov.l    @(4,r4),r3    ! 12    movsi_i/5    [length = 2]
    tst    r3,r3    ! 46    cmpeqsi_t/1    [length = 2]
.L4:
    bt    .L3    ! 14    branch_true    [length = 2]
    mov    #-20,r0    ! 4    movsi_i/3    [length = 2]
    rts    
    nop    ! 52    *return_i    [length = 4]
.L3:
    rts        ! 54    *return_i    [length = 2]
    mov    #-40,r0    ! 5    movsi_i/3    [length = 2]

When compiled with -O1 or -O2 the high word test is completely eliminated
correctly:

    mov.l    @(4,r4),r1    ! 10    movsi_i/5    [length = 2]
    tst    r1,r1    ! 17    cmpeqsi_t/1    [length = 2]
    bt    .L4    ! 18    branch_true    [length = 2]
    mov    #-20,r0    ! 4    movsi_i/3    [length = 2]
    rts    
    nop    ! 50    *return_i    [length = 4]
.L4:
    rts        ! 52    *return_i    [length = 2]
    mov    #-40,r0    ! 5    movsi_i/3    [length = 2]

I'm not sure whether this is actually related to this PR, but have noticed it
with the test cases of this PR.  It seems only to happen for comparison-like
insns.  If I remember correctly, this problem did not exist when I started
working on this PR.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (13 preceding siblings ...)
  2011-11-20 14:20 ` oleg.endo@t-online.de
@ 2011-12-29  1:09 ` oleg.endo@t-online.de
  2012-02-26 16:28 ` olegendo at gcc dot gnu.org
                   ` (29 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: oleg.endo@t-online.de @ 2011-12-29  1:09 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #15 from Oleg Endo <oleg.endo@t-online.de> 2011-12-29 00:34:53 UTC ---
(In reply to comment #14)
> With trunk rev 181517 I have observed the following problem, which happens when
> compiling for -m2*, -m3*, -m4* and -Os:
> 

This is still present as of rev 182713 and seems to be a different issue.
I've created PR51697 for it.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (14 preceding siblings ...)
  2011-12-29  1:09 ` oleg.endo@t-online.de
@ 2012-02-26 16:28 ` olegendo at gcc dot gnu.org
  2012-02-26 23:29 ` olegendo at gcc dot gnu.org
                   ` (28 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-02-26 16:28 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #16 from olegendo at gcc dot gnu.org 2012-02-26 13:31:36 UTC ---
Author: olegendo
Date: Sun Feb 26 13:31:32 2012
New Revision: 184585

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=184585
Log:
    PR target/49263
    * gcc.target/sh/pr49263.c: New.


Added:
    trunk/gcc/testsuite/gcc.target/sh/pr49263.c
Modified:
    trunk/gcc/testsuite/ChangeLog


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (15 preceding siblings ...)
  2012-02-26 16:28 ` olegendo at gcc dot gnu.org
@ 2012-02-26 23:29 ` olegendo at gcc dot gnu.org
  2012-08-27 19:52 ` olegendo at gcc dot gnu.org
                   ` (27 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-02-26 23:29 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

olegendo at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
                 CC|                            |olegendo at gcc dot gnu.org
         Resolution|                            |FIXED

--- Comment #17 from olegendo at gcc dot gnu.org 2012-02-26 23:26:28 UTC ---
I guess this one is done now.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (16 preceding siblings ...)
  2012-02-26 23:29 ` olegendo at gcc dot gnu.org
@ 2012-08-27 19:52 ` olegendo at gcc dot gnu.org
  2012-10-28 22:02 ` olegendo at gcc dot gnu.org
                   ` (26 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-08-27 19:52 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

Oleg Endo <olegendo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
   Last reconfirmed|                            |2012-08-27
         Resolution|FIXED                       |
         AssignedTo|unassigned at gcc dot       |olegendo at gcc dot gnu.org
                   |gnu.org                     |
     Ever Confirmed|0                           |1

--- Comment #18 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-08-27 19:51:42 UTC ---
Not quite so done, actually.  The following example case does not work
properly:


void test00 (int* x, int xb)
{
  if (! (xb & 128))
    x[0] = 0;
}

-O2 -m4:
        mov     r5,r0
        and     #128,r0
        tst     r0,r0
        bf      .L3
        mov.l   r0,@r4
.L3:
        rts    
        nop



void test01 (int* x, int xb)
{
  if (! (xb & 0x55))
    x[0] = 0;
}

-O2 -m4:
        mov     r5,r0
        and     #85,r0
        tst     r0,r0
        bf      .L7
        mov.l   r0,@r4
.L7:
        rts
        nop


It seems that combine is trying to look for the following patterns:

Failed to match this instruction:
(set (pc)
    (if_then_else (ne (zero_extract:SI (reg:SI 5 r5 [ xb ])
                (const_int 1 [0x1])
                (const_int 7 [0x7]))
            (const_int 0 [0]))
        (label_ref:SI 15)
        (pc)))


Failed to match this instruction:
(set (pc)
    (if_then_else (ne (and:SI (reg:SI 5 r5 [ xb ])
                (const_int 85 [0x55]))
            (const_int 0 [0]))
        (label_ref:SI 15)
        (pc)))



Another case that could see some improvement ...

bool test04 (int* x, int xb)
{
  return ! (xb & 0x55);
}

-O2 -m4 (OK):
        mov     r5,r0
        tst     #85,r0
        rts
        movt    r0


bool test02 (int* x, int xb)
{
  return ! (xb & (1 << 6));
}

-O2 -m4 (NG):
        mov     r5,r0
        mov     #-6,r1
        shld    r1,r0
        xor     #1,r0
        rts
        and     #1,r0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (17 preceding siblings ...)
  2012-08-27 19:52 ` olegendo at gcc dot gnu.org
@ 2012-10-28 22:02 ` olegendo at gcc dot gnu.org
  2012-10-31 13:47 ` olegendo at gcc dot gnu.org
                   ` (25 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-10-28 22:02 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #19 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-10-28 22:01:47 UTC ---
Another thing I've noticed...
Cases such as:

        mov.l   r0,@r2        ! LS
        mov     r13,r0        ! MT
        and     #7,r0         ! EX
        tst     r0,r0         ! MT
        bt/s    .L8           ! BR
        mov.l   r0,@(16,r1)

where the result of the and op is re-used would be slightly better as:

        mov.l   r0,@r2       ! LS
        mov     r13,r0       ! MT
        tst     #7,r0        ! MT
        and     #7,r0        ! EX
        bt/s    .L8          ! BR
        mov.l   r0,@(16,r1)

because it reduces dependency on the result of the and op and thus has a higher
chance to be executed in parallel.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (18 preceding siblings ...)
  2012-10-28 22:02 ` olegendo at gcc dot gnu.org
@ 2012-10-31 13:47 ` olegendo at gcc dot gnu.org
  2013-12-08 13:47 ` olegendo at gcc dot gnu.org
                   ` (24 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-10-31 13:47 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #20 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-10-31 13:47:07 UTC ---
(In reply to comment #19)
> Another thing I've noticed...
> Cases such as:
> 
>         mov.l   r0,@r2        ! LS
>         mov     r13,r0        ! MT
>         and     #7,r0         ! EX
>         tst     r0,r0         ! MT
>         bt/s    .L8           ! BR
>         mov.l   r0,@(16,r1)
> 
> where the result of the and op is re-used would be slightly better as:
> 
>         mov.l   r0,@r2       ! LS
>         mov     r13,r0       ! MT
>         tst     #7,r0        ! MT
>         and     #7,r0        ! EX
>         bt/s    .L8          ! BR
>         mov.l   r0,@(16,r1)
> 
> because it reduces dependency on the result of the and op and thus has a higher
> chance to be executed in parallel.

Other similar cases where hoisting the tst insn would make sense:
        mov.b   @(13,r2),r0
        extu.b  r0,r0
        tst     #1,r0
        bt      .L53

        mov.b   @(13,r1),r0
        extu.b  r0,r0
        tst     r0,r0
        bt/s    .L37


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (19 preceding siblings ...)
  2012-10-31 13:47 ` olegendo at gcc dot gnu.org
@ 2013-12-08 13:47 ` olegendo at gcc dot gnu.org
  2013-12-17 12:37 ` olegendo at gcc dot gnu.org
                   ` (23 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2013-12-08 13:47 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #21 from Oleg Endo <olegendo at gcc dot gnu.org> ---
(In reply to Oleg Endo from comment #18)
> 
> It seems that combine is trying to look for the following patterns:
> 
> Failed to match this instruction:
> (set (pc)
>     (if_then_else (ne (and:SI (reg:SI 5 r5 [ xb ])
>                 (const_int 85 [0x55]))
>             (const_int 0 [0]))
>         (label_ref:SI 15)
>         (pc)))

Implementing such a combine pattern like ...
(define_insn_and_split "*tst_cbranch"
  [(set (pc)
    (if_then_else (ne (and:SI (match_operand:SI 0 "logical_operand")
                  (match_operand:SI 1 "const_int_operand"))
              (const_int 0))
              (label_ref (match_operand 2))
              (pc)))
   (clobber (reg:SI T_REG))]
  "TARGET_SH1"
  "#"
  "&& 1"
  [(set (reg:SI T_REG) (eq:SI (and:SI (match_dup 0) (match_dup 1))
                  (const_int 0)))
   (set (pc) (if_then_else (eq (reg:SI T_REG) (const_int 0))
               (label_ref (match_dup 2))
               (pc)))])


results in code such as following code:
        mov     #33,r1
        mov     r5,r0
        tst     #33,r0
        bf/s    .L3
        and     r5,r1
        mov.l   r1,@r4
.L3:
        rts
        nop

which is worse.
What happens is that the sequence is expanded to RTL as follows:

(insn 7 4 8 2 (set (reg:SI 163 [ D.1856 ])
        (and:SI (reg/v:SI 162 [ xb ])
            (const_int 33 [0x21]))) sh_tmp.cpp:17 -1
     (nil))
(insn 8 7 9 2 (set (reg:SI 147 t)
        (eq:SI (reg:SI 163 [ D.1856 ])
            (const_int 0 [0]))) sh_tmp.cpp:17 -1
     (nil))
(jump_insn 9 8 10 2 (set (pc)
        (if_then_else (eq (reg:SI 147 t)
                (const_int 0 [0]))
            (label_ref:SI 15)
            (pc))) sh_tmp.cpp:17 301 {*cbranch_t}
     (int_list:REG_BR_PROB 3900 (nil))
 -> 15)
(note 10 9 11 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
(insn 11 10 12 4 (set (reg:SI 164)
        (const_int 0 [0])) sh_tmp.cpp:18 -1
     (nil))
(insn 12 11 15 4 (set (mem:SI (reg/v/f:SI 161 [ x ]) [2 *x_5(D)+0 S4 A32])
        (reg:SI 164)) sh_tmp.cpp:18 -1
     (nil))


and the cse1 pass decides that the result of the and operation can be shared
and replaces the operand in insn 12 with reg:SI 163:

(insn 12 11 15 3 (set (mem:SI (reg/v/f:SI 161 [ x ]) [2 *x_5(D)+0 S4 A32])
        (reg:SI 163 [ D.1856 ])) sh_tmp.cpp:18 258 {movsi_ie}
     (expr_list:REG_DEAD (reg:SI 164)
        (expr_list:REG_DEAD (reg/v/f:SI 161 [ x ])
            (nil))))

and insn 11 becomes dead code and is eliminated.
All of that happens long time before combine, so the tst combine patterns have
no chance to reconstruct the original code.

A sequence such as

        mov     r5,r0
        mov     #0,r1
        tst     #33,r0
        bf      .L3
        mov.l   r1,@r4
.L3:
        rts
        nop

could probably be achieved by combining insn 7 and insn 8 shortly after RTL
expansion, or even during the expansion of insn 8 (by looking at previous
already expanded insns and emitting a tst insn directly).
The idea would be to reduce dependencies on the tested register which allows
better scheduling.  In addition to that, on SH4A "mov #imm8,Rn" is an MT group
instruction which has a higher probability of being executed in parallel.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (20 preceding siblings ...)
  2013-12-08 13:47 ` olegendo at gcc dot gnu.org
@ 2013-12-17 12:37 ` olegendo at gcc dot gnu.org
  2014-12-30 18:45 ` olegendo at gcc dot gnu.org
                   ` (22 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2013-12-17 12:37 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #22 from Oleg Endo <olegendo at gcc dot gnu.org> ---
(In reply to Oleg Endo from comment #21)
> What happens is that the sequence is expanded to RTL as follows:
> 
> (insn 7 4 8 2 (set (reg:SI 163 [ D.1856 ])
>         (and:SI (reg/v:SI 162 [ xb ])
>             (const_int 33 [0x21]))) sh_tmp.cpp:17 -1
>      (nil))
> (insn 8 7 9 2 (set (reg:SI 147 t)
>         (eq:SI (reg:SI 163 [ D.1856 ])
>             (const_int 0 [0]))) sh_tmp.cpp:17 -1
>      (nil))
> (jump_insn 9 8 10 2 (set (pc)
>         (if_then_else (eq (reg:SI 147 t)
>                 (const_int 0 [0]))
>             (label_ref:SI 15)
>             (pc))) sh_tmp.cpp:17 301 {*cbranch_t}
>      (int_list:REG_BR_PROB 3900 (nil))
>  -> 15)
> (note 10 9 11 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
> (insn 11 10 12 4 (set (reg:SI 164)
>         (const_int 0 [0])) sh_tmp.cpp:18 -1
>      (nil))
> (insn 12 11 15 4 (set (mem:SI (reg/v/f:SI 161 [ x ]) [2 *x_5(D)+0 S4 A32])
>         (reg:SI 164)) sh_tmp.cpp:18 -1
>      (nil))
> 
> 
> and insn 11 becomes dead code and is eliminated.
> All of that happens long time before combine, so the tst combine patterns
> have no chance to reconstruct the original code.
> 

Adding an early peephole pass as described in
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59533#c2 and then adding the
following peephole:

;; Peephole after initial expansion.
(define_peephole2
  [(set (match_operand:SI 0 "arith_reg_dest")
    (and:SI (match_operand:SI 1 "arith_reg_operand")
        (match_operand:SI 2 "logical_operand")))
   (set (reg:SI T_REG) (eq:SI (match_dup 0) (const_int 0)))]
  "TARGET_SH1 && can_create_pseudo_p ()"
  [(set (reg:SI T_REG) (eq:SI (and:SI (match_dup 1) (match_dup 2))
                  (const_int 0)))
   (set (match_dup 0) (and:SI (match_dup 1) (match_dup 2)))])

... fixes the problem and results in more uses of the tst #imm,r0 insn
according to the CSiBE set.  On the other hand there is a total code size
increase of 792 bytes on the whole set.  Below are some things that get worse
in the Linux source (mm/filemap.c):

        mov.b   @(15,r1),r0    ->    mov.b   @(15,r1),r0
        cmp/pz  r0                   tst     #128,r0     // cmp/pz has less
        bf      .L1016               bf      .L1001      // pressure on r0


        mov.b   @(15,r0),r0     ->   mov.b   @(15,r0),r0
        tst     #4,r0                shar    r0
        bf      .L107                shar    r0
                                     tst     #1,r0


        add     #16,r0          ->   add     #16,r0
        mov.b   @(15,r0),r0          mov.b   @(15,r0),r0
        tst     #16,r0               mov     #-4,r1
        bf/s    .L509                shad    r1,r0
                                     tst     #1,r0
                                     bf/s    .L509


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (21 preceding siblings ...)
  2013-12-17 12:37 ` olegendo at gcc dot gnu.org
@ 2014-12-30 18:45 ` olegendo at gcc dot gnu.org
  2015-01-24 13:05 ` olegendo at gcc dot gnu.org
                   ` (21 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2014-12-30 18:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #23 from Oleg Endo <olegendo at gcc dot gnu.org> ---
Author: olegendo
Date: Tue Dec 30 18:44:27 2014
New Revision: 219111

URL: https://gcc.gnu.org/viewcvs?rev=219111&root=gcc&view=rev
Log:
gcc/testsuite/
    PR target/49263
    * gcc.target/sh/pr49263-1.c: New.
    * gcc.target/sh/pr49263-2.c: New.

Added:
    trunk/gcc/testsuite/gcc.target/sh/pr49263-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr49263-2.c
Modified:
    trunk/gcc/testsuite/ChangeLog


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (22 preceding siblings ...)
  2014-12-30 18:45 ` olegendo at gcc dot gnu.org
@ 2015-01-24 13:05 ` olegendo at gcc dot gnu.org
  2015-01-26 23:57 ` olegendo at gcc dot gnu.org
                   ` (20 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2015-01-24 13:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #25 from Oleg Endo <olegendo at gcc dot gnu.org> ---
Author: olegendo
Date: Sat Jan 24 13:04:53 2015
New Revision: 220081

URL: https://gcc.gnu.org/viewcvs?rev=220081&root=gcc&view=rev
Log:
gcc/
    PR target/49263
    PR target/53987
    PR target/64345
    PR target/59533
    PR target/52933
    PR target/54236
    PR target/51244
    * config/sh/sh-protos.h
    (sh_extending_set_of_reg::can_use_as_unextended_reg,
    sh_extending_set_of_reg::use_as_unextended_reg,
    sh_is_nott_insn, sh_movt_set_dest, sh_movrt_set_dest, sh_is_movt_insn,
    sh_is_movrt_insn, sh_insn_operands_modified_between_p,
    sh_reg_dead_or_unused_after_insn, sh_in_recog_treg_set_expr,
    sh_recog_treg_set_expr, sh_split_treg_set_expr): New functions.
    (sh_treg_insns): New class.
    * config/sh/sh.c (TARGET_LEGITIMATE_COMBINED_INSN): Define target hook.
    (scope_counter): New class.
    (sh_legitimate_combined_insn, sh_is_nott_insn, sh_movt_set_dest,
    sh_movrt_set_dest, sh_reg_dead_or_unused_after_insn,
    sh_extending_set_of_reg::can_use_as_unextended_reg,
    sh_extending_set_of_reg::use_as_unextended_reg, sh_recog_treg_set_expr,
    sh_in_recog_treg_set_expr, sh_try_split_insn_simple,
    sh_split_treg_set_expr): New functions.
    (addsubcosts): Handle treg_set_expr.
    (sh_rtx_costs): Handle IF_THEN_ELSE and ZERO_EXTRACT.
    (sh_rtx_costs): Use arith_reg_operand in SIGN_EXTEND and ZERO_EXTEND.
    (sh_rtx_costs): Handle additional bit test patterns in EQ and AND cases.
    (sh_insn_operands_modified_between_p): Make non-static.
    * config/sh/predicates.md (zero_extend_movu_operand): Allow
    simple_mem_operand in addition to displacement_mem_operand.
    (zero_extend_operand): Don't allow zero_extend_movu_operand.
    (treg_set_expr, treg_set_expr_not_const01,
    arith_reg_or_treg_set_expr): New predicates.
    * config/sh/sh.md (tstsi_t): Use arith_reg_operand and
    arith_or_int_operand instead of logical_operand.  Convert to
    insn_and_split.  Try to optimize constant operand in splitter.
    (tsthi_t, tstqi_t): Fold into *tst<mode>_t.  Convert to insn_and_split.
    (*tstqi_t_zero): Delete.
    (*tst<mode>_t_subregs): Add !sh_in_recog_treg_set_expr split condition.
    (tstsi_t_and_not): Delete.
    (tst<mode>_t_zero_extract_eq): Rename to *tst<mode>_t_zero_extract.
    Convert to insn_and_split.
    (unnamed split, tstsi_t_zero_extract_xor,
    tstsi_t_zero_extract_subreg_xor_little,
    tstsi_t_zero_extract_subreg_xor_big): Delete.
    (*tstsi_t_shift_mask): New insn_and_split.
    (cmpeqsi_t, cmpgesi_t): Add new split for const_int 0 operands and try
    to recombine with surrounding insns when splitting.
    (*negtstsi): Add !sh_in_recog_treg_set_expr condition.
    (cmp_div0s_0, cmp_div0s_1, *cmp_div0s_0, *cmp_div0s_1): Rewrite as ...
    (cmp_div0s, *cmp_div0s_1, *cmp_div0s_2, *cmp_div0s_3, *cmp_div0s_4,
    *cmp_div0s_5, *cmp_div0s_6): ... these new insn_and_split patterns.
    (*cbranch_div0s: Delete.
    (*addc): Convert to insn_and_split.  Use treg_set_expr as 3rd operand.
    Try to recombine with surrounding insns when splitting.  Add operand
    order variants.
    (*addc_t_r, *addc_r_t): Use treg_set_expr_not_const01.
    (*addc_r_r_1, *addc_r_lsb, *addc_r_r_lsb, *addc_r_lsb_r, *addc_r_msb,
    *addc_r_r_msb, *addc_2r_msb): Delete.
    (*addc_2r_lsb): Rename to *addc_2r_t.  Use treg_set_expr.  Add operand
    order variant.
    (*addc_negreg_t): New insn_and_split.
    (*subc): Convert to insn_and_split.  Use treg_set_expr as 3rd operand.
    Try to recombine with surrounding insns when splitting.
    Add operand order variants.  
    (*subc_negt_reg, *subc_negreg_t, *reg_lsb_t, *reg_msb_t): New
    insn_and_split patterns.
    (*rotcr): Use arith_reg_or_treg_set_expr.  Try to recombine with
    surrounding insns when splitting.
    (unnamed rotcr split): Use arith_reg_or_treg_set_expr.
    (*rotcl): Likewise.  Add zero_extract variant.
    (*ashrsi2_31): New insn_and_split.
    (*negc): Convert to insn_and_split.  Use treg_set_expr.
    (*zero_extend<mode>si2_disp_mem): Update comment.
    (movrt_negc, *movrt_negc, nott): Add !sh_in_recog_treg_set_expr split
    condition.
    (*mov_t_msb_neg, mov_neg_si_t): Use treg_set_expr.  Try to recombine
    with surrounding insns when splitting.
    (any_treg_expr_to_reg): New insn_and_split.
    (*neg_zero_extract_0, *neg_zero_extract_1, *neg_zero_extract_2,
    *neg_zero_extract_3, *neg_zero_extract_4, *neg_zero_extract_5,
    *neg_zero_extract_6, *zero_extract_0, *zero_extract_1,
    *zero_extract_2): New single bit zero extract patterns.
    (bld_reg, *bld_regqi): Fold into bld<mode>_reg.
    (*get_thread_pointersi, store_gbr, *mov<mode>_gbr_load,
    *mov<mode>_gbr_load, *mov<mode>_gbr_load, *mov<mode>_gbr_load,
    *movdi_gbr_load): Use arith_reg_dest instead of register_operand for
    set destination.
    (set_thread_pointersi, load_gbr): Use arith_reg_operand instead of
    register_operand for set source.

gcc/testsuite/
    PR target/49263
    PR target/53987
    PR target/64345
    PR target/59533
    PR target/52933
    PR target/54236
    PR target/51244
    * gcc.target/sh/pr64345-1.c: New.
    * gcc.target/sh/pr64345-2.c: New.
    * gcc.target/sh/pr59533-1.c: New.
    * gcc.target/sh/pr49263.c: Adjust matching of expected insns.
    * gcc.target/sh/pr52933-2.c: Likewise.
    * gcc.target/sh/pr54089-1.c: Likewise.
    * gcc.target/sh/pr54236-1.c: Likewise.
    * gcc.target/sh/pr51244-20-sh2a.c: Likewise.
    * gcc.target/sh/pr49263-1.c: Remove xfails.
    * gcc.target/sh/pr49263-2.c: Likewise.
    * gcc.target/sh/pr49263-3.c: Likewise.
    * gcc.target/sh/pr53987-1.c: Likewise.
    * gcc.target/sh/pr52933-1.c: Adjust matching of expected insns.
    (test_24, test_25, test_26, test_27, test_28, test_29, test_30): New.
    * gcc.target/sh/pr51244-12.c: Adjust matching of expected insns.
    (test05, test06, test07, test08, test09, test10, test11, test12): New.
    * gcc.target/sh/pr54236-3.c: Adjust matching of expected insns.
    (test_002, test_003, test_004, test_005, test_006, test_007, test_008,
    test_009): New.
    * gcc.target/sh/pr51244-4.c: Adjust matching of expected insns.
    (test_02): New.

Added:
    trunk/gcc/testsuite/gcc.target/sh/pr59533-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr64345-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr64345-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/sh/predicates.md
    trunk/gcc/config/sh/sh-protos.h
    trunk/gcc/config/sh/sh.c
    trunk/gcc/config/sh/sh.md
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/sh/pr49263-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr49263-2.c
    trunk/gcc/testsuite/gcc.target/sh/pr49263-3.c
    trunk/gcc/testsuite/gcc.target/sh/pr49263.c
    trunk/gcc/testsuite/gcc.target/sh/pr51244-12.c
    trunk/gcc/testsuite/gcc.target/sh/pr51244-20-sh2a.c
    trunk/gcc/testsuite/gcc.target/sh/pr51244-4.c
    trunk/gcc/testsuite/gcc.target/sh/pr52933-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr52933-2.c
    trunk/gcc/testsuite/gcc.target/sh/pr53987-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr54089-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr54236-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr54236-3.c


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (23 preceding siblings ...)
  2015-01-24 13:05 ` olegendo at gcc dot gnu.org
@ 2015-01-26 23:57 ` olegendo at gcc dot gnu.org
  2023-05-12 11:46 ` klepikov.alex+bugs at gmail dot com
                   ` (19 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2015-01-26 23:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #26 from Oleg Endo <olegendo at gcc dot gnu.org> ---
Author: olegendo
Date: Mon Jan 26 23:56:05 2015
New Revision: 220144

URL: https://gcc.gnu.org/viewcvs?rev=220144&root=gcc&view=rev
Log:
gcc/
    PR target/49263
    * config/sh/sh.c (sh_split_treg_set_expr): Invoke emit_insn before
    remove_insn.
    * config/sh/sh.md (tstsi_t): Don't try to optimize constant with right
    shifts if it already fits into K08.

gcc/testsuite/
    PR target/49263
    * gcc.target/sh/pr49263-4.c: New.


Added:
    trunk/gcc/testsuite/gcc.target/sh/pr49263-4.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/sh/sh.c
    trunk/gcc/config/sh/sh.md
    trunk/gcc/testsuite/ChangeLog


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (24 preceding siblings ...)
  2015-01-26 23:57 ` olegendo at gcc dot gnu.org
@ 2023-05-12 11:46 ` klepikov.alex+bugs at gmail dot com
  2023-05-23 12:34 ` klepikov.alex+bugs at gmail dot com
                   ` (18 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: klepikov.alex+bugs at gmail dot com @ 2023-05-12 11:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

Alexander Klepikov <klepikov.alex+bugs at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |klepikov.alex+bugs at gmail dot co
                   |                            |m

--- Comment #31 from Alexander Klepikov <klepikov.alex+bugs at gmail dot com> ---
I've found new cases for SH2 and SH2E CPUs only:

#define FLAG 0x40

unsigned int f(char v){
    return (v & FLAG) == FLAG;
}

For both big and little endian translates to dynamic shift call:

-O -m2 (or -m2e)

_f:
        sts.l   pr,@-r15
        mov.l   .L3,r1
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts     
        nop
.L3:
        .long   ___ashiftrt_r4_6

And

#define FLAG 0x40
#define ADDR 0xFFFF0000
#define P ((unsigned char *)ADDR)

unsigned int f(void){
    return (*P & FLAG) == FLAG;
}

Translates to

_f:
        sts.l   pr,@-r15
        mov.l   .L3,r1
        mov.b   @r1,r4
        mov.l   .L4,r1
        jsr     @r1
        nop
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts     
        nop
.L3:
        .long   -65536
.L4:
        .long   ___ashiftrt_r4_6

Assembler output does not depend on ADDR value, but depends on variable (or
pointer) type. When type is integer, assembler code uses 'tst' for all options
'-m4', '-m2', '-m2e':

#define FLAG 0x40

unsigned int f(unsigned int v){
    return (v & FLAG) == FLAG;
}

translates to 

_f:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts     
        negc    r0,r0

and

#define FLAG 0x40
#define ADDR 0xFFFF0000
#define P ((unsigned int *)ADDR)

unsigned int f(void){
    return (*P & FLAG) == FLAG;
}

translates to 

_f:
        mov.l   .L2,r1
        mov.l   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        rts     
        negc    r0,r0
.L2:
        .long   -65536

Interesting that when '-m4' flag is specified, later GCC always translates to
code with 'tst' instruction. I played with godbolt and that's what I found. GCC
ver 4 uses dynamic shift 'shad' with '-m4' option and library call with '-m2'
or '-m2e' options. GCC 9.5 and later uses 'tst' with '-m4' and library call
with both '-m2' and '-m2e' when FLAG==0x40 and 'shll' instruction with both
'-m2' and '-m2e' when FLAG==0x80. I remind you that this is happening when char
type used only.

Maybe SH4 solution can be extended to support SH2/SH2E?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (25 preceding siblings ...)
  2023-05-12 11:46 ` klepikov.alex+bugs at gmail dot com
@ 2023-05-23 12:34 ` klepikov.alex+bugs at gmail dot com
  2023-05-23 12:35 ` klepikov.alex+bugs at gmail dot com
                   ` (17 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: klepikov.alex+bugs at gmail dot com @ 2023-05-23 12:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #32 from Alexander Klepikov <klepikov.alex+bugs at gmail dot com> ---
I'm not sure whether I should write here or open new discussion, but these
topics are related very closely. I've been writing a patch to eliminate the
generation of dynamic shift instructions 'shad' and 'shld' completely at least
for SH4 CPU. And then I get a surprising result - in all the examples I gave
earlier, library call converted to 'tst' instructions!

Here is the patch itself (I also will attach a file):

--- ../gcc-12.3.0.orig/gcc/config/sh/sh.cc      2023-05-08 15:14:39.681161695
+0300
+++ ./gcc/config/sh/sh.cc       2023-05-23 12:23:25.964375731 +0300
@@ -3061,7 +3061,7 @@
   else
     insn_count = ashl_lshr_seq[shift_amount_i].insn_count;

-  return TARGET_DYNSHIFT && (insn_count > 1 + SH_DYNAMIC_SHIFT_COST);
+  return TARGET_DYNSHIFT && (insn_count > 1 + SH_DYNAMIC_SHIFT_COST) && !
disable_dynshift;
 }

 /* Assuming we have a value that has been sign-extended by at least one bit,
@@ -3812,8 +3812,10 @@
   rtx wrk;
   char func[18];
   int value;
+  int long_shift  = disable_dynshift ? 30 : 19;
+  int short_shift = disable_dynshift ? 15 : 5;

-  if (TARGET_DYNSHIFT)
+  if (TARGET_DYNSHIFT && ! disable_dynshift)
     {
       if (!CONST_INT_P (operands[2]))
        {
@@ -3851,7 +3853,7 @@
       emit_insn (gen_ashrsi2_31 (operands[0], operands[1]));
       return true;
     }
-  else if (value >= 16 && value <= 19)
+  else if (value >= 16 && value <= long_shift)
     {
       wrk = gen_reg_rtx (SImode);
       emit_insn (gen_ashrsi2_16 (wrk, operands[1]));
@@ -3862,7 +3864,7 @@
       return true;
     }
   /* Expand a short sequence inline, longer call a magic routine.  */
-  else if (value <= 5)
+  else if (value <= short_shift)
     {
       wrk = gen_reg_rtx (SImode);
       emit_move_insn (wrk, operands[1]);
diff -ur ../gcc-12.3.0.orig/gcc/config/sh/sh.opt ./gcc/config/sh/sh.opt
--- ../gcc-12.3.0.orig/gcc/config/sh/sh.opt     2023-05-08 15:14:39.689161810
+0300
+++ ./gcc/config/sh/sh.opt      2023-05-23 10:45:36.814371159 +0300
@@ -301,3 +301,7 @@
 mlra
 Target Var(sh_lra_flag) Init(0) Save
 Use LRA instead of reload (transitional).
+
+mdisable-dynshift
+Target Var(disable_dynshift) Init(0)
+Disable dynamic shift 'shad' and 'shld' instructions

And here are my tests:
$ cat f.c
#define ADDR 0xFFFF0000
#define P ((unsigned char *)ADDR)
#define FLAG 0x40
#define S 7

unsigned char f(char v){
    return (v & FLAG) == FLAG;
}

unsigned char f_(unsigned char v){
    return (v & FLAG) == FLAG;
}

unsigned char f1(void){
    return (*P & FLAG) == FLAG;
}

int f_signed_rshift(int v){
    return v >> S;
}

int f_signed_lshift(int v){
    return v << S;
}

unsigned int f_unsigned_rshift(unsigned int v){
    return v >> S;
}

unsigned int f_unsigned_lshift(unsigned int v){
    return v << S;
}

$ /usr/local/sh-toolchain/bin/sh-elf-gcc -c -mrenesas -m2e -mb -O
-fno-toplevel-reorder -mdisable-dynshift -S f.c
$ cat f.s
        .file   "f.c"
        .text
        .text
        .align 1
        .global _f
        .type   _f, @function
_f:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
        .size   _f, .-_f
        .align 1
        .global _f_
        .type   _f_, @function
_f_:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
        .size   _f_, .-_f_
        .align 1
        .global _f1
        .type   _f1, @function
_f1:
        mov.l   .L4,r1
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
.L5:
        .align 2
.L4:
        .long   -65536
        .size   _f1, .-_f1
        .align 1
        .global _f_signed_rshift
        .type   _f_signed_rshift, @function
_f_signed_rshift:
        mov     r4,r0
        shar    r0
        shar    r0
        shar    r0
        shar    r0
        shar    r0
        shar    r0
        rts
        shar    r0
        .size   _f_signed_rshift, .-_f_signed_rshift
        .align 1
        .global _f_signed_lshift
        .type   _f_signed_lshift, @function
_f_signed_lshift:
        mov     r4,r0
        shll2   r0
        shll2   r0
        add     r0,r0
        rts
        shll2   r0
        .size   _f_signed_lshift, .-_f_signed_lshift
        .align 1
        .global _f_unsigned_rshift
        .type   _f_unsigned_rshift, @function
_f_unsigned_rshift:
        mov     r4,r0
        shlr2   r0
        shlr2   r0
        shlr    r0
        rts
        shlr2   r0
        .size   _f_unsigned_rshift, .-_f_unsigned_rshift
        .align 1
        .global _f_unsigned_lshift
        .type   _f_unsigned_lshift, @function
_f_unsigned_lshift:
        mov     r4,r0
        shll2   r0
        shll2   r0
        add     r0,r0
        rts
        shll2   r0
        .size   _f_unsigned_lshift, .-_f_unsigned_lshift
        .ident  "GCC: (GNU) 12.3.0"

I also compiled my project with '-m2e' and new '-mdisable-dynshift' options and
tested it in SH-2E mone on Renesas's emulator that comes with High-performance
Embedded Workshop and all unit tests run as expected.

If this patch is useful let's include it in GCC.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (26 preceding siblings ...)
  2023-05-23 12:34 ` klepikov.alex+bugs at gmail dot com
@ 2023-05-23 12:35 ` klepikov.alex+bugs at gmail dot com
  2023-05-23 19:05 ` olegendo at gcc dot gnu.org
                   ` (16 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: klepikov.alex+bugs at gmail dot com @ 2023-05-23 12:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #33 from Alexander Klepikov <klepikov.alex+bugs at gmail dot com> ---
Created attachment 55142
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55142&action=edit
Disable dynamic shift instructions patch

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (27 preceding siblings ...)
  2023-05-23 12:35 ` klepikov.alex+bugs at gmail dot com
@ 2023-05-23 19:05 ` olegendo at gcc dot gnu.org
  2023-05-24 11:40 ` klepikov.alex+bugs at gmail dot com
                   ` (15 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2023-05-23 19:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #34 from Oleg Endo <olegendo at gcc dot gnu.org> ---
(In reply to Alexander Klepikov from comment #33)
> Created attachment 55142 [details]
> Disable dynamic shift instructions patch

First of all, thanks for digging into this.  This issue has been a can of
worms, due to all sorts of reasons.

As you have discovered, some code patterns take the shift instruction route,
which is basically decided earlier by the various middle-end optimizations. 
There have also been some changes to those parts recently, but I haven't been
watching what it does for SH.

> unsigned int f(char v){
>     return (v & FLAG) == FLAG;
> }

Bit-tests of char and unsigned char should be covered by the test-suite and
should work -- at least originally.  However, what might be triggering this
problem is the '== FLAG' comparison.  When I was working on this issue I only
used '== 0' or '!= 0' comparison.  I can imagine that your test code triggers
some other middle end optimizations and hence we get this.

Can you try to rewrite your test code to something like this?

unsigned int f(char v){
    return (v & FLAG) != 0;
}

... and see if it generates the tst instruction as expected?


> I also compiled my project with '-m2e' and new '-mdisable-dynshift'
> options and tested it in SH-2E mone on Renesas's emulator that comes
> with High-performance Embedded Workshop and all unit tests run as expected.

I'm not sure what the purpose of the '-mdisable-dynshift' option would be here
though.  For '-m2e' TARGET_DYNSHIFT is already 'false'.  So the option seems
misnamed.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (28 preceding siblings ...)
  2023-05-23 19:05 ` olegendo at gcc dot gnu.org
@ 2023-05-24 11:40 ` klepikov.alex+bugs at gmail dot com
  2023-05-24 11:57 ` olegendo at gcc dot gnu.org
                   ` (14 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: klepikov.alex+bugs at gmail dot com @ 2023-05-24 11:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #35 from Alexander Klepikov <klepikov.alex+bugs at gmail dot com> ---
(In reply to Oleg Endo from comment #34)
> Bit-tests of char and unsigned char should be covered by the test-suite and
> should work -- at least originally.  However, what might be triggering this
> problem is the '== FLAG' comparison.  When I was working on this issue I
> only used '== 0' or '!= 0' comparison.  I can imagine that your test code
> triggers some other middle end optimizations and hence we get this.

Yes, I am sure that the problem is the '== FLAG' comparison. Before I reported
that bug, I tried to bypass it and this macro does not produce shift
instructions even on GCC 4.7:

#define BIT_MASK_IS_SET_(VALUE, BITMASK)\
        ({int _value = VALUE & BITMASK,\
        _result;\
        if (_value == BITMASK){\
                _result = 1;\
        }\
        else {\
                _result = 0;\
        }\
        _result;})

So this is definitely the comparison.

> 
> Can you try to rewrite your test code to something like this?
> 
> unsigned int f(char v){
>     return (v & FLAG) != 0;
> }
> 
> ... and see if it generates the tst instruction as expected?
> 

As I understand, you meant the following (I added new functions at the end of
file):

$ cat f.c
#define ADDR 0xFFFF0000
#define P ((unsigned char *)ADDR)
#define FLAG 0x40
#define S 7

unsigned char f_char_var(char v){
    return (v & FLAG) == FLAG;
}

unsigned char f_unsigned_char_var(unsigned char v){
    return (v & FLAG) == FLAG;
}

unsigned char f_symbol(void){
    return (*P & FLAG) == FLAG;
}

unsigned char f_symbol_zero(void){
    return (*P & FLAG) == 0;
}

unsigned char f_symbol_non_zero(void){
    return (*P & FLAG) != 0;
}

Compiler flags: -c -mrenesas -m2e -mb -O -fno-toplevel-reorder

With patch disabled:

$ cat f_clean.s
        .file   "f.c"
        .text
        .text
        .align 1
        .global _f_char_var
        .type   _f_char_var, @function
_f_char_var:
        sts.l   pr,@-r15
        mov.l   .L3,r1
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L4:
        .align 2
.L3:
        .long   ___ashiftrt_r4_6
        .size   _f_char_var, .-_f_char_var
        .align 1
        .global _f_unsigned_char_var
        .type   _f_unsigned_char_var, @function
_f_unsigned_char_var:
        sts.l   pr,@-r15
        mov.l   .L7,r1
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L8:
        .align 2
.L7:
        .long   ___ashiftrt_r4_6
        .size   _f_unsigned_char_var, .-_f_unsigned_char_var
        .align 1
        .global _f_symbol
        .type   _f_symbol, @function
_f_symbol:
        sts.l   pr,@-r15
        mov.l   .L11,r1
        mov.b   @r1,r4
        mov.l   .L12,r1
        jsr     @r1
        nop
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L13:
        .align 2
.L11:
        .long   -65536
.L12:
        .long   ___ashiftrt_r4_6
        .size   _f_symbol, .-_f_symbol
        .align 1
        .global _f_symbol_zero
        .type   _f_symbol_zero, @function
_f_symbol_zero:
        mov.l   .L15,r1
        mov.b   @r1,r0
        tst     #64,r0
        rts
        movt    r0
.L16:
        .align 2
.L15:
        .long   -65536
        .size   _f_symbol_zero, .-_f_symbol_zero
        .align 1
        .global _f_symbol_non_zero
        .type   _f_symbol_non_zero, @function
_f_symbol_non_zero:
        sts.l   pr,@-r15
        mov.l   .L19,r1
        mov.b   @r1,r4
        mov.l   .L20,r1
        jsr     @r1
        nop
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L21:
        .align 2
.L19:
        .long   -65536
.L20:
        .long   ___ashiftrt_r4_6
        .size   _f_symbol_non_zero, .-_f_symbol_non_zero
        .ident  "GCC: (GNU) 12.3.0"

With patch enabled:

$ cat f.s
        .file   "f.c"
        .text
        .text
        .align 1
        .global _f_char_var
        .type   _f_char_var, @function
_f_char_var:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
        .size   _f_char_var, .-_f_char_var
        .align 1
        .global _f_unsigned_char_var
        .type   _f_unsigned_char_var, @function
_f_unsigned_char_var:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
        .size   _f_unsigned_char_var, .-_f_unsigned_char_var
        .align 1
        .global _f_symbol
        .type   _f_symbol, @function
_f_symbol:
        mov.l   .L4,r1
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
.L5:
        .align 2
.L4:
        .long   -65536
        .size   _f_symbol, .-_f_symbol
        .align 1
        .global _f_symbol_zero
        .type   _f_symbol_zero, @function
_f_symbol_zero:
        mov.l   .L7,r1
        mov.b   @r1,r0
        tst     #64,r0
        rts
        movt    r0
.L8:
        .align 2
.L7:
        .long   -65536
        .size   _f_symbol_zero, .-_f_symbol_zero
        .align 1
        .global _f_symbol_non_zero
        .type   _f_symbol_non_zero, @function
_f_symbol_non_zero:
        mov.l   .L10,r1
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
.L11:
        .align 2
.L10:
        .long   -65536
        .size   _f_symbol_non_zero, .-_f_symbol_non_zero
        .ident  "GCC: (GNU) 12.3.0"

> I'm not sure what the purpose of the '-mdisable-dynshift' option would be
> here though.  For '-m2e' TARGET_DYNSHIFT is already 'false'.  So the option
> seems misnamed.

I choose that name because I wanted to disable dynamic shift instructions for
all CPUs. I did not hope that it will affect SH-2E code in such way.

I can rewrite the patch so that it only affects CPUs that do not support
dynamic shifts and disables library call for dynamic shifts. I'll do it anyway
because I need it badly. How do you think, what name of option would be better:
'-mdisable-dynshift-libcall' or '-mhw-shift'? Or if you want, please suggest
another one. Thank you!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (29 preceding siblings ...)
  2023-05-24 11:40 ` klepikov.alex+bugs at gmail dot com
@ 2023-05-24 11:57 ` olegendo at gcc dot gnu.org
  2023-05-24 13:34 ` klepikov.alex+bugs at gmail dot com
                   ` (13 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2023-05-24 11:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #36 from Oleg Endo <olegendo at gcc dot gnu.org> ---
(In reply to Alexander Klepikov from comment #35)
> 
> As I understand, you meant the following (I added new functions at the end
> of file):
> 
> $ cat f.c
> #define ADDR 0xFFFF0000
> #define P ((unsigned char *)ADDR)
> #define FLAG 0x40
> #define S 7
> ....

Yes, that's what I meant, thanks.

Can you also compile for little endian, and most of all, use -O2 optimization
level.  Some optimizations are not done below -O2.

> 
> I choose that name because I wanted to disable dynamic shift instructions
> for all CPUs. I did not hope that it will affect SH-2E code in such way.
> 
> I can rewrite the patch so that it only affects CPUs that do not support
> dynamic shifts and disables library call for dynamic shifts. I'll do it
> anyway because I need it badly. How do you think, what name of option would
> be better: '-mdisable-dynshift-libcall' or '-mhw-shift'? Or if you want,
> please suggest another one. Thank you!

'-mdisable-dynshift-libcall' would be more appropriate for what it tries to do,
I think.  Although that is a whole different issue ... but what is it going to
do for real dynamic shifts on SH2?

What kind of code is it supposed to emit for things like

unsigned int dyn_shift (unsigned int x, unsigned int y)
{
  return x << (y & 31);
}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (30 preceding siblings ...)
  2023-05-24 11:57 ` olegendo at gcc dot gnu.org
@ 2023-05-24 13:34 ` klepikov.alex+bugs at gmail dot com
  2023-05-24 15:00 ` olegendo at gcc dot gnu.org
                   ` (12 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: klepikov.alex+bugs at gmail dot com @ 2023-05-24 13:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #37 from Alexander Klepikov <klepikov.alex+bugs at gmail dot com> ---
> Can you also compile for little endian, and most of all, use -O2
> optimization level.  Some optimizations are not done below -O2.

Here's source file, I added functions with non-constant shifts

$ cat f.c
#define ADDR 0xFFFF0000
#define P ((unsigned char *)ADDR)
#define FLAG 0x40
#define S 7

unsigned char f_char_var(char v){
    return (v & FLAG) == FLAG;
}

unsigned char f_unsigned_char_var(unsigned char v){
    return (v & FLAG) == FLAG;
}

unsigned char f_symbol(void){
    return (*P & FLAG) == FLAG;
}

unsigned char f_symbol_zero(void){
    return (*P & FLAG) == 0;
}

unsigned char f_symbol_non_zero(void){
    return (*P & FLAG) != 0;
}

unsigned int dyn_lshift (unsigned int x, unsigned int y)
{
  return x << (y & 31);
}

unsigned int dyn_rshift (unsigned int x, unsigned int y)
{
  return x >> (y & 31);
}

unsigned int really_dyn_lshift (unsigned int x, unsigned int y)
{
  return x << y;
}

unsigned int really_dyn_rshift (unsigned int x, unsigned int y)
{
  return x >> y;
}

With patch disabled, -O2 -mb:

$ cat f.s
        .file   "f.c"
        .text
        .text
        .align 1
        .align 2
        .global _f_char_var
        .type   _f_char_var, @function
_f_char_var:
        mov.l   .L4,r1
        sts.l   pr,@-r15
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L5:
        .align 2
.L4:
        .long   ___ashiftrt_r4_6
        .size   _f_char_var, .-_f_char_var
        .align 1
        .align 2
        .global _f_unsigned_char_var
        .type   _f_unsigned_char_var, @function
_f_unsigned_char_var:
        mov.l   .L8,r1
        sts.l   pr,@-r15
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L9:
        .align 2
.L8:
        .long   ___ashiftrt_r4_6
        .size   _f_unsigned_char_var, .-_f_unsigned_char_var
        .align 1
        .align 2
        .global _f_symbol
        .type   _f_symbol, @function
_f_symbol:
        mov.l   .L12,r1
        sts.l   pr,@-r15
        mov.b   @r1,r4
        mov.l   .L13,r1
        jsr     @r1
        nop
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L14:
        .align 2
.L12:
        .long   -65536
.L13:
        .long   ___ashiftrt_r4_6
        .size   _f_symbol, .-_f_symbol
        .align 1
        .align 2
        .global _f_symbol_zero
        .type   _f_symbol_zero, @function
_f_symbol_zero:
        mov.l   .L16,r1
        mov.b   @r1,r0
        tst     #64,r0
        rts
        movt    r0
.L17:
        .align 2
.L16:
        .long   -65536
        .size   _f_symbol_zero, .-_f_symbol_zero
        .align 1
        .align 2
        .global _f_symbol_non_zero
        .type   _f_symbol_non_zero, @function
_f_symbol_non_zero:
        mov.l   .L20,r1
        sts.l   pr,@-r15
        mov.b   @r1,r4
        mov.l   .L21,r1
        jsr     @r1
        nop
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L22:
        .align 2
.L20:
        .long   -65536
.L21:
        .long   ___ashiftrt_r4_6
        .size   _f_symbol_non_zero, .-_f_symbol_non_zero
        .align 1
        .align 2
        .global _dyn_lshift
        .type   _dyn_lshift, @function
_dyn_lshift:
        mov.l   .L25,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L26:
        .align 2
.L25:
        .long   ___ashlsi3_r0
        .size   _dyn_lshift, .-_dyn_lshift
        .align 1
        .align 2
        .global _dyn_rshift
        .type   _dyn_rshift, @function
_dyn_rshift:
        mov.l   .L29,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L30:
        .align 2
.L29:
        .long   ___lshrsi3_r0
        .size   _dyn_rshift, .-_dyn_rshift
        .align 1
        .align 2
        .global _really_dyn_lshift
        .type   _really_dyn_lshift, @function
_really_dyn_lshift:
        mov.l   .L33,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L34:
        .align 2
.L33:
        .long   ___ashlsi3_r0
        .size   _really_dyn_lshift, .-_really_dyn_lshift
        .align 1
        .align 2
        .global _really_dyn_rshift
        .type   _really_dyn_rshift, @function
_really_dyn_rshift:
        mov.l   .L37,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L38:
        .align 2
.L37:
        .long   ___lshrsi3_r0
        .size   _really_dyn_rshift, .-_really_dyn_rshift
        .ident  "GCC: (GNU) 12.3.0"

With patch disabled, -O2 -ml

$ cat f.s
        .file   "f.c"
        .text
        .little
        .text
        .align 1
        .align 2
        .global _f_char_var
        .type   _f_char_var, @function
_f_char_var:
        mov.l   .L4,r1
        sts.l   pr,@-r15
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L5:
        .align 2
.L4:
        .long   ___ashiftrt_r4_6
        .size   _f_char_var, .-_f_char_var
        .align 1
        .align 2
        .global _f_unsigned_char_var
        .type   _f_unsigned_char_var, @function
_f_unsigned_char_var:
        mov.l   .L8,r1
        sts.l   pr,@-r15
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L9:
        .align 2
.L8:
        .long   ___ashiftrt_r4_6
        .size   _f_unsigned_char_var, .-_f_unsigned_char_var
        .align 1
        .align 2
        .global _f_symbol
        .type   _f_symbol, @function
_f_symbol:
        mov.l   .L12,r1
        sts.l   pr,@-r15
        mov.b   @r1,r4
        mov.l   .L13,r1
        jsr     @r1
        nop
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L14:
        .align 2
.L12:
        .long   -65536
.L13:
        .long   ___ashiftrt_r4_6
        .size   _f_symbol, .-_f_symbol
        .align 1
        .align 2
        .global _f_symbol_zero
        .type   _f_symbol_zero, @function
_f_symbol_zero:
        mov.l   .L16,r1
        mov.b   @r1,r0
        tst     #64,r0
        rts
        movt    r0
.L17:
        .align 2
.L16:
        .long   -65536
        .size   _f_symbol_zero, .-_f_symbol_zero
        .align 1
        .align 2
        .global _f_symbol_non_zero
        .type   _f_symbol_non_zero, @function
_f_symbol_non_zero:
        mov.l   .L20,r1
        sts.l   pr,@-r15
        mov.b   @r1,r4
        mov.l   .L21,r1
        jsr     @r1
        nop
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L22:
        .align 2
.L20:
        .long   -65536
.L21:
        .long   ___ashiftrt_r4_6
        .size   _f_symbol_non_zero, .-_f_symbol_non_zero
        .align 1
        .align 2
        .global _dyn_lshift
        .type   _dyn_lshift, @function
_dyn_lshift:
        mov.l   .L25,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L26:
        .align 2
.L25:
        .long   ___ashlsi3_r0
        .size   _dyn_lshift, .-_dyn_lshift
        .align 1
        .align 2
        .global _dyn_rshift
        .type   _dyn_rshift, @function
_dyn_rshift:
        mov.l   .L29,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L30:
        .align 2
.L29:
        .long   ___lshrsi3_r0
        .size   _dyn_rshift, .-_dyn_rshift
        .align 1
        .align 2
        .global _really_dyn_lshift
        .type   _really_dyn_lshift, @function
_really_dyn_lshift:
        mov.l   .L33,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L34:
        .align 2
.L33:
        .long   ___ashlsi3_r0
        .size   _really_dyn_lshift, .-_really_dyn_lshift
        .align 1
        .align 2
        .global _really_dyn_rshift
        .type   _really_dyn_rshift, @function
_really_dyn_rshift:
        mov.l   .L37,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L38:
        .align 2
.L37:
        .long   ___lshrsi3_r0
        .size   _really_dyn_rshift, .-_really_dyn_rshift
        .ident  "GCC: (GNU) 12.3.0"

With patch enabled -O2 -mb

$ cat f.s
        .file   "f.c"
        .text
        .text
        .align 1
        .align 2
        .global _f_char_var
        .type   _f_char_var, @function
_f_char_var:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
        .size   _f_char_var, .-_f_char_var
        .align 1
        .align 2
        .global _f_unsigned_char_var
        .type   _f_unsigned_char_var, @function
_f_unsigned_char_var:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
        .size   _f_unsigned_char_var, .-_f_unsigned_char_var
        .align 1
        .align 2
        .global _f_symbol
        .type   _f_symbol, @function
_f_symbol:
        mov.l   .L5,r1
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
.L6:
        .align 2
.L5:
        .long   -65536
        .size   _f_symbol, .-_f_symbol
        .align 1
        .align 2
        .global _f_symbol_zero
        .type   _f_symbol_zero, @function
_f_symbol_zero:
        mov.l   .L8,r1
        mov.b   @r1,r0
        tst     #64,r0
        rts
        movt    r0
.L9:
        .align 2
.L8:
        .long   -65536
        .size   _f_symbol_zero, .-_f_symbol_zero
        .align 1
        .align 2
        .global _f_symbol_non_zero
        .type   _f_symbol_non_zero, @function
_f_symbol_non_zero:
        mov.l   .L11,r1
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
.L12:
        .align 2
.L11:
        .long   -65536
        .size   _f_symbol_non_zero, .-_f_symbol_non_zero
        .align 1
        .align 2
        .global _dyn_lshift
        .type   _dyn_lshift, @function
_dyn_lshift:
        mov.l   .L15,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L16:
        .align 2
.L15:
        .long   ___ashlsi3_r0
        .size   _dyn_lshift, .-_dyn_lshift
        .align 1
        .align 2
        .global _dyn_rshift
        .type   _dyn_rshift, @function
_dyn_rshift:
        mov.l   .L19,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L20:
        .align 2
.L19:
        .long   ___lshrsi3_r0
        .size   _dyn_rshift, .-_dyn_rshift
        .align 1
        .align 2
        .global _really_dyn_lshift
        .type   _really_dyn_lshift, @function
_really_dyn_lshift:
        mov.l   .L23,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L24:
        .align 2
.L23:
        .long   ___ashlsi3_r0
        .size   _really_dyn_lshift, .-_really_dyn_lshift
        .align 1
        .align 2
        .global _really_dyn_rshift
        .type   _really_dyn_rshift, @function
_really_dyn_rshift:
        mov.l   .L27,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L28:
        .align 2
.L27:
        .long   ___lshrsi3_r0
        .size   _really_dyn_rshift, .-_really_dyn_rshift
        .ident  "GCC: (GNU) 12.3.0"

With patch enabled, -O2 -ml

$ cat f.s
        .file   "f.c"
        .text
        .little
        .text
        .align 1
        .align 2
        .global _f_char_var
        .type   _f_char_var, @function
_f_char_var:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
        .size   _f_char_var, .-_f_char_var
        .align 1
        .align 2
        .global _f_unsigned_char_var
        .type   _f_unsigned_char_var, @function
_f_unsigned_char_var:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
        .size   _f_unsigned_char_var, .-_f_unsigned_char_var
        .align 1
        .align 2
        .global _f_symbol
        .type   _f_symbol, @function
_f_symbol:
        mov.l   .L5,r1
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
.L6:
        .align 2
.L5:
        .long   -65536
        .size   _f_symbol, .-_f_symbol
        .align 1
        .align 2
        .global _f_symbol_zero
        .type   _f_symbol_zero, @function
_f_symbol_zero:
        mov.l   .L8,r1
        mov.b   @r1,r0
        tst     #64,r0
        rts
        movt    r0
.L9:
        .align 2
.L8:
        .long   -65536
        .size   _f_symbol_zero, .-_f_symbol_zero
        .align 1
        .align 2
        .global _f_symbol_non_zero
        .type   _f_symbol_non_zero, @function
_f_symbol_non_zero:
        mov.l   .L11,r1
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
.L12:
        .align 2
.L11:
        .long   -65536
        .size   _f_symbol_non_zero, .-_f_symbol_non_zero
        .align 1
        .align 2
        .global _dyn_lshift
        .type   _dyn_lshift, @function
_dyn_lshift:
        mov.l   .L15,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L16:
        .align 2
.L15:
        .long   ___ashlsi3_r0
        .size   _dyn_lshift, .-_dyn_lshift
        .align 1
        .align 2
        .global _dyn_rshift
        .type   _dyn_rshift, @function
_dyn_rshift:
        mov.l   .L19,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L20:
        .align 2
.L19:
        .long   ___lshrsi3_r0
        .size   _dyn_rshift, .-_dyn_rshift
        .align 1
        .align 2
        .global _really_dyn_lshift
        .type   _really_dyn_lshift, @function
_really_dyn_lshift:
        mov.l   .L23,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L24:
        .align 2
.L23:
        .long   ___ashlsi3_r0
        .size   _really_dyn_lshift, .-_really_dyn_lshift
        .align 1
        .align 2
        .global _really_dyn_rshift
        .type   _really_dyn_rshift, @function
_really_dyn_rshift:
        mov.l   .L27,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L28:
        .align 2
.L27:
        .long   ___lshrsi3_r0
        .size   _really_dyn_rshift, .-_really_dyn_rshift
        .ident  "GCC: (GNU) 12.3.0" 

> '-mdisable-dynshift-libcall' would be more appropriate for what it tries to
> do, I think.  Although that is a whole different issue ... but what is it
> going to do for real dynamic shifts on SH2?
> 
> What kind of code is it supposed to emit for things like
> 
> unsigned int dyn_shift (unsigned int x, unsigned int y)
> {
>   return x << (y & 31);
> }

As far as I understand from GCC sources, function I patched 'expand_ashiftrt'
process only constant values of shift. As you can see earlier, I added your and
other examples to tests. It looks like really dynamic shifts translate to
library calls.

Should I test more exotic situations? If so, could you please help me with
really exotic or weired examples?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (31 preceding siblings ...)
  2023-05-24 13:34 ` klepikov.alex+bugs at gmail dot com
@ 2023-05-24 15:00 ` olegendo at gcc dot gnu.org
  2023-05-25 17:53 ` klepikov.alex+bugs at gmail dot com
                   ` (11 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2023-05-24 15:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #38 from Oleg Endo <olegendo at gcc dot gnu.org> ---
(In reply to Alexander Klepikov from comment #37)
> 
> As far as I understand from GCC sources, function I patched
> 'expand_ashiftrt' process only constant values of shift. As you can see
> earlier, I added your and other examples to tests.

OK, thanks for the additional test cases.  It really looks like the way the
constant shift is expanded (via ashrsi3_n insn) on SH1/SH2 is getting in the
way.

The tst insn is mainly formed by the combine pass, which relies on certain insn
patterns and combinations thereof.  See also sh.md, around line 530.

You can look at the debug output with the -fdump-rtl-all option to see what's
happening in the RTL passes.

What your patch is doing is to make it not emit the ashrsi3_n insn for constant
shifts altogether?  I guess it will make code that actually needs those real
shifts larger, as it will always emit the whole shift stitching sequence.  That
might be a good thing or not.


> It looks like really
> dynamic shifts translate to library calls.

So the option name '-mdisable-dynshift-libcall' doesn't make sense.
What it actually does is more like '-mdisable-constshift-libcall'.

> 
> Should I test more exotic situations? If so, could you please help me with
> really exotic or weired examples?

Have you had a look at the existing test cases for this in
gcc/testsuite/gcc.target/sh ?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (32 preceding siblings ...)
  2023-05-24 15:00 ` olegendo at gcc dot gnu.org
@ 2023-05-25 17:53 ` klepikov.alex+bugs at gmail dot com
  2023-05-25 21:32 ` olegendo at gcc dot gnu.org
                   ` (10 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: klepikov.alex+bugs at gmail dot com @ 2023-05-25 17:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #39 from Alexander Klepikov <klepikov.alex+bugs at gmail dot com> ---
> The tst insn is mainly formed by the combine pass, which relies on certain
> insn patterns and combinations thereof.  See also sh.md, around line 530.

I'm sorry, but .md lang is too complicated for me.

> You can look at the debug output with the -fdump-rtl-all option to see
> what's happening in the RTL passes.

It looks like with optimization enabled it converts bitwise AND to right shift
and then optimizes again. But SH4 has 'shad' and 'shad' can be optimized to
'tst'. And SH2E has libcall instead of dynamic shift and libcll cannot be
converted. It seems that very first optimization spoils things.

But when we have numerous 'shar' instructions, optimization joins the game
again and converts them to 'tst'.

> What your patch is doing is to make it not emit the ashrsi3_n insn for
> constant shifts altogether?  I guess it will make code that actually needs
> those real shifts larger, as it will always emit the whole shift stitching
> sequence.  That might be a good thing or not.

You are absolutely right, the code will be larger when we do right shifts. But
there's situations when you can't use library. For exmple, SH7055 engine
control unit can barely contain the program. The library just won't fit.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (33 preceding siblings ...)
  2023-05-25 17:53 ` klepikov.alex+bugs at gmail dot com
@ 2023-05-25 21:32 ` olegendo at gcc dot gnu.org
  2023-05-26 12:03 ` klepikov.alex+bugs at gmail dot com
                   ` (9 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2023-05-25 21:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #40 from Oleg Endo <olegendo at gcc dot gnu.org> ---
(In reply to Alexander Klepikov from comment #39)
> 
> I'm sorry, but .md lang is too complicated for me.

Yeah, it looks alien at first sight.  But it's where a lot of things happen
w.r.t. instruction selection.

> It looks like with optimization enabled it converts bitwise AND to right
> shift and then optimizes again. But SH4 has 'shad' and 'shad' can be
> optimized to 'tst'. And SH2E has libcall instead of dynamic shift and libcll
> cannot be converted. It seems that very first optimization spoils things.
> 
> But when we have numerous 'shar' instructions, optimization joins the game
> again and converts them to 'tst'.

Yes, something like this is what I remember happening there.  I'll try to look
into the issue with your test cases and see if it's possible to add some
patterns to catch those.

BTW, have you tried it on a more recent GCC?  There have also been some
optimizations in the middle-end (a bit more backend independent) for this kind
of thing.

> You are absolutely right, the code will be larger when we do right shifts.
> But there's situations when you can't use library. For exmple, SH7055 engine
> control unit can barely contain the program. The library just won't fit.

Have you tried to use whole program optimization via -flto and
-ffunction-sections? It  should be able to strip out all unnecessary library
functions.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (34 preceding siblings ...)
  2023-05-25 21:32 ` olegendo at gcc dot gnu.org
@ 2023-05-26 12:03 ` klepikov.alex+bugs at gmail dot com
  2023-05-26 17:44 ` olegendo at gcc dot gnu.org
                   ` (8 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: klepikov.alex+bugs at gmail dot com @ 2023-05-26 12:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #41 from Alexander Klepikov <klepikov.alex+bugs at gmail dot com> ---
> > It looks like with optimization enabled it converts bitwise AND to right
> > shift and then optimizes again. But SH4 has 'shad' and 'shad' can be
> > optimized to 'tst'. And SH2E has libcall instead of dynamic shift and libcll
> > cannot be converted. It seems that very first optimization spoils things.
> > 
> > But when we have numerous 'shar' instructions, optimization joins the game
> > again and converts them to 'tst'.
> 
> Yes, something like this is what I remember happening there.  I'll try to
> look into the issue with your test cases and see if it's possible to add
> some patterns to catch those.

Thank you! I have an idea. If it's impossible to defer initial optimization,
maybe it's possible to emit some intermediate insn and catch it and optimize
later?

> BTW, have you tried it on a more recent GCC?  There have also been some
> optimizations in the middle-end (a bit more backend independent) for this
> kind of thing.

I tried 13.1 about week or two ago with the same result.

> Have you tried to use whole program optimization via -flto and
> -ffunction-sections? It  should be able to strip out all unnecessary library
> functions.

Thank you for advice, I'll take a try.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (35 preceding siblings ...)
  2023-05-26 12:03 ` klepikov.alex+bugs at gmail dot com
@ 2023-05-26 17:44 ` olegendo at gcc dot gnu.org
  2023-05-28 10:24 ` klepikov.alex+bugs at gmail dot com
                   ` (7 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2023-05-26 17:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #42 from Oleg Endo <olegendo at gcc dot gnu.org> ---
(In reply to Alexander Klepikov from comment #41)
> 
> Thank you! I have an idea. If it's impossible to defer initial optimization,
> maybe it's possible to emit some intermediate insn and catch it and optimize
> later?

This is basically what is supposed to be happening there already.

However, it's a bit of a dilemma.

1) If we don't have a dynamic shift insn and we smash the constant shift into
individual 
stitching shifts early, it might open some new optimization opportunities, e.g.
by sharing intermediate shift results.  Not sure though if that actually
happens in practice though.

2) Whether to use the dynamic shift insn or emit a function call or use
stitching shifts sequence, it all has an impact on register allocation and
other instruction use.  This can be problematic during the course of RTL
optimization passes.

3) Even if we have a dynamic shift, sometimes it's more beneficial to emit a
shorter stitching shift sequence.  Which one is better depends on the
surrounding code.  I don't think there is anything good there to make the
proper choice.

Some other shift related PRs: PR 54089, PR 65317, PR 67691, PR 67869, PR 52628,
PR 58017


> > BTW, have you tried it on a more recent GCC?  There have also been some
> > optimizations in the middle-end (a bit more backend independent) for this
> > kind of thing.
> 
> I tried 13.1 about week or two ago with the same result.
> 

Good to know.  Thanks for checking it!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (36 preceding siblings ...)
  2023-05-26 17:44 ` olegendo at gcc dot gnu.org
@ 2023-05-28 10:24 ` klepikov.alex+bugs at gmail dot com
  2023-05-28 10:48 ` olegendo at gcc dot gnu.org
                   ` (6 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: klepikov.alex+bugs at gmail dot com @ 2023-05-28 10:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #43 from Alexander Klepikov <klepikov.alex+bugs at gmail dot com> ---
> > Thank you! I have an idea. If it's impossible to defer initial optimization,
> > maybe it's possible to emit some intermediate insn and catch it and optimize
> > later?
> 
> This is basically what is supposed to be happening there already.

Well, not really. Look what's happening during expand pass when 'ashrsi3' is
expanding. Function 'expand_ashiftrt' is called and what it does at the end -
it explicitly emits 3 insns:

wrk = gen_reg_rtx (Pmode);

  //This one
  emit_move_insn (gen_rtx_REG (SImode, 4), operands[1]);

  sprintf (func, "__ashiftrt_r4_%d", value);
  rtx lab = function_symbol (wrk, func, SFUNC_STATIC).lab;

  //This one
  emit_insn (gen_ashrsi3_n (GEN_INT (value), wrk, lab));

  //And this one
  emit_move_insn (operands[0], gen_rtx_REG (SImode, 4));

As far as I understand these insns could be catched later by a peephole and
converted to 'tstsi_t' insn like it is done for other much simple insn
sequences.

What I'm thinkig about is to emit only one, say 'compound', insn. Which could
be then splitted later somwhere in split pass to function call, to those 3
insns.

I wrote test code that emits only one bogus insn. This insn expands to pure asm
code. Of course, that asm code is invalid, because it is impossible to place a
libcall label at the end of function with pure asm code injection. But then all
what is could be coverted to 'tst', converts to 'tst'. And all pure right
shifts converts to invalid asm code, of course.

That's why I am thinking about possibility of emitting some intermediate insn
at expand pass that will defer it real expanding. But I still don't know how to
do it right and even if it is possible.

By the way, right shift for integers expands to only one 'lshiftrt' insn and
that's why it can be catched and converted to 'tst'.

> 
> However, it's a bit of a dilemma.
> 
> 1) If we don't have a dynamic shift insn and we smash the constant shift
> into individual 
> stitching shifts early, it might open some new optimization opportunities,
> e.g. by sharing intermediate shift results.  Not sure though if that
> actually happens in practice though.
> 
> 2) Whether to use the dynamic shift insn or emit a function call or use
> stitching shifts sequence, it all has an impact on register allocation and
> other instruction use.  This can be problematic during the course of RTL
> optimization passes.
> 
> 3) Even if we have a dynamic shift, sometimes it's more beneficial to emit a
> shorter stitching shift sequence.  Which one is better depends on the
> surrounding code.  I don't think there is anything good there to make the
> proper choice.
> 
> Some other shift related PRs: PR 54089, PR 65317, PR 67691, PR 67869, PR
> 52628, PR 58017

Thank you for your time and detailed explanations! I agree with you on all
points. Software cannot be perfect and it's OK for GCC not to be super
optimized, so this part better sholud be left intact.

By the way, I tried to link library to my project and I figured out that linker
is smart enough to link only necessary library functions even without LTO. So
increase in size is about 100 or 200 bytes, that is acceptable. Thank you very
much for help!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (37 preceding siblings ...)
  2023-05-28 10:24 ` klepikov.alex+bugs at gmail dot com
@ 2023-05-28 10:48 ` olegendo at gcc dot gnu.org
  2023-05-29 14:54 ` klepikov.alex+bugs at gmail dot com
                   ` (5 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2023-05-28 10:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #44 from Oleg Endo <olegendo at gcc dot gnu.org> ---
(In reply to Alexander Klepikov from comment #43)
> 
> Well, not really. Look what's happening during expand pass when 'ashrsi3' is
> expanding. Function 'expand_ashiftrt' is called and what it does at the end
> - it explicitly emits 3 insns:
> [...]

> 
> By the way, right shift for integers expands to only one 'lshiftrt' insn and
> that's why it can be catched and converted to 'tst'.
> 

Yeah, I might have dropped the ball on the right shift patterns back then and
only reworked the left shift patterns to do that. 


> 
> As far as I understand these insns could be catched later by a peephole and
> converted to 'tstsi_t' insn like it is done for other much simple insn
> sequences.

It's the combine RTL pass and split1 RTL pass that does most of this work here.
 Peephole pass in GCC is too simplistic for this.


> 
> Thank you for your time and detailed explanations! I agree with you on all
> points. Software cannot be perfect and it's OK for GCC not to be super
> optimized, so this part better sholud be left intact.

We can't have it perfect, but we can try ;)

> 
> By the way, I tried to link library to my project and I figured out that
> linker is smart enough to link only necessary library functions even without
> LTO. So increase in size is about 100 or 200 bytes, that is acceptable.
> Thank you very much for help!

You're welcome.

Yes, to strip out unused library functions it doesn't need LTO.  But using LTO
for embedded/MCU  firmware, I find it can reduce the code size by about 20%. 
For example, it can also inline small library functions in your code (if the
library was also compiled with LTO).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (38 preceding siblings ...)
  2023-05-28 10:48 ` olegendo at gcc dot gnu.org
@ 2023-05-29 14:54 ` klepikov.alex+bugs at gmail dot com
  2023-05-30  1:48 ` egallager at gcc dot gnu.org
                   ` (4 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: klepikov.alex+bugs at gmail dot com @ 2023-05-29 14:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #45 from Alexander Klepikov <klepikov.alex+bugs at gmail dot com> ---
>I have an idea. If it's impossible to defer initial optimization,
> maybe it's possible to emit some intermediate insn and catch it and optimize
> later?


Good news. I've made a proof of concept. It works at least sometimes - on
simple tests.

$ cat f.c
#define A 0xFFFF0000
#define P ((unsigned char *)A)
#define F 64
#define S 8

unsigned char f_non_zero(unsigned char v){
    return (v & F) != 0;
}

unsigned f_sym_non_zero(void){
    return (*P & F) != 0;
}

unsigned f_sym_mask(void){
    return (*P & F) == F;
}

int f_rshift(char v){
    return v >> S;
}

$ /usr/local/sh-toolchain/bin/sh-elf-gcc -O2 -mb -m2e -da -S f.c

$ cat f.s
        .file   "f.c"
        .text
        .text
        .align 1
        .align 2
        .global _f_non_zero
        .type   _f_non_zero, @function
_f_non_zero:
        mov     r4,r0
        sts.l   pr,@-r15
        tst     #64,r0
        mov     #-1,r0
        negc    r0,r0
        lds.l   @r15+,pr
        rts
        nop
        .size   _f_non_zero, .-_f_non_zero
        .align 1
        .align 2
        .global _f_sym_non_zero
        .type   _f_sym_non_zero, @function
_f_sym_non_zero:
        mov.l   .L6,r1
        sts.l   pr,@-r15
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        negc    r0,r0
        lds.l   @r15+,pr
        rts
        nop
.L7:
        .align 2
.L6:
        .long   -65536
        .size   _f_sym_non_zero, .-_f_sym_non_zero
        .align 1
        .align 2
        .global _f_sym_mask
        .type   _f_sym_mask, @function
_f_sym_mask:
        mov.l   .L10,r1
        sts.l   pr,@-r15
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        negc    r0,r0
        lds.l   @r15+,pr
        rts
        nop
.L11:
        .align 2
.L10:
        .long   -65536
        .size   _f_sym_mask, .-_f_sym_mask
        .align 1
        .align 2
        .global _f_rshift
        .type   _f_rshift, @function
_f_rshift:
        mov.l   .L14,r1
        sts.l   pr,@-r15
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        lds.l   @r15+,pr
        rts
        nop
.L15:
        .align 2
.L14:
        .long   ___ashiftrt_r4_8
        .size   _f_rshift, .-_f_rshift
        .ident  "GCC: (GNU) 13.1.0"

$ /usr/local/sh-toolchain/bin/sh-elf-gcc -O2 -ml -m2e -da -S f.c

$ cat f.s
        .file   "f.c"
        .text
        .little
        .text
        .align 1
        .align 2
        .global _f_non_zero
        .type   _f_non_zero, @function
_f_non_zero:
        mov     r4,r0
        sts.l   pr,@-r15
        tst     #64,r0
        mov     #-1,r0
        negc    r0,r0
        lds.l   @r15+,pr
        rts
        nop
        .size   _f_non_zero, .-_f_non_zero
        .align 1
        .align 2
        .global _f_sym_non_zero
        .type   _f_sym_non_zero, @function
_f_sym_non_zero:
        mov.l   .L6,r1
        sts.l   pr,@-r15
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        negc    r0,r0
        lds.l   @r15+,pr
        rts
        nop
.L7:
        .align 2
.L6:
        .long   -65536
        .size   _f_sym_non_zero, .-_f_sym_non_zero
        .align 1
        .align 2
        .global _f_sym_mask
        .type   _f_sym_mask, @function
_f_sym_mask:
        mov.l   .L10,r1
        sts.l   pr,@-r15
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        negc    r0,r0
        lds.l   @r15+,pr
        rts
        nop
.L11:
        .align 2
.L10:
        .long   -65536
        .size   _f_sym_mask, .-_f_sym_mask
        .align 1
        .align 2
        .global _f_rshift
        .type   _f_rshift, @function
_f_rshift:
        mov.l   .L14,r1
        sts.l   pr,@-r15
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        lds.l   @r15+,pr
        rts
        nop
.L15:
        .align 2
.L14:
        .long   ___ashiftrt_r4_8
        .size   _f_rshift, .-_f_rshift
        .ident  "GCC: (GNU) 13.1.0"

Splitting takes place at split1 pass as expected. Here is the patch itself.

$ cat gcc-13.1.0-ashrsi3_libcall.patch
diff -ur gcc-13.1.0.orig/gcc/config/sh/sh-protos.h
gcc-13.1.0/gcc/config/sh/sh-protos.h
--- gcc-13.1.0.orig/gcc/config/sh/sh-protos.h   2023-04-26 10:09:39.000000000
+0300
+++ gcc-13.1.0/gcc/config/sh/sh-protos.h        2023-05-29 11:45:05.134723435
+0300
@@ -78,6 +78,7 @@
 extern void gen_shifty_op (int, rtx *);
 extern void gen_shifty_hi_op (int, rtx *);
 extern bool expand_ashiftrt (rtx *);
+extern bool expand_ashrsi3_libcall (rtx *);//delete
 extern bool sh_dynamicalize_shift_p (rtx);
 extern int shl_and_kind (rtx, rtx, int *);
 extern int shl_and_length (rtx);
diff -ur gcc-13.1.0.orig/gcc/config/sh/sh.cc gcc-13.1.0/gcc/config/sh/sh.cc
--- gcc-13.1.0.orig/gcc/config/sh/sh.cc 2023-04-26 10:09:39.000000000 +0300
+++ gcc-13.1.0/gcc/config/sh/sh.cc      2023-05-29 17:09:54.602787537 +0300
@@ -3875,11 +3877,37 @@
   wrk = gen_reg_rtx (Pmode);

   /* Load the value into an arg reg and call a helper.  */
-  emit_move_insn (gen_rtx_REG (SImode, 4), operands[1]);
+  /*emit_move_insn (gen_rtx_REG (SImode, 4), operands[1]);
   sprintf (func, "__ashiftrt_r4_%d", value);
   rtx lab = function_symbol (wrk, func, SFUNC_STATIC).lab;
   emit_insn (gen_ashrsi3_n (GEN_INT (value), wrk, lab));
   emit_move_insn (operands[0], gen_rtx_REG (SImode, 4));
+  return true;*/
+
+  if (dump_file)
+    fprintf(dump_file, "ashrsi3: Emitting collapsed libcall\n");
+  emit_insn (gen_ashrsi3_libcall_collapsed (operands[0], operands[1],
GEN_INT(value)));//delete
+  return true;//delete
+}
+
+//delete
+bool
+expand_ashrsi3_libcall (rtx *operands) {
+  char func[18];
+
+  if (dump_file)
+    fprintf(dump_file, "ashrsi3_libcall_collapsed: Expanding ashrsi3
libcall\n");
+
+  rtx wrk = gen_reg_rtx (Pmode);
+  emit_move_insn (gen_rtx_REG (SImode, 4), operands[1]);
+
+  sprintf (func, "__ashiftrt_r4_%d", INTVAL (operands[2]));
+
+  rtx lab = function_symbol (wrk, func, SFUNC_STATIC).lab;
+
+  emit_insn (gen_ashrsi3_n (operands[2], wrk, lab));
+  emit_move_insn (operands[0], gen_rtx_REG (SImode, 4));
+
   return true;
 }

diff -ur gcc-13.1.0.orig/gcc/config/sh/sh.md gcc-13.1.0/gcc/config/sh/sh.md
--- gcc-13.1.0.orig/gcc/config/sh/sh.md 2023-04-26 10:09:39.000000000 +0300
+++ gcc-13.1.0/gcc/config/sh/sh.md      2023-05-29 17:10:42.752779922 +0300
@@ -3867,6 +3867,35 @@
   [(set_attr "type" "sfunc")
    (set_attr "needs_delay_slot" "yes")])

+(define_insn "ashrsi3_libcall_collapsed"
+  [(set (match_operand:SI 0 "arith_reg_dest" "=r")
+       (ashiftrt:SI (match_operand:SI 1 "arith_reg_operand" "0")
+                    (match_operand:SI 2 "const_int_operand")))
+   (clobber (reg:SI T_REG))
+   (clobber (reg:SI PR_REG))]
+  "TARGET_SH1"
+  "OOPS"
+  [(set_attr "type" "dyn_shift")
+   (set_attr "needs_delay_slot" "yes")])
+
+(define_insn_and_split "ashrsi3_libcall_expand"
+  [(parallel [(set (match_operand:SI 0 "arith_reg_dest")
+       (ashiftrt:SI (match_operand:SI 1 "arith_reg_operand")
+           (match_operand:SI 2 "const_int_operand"))
+       )(clobber (reg:SI T_REG))
+       (clobber (reg:SI PR_REG))
+  ])]
+  "TARGET_SH1"
+  "OOPS_1"
+  "&& 1"
+  [(const_int 0)]
+{
+  if (expand_ashrsi3_libcall(operands))
+    DONE;
+  else
+    FAIL;
+})
+
 ;; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
 ;; DImode arithmetic shift right

I did it by feel, actually picking up the parameters until it worked. So please
check it and improve it because I'm sure it will break something. Thank you.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (39 preceding siblings ...)
  2023-05-29 14:54 ` klepikov.alex+bugs at gmail dot com
@ 2023-05-30  1:48 ` egallager at gcc dot gnu.org
  2023-05-30  1:56 ` olegendo at gcc dot gnu.org
                   ` (3 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: egallager at gcc dot gnu.org @ 2023-05-30  1:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #46 from Eric Gallager <egallager at gcc dot gnu.org> ---
Reminder that patches go to the gcc-patches mailing list

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (40 preceding siblings ...)
  2023-05-30  1:48 ` egallager at gcc dot gnu.org
@ 2023-05-30  1:56 ` olegendo at gcc dot gnu.org
  2023-05-30 12:42 ` klepikov.alex+bugs at gmail dot com
                   ` (2 subsequent siblings)
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2023-05-30  1:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #47 from Oleg Endo <olegendo at gcc dot gnu.org> ---
(In reply to Eric Gallager from comment #46)
> Reminder that patches go to the gcc-patches mailing list

It's OK.  We're just throwing around ideas, nothing final

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (41 preceding siblings ...)
  2023-05-30  1:56 ` olegendo at gcc dot gnu.org
@ 2023-05-30 12:42 ` klepikov.alex+bugs at gmail dot com
  2023-05-30 19:57 ` olegendo at gcc dot gnu.org
  2023-05-30 20:00 ` olegendo at gcc dot gnu.org
  44 siblings, 0 replies; 46+ messages in thread
From: klepikov.alex+bugs at gmail dot com @ 2023-05-30 12:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #48 from Alexander Klepikov <klepikov.alex+bugs at gmail dot com> ---
I made tests (including *.c files from GCC testsuite) and everything looks fine
for now. But I'm still afraid that pattern for 'ashrsi3_libcall_expand' is too
wide. It is possible to narrow it down as much as possible by adding distinct
attribute and set when emitting 'ashrsi3_libcall_collapsed' and then check it
and fail if not set:

(define_attr "libcall_collapsed"
 "ashrsi3,nil"
 (const_string "nil"))

(define_insn "ashrsi3_libcall_collapsed"
  [(set (match_operand:SI 0 "arith_reg_dest" "=r")
        (ashiftrt:SI (match_operand:SI 1 "arith_reg_operand" "0")
                     (match_operand:SI 2 "const_int_operand")))
   (clobber (reg:SI T_REG))
   (clobber (reg:SI PR_REG))]
  "TARGET_SH1"
  "OOPS"
  [(set_attr "type" "dyn_shift")
   (set_attr "libcall_collapsed" "ashrsi3")
   (set_attr "needs_delay_slot" "yes")])

 if (get_attr_libcall_collapsed(insn) != LIBCALL_COLLAPSED_ASHRSI3)
    return false;

It will be super safe then but ugly a little bit.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (42 preceding siblings ...)
  2023-05-30 12:42 ` klepikov.alex+bugs at gmail dot com
@ 2023-05-30 19:57 ` olegendo at gcc dot gnu.org
  2023-05-30 20:00 ` olegendo at gcc dot gnu.org
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2023-05-30 19:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #49 from Oleg Endo <olegendo at gcc dot gnu.org> ---
(In reply to Alexander Klepikov from comment #48)
> I made tests (including *.c files from GCC testsuite) and everything looks
> fine for now. But I'm still afraid that pattern for 'ashrsi3_libcall_expand'
> is too wide. It is possible to narrow it down as much as possible by adding
> distinct attribute and set when emitting 'ashrsi3_libcall_collapsed' and
> then check it and fail if not set:
> 

For this kind of change, the whole GCC test suite needs to be ran for at least
big/little -m2,-m4 variants.


+(define_insn_and_split "ashrsi3_libcall_expand"
+  [(parallel [(set (match_operand:SI 0 "arith_reg_dest")
+       (ashiftrt:SI (match_operand:SI 1 "arith_reg_operand")
+           (match_operand:SI 2 "const_int_operand"))
+       )(clobber (reg:SI T_REG))
+       (clobber (reg:SI PR_REG))
+  ])]

The 'parallel' construct looks strange.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
  2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
                   ` (43 preceding siblings ...)
  2023-05-30 19:57 ` olegendo at gcc dot gnu.org
@ 2023-05-30 20:00 ` olegendo at gcc dot gnu.org
  44 siblings, 0 replies; 46+ messages in thread
From: olegendo at gcc dot gnu.org @ 2023-05-30 20:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #50 from Oleg Endo <olegendo at gcc dot gnu.org> ---
Actually, let's take any further discussion of shift patterns to PR 54089.

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2023-05-30 20:00 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-01 20:17 [Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction oleg.endo@t-online.de
2011-06-01 20:42 ` [Bug target/49263] " oleg.endo@t-online.de
2011-06-12 23:12 ` kkojima at gcc dot gnu.org
2011-06-19 16:42 ` oleg.endo@t-online.de
2011-06-22 22:34 ` kkojima at gcc dot gnu.org
2011-06-26 22:31 ` oleg.endo@t-online.de
2011-06-27  5:15 ` kkojima at gcc dot gnu.org
2011-10-09 23:35 ` oleg.endo@t-online.de
2011-10-10  1:32 ` kkojima at gcc dot gnu.org
2011-10-10 23:48 ` oleg.endo@t-online.de
2011-10-11  1:47 ` kkojima at gcc dot gnu.org
2011-10-13 22:55 ` oleg.endo@t-online.de
2011-10-14 23:06 ` kkojima at gcc dot gnu.org
2011-10-15  2:33 ` kkojima at gcc dot gnu.org
2011-11-20 14:20 ` oleg.endo@t-online.de
2011-12-29  1:09 ` oleg.endo@t-online.de
2012-02-26 16:28 ` olegendo at gcc dot gnu.org
2012-02-26 23:29 ` olegendo at gcc dot gnu.org
2012-08-27 19:52 ` olegendo at gcc dot gnu.org
2012-10-28 22:02 ` olegendo at gcc dot gnu.org
2012-10-31 13:47 ` olegendo at gcc dot gnu.org
2013-12-08 13:47 ` olegendo at gcc dot gnu.org
2013-12-17 12:37 ` olegendo at gcc dot gnu.org
2014-12-30 18:45 ` olegendo at gcc dot gnu.org
2015-01-24 13:05 ` olegendo at gcc dot gnu.org
2015-01-26 23:57 ` olegendo at gcc dot gnu.org
2023-05-12 11:46 ` klepikov.alex+bugs at gmail dot com
2023-05-23 12:34 ` klepikov.alex+bugs at gmail dot com
2023-05-23 12:35 ` klepikov.alex+bugs at gmail dot com
2023-05-23 19:05 ` olegendo at gcc dot gnu.org
2023-05-24 11:40 ` klepikov.alex+bugs at gmail dot com
2023-05-24 11:57 ` olegendo at gcc dot gnu.org
2023-05-24 13:34 ` klepikov.alex+bugs at gmail dot com
2023-05-24 15:00 ` olegendo at gcc dot gnu.org
2023-05-25 17:53 ` klepikov.alex+bugs at gmail dot com
2023-05-25 21:32 ` olegendo at gcc dot gnu.org
2023-05-26 12:03 ` klepikov.alex+bugs at gmail dot com
2023-05-26 17:44 ` olegendo at gcc dot gnu.org
2023-05-28 10:24 ` klepikov.alex+bugs at gmail dot com
2023-05-28 10:48 ` olegendo at gcc dot gnu.org
2023-05-29 14:54 ` klepikov.alex+bugs at gmail dot com
2023-05-30  1:48 ` egallager at gcc dot gnu.org
2023-05-30  1:56 ` olegendo at gcc dot gnu.org
2023-05-30 12:42 ` klepikov.alex+bugs at gmail dot com
2023-05-30 19:57 ` olegendo at gcc dot gnu.org
2023-05-30 20:00 ` olegendo at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).