public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/63464] New: compare one character to many: faster
@ 2014-10-05 21:28 glisse at gcc dot gnu.org
  2014-10-06 11:27 ` [Bug tree-optimization/63464] " jakub at gcc dot gnu.org
                   ` (15 more replies)
  0 siblings, 16 replies; 17+ messages in thread
From: glisse at gcc dot gnu.org @ 2014-10-05 21:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

            Bug ID: 63464
           Summary: compare one character to many: faster
           Product: gcc
           Version: 5.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: glisse at gcc dot gnu.org

This is inspired by reading this post:

http://stackoverflow.com/questions/26124620/why-does-msvc-emit-a-useless-movsx-before-performing-this-bit-test

which shows that bit testing can provide a very efficient lookup table for
small sizes, and in particular when we want to test if a character belongs to a
predefined set.

char*f1(char*s){
  while (*s == ' ' || *s == ',' || *s == '\r' || *s == '\n')
    ++s;
  return s;
}
char*f2(char*s){
  long n = (1L << ' ') | (1L << ',') | (1L << '\r') | (1L << '\n');
  int m = max(max(' ',','),max('\r','\n'));
  while(*s <= m && (n & (1L << *s)))
    ++s;
  return s;
}

On x86_64, the first one compiles to a bunch of cmpb+sete, combined with orl.
The second one has one cmpb+jg but uses btq+jc for the main job. A simple
benchmark on a string full of ',' shows running times of 158 vs 32, a very
large win for f2 (suspicious actually, but if I only test ' ', ',' and '\r', I
get the less surprising 50 for f1, which still shows f2 far ahead).

As is, it only works with characters less than 64, but more general versions
are possible with a bit more overhead (like applying the same to *s-64 after
testing its sign, or looking up the (*s/64)th element in a regular lookup table
and bit testing against that, etc).

The running time of 32 is exactly the same I get with a larger lookup table:
char issep[256]={0,0,...};
while(issep[*s])...

In this particular case, vectorization might also be an option, either the loop
kind if the string is likely to be long, but we don't even try because we can't
compute the number of iterations, or the slp kind by broadcasting *s, comparing
to all chars at once and reducing.

This PR is a bit vague and we cannot implement every weird optimization, but I
expect this type of comparison might be common enough that it would be worth
looking at. If you disagree, feel free to close, I never write code like that
myself ;-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/63464] compare one character to many: faster
  2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
@ 2014-10-06 11:27 ` jakub at gcc dot gnu.org
  2014-10-06 11:52 ` rguenther at suse dot de
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-10-06 11:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2014-10-06
                 CC|                            |jakub at gcc dot gnu.org,
                   |                            |rguenth at gcc dot gnu.org,
                   |                            |steven at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
We have this optimization implemented for switches, if you compile
char*f3(char*s){
  do
    {
      switch (*s)
        {
        case ' ':
        case ',':
        case '\r':
        case '\n':
          ++s;
          continue;
        default:
          return s;
        }
    }
  while (1);
}

then it will do the bit test, see r189173 (and various follow-up fixes for
that).
Now, we can argue whether in this case it is beneficial to perform the
MINUS_EXPR or if maxval is small enough (e.g. when maxval is smaller than
BITS_PER_WORD), just assume minval is 0.

And then the question is, if we should teach reassoc range optimizations to
reuse emit_case_bit_tests, or convert such tests into a GIMPLE_SWITCH and
expand as such.

Richard/Steven, thoughts about this?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/63464] compare one character to many: faster
  2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
  2014-10-06 11:27 ` [Bug tree-optimization/63464] " jakub at gcc dot gnu.org
@ 2014-10-06 11:52 ` rguenther at suse dot de
  2014-10-06 16:59 ` jakub at gcc dot gnu.org
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenther at suse dot de @ 2014-10-06 11:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

--- Comment #2 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 6 Oct 2014, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464
> 
> Jakub Jelinek <jakub at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|UNCONFIRMED                 |NEW
>    Last reconfirmed|                            |2014-10-06
>                  CC|                            |jakub at gcc dot gnu.org,
>                    |                            |rguenth at gcc dot gnu.org,
>                    |                            |steven at gcc dot gnu.org
>      Ever confirmed|0                           |1
> 
> --- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> We have this optimization implemented for switches, if you compile
> char*f3(char*s){
>   do
>     {
>       switch (*s)
>         {
>         case ' ':
>         case ',':
>         case '\r':
>         case '\n':
>           ++s;
>           continue;
>         default:
>           return s;
>         }
>     }
>   while (1);
> }
> 
> then it will do the bit test, see r189173 (and various follow-up fixes for
> that).
> Now, we can argue whether in this case it is beneficial to perform the
> MINUS_EXPR or if maxval is small enough (e.g. when maxval is smaller than
> BITS_PER_WORD), just assume minval is 0.
> 
> And then the question is, if we should teach reassoc range optimizations to
> reuse emit_case_bit_tests, or convert such tests into a GIMPLE_SWITCH and
> expand as such.
> 
> Richard/Steven, thoughts about this?

Factoring out the code-gen part of switch-conversion and use it
from reassoc range optimization?

I don't think creating a GIMPLE_SWITCH from reassoc is a good idea
(given that switch conversion runs early).

Richard.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/63464] compare one character to many: faster
  2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
  2014-10-06 11:27 ` [Bug tree-optimization/63464] " jakub at gcc dot gnu.org
  2014-10-06 11:52 ` rguenther at suse dot de
@ 2014-10-06 16:59 ` jakub at gcc dot gnu.org
  2014-10-07 11:11 ` jakub at gcc dot gnu.org
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-10-06 16:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 33654
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33654&action=edit
gcc5-pr63464.patch

Untested patch to avoid the subtraction of info.range_min from index.
Might not always be a net win, if the mask constant(s) is(are) more costly if
we don't subtract minval than when we do.
E.g. on x86_64, if without subtracting we need to use movabsq + and with a reg,
while with subtracting just sub + and with an immediate mask.
So perhaps we need to build all the masks for both cases and sum up rtx cost of
all the masks and the sub.

This is for the switch opt.  In tree-ssa-reassoc, I'll see what can be reused,
but probably not very much (perhaps the rtx_cost code if we add it for the
switch conversion).


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/63464] compare one character to many: faster
  2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2014-10-06 16:59 ` jakub at gcc dot gnu.org
@ 2014-10-07 11:11 ` jakub at gcc dot gnu.org
  2014-10-07 11:15 ` rguenther at suse dot de
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-10-07 11:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

--- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 33658
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33658&action=edit
gcc5-pr63464.patch

Updated patch for the switchconv, this time checking rtx costs.

As for reassoc, the problem I see is that this kind of optimization needs to
split basic blocks, as left shift by negative or >= word bit size is undefined
behavior, so the expected generated code is probably jump around the left
shift.
I think reassoc pass is not prepared to see splitting of basic blocks, nor
adding new PHI nodes etc.  In the:
int
foo (int x)
{
  return x == 1 || x == 2 || x == 4 || x == 6 || x == 15 || x == 17;
}
case we actually have 2 basic blocks and there is no other test ored in in
either of the basic blocks, so we could perform it even without creating a new
bb, but I'd say that very often we will not be that lucky.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/63464] compare one character to many: faster
  2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2014-10-07 11:11 ` jakub at gcc dot gnu.org
@ 2014-10-07 11:15 ` rguenther at suse dot de
  2014-10-07 20:15 ` glisse at gcc dot gnu.org
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenther at suse dot de @ 2014-10-07 11:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

--- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 7 Oct 2014, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464
> 
> --- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> Created attachment 33658
>   --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33658&action=edit
> gcc5-pr63464.patch
> 
> Updated patch for the switchconv, this time checking rtx costs.
> 
> As for reassoc, the problem I see is that this kind of optimization needs to
> split basic blocks, as left shift by negative or >= word bit size is undefined
> behavior, so the expected generated code is probably jump around the left
> shift.
> I think reassoc pass is not prepared to see splitting of basic blocks, nor
> adding new PHI nodes etc.  In the:
> int
> foo (int x)
> {
>   return x == 1 || x == 2 || x == 4 || x == 6 || x == 15 || x == 17;
> }
> case we actually have 2 basic blocks and there is no other test ored in in
> either of the basic blocks, so we could perform it even without creating a new
> bb, but I'd say that very often we will not be that lucky.

But you could do preparations and do the actual transform splitting blocks
as a 2nd phase?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/63464] compare one character to many: faster
  2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2014-10-07 11:15 ` rguenther at suse dot de
@ 2014-10-07 20:15 ` glisse at gcc dot gnu.org
  2014-10-07 21:01 ` glisse at gcc dot gnu.org
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: glisse at gcc dot gnu.org @ 2014-10-07 20:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

--- Comment #6 from Marc Glisse <glisse at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #1)
> We have this optimization implemented for switches,

Thanks, that's indeed the most natural place for it, although I hadn't thought
of testing that...

Glibc's strspn has:

__STRING_INLINE size_t
__strspn_c3 (const char *__s, int __accept1, int __accept2, int __accept3)
{
  size_t __result = 0;
  /* Please note that __accept1 to __accept3 never can be '\0'.  */
  while (__s[__result] == __accept1 || __s[__result] == __accept2
         || __s[__result] == __accept3)
    ++__result;
  return __result;
}


This is only called when optimizing and with a second argument that is a string
literal, but it is still inconvenient to turn that into a switch, so it seems
useful to optimize this form as well (well, maybe we could expand
__builtin_strspn in gcc instead so it also works for larger constant second
arguments, but creating a loop is not so nice and there are plenty of other
similar functions).

(By the way, those optimizations are protected by a test __builtin_constant_p
(s) which only seems to be true if passing _directly_ a string literal, maybe
__builtin_constant_p could be made to return true a bit more often, or glibc
could test instead __builtin_constant_p (s[0]) etc)

(For completeness, I also compared with a "repne scasb; je" version I found
somewhere in glibc, which was more than 20 times slower, and quite difficult to
generate since we don't allow clobbers on asm goto...)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/63464] compare one character to many: faster
  2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2014-10-07 20:15 ` glisse at gcc dot gnu.org
@ 2014-10-07 21:01 ` glisse at gcc dot gnu.org
  2014-10-10 12:16 ` jakub at gcc dot gnu.org
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: glisse at gcc dot gnu.org @ 2014-10-07 21:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

--- Comment #7 from Marc Glisse <glisse at gcc dot gnu.org> ---
The SLP version is slightly slower than the bit test in this case (at least on
my old desktop), but it can more easily handle testing for characters that are
not within 64 of each other.

  __m128i b=_mm_set1_epi8(*s);
  __m128i m=_mm_set_epi8('\n','\r',',',' ',' ',' ',' ',' ',' ',' ',' ',' ','
',' ',' ',' ');
  __m128i e=_mm_cmpeq_epi8(b,m);
  bool found=_mm_movemask_epi8(e)!=0;

Though we are missing REDUC_TRUTH_OR_EXPR to model the last line.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/63464] compare one character to many: faster
  2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2014-10-07 21:01 ` glisse at gcc dot gnu.org
@ 2014-10-10 12:16 ` jakub at gcc dot gnu.org
  2014-10-13 15:51 ` jakub at gcc dot gnu.org
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-10-10 12:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

--- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Author: jakub
Date: Fri Oct 10 12:15:30 2014
New Revision: 216072

URL: https://gcc.gnu.org/viewcvs?rev=216072&root=gcc&view=rev
Log:
    PR tree-optimization/63464
    * tree-switch-conversion.c (struct case_bit_test): Remove
    hi and lo fields, add wide_int mask field.
    (emit_case_bit_tests): Add MAXVAL argument, rewrite uses of
    hi/lo fields into wide_int mask operations, optimize by pretending
    minval to be 0 if maxval is small enough.
    (process_switch): Adjust caller.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-switch-conversion.c


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/63464] compare one character to many: faster
  2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2014-10-10 12:16 ` jakub at gcc dot gnu.org
@ 2014-10-13 15:51 ` jakub at gcc dot gnu.org
  2014-10-13 15:53 ` jakub at gcc dot gnu.org
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-10-13 15:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

--- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 33697
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33697&action=edit
gcc5-pr63464.patch

WIP patch.  What is missing:
1) the optimize_range_tests_to_bit_test call should be guarded by
lshift_cheap_p (), Richard, any preference on where to declare that function
(tree-switch-conversion.h and include that in tree-ssa-reassoc.c, somewhere
else?)
2) much more importantly, it right now doesn't actually fixup the IL, so
instead of the desirable jump around the shift we have there just BIT_IOR_EXPR
(and the shift actually happens to be done first, before the range test).
Richard, any preference how to represent it in the IL from in between the
optimization and fixup once all bbs are reassociated?  I've been thinking about
e.g. some pass-through internal call on the TRUTH_ORIF_EXPR operand.  Another
option might be, if we'd adjust update_range_test, so that it would not only
emit a gimplification of the range test, but also an optional gimple_seq before
that, would be to push all the SSA_NAMEs that would be otherwise passed to the
internal call, into some vector, and just split bb after the def stmt of those
SSA_NAMEs and also before the (single, hopefully) user of that SSA_NAME, adding
an edge around the middle of the bb and PHI.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/63464] compare one character to many: faster
  2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2014-10-13 15:51 ` jakub at gcc dot gnu.org
@ 2014-10-13 15:53 ` jakub at gcc dot gnu.org
  2014-10-15 14:19 ` jakub at gcc dot gnu.org
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-10-13 15:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

--- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 33698
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33698&action=edit
bittest.c

Testcase I've been playing with.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/63464] compare one character to many: faster
  2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2014-10-13 15:53 ` jakub at gcc dot gnu.org
@ 2014-10-15 14:19 ` jakub at gcc dot gnu.org
  2014-10-17 10:55 ` jakub at gcc dot gnu.org
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-10-15 14:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

--- Comment #12 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 33724
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33724&action=edit
gcc5-pr63464.patch

Untested patch.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/63464] compare one character to many: faster
  2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2014-10-15 14:19 ` jakub at gcc dot gnu.org
@ 2014-10-17 10:55 ` jakub at gcc dot gnu.org
  2014-10-17 11:20 ` jakub at gcc dot gnu.org
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-10-17 10:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

--- Comment #13 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Author: jakub
Date: Fri Oct 17 10:54:54 2014
New Revision: 216393

URL: https://gcc.gnu.org/viewcvs?rev=216393&root=gcc&view=rev
Log:
    PR tree-optimization/63464
    * gimple.h (gimple_seq_discard): New prototype.
    * gimple.c: Include stringpool.h and tree-ssanames.h.
    (gimple_seq_discard): New function.
    * optabs.h (lshift_cheap_p): New prototype.
    * optabs.c (lshift_cheap_p): New function, moved from...
    * tree-switch-conversion.c (lshift_cheap_p): ... here.
    * tree-ssa-reassoc.c: Include gimplify.h and optabs.h.
    (reassoc_branch_fixups): New variable.
    (update_range_test): Add otherrangep and seq arguments.
    Unshare exp.  If otherrange is NULL, use for other ranges
    array of pointers pointed by otherrangep instead.
    Emit seq before gimplified statements for tem.
    (optimize_range_tests_diff): Adjust update_range_test
    caller.
    (optimize_range_tests_xor): Likewise.  Fix up comment.
    (extract_bit_test_mask, optimize_range_tests_to_bit_test): New
    functions.
    (optimize_range_tests): Adjust update_range_test caller.
    Call optimize_range_tests_to_bit_test.
    (branch_fixup): New function.
    (execute_reassoc): Call branch_fixup.

    * gcc.dg/torture/pr63464.c: New test.
    * gcc.dg/tree-ssa/reassoc-37.c: New test.
    * gcc.dg/tree-ssa/reassoc-38.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/torture/pr63464.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/reassoc-37.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/reassoc-38.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/gimple.c
    trunk/gcc/gimple.h
    trunk/gcc/optabs.c
    trunk/gcc/optabs.h
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-reassoc.c
    trunk/gcc/tree-switch-conversion.c


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/63464] compare one character to many: faster
  2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2014-10-17 10:55 ` jakub at gcc dot gnu.org
@ 2014-10-17 11:20 ` jakub at gcc dot gnu.org
  2014-10-17 11:49 ` ubizjak at gmail dot com
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-10-17 11:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |uros at gcc dot gnu.org

--- Comment #14 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Patch committed.  There is still an issue that remains, e.g. on vrp66.c
testcase
we often get code like:
  _120 = 3314649325744685057 >> _119;
  _121 = _120 & 1;
  _25 = _121 ^ 1;
  _122 = (_Bool) _25;
  if (_122 != 0)
when actually
  _120 = 3314649325744685057 >> _119;
  _121 = _120 & 1;
  if (_121 == 0)
would be enough and shorter.  And it isn't just a GIMPLE preference, it
actually
shows up in the generated code.  On vrp66.c on x86_64 at -O2, I'm seeing just
24 btq instructions (before the optimization 0) and 22 shrq instructions
(again, before 0), where the assembly typically looks like:
        movabsq $-9223372032543031257, %rax
        movl    %edi, %ecx
        shrq    %cl, %rax
        andl    $1, %eax
        xorq    $1, %rax
        andl    $1, %eax
i.e. pretty much the same nonsense, at least the last andl $1 is redundant,
because the first andl ensures %eax is 0 or 1, the xorq turns that into 1 or 0
and so the second andl is useless.  It might be nicer if the code used btq/setc
instead.

So, supposedly there is something we want to match-and-simplify, perhaps also
something we want to simplify at the RTL level, and check if bt+set{,n}c might
not be beneficial here compared to the shift + and + xor.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/63464] compare one character to many: faster
  2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2014-10-17 11:20 ` jakub at gcc dot gnu.org
@ 2014-10-17 11:49 ` ubizjak at gmail dot com
  2015-01-16 17:14 ` jakub at gcc dot gnu.org
  2015-01-16 17:28 ` jakub at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: ubizjak at gmail dot com @ 2014-10-17 11:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

--- Comment #15 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Jakub Jelinek from comment #14)

> So, supposedly there is something we want to match-and-simplify, perhaps
> also something we want to simplify at the RTL level, and check if
> bt+set{,n}c might not be beneficial here compared to the shift + and + xor.

Generation of "bt" is quite fragile and heavily depends on combiner to do its
job correctly. IIRC, there were some talks to introduce bit-test tree code (and
corresponding RTX ?) to enable further optimizations. Something like how bswap
is now handled throughout the compilation.
>From gcc-bugs-return-464346-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org Fri Oct 17 11:59:17 2014
Return-Path: <gcc-bugs-return-464346-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Delivered-To: listarch-gcc-bugs@gcc.gnu.org
Received: (qmail 8078 invoked by alias); 17 Oct 2014 11:59:17 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Delivered-To: mailing list gcc-bugs@gcc.gnu.org
Received: (qmail 8043 invoked by uid 48); 17 Oct 2014 11:59:13 -0000
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug ipa/63575] [5 Regression] ICF miscompiles libstdc++
Date: Fri, 17 Oct 2014 11:59:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: ipa
X-Bugzilla-Version: 5.0
X-Bugzilla-Keywords: wrong-code
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: RESOLVED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 5.0
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: bug_status resolution
Message-ID: <bug-63575-4-ugGHxRGvyJ@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-63575-4@http.gcc.gnu.org/bugzilla/>
References: <bug-63575-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-10/txt/msg01367.txt.bz2
Content-length: 675

https://gcc.gnu.org/bugzilla/show_bug.cgi?idc575

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |FIXED

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Seems to be fixed with

2014-10-17  Martin Liska  <mliska@suse.cz>

        * ipa-icf.c (sem_function::merge): Local flags are set to false
        to enforce equal calling convention to be used.
        * opts.c (common_handle_option): Indentation fix.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/63464] compare one character to many: faster
  2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2014-10-17 11:49 ` ubizjak at gmail dot com
@ 2015-01-16 17:14 ` jakub at gcc dot gnu.org
  2015-01-16 17:28 ` jakub at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-01-16 17:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

--- Comment #16 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
So, I had a brief look at the #c14 issues.

--- gcc/match.pd.jj    2015-01-05 13:07:14.000000000 +0100
+++ gcc/match.pd    2015-01-16 16:27:08.053817209 +0100
@@ -614,6 +614,16 @@ (define_operator_list inverted_tcc_compa
   (cmp (bit_and (lshift integer_onep @0) integer_onep) integer_zerop)
   (icmp @0 { build_zero_cst (TREE_TYPE (@0)); })))

+/* ((type) ((A & 1) ^ 1)) == 0 -> (A & 1) != 0
+   ((type) ((A & 1) ^ 1)) != 0 -> (A & 1) == 0 */
+(for cmp (ne eq)
+     icmp (eq ne)
+ (simplify
+  (cmp (convert?@0 (bit_xor (bit_and@1 @2 integer_onep) integer_onep))
+       integer_zerop)
+  (if (INTEGRAL_TYPE_P (TREE_TYPE (@1)) && INTEGRAL_TYPE_P (TREE_TYPE (@0)))
+   (icmp @1 { build_zero_cst (TREE_TYPE (@1)); }))))
+
 /* Simplifications of conversions.  */

 /* Basic strip-useless-type-conversions / strip_nops.  */

worked (tried vrp66.c only) and managed to simplify *.optimized dump, though
there isn't DCE post forwprop4 so the dead ^1 and casts still stayed.
But apparently RTL optimization passes managed to fix that up and we get the
same assembly, so not sure if the above is worth pushing in.

Then I wanted to do at least something with the useless andl $1, %eax; xorl $1,
%eax; andl $1, %eax where the second andl is definitely complete unneeded.
I see two issues there.  One is that the ^ 1 does not have range info on it, as
it has been created by VRP2 pass.  For that I have:

--- gcc/tree-vrp.c.jj    2015-01-15 20:25:11.000000000 +0100
+++ gcc/tree-vrp.c    2015-01-16 17:33:21.041941783 +0100
@@ -8993,11 +8993,21 @@ simplify_truth_ops_using_ranges (gimple_
   /* For A != B we substitute A ^ B.  Either with conversion.  */
   else if (need_conversion)
     {
-      tree tem = make_ssa_name (TREE_TYPE (op0));
+      tree type = TREE_TYPE (op0);
+      tree tem = make_ssa_name (type);
       gassign *newop
     = gimple_build_assign (tem, BIT_XOR_EXPR, op0, op1);
       gsi_insert_before (gsi, newop, GSI_SAME_STMT);
       gimple_assign_set_rhs_with_ops (gsi, NOP_EXPR, tem);
+
+      /* Record value range info for TEM.  */
+      value_range_t vr = VR_INITIALIZER;
+      extract_range_from_binary_expr (&vr, BIT_XOR_EXPR, type, op0, op1);
+      if (INTEGRAL_TYPE_P (type)
+      && (vr.type == VR_RANGE || vr.type == VR_ANTI_RANGE)
+      && TREE_CODE (vr.min) == INTEGER_CST
+      && TREE_CODE (vr.max) == INTEGER_CST)
+    set_range_info (tem, vr.type, vr.min, vr.max);
     }
   /* Or without.  */
   else

patch (also tested just on vrp66.c).  And the second issue is that nothing
actually uses the range info during expansion.  In particular, this is from
reduce_to_bit_field_precision, but unfortunately we don't pass the original
tree to that function.  If we did, say by adding (optionally NULL?) an extra
tree argument to REDUCE_BIT_FIELD macro, then we could in that function check
if it is an SSA_NAME for which we do have range info, and if the range info
suggests we don't have to mask (for unsigned types) or shift left/right (for
signed types), then it could avoid that step.  Richard, what do you think about
that?  Of course, for rtxes that don't have any corresponding tree we would
just pass NULL or some other value to make it clear that we don't know the
range info.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/63464] compare one character to many: faster
  2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2015-01-16 17:14 ` jakub at gcc dot gnu.org
@ 2015-01-16 17:28 ` jakub at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-01-16 17:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464

--- Comment #17 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Or try to deal with this in combine.  We have:
(insn 91 90 92 18 (parallel [
            (set (reg:DI 149 [ D.1943 ])
                (and:DI (reg:DI 147 [ D.1943 ])
                    (const_int 1 [0x1])))
            (clobber (reg:CC 17 flags))
        ]) vrp66-1.c:32 379 {*anddi_1}
     (expr_list:REG_DEAD (reg:DI 147 [ D.1943 ])
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))
(insn 92 91 94 18 (parallel [
            (set (reg:DI 150 [ D.1943 ])
                (xor:DI (reg:DI 149 [ D.1943 ])
                    (const_int 1 [0x1])))
            (clobber (reg:CC 17 flags))
        ]) vrp66-1.c:32 401 {*xordi_1}
     (expr_list:REG_DEAD (reg:DI 149 [ D.1943 ])
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))
(insn 94 92 95 18 (parallel [
            (set (reg:QI 93 [ D.1944 ])
                (and:QI (subreg:QI (reg:DI 150 [ D.1943 ]) 0)
                    (const_int 1 [0x1])))
            (clobber (reg:CC 17 flags))
        ]) vrp66-1.c:32 383 {*andqi_1}
     (expr_list:REG_DEAD (reg:DI 150 [ D.1943 ])
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))

and I believe nonzero_bits is smart enough to say that nonzero_bits of
(subreg:QI (reg:DI 150 [ D.1943 ]) 0) is 1.  But the problem is that
simplify-rtl attempts to match all kinds of weird patterns, like:
(parallel [
        (set (reg:QI 93 [ D.1944 ])
            (eq:QI (zero_extract:DI (reg:DI 147 [ D.1943 ])
                    (const_int 1 [0x1])
                    (const_int 0 [0]))
                (const_int 0 [0])))
        (clobber (reg:CC 17 flags))
    ])
etc.  Perhaps if everything fails, if for and with constant it would just try
to call nonzero_bits on the non-constant operand and if only bits set in the
constant mask are set, it would try to match the stmt into a plain assignment?


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-01-16 17:28 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-05 21:28 [Bug tree-optimization/63464] New: compare one character to many: faster glisse at gcc dot gnu.org
2014-10-06 11:27 ` [Bug tree-optimization/63464] " jakub at gcc dot gnu.org
2014-10-06 11:52 ` rguenther at suse dot de
2014-10-06 16:59 ` jakub at gcc dot gnu.org
2014-10-07 11:11 ` jakub at gcc dot gnu.org
2014-10-07 11:15 ` rguenther at suse dot de
2014-10-07 20:15 ` glisse at gcc dot gnu.org
2014-10-07 21:01 ` glisse at gcc dot gnu.org
2014-10-10 12:16 ` jakub at gcc dot gnu.org
2014-10-13 15:51 ` jakub at gcc dot gnu.org
2014-10-13 15:53 ` jakub at gcc dot gnu.org
2014-10-15 14:19 ` jakub at gcc dot gnu.org
2014-10-17 10:55 ` jakub at gcc dot gnu.org
2014-10-17 11:20 ` jakub at gcc dot gnu.org
2014-10-17 11:49 ` ubizjak at gmail dot com
2015-01-16 17:14 ` jakub at gcc dot gnu.org
2015-01-16 17:28 ` jakub at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).