[Bug middle-end/110692] New: epilogues for loop which can be also vectorized with half size can be improved.

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

From: "hubicka at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug middle-end/110692] New: epilogues for loop which can be also vectorized with half size can be improved.
Date: Sun, 16 Jul 2023 19:31:27 +0000	[thread overview]
Message-ID: <bug-110692-4@http.gcc.gnu.org/bugzilla/> (raw)

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110692

            Bug ID: 110692
           Summary: epilogues for loop which can be also vectorized with
                    half size can be improved.
           Product: gcc
           Version: 13.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

for:
int a[99];
test()
{
        for (int i = 0 ; i < 99; i++)
                a[i]++;
}
we produce:
   0:   66 0f 6f 0d 00 00 00    movdqa 0x0(%rip),%xmm1        # 8 <test+0x8>
   7:   00 
   8:   b8 00 00 00 00          mov    $0x0,%eax
   d:   ba 00 00 00 00          mov    $0x0,%edx
  12:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  18:   66 0f 6f 00             movdqa (%rax),%xmm0
  1c:   48 83 c0 10             add    $0x10,%rax
  20:   66 0f fe c1             paddd  %xmm1,%xmm0
  24:   0f 29 40 f0             movaps %xmm0,-0x10(%rax)
  28:   48 39 c2                cmp    %rax,%rdx
  2b:   75 eb                   jne    18 <test+0x18>
  2d:   f3 0f 7e 05 00 00 00    movq   0x0(%rip),%xmm0        # 35 <test+0x35>
  34:   00 
  35:   83 05 00 00 00 00 01    addl   $0x1,0x0(%rip)        # 3c <test+0x3c>
  3c:   f3 0f 7e 0d 00 00 00    movq   0x0(%rip),%xmm1        # 44 <test+0x44>
  43:   00 
  44:   66 0f fe c1             paddd  %xmm1,%xmm0
  48:   66 0f d6 05 00 00 00    movq   %xmm0,0x0(%rip)        # 50 <test+0x50>
  4f:   00 
  50:   c3                      ret

which does the 4x vectorized loop followed by 2x vectorized loopless epilogue
and copy of remaining byte.

When bound is unknow we do:
int a[99];
test(int n)
{
        for (int i = 0 ; i < n; i++)
                a[i]++;
}
   0:   85 ff                   test   %edi,%edi
   2:   7e 70                   jle    74 <test+0x74>
   4:   8d 47 ff                lea    -0x1(%rdi),%eax
   7:   83 f8 02                cmp    $0x2,%eax
   a:   76 6d                   jbe    79 <test+0x79>
   c:   89 fa                   mov    %edi,%edx
   e:   66 0f 6f 0d 00 00 00    movdqa 0x0(%rip),%xmm1        # 16 <test+0x16>
  15:   00
  16:   31 c0                   xor    %eax,%eax
  18:   c1 ea 02                shr    $0x2,%edx
  1b:   48 c1 e2 04             shl    $0x4,%rdx
  1f:   66 0f 6f 80 00 00 00    movdqa 0x0(%rax),%xmm0
  26:   00
  27:   48 83 c0 10             add    $0x10,%rax
  2b:   66 0f fe c1             paddd  %xmm1,%xmm0
  2f:   0f 29 80 00 00 00 00    movaps %xmm0,0x0(%rax)
  36:   48 39 d0                cmp    %rdx,%rax
  39:   75 e4                   jne    1f <test+0x1f>
  3b:   89 f8                   mov    %edi,%eax
  3d:   83 e0 fc                and    $0xfffffffc,%eax
  40:   40 f6 c7 03             test   $0x3,%dil
  44:   74 32                   je     78 <test+0x78>
  46:   48 63 d0                movslq %eax,%rdx
  49:   83 04 95 00 00 00 00    addl   $0x1,0x0(,%rdx,4)
  50:   01
  51:   8d 50 01                lea    0x1(%rax),%edx
  54:   39 d7                   cmp    %edx,%edi
  56:   7e 1c                   jle    74 <test+0x74>
  58:   48 63 d2                movslq %edx,%rdx
  5b:   83 c0 02                add    $0x2,%eax
  5e:   83 04 95 00 00 00 00    addl   $0x1,0x0(,%rdx,4)
  65:   01
  66:   39 c7                   cmp    %eax,%edi
  68:   7e 0a                   jle    74 <test+0x74>
  6a:   48 98                   cltq
  6c:   83 04 85 00 00 00 00    addl   $0x1,0x0(,%rax,4)
  73:   01
  74:   c3                      ret
  75:   0f 1f 00                nopl   (%rax)
  78:   c3                      ret
  79:   31 c0                   xor    %eax,%eax
  7b:   eb c9                   jmp    46 <test+0x46>

the profitability threshold is 4.
Producing the loopless epilogue just as in the first case with additional tests
for block size would work better. The code looks quite bad for small trip
counts since there is extra jump down to 79.

next             reply	other threads:[~2023-07-16 19:31 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-16 19:31 hubicka at gcc dot gnu.org [this message]
2023-07-17  7:45 ` [Bug tree-optimization/110692] " rguenth at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-110692-4@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).