public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/111894] New: Missed vectorization opportunity
@ 2023-10-20 13:57 Mark_B53 at yahoo dot com
  2023-10-23  9:00 ` [Bug tree-optimization/111894] " rguenth at gcc dot gnu.org
  0 siblings, 1 reply; 2+ messages in thread
From: Mark_B53 at yahoo dot com @ 2023-10-20 13:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111894

            Bug ID: 111894
           Summary: Missed vectorization opportunity
           Product: gcc
           Version: 13.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: Mark_B53 at yahoo dot com
  Target Milestone: ---

Consider the following code that uses views to implement a two dimensional
`iota`:

    #include <array>
    #include <ranges>

    template<std::integral T>
    std::ranges::range auto iota2D(T xbound, T ybound) {
        auto fn = [=](T idx) { return std::tuple{idx / ybound, idx % ybound};
};
        return std::views::iota(T{0}, xbound * ybound) |
std::views::transform(fn);
    }

    constexpr std::size_t N = 20;
    std::array<std::array<int, N>, N> data;

    __attribute__((noinline)) void init1() {
        for (auto i : std::views::iota(size_t{}, N)) {
            for (auto j : std::views::iota(size_t{}, N)) {
                data[i][j] = 123;
            }
        }
    }

    __attribute__((noinline)) void init2() {
        for (auto [i, j] : iota2D(N,N)) {
            data[i][j] = 123;
        }
    }

Using gcc 13.2 with -O3, we see that the code using a nested loop is nicely
vectorized:

    init1():
            movdqa  xmm0, XMMWORD PTR .LC0[rip]
            mov     eax, OFFSET FLAT:data
    .L2:
            movaps  XMMWORD PTR [rax], xmm0
            add     rax, 80
            movaps  XMMWORD PTR [rax-64], xmm0
            movaps  XMMWORD PTR [rax-48], xmm0
            movaps  XMMWORD PTR [rax-32], xmm0
            movaps  XMMWORD PTR [rax-16], xmm0
            cmp     rax, OFFSET FLAT:data+1600
            jne     .L2
            ret

The code using iota2D is not vectorized:

    init2():
            xor     eax, eax
    .L6:
            mov     DWORD PTR data[0+rax*4], 123
            add     rax, 1
            cmp     rax, 400
            jne     .L6
            ret

Although GCC 13 produces much higher quality assembly than previous versions,
it fails to vectorize the loop.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Bug tree-optimization/111894] Missed vectorization opportunity
  2023-10-20 13:57 [Bug tree-optimization/111894] New: Missed vectorization opportunity Mark_B53 at yahoo dot com
@ 2023-10-23  9:00 ` rguenth at gcc dot gnu.org
  0 siblings, 0 replies; 2+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-10-23  9:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111894

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Blocks|                            |53947
   Last reconfirmed|                            |2023-10-23
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.

Creating dr for MEM <struct array> [(value_type
&)&data]._M_elems[_7]._M_elems[_8]
analyze_innermost: t.C:23:24: missed:  failed: evolution of offset is not
affine.

we have

void init2 ()
{
  long unsigned int SR.100;
  long unsigned int ivtmp_1;
  long unsigned int _4;
  long unsigned int ivtmp_6;
  long unsigned int _7;
  long unsigned int _8;

  <bb 2> [local count: 10737416]:

  <bb 3> [local count: 1063004409]:
  # SR.100_13 = PHI <_4(5), 0(2)>
  # ivtmp_6 = PHI <ivtmp_1(5), 400(2)>
  _7 = SR.100_13 / 20;
  _8 = SR.100_13 % 20;
  MEM <struct array> [(value_type &)&data]._M_elems[_7]._M_elems[_8] = 123;
  _4 = SR.100_13 + 1;
  ivtmp_1 = ivtmp_6 - 1;
  if (ivtmp_1 != 0)
    goto <bb 5>; [99.00%]
  else
    goto <bb 4>; [1.00%]

  <bb 5> [local count: 1052374367]:
  goto <bb 3>; [100.00%]

  <bb 4> [local count: 10737416]:
  return;

and the issue is the use of division/modulo for the array accesses, that is,
that this C++ idiom results in a "flattened" loop instead of a nest of
level two.

GCC doesn't support "un-flattening" this.  In general, when we can
see that the number of iterations is so that unrolling the loop
by VF will not cross dimensions we might be able to apply regular loop
vectorization but we currently do not have code doing that.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-10-23  9:00 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-20 13:57 [Bug tree-optimization/111894] New: Missed vectorization opportunity Mark_B53 at yahoo dot com
2023-10-23  9:00 ` [Bug tree-optimization/111894] " rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).