From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-432470-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 25860 invoked by alias); 22 Oct 2013 14:56:58 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Received: (qmail 25770 invoked by uid 48); 22 Oct 2013 14:56:55 -0000
From: "jakub at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug rtl-optimization/47477] [4.7/4.8/4.9 regression] Sub-optimal mov at end of method
Date: Tue, 22 Oct 2013 14:56:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: rtl-optimization
X-Bugzilla-Version: 4.6.0
X-Bugzilla-Keywords: missed-optimization, ra
X-Bugzilla-Severity: normal
X-Bugzilla-Who: jakub at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 4.8.3
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields:
Message-ID: <bug-47477-4-cIGD3klrYd@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-47477-4@http.gcc.gnu.org/bugzilla/>
References: <bug-47477-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2013-10/txt/msg01614.txt.bz2

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47477
--- Comment #18 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Kai Tietz from comment #17)
> What optimization you expect here?  I see by the new type-demotion pass some
> changes in optimized tree-output:

This one is for vectorization, try it with -O3 -mavx2 and look what vectorized
loop we get.  With type demotion and promotion for the vectorized loops
(perhaps only for that and not for the scalar loops), you could get similar
vectorization to say:
short a[1024], b[1024];

void
foo (void)
{
  int i;
  for (i = 0; i < 1024; i++)
    {
      unsigned short c = ((short)(a[i] << 8) >> 8) + 5U;
      unsigned short d = b[i] + 12U;
      a[i] = c + d;
    }
}
though even in this case I still couldn't achieve the sign extension to be
actually performed as 16-bit left + right (signed) shift, while I guess that
would lead to even better code.
Or look at how we vectorize:
short a[1024], b[1024];

void
foo (void)
{
  int i;
  for (i = 0; i < 1024; i++)
    {
      unsigned char e = a[i];
      short c = e + 5;
      long long d = (long long) b[i] + 12;
      a[i] = c + d;
    }
}
(note, here forwprop pass already performs type promotion, instead of
converting a[i] to unsigned char and back to short, it computes a[i] & 255 in
short mode) and how we could instead with type demotions:
short a[1024], b[1024];

void
foo (void)
{
  int i;
  for (i = 0; i < 1024; i++)
    {
      unsigned short c = (a[i] & 0xff) + 5U;
      unsigned short d = b[i] + 12U;
      a[i] = c + d;
    }
}

These are all admittedly artificial testcases, but I've seen tons of loops
where multiple types were vectorized and I think in some portion of those loops
we could either use just a single type size, or at least decrease the number of
conversions and different type sizes in the vectorized loops.