From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-394812-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 11792 invoked by alias); 21 Jun 2012 08:47:23 -0000
Received: (qmail 11772 invoked by uid 22791); 21 Jun 2012 08:47:21 -0000
X-SWARE-Spam-Status: No, hits=-4.3 required=5.0	tests=ALL_TRUSTED,AWL,BAYES_00,KHOP_THREADED,TW_CP
X-Spam-Check-By: sourceware.org
Received: from localhost (HELO gcc.gnu.org) (127.0.0.1)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 21 Jun 2012 08:47:08 +0000
From: "rguenther at suse dot de" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
Date: Thu, 21 Jun 2012 08:47:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenther at suse dot de
X-Bugzilla-Status: WAITING
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 4.8.0
X-Bugzilla-Changed-Fields:
Message-ID: <bug-53726-4-ygrMgPUFZl@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-53726-4@http.gcc.gnu.org/bugzilla/>
References: <bug-53726-4@http.gcc.gnu.org/bugzilla/>
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
Content-Type: text/plain; charset="UTF-8"
MIME-Version: 1.0
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2012-06/txt/msg01439.txt.bz2

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #18 from rguenther at suse dot de <rguenther at suse dot de> 2012-06-21 08:46:11 UTC ---
On Wed, 20 Jun 2012, hjl.tools at gmail dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
> 
> --- Comment #17 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 15:36:09 UTC ---
> (In reply to comment #16)
> > But I am not sure if a good library implementation shouldn't be always
> > preferable to a byte-wise copy.  We could, at least try to envision a way
> > to retain and use the knowledge that the size is at most 8 when expanding
> > the memcpy (with AVX we could use a masked store for example - quite fancy).
> 
> string/memory functions in libc can be much faster than the ones generated
> by GCC unless the size is very small, PR 43052.

Yes.  The question is what is "very small" and how can we possibly
detect "very small".  For this testcase we can derive an upper bound
of the size, which is 8, but the size is not constant.  I think unless
we know we can expand the variable-size memcpy with, say, three
CPU instructions inline there is no reason to not call memcpy.

Thus if the CPU could do

  tem = unaligned-load-8-bytes-from-src-and-ignore-faults;
  mask = generate mask from size
  store-unaligned-8-bytes-with-maxk

then expanding the memcpy call inline would be a win I suppose.
AVX has VMASKMOV, but I'm not sure using that for sizes <= 16
bytes is profitable?  Note that from the specs
of VMASKMOV it seems the memory operands need to be aligned and
the mask does not support byte-granularity.

Which would leave us to inline expanding the case of at most 2 byte
memcpy.  Of course currently there is no way to record an upper
bound for the size (we do not retain value-range information - but
we of course should).