From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 27670 invoked by alias); 13 Dec 2012 10:22:47 -0000 Received: (qmail 27660 invoked by uid 22791); 13 Dec 2012 10:22:47 -0000 X-SWARE-Spam-Status: No, hits=-4.9 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,KHOP_RCVD_TRUST,KHOP_THREADED,RCVD_IN_DNSWL_LOW,RCVD_IN_HOSTKARMA_YE,TW_CP X-Spam-Check-By: sourceware.org Received: from mail-wg0-f53.google.com (HELO mail-wg0-f53.google.com) (74.125.82.53) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 13 Dec 2012 10:22:43 +0000 Received: by mail-wg0-f53.google.com with SMTP id ei8so724083wgb.8 for ; Thu, 13 Dec 2012 02:22:41 -0800 (PST) MIME-Version: 1.0 Received: by 10.180.14.2 with SMTP id l2mr27935847wic.2.1355394161895; Thu, 13 Dec 2012 02:22:41 -0800 (PST) Received: by 10.194.179.130 with HTTP; Thu, 13 Dec 2012 02:22:41 -0800 (PST) In-Reply-To: <20121213062128.GK2315@tucnak.redhat.com> References: <20121212163722.GA21037@atrey.karlin.mff.cuni.cz> <20121212183036.GB5303@atrey.karlin.mff.cuni.cz> <20121213011933.GB21037@atrey.karlin.mff.cuni.cz> <20121213062128.GK2315@tucnak.redhat.com> Date: Thu, 13 Dec 2012 10:22:00 -0000 Message-ID: Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs From: Richard Biener To: Jakub Jelinek Cc: Xinliang David Li , Jan Hubicka , GCC Patches , Teresa Johnson Content-Type: text/plain; charset=ISO-8859-1 X-IsSubscribed: yes Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org X-SW-Source: 2012-12/txt/msg00899.txt.bz2 On Thu, Dec 13, 2012 at 7:21 AM, Jakub Jelinek wrote: > On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote: >> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka wrote: >> >> > libcall is not faster up to 8KB to rep sequence that is better for regalloc/code >> >> > cache than fully blowin function call. >> >> >> >> Be careful with this. My recollection is that REP sequence is good for >> >> any size -- for smaller size, the REP initial set up cost is too high >> >> (10s of cycles), while for large size copy, it is less efficient >> >> compared with library version. >> > >> > Well this is based on the data from the memtest script. >> > Core has good REP implementation - it is a win from rather small blocks (16 >> > bytes if I recall) and it does not need alignment. >> > Library version starts to be interesting with caching hints, but I think till 80KB >> > it is still not a win for my setup (glibc-2.15) >> >> A simple test shows that -mstringop-strategy=libcall always beats >> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size >> smaller than 8 where the rep_8byte strategy simply bypasses REP movs. >> Can you share your memtest ? > > I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a > libcall. The PLT call overhead is simply too high. I believe the PLT call overhead may be effectively zero if the benchmarking is just a loop around a memcpy. Thus for measuring the PLT overhead I call the benchmark broken ;) Richard. > Jakub