From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 79660 invoked by alias); 13 Oct 2016 09:43:51 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 79644 invoked by uid 89); 13 Oct 2016 09:43:50 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: Yes, score=5.9 required=5.0 tests=AWL,BAYES_50,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPAM_BODY,SPF_PASS autolearn=no version=3.3.2 spammy=lived, 11.6, arcs, odds X-HELO: mail-qk0-f194.google.com Received: from mail-qk0-f194.google.com (HELO mail-qk0-f194.google.com) (209.85.220.194) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 13 Oct 2016 09:43:40 +0000 Received: by mail-qk0-f194.google.com with SMTP id z190so5438215qkc.3 for ; Thu, 13 Oct 2016 02:43:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=E6w1givDc1mueSh//mFFjW7bQ96Rna31iUXdn3TbmGw=; b=DW4vX9l2F3veIJc7iTW9qix0ZqS6VnNeoJayneBXdDEsJJfNyvu7Qi9GAYoNAl+P4E G6AnUwiEoe0R0weU9MIu+InW6wdikk+kywk1gvXNSA5n2s3FJXFfj0sYwy4uFeioWFWU RLdErmO4M5rVNbJSnzgT7Ztjrr5xHVEA4WD8lqF/vNcBI6g7H90OtQv2HaQn85Hc3KP6 lq9NMO3Mpl3bHslC1s+IOhMdPrCLtY+O6IZvXGXnfVhvNVz/0gfv6AhXq0aje4PpX6Ty 8UuX/rDd+Rdd+Ccw2dpyKoEScpdqYxPjT63U3ARYxqNNZE3k4Sh+PGcknxC+7ssLSZUB UEgQ== X-Gm-Message-State: AA6/9RndC5GGeJHLtNqMbrvmaJWfPI0xQ+SVyIhIaELHw3fXmmaod9cN2Lg+tYci9dwMcwRzC9SEIMFh4S+6aA== X-Received: by 10.194.30.137 with SMTP id s9mr6478882wjh.77.1476351818682; Thu, 13 Oct 2016 02:43:38 -0700 (PDT) MIME-Version: 1.0 Received: by 10.28.155.146 with HTTP; Thu, 13 Oct 2016 02:43:38 -0700 (PDT) In-Reply-To: <6af6c6fc-9994-ff81-8e07-dab6b647d143@suse.cz> References: <1253ac69-3301-f185-e43a-a34cadf8f51e@suse.cz> <67fda6d2-9b3e-a0d1-effc-34e1115030b2@acm.org> <1ff3cc75-7cee-79f3-395b-ef7a4d286a3d@acm.org> <04a05835-4666-4d7d-c1a9-d4bcc4ea924a@suse.cz> <87k2fpdatl.fsf@tassilo.jf.intel.com> <6f8b1905-818b-bfff-1bf3-5ba04f3b4b64@suse.cz> <20160818155130.GE5871@two.firstfloor.org> <20160818155449.GP14857@tucnak.redhat.com> <5798a459-2fc7-d82a-f89b-30a45a03c831@suse.cz> <8932e842-2457-64a0-76bc-a81b9a9a9b31@suse.cz> <6af6c6fc-9994-ff81-8e07-dab6b647d143@suse.cz> From: Richard Biener Date: Thu, 13 Oct 2016 09:43:00 -0000 Message-ID: Subject: Re: [RFC] Speed-up -fprofile-update=atomic To: =?UTF-8?Q?Martin_Li=C5=A1ka?= Cc: Jakub Jelinek , Andi Kleen , Jeff Law , Nathan Sidwell , GCC Patches , "Hubicha, Jan" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2016-10/txt/msg00995.txt.bz2 On Wed, Oct 12, 2016 at 3:52 PM, Martin Li=C5=A1ka wrote: > On 10/04/2016 11:45 AM, Richard Biener wrote: >> On Thu, Sep 15, 2016 at 12:00 PM, Martin Li=C5=A1ka wro= te: >>> On 09/07/2016 02:09 PM, Richard Biener wrote: >>>> On Wed, Sep 7, 2016 at 1:37 PM, Martin Li=C5=A1ka wro= te: >>>>> On 08/18/2016 06:06 PM, Richard Biener wrote: >>>>>> On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek wrote: >>>>>>> On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote: >>>>>>>>> I'd prefer to make updates atomic in multi-threaded applications. >>>>>>>>> The best proxy we have for that is -pthread. >>>>>>>>> >>>>>>>>> Is it slower, most definitely, but odds are we're giving folks >>>>>>>>> garbage data otherwise, which in many ways is even worse. >>>>>>>> >>>>>>>> It will likely be catastrophically slower in some cases. >>>>>>>> >>>>>>>> Catastrophically as in too slow to be usable. >>>>>>>> >>>>>>>> An atomic instruction is a lot more expensive than a single >>>>>>> increment. Also >>>>>>>> they sometimes are really slow depending on the state of the machi= ne. >>>>>>> >>>>>>> Can't we just have thread-local copies of all the counters (perhaps >>>>>>> using >>>>>>> __thread pointer as base) and just atomically merge at thread >>>>>>> termination? >>>>>> >>>>>> I suggested that as well but of course it'll have its own class of i= ssues (short lived threads, so we need to somehow re-use counters from term= inated threads, large number of threads and thus using too much memory for = the counters) >>>>>> >>>>>> Richard. >>>>> >>>>> Hello. >>>>> >>>>> I've got written the approach on my TODO list, let's see whether it w= ould be doable in a reasonable amount of time. >>>>> >>>>> I've just finished some measurements to illustrate slow-down of -fpro= file-update=3Datomic approach. >>>>> All numbers are: no profile, -fprofile-generate, -fprofile-generate -= fprofile-update=3Datomic >>>>> c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s >>>>> unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s >>>>> tramp3d (1 thread, -O3): 18.0, 46.6, 168s >>>>> >>>>> So the slow-down is roughly 300% compared to -fprofile-generate. I'm = not having much experience with default option >>>>> selection, but these numbers can probably help. >>>>> >>>>> Thoughts? >>>> >>>> Look at the generated code for an instrumented simple loop and see tha= t for >>>> the non-atomic updates we happily apply store-motion to the counter up= date >>>> and thus we only get one counter update per loop exit rather than one = per >>>> loop iteration. Now see what happens for the atomic case (I suspect y= ou >>>> get one per iteration). >>>> >>>> I'll bet this accounts for most of the slowdown. >>>> >>>> Back in time ICC which had atomic counter updates (but using function >>>> calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also >>>> didn't have early inlining -- removing abstraction helps reducing the = number >>>> of counters significantly). >>>> >>>> Richard. >>> >>> Hi. >>> >>> During Cauldron I discussed with Richi approaches how to speed-up ARCS >>> profile counter updates. My first attempt is to utilize TLS storage, wh= ere >>> every function is accumulating arcs counters. These are eventually added >>> (using atomic operations) to the global one at the very end of a functi= on. >>> Currently I rely on target support of TLS, which is questionable whether >>> to have such a requirement for -fprofile-update=3Datomic, or to add a n= ew option value >>> like -fprofile-update=3Datomic-tls? >>> >>> Running the patch on tramp3d, compared to previous numbers, it takes 88= s to finish. >>> Time shrinks to 50%, compared to the current implementation. >>> >>> Thoughts? >> >> Hmm, I thought I suggested that you can simply use automatic storage >> (which effectively >> is TLS...) for regions that are not forked or abnormally left (which >> means SESE regions >> that have no calls that eventually terminate or throw externally). >> >> So why did you end up with TLS? > > Hi. > > Usage for TLS does not makes sense, stupid mistake ;) > > By using SESE regions, do you mean the infrastructure that is utilized > by Graphite machinery? No, just as "single-entry single-exit region" which means placing of initializations of the internal counters to zero and the updates of the actual counters is "obvious". Note that this "optimization" isn't one if the SESE region does not contain cycle(s). Unless there is a way to do an atomic update of a bunch of counters faster than doing them separately. This optimization will also increase register pressure (or force the internal counters to the stack). Thus selecting which counters to "optimize" and which ones to leave in place might be necessary. Richard. > Thanks, > Martin > >> >> Richard. >> >>> Martin >>> >>>> >>>>> Martin >>>>> >>>>>> >>>>>>> Jakub >>>>>> >>>>>> >>>>> >>> >