From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-324919-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 19705 invoked by alias); 20 Aug 2012 14:53:50 -0000
Received: (qmail 19694 invoked by uid 22791); 20 Aug 2012 14:53:45 -0000
X-SWARE-Spam-Status: No, hits=-5.5 required=5.0	tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,KHOP_RCVD_TRUST,KHOP_THREADED,RCVD_IN_DNSWL_LOW,RCVD_IN_HOSTKARMA_YE,RP_MATCHES_RCVD
X-Spam-Check-By: sourceware.org
Received: from mail-pz0-f47.google.com (HELO mail-pz0-f47.google.com) (209.85.210.47)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Mon, 20 Aug 2012 14:53:23 +0000
Received: by daks35 with SMTP id s35so2282627dak.20        for <gcc-patches@gcc.gnu.org>; Mon, 20 Aug 2012 07:53:22 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=google.com; s=20120113;        h=mime-version:in-reply-to:references:date:message-id:subject:from:to         :cc:content-type:x-system-of-record:x-gm-message-state;        bh=aArNazs/6tqNu4aSjDTIvz5iQM8M0QjcVlfjiny20h8=;        b=eVAiIbN7LE9uYSypO1xuqE9erB+Mtesmkc2TxRNNxKD9zMnVWxeGqzJOhTgyz+txoQ         NGRZiGWaZnB4/Qpi18W/aZRTitmDU9L21ZLJhEbXpMxCL6LQ8+T7xaPG1Q9O0zR9r9T4         zJYFJlBO8JfuSbLYTfwROrZD0nIXBxuKMjGUeDlxLu6rEQxjUodZdbafc0ckdzjQNdPg         1aVtQKbAsDwgrVDytfchJLi4e2oETmdKnsSZSWsS2kr/ua0AZ47rkBYonescX/ORTHei         FkSAs9edzw8ZPWU/Vw5DlL8eIALZahw54SlvFbsAmYB3ibd07CoUTSxeQy0W7omPNBCh         NBCA==
Received: by 10.68.227.70 with SMTP id ry6mr35281170pbc.53.1345474402179;        Mon, 20 Aug 2012 07:53:22 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.68.227.70 with SMTP id ry6mr35281147pbc.53.1345474401988; Mon, 20 Aug 2012 07:53:21 -0700 (PDT)
Received: by 10.68.131.69 with HTTP; Mon, 20 Aug 2012 07:53:21 -0700 (PDT)
In-Reply-To: <20120820094833.GB5505@kam.mff.cuni.cz>
References: <20120816142137.7F5EA61414@tjsboxrox.mtv.corp.google.com>	<20120818081956.GD26255@kam.mff.cuni.cz>	<CAAe5K+XLFtjuDXYhSPXOXMNxgrWDmmajLRtpzjhbCXn51BkUdA@mail.gmail.com>	<20120820094833.GB5505@kam.mff.cuni.cz>
Date: Mon, 20 Aug 2012 14:53:00 -0000
Message-ID: <CAAe5K+UUaJtWekptpsLkouCD9V9OY6k2H50iMmaPAGLpKqNaGQ@mail.gmail.com>
Subject: Re: [PATCH] Add working-set size and hotness information to fdo summary (issue6465057)
From: Teresa Johnson <tejohnson@google.com>
To: Jan Hubicka <hubicka@ucw.cz>
Cc: reply@codereview.appspotmail.com, davidxl@google.com, 	gcc-patches@gcc.gnu.org
Content-Type: text/plain; charset=ISO-8859-1
X-System-Of-Record: true
X-Gm-Message-State: ALoCoQlAkXbHvwIuvwSmjKEU7LaQC/XSL7VDXdc09SeL1U3qcIg4bvKEsTJjoNN1NzeJY6P+XdhI436pN9ZEGlX/lBqQgbaw5KZMiDhb0X0+/auSkCW5CgTXqgsErHrwD3wIJqxftfGYTgMG7jk771ccEr5bVUR8gNzstax6lg9UQHoaUVQ0PtIFUJo2P4VD2JoIOMXBHYZd
X-IsSubscribed: yes
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
X-SW-Source: 2012-08/txt/msg01355.txt.bz2

On Mon, Aug 20, 2012 at 2:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Well, it should store the largest working set in BBs, or one that came
>> from a much longer run. But I see the issue you are pointing out. The
>> num_counters (the number of hot bbs) should be reasonable, as the
>> total number of BBs is the same between all runs, and I want to show
>> the largest or more dominant working set in terms of the number of hot
>> bbs. But the min_bb_counter will become more and more inaccurate as
>> the number of runs increases, since the counter values are
>> accumulated.
>
> Yes and that is the one that we plan to use to determine hot/cold decisions on,
> right?

Yes, so I agree it is important to do something to update this as
profiles are merged.

>
> Note that there is no really 1-1 corespondence in betwen BBs and counters.
> For each function the there should be num_edges-num_bbs+1 counters.
> What do you plan to use BB counts for?

True, although what I am using this for is to get an idea of the size
of the working set to help identify large icache footprints in order
to control code size increasing optimizations such as unrolling. The
idea is that a huge number of hot counters indicates a huge number of
hot bbs which in turn indicates a huge number of hot instructions and
therefore high icache pressure. We're using it in the google branch to
control unrolling successfully.

>>
>> I typed up a detailed email below on why getting this right would be
>> difficult, but then I realized there might be a fairly simple accurate
>> solution, which I'll describe first:
>>
>> The only way I see to do this completely accurately is to take two
>> passes through the existing gcda files that we are merging into, one
>> to read in all the counter values and merge them into all the counter
>> values in the arrays from the current run (after which we can do the
>> histogramming/working set computation accurately from the merged
>> counters), and the second to rewrite them. In libgcov this doesn't
>> seem like it would be too difficult to do, although it is a little bit
>> of restructuring of the main merging loop and needs some special
>> handling for buffered functions (which could increase the memory
>> footprint a bit if there are many of these since they will all need to
>> be buffered across the iteration over all the gcda files).
>>
>> The summary merging in coverage.c confuses me a bit as it seems to be
>> handling the case when there are multiple program summaries in a
>> single gcda file. When would this happen? It looks like the merge
>> handling in libgcov should always produce a single program summary per
>> gcda file.
>
>
> This is something Nathan introduced years back. The idea was IMO to handle
> more acurately objects linked into multiple binaries. I am not sure
> if the code really works or worked on some point.
>
> The idea, as I recall it, was to produce overall checksum of all objects and
> have different profile records for each combination.
>
> This is not really useful for profile feedback as you generate single object
> file for all uses, but it might become useful for LTO where you know into which
> binary you are linking to. I am not really sure it is worth all the infrastructure
> needed to support this though.

Ok, so perhaps for the merging in coverage.c we can do something less
accurate, and worry more about the libgcov merging.

>>
>> >
>> >
>> > Why you don't simply write the histogram into gcov file and don't merge the values
>> > here (i.e. doing the cummulation loop in GCC instead of within libgcov)?
>>
>> That doesn't completely solve the problem, unfortunately. The reason
>> is that you don't know which histogram entry contains a particular
>> block each run, and therefore you don't know how to compute the new
>> combined histogram value and index for that bb counter. For example, a
>> particular histogram index may have 5 counters (bbs) in it for one run
>> and 2 counters (bbs) in it for a second run, so the question is how to
>> compute the new entry for all of those bb counters, as the 5 bbs from
>> the first run may or may not be a superset of the 2 from the second
>> run. You could assume that the bbs have the same relative order of
>> hotness between runs, and combine the bb counters accordingly, but
>> there is still some trickiness because you have to apportion the
>> cumulative counter sum stored in the histogram entry between new
>> entries. For example, assume the highest indexed non-zero entries (the
>> histogram buckets containing the largest counter values) in the two
>> incoming histograms are:
>>
>> histogram 1:
>>
>> index 100: 4 counters, cumulative value X, min counter value minx
>> ...
>>
>> histogram 2:
>>
>> index 100: 2 counters, cumulative value Y, min counter value miny
>> index 99: 3 counters, cumulative value W, min counter value minw
>> ...
>>
>> To merge, you could assume something like 2 counters with a new
>> cumulative value (Y + X*2/4), and new min counter value minx+miny,
>> that go into the merged histogram with the index corresponding to
>> counter value minx+miny. And then 2 counters have a new cumulative
>> value (W*2/3 + X*2/4) and new min counter value minx+minw, that go
>> into the merged histogram with index corresponding to counter value
>> minw+minx. Etc... Not entirely accurate, although it might be a
>> reasonable approximation, but it also requires a number of division
>> operations during the merge in libgcov.
>>
>> Another possibility, that might also provide a reasonable
>> approximation, would be to scale the min_bb_counter values in the
>> working sets by the sum_all_merged/sum_all_orig, where sum_all_merged
>> is the new sum_all, and sum_all_orig is the sum_all from the summary
>> whose working_set was chosen to be propagated to the new merged
>> summary. This also requires some divisions at libgcov merge time,
>> unless we save the original sum_all along with the working set in the
>> summary and do the scaling at profile-use time.
>>
>> > By default you are streaming 128 values that is the same as needed to stream the histogram.
>> > I suppose we can have environment variable to reduce the histogram size - I guess in smaller
>> > setups smaller histogram will run just fine...
>>
>> It is a bit more, as the histogram will have 64*4 = 256 entries for
>> 64-bit counters, and each entry contains 2 gcov_type counter values
>> and one unsigned int. The working set entries only have one gcov_type
>> counter and one unsigned. So it could be close to 4x.
>>
>> What do you think?
>
> So we have options
>   1) ignore the problem that summaries become inaccurate with many train runs
>      as we do now.

Or 1b) do some sum_all based scaling of the min_bb_counter values as I
describe a couple paragraphs above.

>   2) write histogram only, do not track BB counts or approximate them by scalling
>      and perhaps retire max_counter from the summary in favour of histogram estimate

By max_counter do you mean sum_max?

>
>      We will pay by more data being written (I think the histogram may actually compress pretty
>      well by skipping zero entries, don't know) and by getting only estimated max_count
>      from the histogram that still should be good for practice (max_count is used only
>      for the hot/cold decisions and those will be histogram based anyway)

Regarding the data being written, I think we can minimize this since
the histogram should be pretty sparse. So store only non-zero entries.
The histogram indices can be recomputed on read from the min counter
value stored in the entry.

>   3) do two stage processing with reading data into memory first, producing summaries
>      and writting them next.
>
> 2) seems appealing to me because it is simple, but there are limitations in
> what it can handle.
> 3) solve precision problems, but how we handle the locking & races?  In the
> case like bootstrap where GCCs are executed in parallel we will end up with
> random results if the files gets modified by another profiled run of GCC in
> between read in and write out.
>
> So I definitely preffer 2 or 3 over 1. David has experience with 3. How well does
> it work for LIPO?

I think the locking issue is reduced for LIPO because each gcda file
is read/merged/written in a single pass as it is on trunk, and the
lipo module info is written subsequently to a separate file. Although
that could certainly get a little stale if there are many simultaneous
profile runs merging the files.

To address this for the working set summary info, I think we would
have to maintain the current per-gcdafile read/merge/write for the
counters but not write any working set summary info on the first pass.
Then compute the working set summary and rewrite the all the gcda
files with it. The working set summary may get a little stale before
it is written (just as in the lipo case for the module info), although
at least the individual counters will not. Since the individual
counters will get merged properly, there may be a good chance that one
of the final profile runs will compute an accurate working set as it
is finishing up.

If this approach seems like it is feasible, then we could stick with
the current approach of emitting the working set array in the summary,
mitigating it somewhat by doing the sum_all based scaling of the
counter values, then in a follow on patch restructure the merging code
to delay the working set computation as described above.

Otherwise, either go with the approach of scaling the counter values
using the sum_all ratio, or encode/merge the histograms instead.

Teresa

>
> Honza
>>
>> Thanks,
>> Teresa
>>
>> >
>> > Honza
>>
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413


-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413