On 09/28/2014 04:20 AM, Jan Hubicka wrote:
>>
>> Hi.
>>
>> Thank you Markus for presenting numbers, it corresponds with I measured. If I see correctly, IPA ICF pass takes about 7 seconds,
>> the rest is distributed in verifier (not interesting for release version of the compiler) and 'phase opt and generate'. No idea
>> what can make the difference?
>
> phase opt and generate just combine all the optimization times together, so it
> is same 7 seconds as in the ICF pass :)
> 1GB of function bodies just to elimnate 2-3% of code seems quite alot. Do you
> have any idea how many of those turns out to be different?
> It would be nice to be able to release the duplicate bodies from memory after
> the equivalency was stablished....
>
> Honza
>
>>
>> Martin

Hello.

After few days of measurement and tuning, I was able to get numbers to the following shape:
Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall    1412 kB ( 0%) ggc
 phase opt and generate  :  27.83 (59%) usr   0.66 (19%) sys  28.52 (37%) wall 1028813 kB (24%) ggc
 phase stream in         :  16.90 (36%) usr   0.63 (18%) sys  17.60 (23%) wall 3246453 kB (76%) ggc
 phase stream out        :   2.76 ( 6%) usr   2.19 (63%) sys  31.34 (40%) wall       2 kB ( 0%) ggc
 callgraph optimization  :   0.36 ( 1%) usr   0.00 ( 0%) sys   0.35 ( 0%) wall      40 kB ( 0%) ggc
 ipa dead code removal   :   3.31 ( 7%) usr   0.01 ( 0%) sys   3.25 ( 4%) wall       0 kB ( 0%) ggc
 ipa virtual call target :   3.69 ( 8%) usr   0.03 ( 1%) sys   3.80 ( 5%) wall      21 kB ( 0%) ggc
 ipa devirtualization    :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.15 ( 0%) wall   13704 kB ( 0%) ggc
 ipa cp                  :   1.11 ( 2%) usr   0.07 ( 2%) sys   1.17 ( 2%) wall  188558 kB ( 4%) ggc
 ipa inlining heuristics :   8.17 (17%) usr   0.14 ( 4%) sys   8.27 (11%) wall  494738 kB (12%) ggc
 ipa comdats             :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto gimple in       :   1.86 ( 4%) usr   0.40 (11%) sys   2.20 ( 3%) wall  537970 kB (13%) ggc
 ipa lto gimple out      :   0.19 ( 0%) usr   0.08 ( 2%) sys   0.27 ( 0%) wall       2 kB ( 0%) ggc
 ipa lto decl in         :  12.20 (26%) usr   0.37 (11%) sys  12.64 (16%) wall 2441687 kB (57%) ggc
 ipa lto decl out        :   2.51 ( 5%) usr   0.21 ( 6%) sys   2.71 ( 3%) wall       0 kB ( 0%) ggc
 ipa lto constructors in :   0.13 ( 0%) usr   0.02 ( 1%) sys   0.17 ( 0%) wall   15692 kB ( 0%) ggc
 ipa lto constructors out:   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto cgraph I/O      :   0.54 ( 1%) usr   0.09 ( 3%) sys   0.63 ( 1%) wall  407182 kB (10%) ggc
 ipa lto decl merge      :   1.34 ( 3%) usr   0.00 ( 0%) sys   1.34 ( 2%) wall    8220 kB ( 0%) ggc
 ipa lto cgraph merge    :   1.00 ( 2%) usr   0.00 ( 0%) sys   1.00 ( 1%) wall   14605 kB ( 0%) ggc
 whopr wpa               :   0.92 ( 2%) usr   0.00 ( 0%) sys   0.89 ( 1%) wall       1 kB ( 0%) ggc
 whopr wpa I/O           :   0.01 ( 0%) usr   1.90 (55%) sys  28.31 (37%) wall       0 kB ( 0%) ggc
 whopr partitioning      :   2.81 ( 6%) usr   0.01 ( 0%) sys   2.83 ( 4%) wall    4943 kB ( 0%) ggc
 ipa reference           :   1.34 ( 3%) usr   0.00 ( 0%) sys   1.35 ( 2%) wall       0 kB ( 0%) ggc
 ipa profile             :   0.20 ( 0%) usr   0.01 ( 0%) sys   0.21 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   1.62 ( 3%) usr   0.00 ( 0%) sys   1.63 ( 2%) wall       0 kB ( 0%) ggc
 ipa icf                 :   2.65 ( 6%) usr   0.02 ( 1%) sys   2.68 ( 3%) wall    1352 kB ( 0%) ggc
 inline parameters       :   0.00 ( 0%) usr   0.01 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA rewrite        :   0.11 ( 0%) usr   0.01 ( 0%) sys   0.08 ( 0%) wall   18919 kB ( 0%) ggc
 tree SSA other          :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA incremental    :   0.24 ( 1%) usr   0.01 ( 0%) sys   0.32 ( 0%) wall   11325 kB ( 0%) ggc
 tree operand scan       :   0.15 ( 0%) usr   0.02 ( 1%) sys   0.18 ( 0%) wall  116283 kB ( 3%) ggc
 dominance frontiers     :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation   :   0.13 ( 0%) usr   0.01 ( 0%) sys   0.16 ( 0%) wall       0 kB ( 0%) ggc
 varconst                :   0.01 ( 0%) usr   0.02 ( 1%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 loop fini               :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 unaccounted todo        :   0.55 ( 1%) usr   0.00 ( 0%) sys   0.56 ( 1%) wall       0 kB ( 0%) ggc
 TOTAL                 :  47.49             3.48            77.46            4276682 kB

and I was able to reduce function bodies loaded in WPA to 35% (from previous 55%). The main problem
with speed was hidden in work list for congruence classes, where hash_set was used. I chose the data
structure to support delete operation, but it was really slow. Thus, hash_set was replaced with linked list
and a flag is used to identify if a set is removed or not.

I have no clue who complicated can it be to implement release_body function to an operation that
really releases the memory?

Markus' problem with -fprofile-use has been removed, IPA-ICF is preceding devirtualization pass. I hope it is fine?

There's new version of the patch and I plan to comment both Honza's emails where he pointed some nits.

Thanks,
Martin