public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/114563] New: ggc_internal_alloc is slow
@ 2024-04-02 10:28 rguenth at gcc dot gnu.org
  2024-04-02 10:29 ` [Bug middle-end/114563] " rguenth at gcc dot gnu.org
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-04-02 10:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114563

            Bug ID: 114563
           Summary: ggc_internal_alloc is slow
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

We seem to have a single bucket of free objects we walk:

  /* Check the list of free pages for one we can use.  */
  for (pp = &G.free_pages, p = *pp; p; pp = &p->next, p = *pp) 
    if (p->bytes == entry_size)
      break;

and if that list becomes large this is a very bad bottleneck.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug middle-end/114563] ggc_internal_alloc is slow
  2024-04-02 10:28 [Bug middle-end/114563] New: ggc_internal_alloc is slow rguenth at gcc dot gnu.org
@ 2024-04-02 10:29 ` rguenth at gcc dot gnu.org
  2024-04-03  8:44 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-04-02 10:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114563

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |compile-time-hog
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=114480

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
See PR114480 for a testcase which shows

Samples: 299K of event 'cycles', Event count (approx.): 338413178083            
Overhead       Samples  Command  Shared Object       Symbol                     
  23.16%         67756  cc1plus  cc1plus             [.] ggc_internal_alloc

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug middle-end/114563] ggc_internal_alloc is slow
  2024-04-02 10:28 [Bug middle-end/114563] New: ggc_internal_alloc is slow rguenth at gcc dot gnu.org
  2024-04-02 10:29 ` [Bug middle-end/114563] " rguenth at gcc dot gnu.org
@ 2024-04-03  8:44 ` rguenth at gcc dot gnu.org
  2024-04-03  8:49 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-04-03  8:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114563

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note this is likely because of release_pages keeping a large freelist when
using madvise.  After r14-9767 this improved to

   5.15%         35482  cc1plus  cc1plus           [.] ggc_internal_alloc

I've tried a quick hack to use the 'prev' field to implement a skip-list,
skipping to the next page entry with a different size.  That works
reasonably well but it also shows the freelist is heavily fragmented.

N: M P Q
100001: 98767 19662 17321
200001: 176918 68336 27167
300001: 228676 164683 27185

that's stats after N alloc_page which M times finds a free page to re-use,
in that process P times using the skip-list to skip at least one entry and
Q times following the ->next link directly.

It does get alloc_page from the profile.

It might be worth keeping the list sorted in ascending ->bytes order with
this, making pagesize allocations O(1) and other sizes O(constant).

Of course using N buckets would be the straight-forward thing but then
release_pages would be complicated, esp. malloc page groups I guess.

But as said, this is likely a symptom of the MADVISE path keeping too many
page entries for the testcase, so another attack vector is to more
aggressively release them.  I don't know how much fragmented they are,
we don't seem to try sorting them before unmapping the >= free_unit chunks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug middle-end/114563] ggc_internal_alloc is slow
  2024-04-02 10:28 [Bug middle-end/114563] New: ggc_internal_alloc is slow rguenth at gcc dot gnu.org
  2024-04-02 10:29 ` [Bug middle-end/114563] " rguenth at gcc dot gnu.org
  2024-04-03  8:44 ` rguenth at gcc dot gnu.org
@ 2024-04-03  8:49 ` rguenth at gcc dot gnu.org
  2024-04-03  9:32 ` rguenth at gcc dot gnu.org
  2024-04-03  9:50 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-04-03  8:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114563

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 57856
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57856&action=edit
quick skip-list patch

Before:

> /usr/bin/time ./cc1plus -quiet -o /dev/null /tmp/a-test-poly.ii -O
173.29user 3.25system 2:56.59elapsed 99%CPU (0avgtext+0avgdata
11311472maxresident)k
0inputs+0outputs (0major+2867887minor)pagefaults 0swaps

After:

> /usr/bin/time ./cc1plus -quiet -o /dev/null /tmp/a-test-poly.ii -O
161.23user 3.15system 2:44.44elapsed 99%CPU (0avgtext+0avgdata
11308852maxresident)k
0inputs+0outputs (0major+2868137minor)pagefaults 0swaps

The patch uses the ->prev pointer to point to the previous entry of the
next entry with differing ->bytes.  I re-compute the pointers from scratch
during release_pages and update during alloc/free but do not merge ranges
when allocating.

It would be possible to compute/update the skip list pointer during the
walk itself at a bit of extra cost there.

As said, when we see to maintain a sorted free_pages list this might
speed up the walk some more as we can stop after checking the right-size
chunk.

Doing a better job in release_pages for the madvise case would be good as well.

I wonder if anybody has a preference / priority?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug middle-end/114563] ggc_internal_alloc is slow
  2024-04-02 10:28 [Bug middle-end/114563] New: ggc_internal_alloc is slow rguenth at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-04-03  8:49 ` rguenth at gcc dot gnu.org
@ 2024-04-03  9:32 ` rguenth at gcc dot gnu.org
  2024-04-03  9:50 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-04-03  9:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114563

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 57858
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57858&action=edit
better release_pages

Ah, and it's not so much fragmentation but large free_unit (1MB) that's hard
to get to.  The attached sorts the pages before releasing contiguous spaces.

printing all > G.pagesize occurances yields

where the first column is the number of times (sort -n | uniq -c) and the
rest is the size of the contiguous area and in how many page entries that
is spread (I notice we don't merge the page entries either, but that would
not obviously be a good thing I guess)

      5 8192 in 1
    465 8192 in 2
    410 12288 in 3
    317 16384 in 4
    162 20480 in 5
    158 24576 in 6
    145 28672 in 7
     20 32768 in 1
     94 32768 in 8
     81 36864 in 9
     59 40960 in 10
     61 45056 in 11
     50 49152 in 12
      1 49152 in 6
     27 53248 in 13
     20 57344 in 14
     13 61440 in 15
      5 65536 in 1
     14 65536 in 16
      5 65536 in 2
      3 69632 in 17
      1 73728 in 18
      1 77824 in 19
      1 81920 in 20
      1 86016 in 21
      1 94208 in 23
      2 98304 in 2
      1 114688 in 14
      1 118784 in 29
      1 126976 in 31
      2 131072 in 1
      1 131072 in 16
      2 131072 in 3
      1 155648 in 38
      1 159744 in 39
      1 167936 in 41
      1 196608 in 2
      1 204800 in 50
      6 204800 in 7
      4 229376 in 3
      2 262144 in 1
      1 278528 in 34
      1 344064 in 42
      1 393216 in 48
      1 524288 in 1
      1 544768 in 133
      2 565248 in 138
      1 569344 in 139
      1 573440 in 140
      1 737280 in 90
      1 835584 in 204
      1 884736 in 54
      1 999424 in 61

-- below is then released
      2 1048576 in 1
      1 1048576 in 2
      1 1310720 in 65
      1 1359872 in 129
      1 1400832 in 149
      1 1597440 in 8
      1 1605632 in 392
      3 1605632 in 7
      1 1835008 in 7
      1 1867776 in 8
      1 2023424 in 494
      3 3145728 in 6
      3 3211264 in 7
      1 4194304 in 1
      1 4685824 in 319
      1 8388608 in 1
      1 8617984 in 439
      1 8896512 in 450
      1 9363456 in 297
      1 9732096 in 480
      1 9740288 in 508
      1 9764864 in 478
      1 10485760 in 2560
      2 14049280 in 616
      3 14336000 in 628
      1 16777216 in 1
      1 33554432 in 1
      1 85975040 in 1966
      1 210501632 in 3040
      1 275062784 in 3586
      1 302637056 in 1298
      1 339976192 in 3853
      1 429260800 in 2384
      1 476405760 in 4017
      1 556277760 in 3465
      1 561905664 in 4372
      1 563216384 in 4379
      1 645537792 in 5451
      1 1515700224 in 8173
      1 1524654080 in 8195
      1 1525112832 in 8196

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug middle-end/114563] ggc_internal_alloc is slow
  2024-04-02 10:28 [Bug middle-end/114563] New: ggc_internal_alloc is slow rguenth at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-04-03  9:32 ` rguenth at gcc dot gnu.org
@ 2024-04-03  9:50 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-04-03  9:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114563

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, I'd say for the sake of avoiding virtual memory fragmentation free_unit
should be equal to GGC_QUIRE_SIZE.  But we should possibly merge adjacent
entries we don't free to power-of-two chunks and possibly have alloc_page
split a larger page when the small ones are exhausted (at least up to
GGC_QUIRE_SIZE?).

I'll also note that while we chunk G.pagesize allocs
with GGC_QUIRE_SIZE we don't do that for the larger allocations?  We also
immediately split to G.pagesize instead of also filling the larger orders
with free pages.  Maybe because we don't chunk the larger orders we shouldn't
force them to be released in large chunks only either?

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-04-03  9:51 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-02 10:28 [Bug middle-end/114563] New: ggc_internal_alloc is slow rguenth at gcc dot gnu.org
2024-04-02 10:29 ` [Bug middle-end/114563] " rguenth at gcc dot gnu.org
2024-04-03  8:44 ` rguenth at gcc dot gnu.org
2024-04-03  8:49 ` rguenth at gcc dot gnu.org
2024-04-03  9:32 ` rguenth at gcc dot gnu.org
2024-04-03  9:50 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).