public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses
@ 2020-09-05  9:42 dmitriy.ovdienko at gmail dot com
  2020-09-05  9:42 ` [Bug libstdc++/96942] " dmitriy.ovdienko at gmail dot com
                   ` (42 more replies)
  0 siblings, 43 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-05  9:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

            Bug ID: 96942
           Summary: std::pmr::monotonic_buffer_resource causes CPU cache
                    misses
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: dmitriy.ovdienko at gmail dot com
  Target Milestone: ---

Created attachment 49183
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49183&action=edit
Original implementation

There is a webpage that compares performance of different programming
languages: C++, C, Rust, Java, etc.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/

There is a "binary trees" test there. In this test application creates `perfect
binary tree` and traverses it.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/description/binarytrees.html#binarytrees

The fastest solution for this test is created in Rust. 

https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/binarytrees.html

C++ implementation of this problem uses `std::pmr::monotonic_buffer_resource`
class as a memory storage.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-gpp-4.html

I like C++ very much and I've started an investigation why application compiled
in Rust is faster than C++.

At first, I've run a `perf` tool and had found that application compiled in C++
generates a lot of CPU cache misses (54%):

```txt
root@E5530:/home/dmytro_ovdiienko/Sources# perf stat -B -e
cache-references,cache-misses,cycles,instructions,branches,faults,migrations 
./bt_orig 21
 Performance counter stats for './bt_orig 21':

        45,104,136      cache-references                                        
        24,448,475      cache-misses              #   54.205 % of all cache
refs    
    19,904,251,283      cycles                                                  
    30,462,013,065      instructions              #    1.53  insn per cycle     
     4,834,392,341      branches                                                
           234,796      faults                                                  
                 2      migrations                                              

       2.083603709 seconds time elapsed

       5.559471000 seconds user
       0.309529000 seconds sys
```

I thought that it is caused by tree traversing. But after I've modified the
code, I found that a lot of cache misses are caused by
`std::pmr::monotonic_buffer_resource` class, which is used as a memory pool.

I've modified that sample to pre-allocate memory required to hold entire binary
tree instead of grow in geometric progression, but it had made things even
worse.

```txt
root@E5530:/home/dmytro_ovdiienko/Sources# perf stat -B -e
cache-references,cache-misses,cycles,instructions,branches,faults,migrations 
./bt_orig_prealloc 21
 Performance counter stats for './bt_orig_prealloc 21':

        66,400,545      cache-references                                        
        45,740,962      cache-misses              #   68.886 % of all cache
refs    
    21,461,610,267      cycles                                                  
    31,296,637,782      instructions              #    1.46  insn per cycle     
     4,967,611,660      branches                                                
           575,100      faults                                                  
                 9      migrations                                              

       2.219161594 seconds time elapsed

       5.464583000 seconds user
       0.854839000 seconds sys
```

That looks really weird and I've implemented my own allocator that behaves like
`std::pmr::monotonic_buffer_resource` and with my memory storage CPU cache
misses are dropped to 34%.

```txt
root@E5530:/home/dmytro_ovdiienko/Sources# perf stat -B -e
cache-references,cache-misses,cycles,instructions,branches,faults,migrations 
./bt_malloc 21
 Performance counter stats for './bt_malloc 21':

        40,713,525      cache-references                                        
        14,147,648      cache-misses              #   34.749 % of all cache
refs    
    14,823,743,812      cycles                                                  
    22,306,442,507      instructions              #    1.50  insn per cycle     
     4,331,968,591      branches                                                
            60,227      faults                                                  
                 6      migrations                                              

       1.474751692 seconds time elapsed

       4.282074000 seconds user
       0.092476000 seconds sys
```

Execution time is also dropped from 2.12s to 1.52s (on my laptop).

For completness, following is the report for application compiled in Rust:

```txt
 Performance counter stats for './rust/target/release/rust 21':

        29,268,774      cache-references                                        
        12,147,041      cache-misses              #   41.502 % of all cache
refs    
    24,539,557,585      cycles                                                  
    31,784,741,964      instructions              #    1.30  insn per cycle     
     4,829,547,556      branches                                                
            68,023      faults                                                  
                 8      migrations                                              

       2.087679654 seconds time elapsed

       7.127183000 seconds user
       0.123777000 seconds sys
```

Execution time is 2.09s

Given all above, could you review the implementation of
`std::pmr::monotonic_buffer_resource`. This memory storage has to be very fast.
In fact any attempt to speedup the application makes things slower.

Attached are original C++ implmentation, C++ implementation with preallocated
memory and implementation based on my memory allocator, which uses malloc.

I understand that any class in the `std` namespace has to conform to C++
standard, so it can be hard to fix this issue. Probably G++ compiled can be
tweaked to compile more faster code for `std::pmr::monotonic_buffer_resource`.



Environment:
* CPU: Intel(R) Core(TM) i5-3380M CPU @ 2.90GHz
* OS: Ubuntu 20.04.1 LTS
* Command line: g++ --std=c++17 -O3 --omit-frame-pointer -march=ivybridge
-lpthread
* Target: x86_64-linux-gnu
* Configured with: ../src/configure -v --with-pkgversion='Ubuntu
9.3.0-10ubuntu2' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-9
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-vtable-verify --enable-plugin
--enable-default-pie --with-system-zlib --with-target-system-zlib=auto
--enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686
--with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib
--with-tune=generic --enable-offload-targets=nvptx-none,hsa
--without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
* Thread model: posix
* gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
@ 2020-09-05  9:42 ` dmitriy.ovdienko at gmail dot com
  2020-09-05  9:43 ` dmitriy.ovdienko at gmail dot com
                   ` (41 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-05  9:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #1 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
Created attachment 49184
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49184&action=edit
Original implementation with preallocated buffer

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
  2020-09-05  9:42 ` [Bug libstdc++/96942] " dmitriy.ovdienko at gmail dot com
@ 2020-09-05  9:43 ` dmitriy.ovdienko at gmail dot com
  2020-09-07  7:21 ` dmitriy.ovdienko at gmail dot com
                   ` (40 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-05  9:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #2 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
Created attachment 49185
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49185&action=edit
Modified solution with custom allocator based on malloc

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
  2020-09-05  9:42 ` [Bug libstdc++/96942] " dmitriy.ovdienko at gmail dot com
  2020-09-05  9:43 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-07  7:21 ` dmitriy.ovdienko at gmail dot com
  2020-09-07  7:24 ` dmitriy.ovdienko at gmail dot com
                   ` (39 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-07  7:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #3 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
Created attachment 49189
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49189&action=edit
Original implementation (simplified, single threaded)

Attached is a simplified original version of the benchmark.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (2 preceding siblings ...)
  2020-09-07  7:21 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-07  7:24 ` dmitriy.ovdienko at gmail dot com
  2020-09-07 10:49 ` amonakov at gcc dot gnu.org
                   ` (38 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-07  7:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #4 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
Created attachment 49190
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49190&action=edit
Modified solution with custom allocator based on malloc (simplified, single
threaded)

Attached is a benchmark based on Malloc allocator, modified simplified single
threaded.

Following is a execution time for different tree depth:

                    depth_17    depth_18    depth_19
bt_pmr_0thrd        0.105s      0.313s      0.577s
bt_malloc_0thrd     0.087s      0.147s      0.448s

Commandline is:

  time ./bt_pmr_0thrd <depth>
  time ./bt_malloc_0thrd <depth>

On depth=18 boundary there is 2x times difference.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (3 preceding siblings ...)
  2020-09-07  7:24 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-07 10:49 ` amonakov at gcc dot gnu.org
  2020-09-07 16:13 ` dmitriy.ovdienko at gmail dot com
                   ` (37 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: amonakov at gcc dot gnu.org @ 2020-09-07 10:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
You raise valid points (i.e. it would be good to understand why preallocation
is not beneficial, or what's causing the performance gap w.r.t malloc), but
looking at cache-misses counter does not make sense here (perf is not explicit
about that, but it counts misses in L3, and as you see the count is three
magnitudes lower than that of cycles&instructions, so it's not the main factor
in overall performance picture).

As for comparison against Rust, it spreads more work over available cores: you
can see that its "user time" is higher, though "wall-clock time" is same or
lower. In other words, the C++ variant does not achieve good multicore scaling.

The main gotcha here is m_b_r does not allocate on construction, but rather
allocates 2x of the preallocation size on first call to 'allocate', and then
deallocates when 'release' is called. So it repeatedly calls malloc/free in the
inner benchmark loop, whereas you custom allocator allocates on construction
and deallocates on destruction, avoiding repeated malloc/free calls in the loop
and associated lock contention when multithreaded.

(also obviously it simply does more work in 'allocate', which costs extra
cycles)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (4 preceding siblings ...)
  2020-09-07 10:49 ` amonakov at gcc dot gnu.org
@ 2020-09-07 16:13 ` dmitriy.ovdienko at gmail dot com
  2020-09-07 20:53 ` dmitriy.ovdienko at gmail dot com
                   ` (36 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-07 16:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #6 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
> looking at cache-misses counter does not make sense here

Well, if you compare Rust and C++, cache-misses CPU counter differs
dramatically... and page-faults too... while amount of instructions is the
same.

Page-faults btw, can significantly affect performance too. It could happen that
that is the reason.

I've put all numbers into one table for convenience:


| CPU counter      | PMR            | Malloc         | Rust           |
|------------------|----------------|----------------|----------------|
| cache-references |     45,104,136 |     40,713,525 |     29,268,774 |
| cache-misses     |     24,448,475 |     14,147,648 |     12,147,041 |
| cycles           | 19,904,251,283 | 14,823,743,812 | 24,539,557,585 |
| instructions     | 30,462,013,065 | 22,306,442,507 | 31,784,741,964 |
| branches         |  4,834,392,341 |  4,331,968,591 |  4,829,547,556 |
| faults           |        234,796 |         60,227 |         68,023 |
| migrations       |              2 |              6 |              8 |

> The main gotcha here is m_b_r does not allocate on construction, but rather allocates 2x of the preallocation size on first call to 'allocate'

In the two previous posts I've attached a code that does not create any thread
and allocates/deallocates memory in the loop. So, both samples have the same
behaviour.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (5 preceding siblings ...)
  2020-09-07 16:13 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-07 20:53 ` dmitriy.ovdienko at gmail dot com
  2020-09-07 21:21 ` dmitriy.ovdienko at gmail dot com
                   ` (35 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-07 20:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #7 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
Following are CPU counters for single threaded code. Pre-allocation is enabled.
Memory pool is created inside the loop.

```cpp
  int poolSize(int depth)
  {
    return (1 << (depth + 1)) * sizeof(Node);
  }

  int count = 0;
  for(int i = 0; i < 20; ++i)
  {
    MemoryPool store (poolSize(stretch_depth));

    Node *c = make(stretch_depth, store);
    count += c->check();
    store.release();
  }
```

Depth = 21, Pool size = 134,217,728 Bytes

|                   |            PMR |        Malloc |
|-------------------|----------------|---------------|
| cache-references  |     60,180,483 |    60,205,187 |
| cache-misses      |     50,288,765 |    50,426,418 |
| cycles            |  7,587,314,879 | 6,076,106,356 |
| instructions      | 14,347,088,112 | 8,138,591,245 |
| branches          |  2,224,641,671 | 1,550,701,277 |
| branch-misses     |      8,074,211 |     7,307,996 |
| faults            |        655,503 |       655,485 |
| migrations        |              1 |             2 |
| time elapsed, sec |           2.16 |          1.75 |
| time (user, sec)  |           1.46 |          1.01 |
| time (sys, sec)   |           0.69 |          0.73 |

Depth = 18, Pool size = 16,777,216 Bytes

|                   |           PMR |      Malloc |
|-------------------|---------------|-------------|
| cache-references  |     8,186,788 |   3,450,642 |
| cache-misses      |     6,504,691 |   1,592,945 |
| cycles            |   992,526,559 | 472,979,689 |
| instructions      | 1,806,230,679 | 766,527,818 |
| branches          |   279,352,274 | 151,353,530 |
| branch-misses     |     1,072,404 |     474,648 |
| faults            |        82,063 |       8,314 |
| migrations        |             0 |           0 |
| time elapsed, sec |          0.28 |        0.14 |
| time (user, sec)  |          0.17 |        0.13 |
| time (sys, sec)   |          0.11 |        0.01 |

Depth = 17, Pool size: 8,388,608 Bytes

|                   |         PMR |      Malloc |
|-------------------|-------------|-------------|
| cache-references  |   1,624,992 |   1,707,061 |
| cache-misses      |     867,310 |     718,011 |
| cycles            | 312,687,116 | 255,951,365 |
| instructions      | 765,410,795 | 389,671,180 |
| branches          | 118,619,222 |  74,686,565 |
| branch-misses     |     272,286 |     263,916 |
| faults            |       4,221 |       4,219 |
| migrations        |           0 |           0 |
| time elapsed, sec |        0.10 |        0.08 |
| time (user, sec)  |        0.10 |        0.07 |
| time (sys, sec)   |        0.00 |        0.01 |

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (6 preceding siblings ...)
  2020-09-07 20:53 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-07 21:21 ` dmitriy.ovdienko at gmail dot com
  2020-09-08  7:11 ` amonakov at gcc dot gnu.org
                   ` (34 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-07 21:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #8 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
Same as above for Depth = 19

|                   |           PMR |        Malloc |
|-------------------|---------------|---------------|
| cache-references  |    16,571,923 |    16,260,256 |
| cache-misses      |    13,576,560 |    13,197,813 |
| cycles            | 2,000,406,391 | 1,566,192,030 |
| instructions      | 3,566,826,120 | 2,021,390,552 |
| branches          |   554,315,206 |   386,308,307 |
| branch-misses     |     1,832,371 |     1,790,865 |
| faults            |       163,983 |       163,966 |
| migrations        |             0 |             0 |
| time elapsed, sec |          0.58 |          0.45 |
| time (user, sec)  |          0.40 |          0.24 |
| time (sys, sec)   |          0.17 |          0.21 |

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (7 preceding siblings ...)
  2020-09-07 21:21 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-08  7:11 ` amonakov at gcc dot gnu.org
  2020-09-08  9:39 ` dmitriy.ovdienko at gmail dot com
                   ` (33 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: amonakov at gcc dot gnu.org @ 2020-09-08  7:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #9 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
The most pronounced difference for depth=18 seems to be caused by m_b_r
over-allocating by 2x: internally it mallocs 2x of the size given to the
constructor, and then Linux pre-faults those extra pages, penalizing the
benchmark.

Dividing estimated size by 2 to counter the over-allocation effect:

    MemoryPool store (poolSize(stretch_depth) / 2);

substantially improves the benchmark for me.

I think the rest of the slowdown can be attributed to m_b_r simply doing more
work internally compared to your bare-bones malloc allocator (I'm seeing less
pronounced differences though, I'm testing on a Sandybridge CPU with -O2).

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (8 preceding siblings ...)
  2020-09-08  7:11 ` amonakov at gcc dot gnu.org
@ 2020-09-08  9:39 ` dmitriy.ovdienko at gmail dot com
  2020-09-08 11:19 ` redi at gcc dot gnu.org
                   ` (32 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-08  9:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #10 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
Looks like I know why C++ sample does not use all the CPU resources.

C++ does not load threads equally. Last thread gets the most heavy task
(MAX_DEPTH) and performs N iterations alone.

Rust code instead creates a loop from MIN_DEPTH..MAX_DEPTH and runs iterations
in the threads. So at first, N threads are working on Depth=1, then all the
threads are working on Depth=2... and then all the threads are working on
Depth=MAX

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (9 preceding siblings ...)
  2020-09-08  9:39 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-08 11:19 ` redi at gcc dot gnu.org
  2020-09-08 11:23 ` redi at gcc dot gnu.org
                   ` (31 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: redi at gcc dot gnu.org @ 2020-09-08 11:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #11 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #9)
> The most pronounced difference for depth=18 seems to be caused by m_b_r
> over-allocating by 2x: internally it mallocs 2x of the size given to the
> constructor, and then Linux pre-faults those extra pages, penalizing the
> benchmark.

It adds 11 bytes to the size given to the constructor (for its internal
bookkeeping) and then rounds up to a power of two.

Since the poolSize function actually returns sizeof(Node) more than it needs,
and sizeof(Node) > 11, the overallocation should be avoidable by simply fixing
poolSize to return the right value:

int poolSize(int depth)
{
  return ((1 << (depth + 1)) - 1) * sizeof(Node);
}

The original function returns a power of two, but the code actually creates an
odd number of nodes (there is only one node at depth zero, not two).

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (10 preceding siblings ...)
  2020-09-08 11:19 ` redi at gcc dot gnu.org
@ 2020-09-08 11:23 ` redi at gcc dot gnu.org
  2020-09-08 11:32 ` redi at gcc dot gnu.org
                   ` (30 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: redi at gcc dot gnu.org @ 2020-09-08 11:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #12 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #5)
> The main gotcha here is m_b_r does not allocate on construction, but rather
> allocates 2x of the preallocation size on first call to 'allocate', and then
> deallocates when 'release' is called. So it repeatedly calls malloc/free in
> the inner benchmark loop, whereas you custom allocator allocates on
> construction and deallocates on destruction, avoiding repeated malloc/free
> calls in the loop and associated lock contention when multithreaded.

m_b_r really needs a "rewind()" member to mark all allocated memory as unused
again, without actually deallocating it and returning it to the upstream
resource.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (11 preceding siblings ...)
  2020-09-08 11:23 ` redi at gcc dot gnu.org
@ 2020-09-08 11:32 ` redi at gcc dot gnu.org
  2020-09-08 11:35 ` amonakov at gcc dot gnu.org
                   ` (29 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: redi at gcc dot gnu.org @ 2020-09-08 11:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #13 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #9)
> I think the rest of the slowdown can be attributed to m_b_r simply doing
> more work internally compared to your bare-bones malloc allocator

The malloc-based one only supports fundamental alignments and has a fixed upper
limit, whereas std::m_b_r supports extended alignments and can grow beyond the
initial capacity.

If you don't need the additional features of std::m_b_r, then of course you can
beat its performance.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (12 preceding siblings ...)
  2020-09-08 11:32 ` redi at gcc dot gnu.org
@ 2020-09-08 11:35 ` amonakov at gcc dot gnu.org
  2020-09-08 11:48 ` redi at gcc dot gnu.org
                   ` (28 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: amonakov at gcc dot gnu.org @ 2020-09-08 11:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #14 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
> It adds 11 bytes to the size given to the constructor (for its internal
> bookkeeping) and then rounds up to a power of two.

What is the purpose of this rounding up?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (13 preceding siblings ...)
  2020-09-08 11:35 ` amonakov at gcc dot gnu.org
@ 2020-09-08 11:48 ` redi at gcc dot gnu.org
  2020-09-08 11:49 ` redi at gcc dot gnu.org
                   ` (27 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: redi at gcc dot gnu.org @ 2020-09-08 11:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #15 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Jonathan Wakely from comment #12)
> m_b_r really needs a "rewind()" member to mark all allocated memory as
> unused again, without actually deallocating it and returning it to the
> upstream resource.

Alternatively, you can use a pooling allocator as the upstream memory resource
for the std::m_b_r resource, so that when you call release() on the std::m_b_r
object it only releases it upstream, not back to malloc.

Here's an upstream allocator that only supports fundamental alignments and will
only allocate once, but keeps it around for reuse when the std::m_b_r
deallocates:

struct OnceResource : std::pmr::memory_resource
{
  void* do_allocate(std::size_t n, std::size_t alignment)
  {
    if (alignment > alignof(std::max_align_t))
      throw std::bad_alloc();

    if (!m_ptr)
    {
      m_ptr = std::malloc(n);
      if (m_ptr == nullptr)
        throw std::bad_alloc();
      m_size = n;
      m_free = true;
    }
    else if (!m_free || n > m_size)
      throw std::bad_alloc();

    return m_ptr;
  }

  void do_deallocate(void*, std::size_t, std::size_t)
  { m_free = true; }

  bool do_is_equal(const std::pmr::memory_resource& m) const noexcept
  { return std::addressof(m) == this; }

  ~OnceResource()
  { std::free(m_ptr); }

  void* m_ptr = nullptr;
  std::size_t m_size = 0;
  bool m_free = false;
};

and then ...

  OnceResource o;
  // Alloc then dealloc stretchdepth tree.
  int count = 0;
  for(int i = 0; i < 20; ++i)
  {
    MemoryPool store (poolSize(stretch_depth), &o);

This makes a big difference.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (14 preceding siblings ...)
  2020-09-08 11:48 ` redi at gcc dot gnu.org
@ 2020-09-08 11:49 ` redi at gcc dot gnu.org
  2020-09-08 11:50 ` redi at gcc dot gnu.org
                   ` (26 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: redi at gcc dot gnu.org @ 2020-09-08 11:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #16 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #14)
> > It adds 11 bytes to the size given to the constructor (for its internal
> > bookkeeping) and then rounds up to a power of two.
> 
> What is the purpose of this rounding up?

To request a power of two size from the upstream resource, so that we don't
e.g. ask for 4090 bytes and waste the end of a page.

It might make more sense to round up to a page boundary, or some other size
instead.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (15 preceding siblings ...)
  2020-09-08 11:49 ` redi at gcc dot gnu.org
@ 2020-09-08 11:50 ` redi at gcc dot gnu.org
  2020-09-08 11:58 ` amonakov at gcc dot gnu.org
                   ` (25 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: redi at gcc dot gnu.org @ 2020-09-08 11:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #17 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Jonathan Wakely from comment #16)
> To request a power of two size from the upstream resource, so that we don't
> e.g. ask for 4090 bytes and waste the end of a page.

Or even worse, allocate (n*4096+11) and get n+1 pages from malloc, and waste
4085 bytes from the last page.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (16 preceding siblings ...)
  2020-09-08 11:50 ` redi at gcc dot gnu.org
@ 2020-09-08 11:58 ` amonakov at gcc dot gnu.org
  2020-09-08 12:03 ` redi at gcc dot gnu.org
                   ` (24 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: amonakov at gcc dot gnu.org @ 2020-09-08 11:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #18 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Huh? malloc is capable of splitting the tail of the last page for reuse in
subsequent small allocations, why not let it do it? It will not be "wasted".

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (17 preceding siblings ...)
  2020-09-08 11:58 ` amonakov at gcc dot gnu.org
@ 2020-09-08 12:03 ` redi at gcc dot gnu.org
  2020-09-08 12:17 ` amonakov at gcc dot gnu.org
                   ` (23 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: redi at gcc dot gnu.org @ 2020-09-08 12:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #19 from Jonathan Wakely <redi at gcc dot gnu.org> ---
The upstream isn't necessarily malloc (although we could add a check to find
out whether it is the std::new_delete_resource() if we wanted to).

Suggestions for improvement are welcome. This is only the second time I've ever
received any indication anybody is even using the <memory_resource> header, so
I've not wasted my time tuning it.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (18 preceding siblings ...)
  2020-09-08 12:03 ` redi at gcc dot gnu.org
@ 2020-09-08 12:17 ` amonakov at gcc dot gnu.org
  2020-09-08 12:53 ` dmitriy.ovdienko at gmail dot com
                   ` (22 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: amonakov at gcc dot gnu.org @ 2020-09-08 12:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #20 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Round up to 64 bytes (typical cache line size).

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (19 preceding siblings ...)
  2020-09-08 12:17 ` amonakov at gcc dot gnu.org
@ 2020-09-08 12:53 ` dmitriy.ovdienko at gmail dot com
  2020-09-08 16:13 ` redi at gcc dot gnu.org
                   ` (21 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-08 12:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #21 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
> This is only the second time 
> I've ever received any indication anybody is even using the 
> <memory_resource> header, so I've not wasted my time tuning it.

I used it to create an order book :). It helped me to improve latency
distribution.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (20 preceding siblings ...)
  2020-09-08 12:53 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-08 16:13 ` redi at gcc dot gnu.org
  2020-09-08 16:28 ` amonakov at gcc dot gnu.org
                   ` (20 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: redi at gcc dot gnu.org @ 2020-09-08 16:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #22 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Created attachment 49199
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49199&action=edit
Patch to reduce monotonic_buffer_resource chunk sizes

(In reply to Alexander Monakov from comment #20)
> Round up to 64 bytes (typical cache line size).

This patch does that. The code is quite a bit simpler, but doesn't seem to
affect the benchmarks much.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (21 preceding siblings ...)
  2020-09-08 16:13 ` redi at gcc dot gnu.org
@ 2020-09-08 16:28 ` amonakov at gcc dot gnu.org
  2020-09-08 16:44 ` redi at gcc dot gnu.org
                   ` (19 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: amonakov at gcc dot gnu.org @ 2020-09-08 16:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #23 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Are you benchmarking with bt_pmr_0thrd (attached in comment #3) with depth=18?
On earlier tests there are other effects in play too.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (22 preceding siblings ...)
  2020-09-08 16:28 ` amonakov at gcc dot gnu.org
@ 2020-09-08 16:44 ` redi at gcc dot gnu.org
  2020-09-08 17:30 ` redi at gcc dot gnu.org
                   ` (18 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: redi at gcc dot gnu.org @ 2020-09-08 16:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #24 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Yes, I'm using that one, but I was testing depth=20.

The new code is a lot better for depth=18, but worse (or not much different)
for greater depths.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (23 preceding siblings ...)
  2020-09-08 16:44 ` redi at gcc dot gnu.org
@ 2020-09-08 17:30 ` redi at gcc dot gnu.org
  2020-09-08 20:20 ` dmitriy.ovdienko at gmail dot com
                   ` (17 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: redi at gcc dot gnu.org @ 2020-09-08 17:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #25 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Depth = 21, Pool size = 134,217,728 Bytes

|                   |            PMR |  PMR (round64) |        Malloc |
|-------------------|----------------|----------------|---------------|
| cache-references  |     84,661,444 |     83,302,747 |    81,967,273 |
| cache-misses      |     53,391,899 |     54,532,373 |    53,888,509 |
| cycles            |  3,264,817,828 |  3,140,191,961 | 2,089,490,361 |
| instructions      |  8,293,114,268 |  8,153,394,258 | 4,201,790,186 |
| branches          |  1,202,066,762 |  1,203,833,244 |   789,430,524 |
| branch-misses     |      1,423,441 |      1,444,175 |       904,456 |
| faults            |        655,464 |        655,463 |       655,443 |
| migrations        |              0 |              0 |             0 |
| time elapsed, sec |           1.49 |           1.51 |          1.20 |
| time (user, sec)  |           0.94 |           0.94 |          0.64 |
| time (sys, sec)   |           0.54 |           0.56 |          0.55 |

Depth = 20, Pool size = 67,108,864 Bytes

|                   |            PMR |  PMR (round64) |        Malloc |
|-------------------|----------------|----------------|---------------|
| cache-references  |     42,084,039 |     43,344,605 |    42,692,888 |
| cache-misses      |     26,543,933 |     26,816,142 |    27,127,806 |
| cycles            |  1,575,404,466 |  1,566,930,521 | 1,093,643,897 |
| instructions      |  4,078,822,557 |  4,071,588,794 | 2,396,116,643 |
| branches          |    600,978,058 |    601,691,880 |   411,951,856 |
| branch-misses     |        733,227 |        720,565 |       458,525 |
| faults            |        327,783 |        327,785 |       327,761 |
| migrations        |              0 |              0 |             0 |
| time elapsed, sec |           0.77 |           0.76 |          0.64 |
| time (user, sec)  |           0.47 |           0.47 |          0.32 |
| time (sys, sec)   |           0.29 |           0.29 |          0.32 |

Depth = 19, Pool size = 33,554,432 Bytes

|                   |            PMR |  PMR (round64) |        Malloc |
|-------------------|----------------|----------------|---------------|
| cache-references  |     21,497,688 |     21,080,716 |    23,434,487 |
| cache-misses      |     11,879,471 |     11,638,742 |    12,858,693 |
| cycles            |    774,408,878 |    779,596,013 |   638,645,510 |
| instructions      |  2,029,002,195 |  2,032,153,944 | 1,439,412,296 |
| branches          |    299,585,737 |    300,448,330 |   237,543,121 |
| branch-misses     |        369,666 |        368,317 |       245,442 |
| faults            |        163,947 |        163,944 |       163,923 |
| migrations        |              0 |              0 |             0 |
| time elapsed, sec |           0.38 |           0.38 |          0.35 |
| time (user, sec)  |           0.24 |           0.23 |          0.20 |
| time (sys, sec)   |           0.14 |           0.15 |          0.15 |

Depth = 18, Pool size = 16,777,216 Bytes

|                   |            PMR |  PMR (round64) |        Malloc |
|-------------------|----------------|----------------|---------------|
| cache-references  |     11,092,750 |     21,138,662 |    21,186,799 |
| cache-misses      |      4,421,478 |      9,129,431 |     9,187,107 |
| cycles            |    386,676,154 |    390,228,584 |   266,835,089 |
| instructions      |  1,001,405,722 |  1,002,323,697 |   506,837,026 |
| branches          |    150,187,230 |    148,167,343 |    96,520,750 |
| branch-misses     |        182,986 |        121,150 |       103,099 |
| faults            |         82,027 |          8,277 |         8,275 |
| migrations        |              0 |              0 |             0 |
| time elapsed, sec |           0.19 |           0.11 |          0.08 |
| time (user, sec)  |           0.12 |           0.10 |          0.07 |
| time (sys, sec)   |           0.06 |           0.01 |          0.01 |

Depth = 17, Pool size = 8,388,608 Bytes

|                   |            PMR |  PMR (round64) |        Malloc |
|-------------------|----------------|----------------|---------------|
| cache-references  |     10,348,099 |     10,397,668 |    10,640,737 |
| cache-misses      |      1,622,569 |      1,826,868 |     1,610,649 |
| cycles            |    188,747,289 |    187,914,047 |   135,829,450 |
| instructions      |    497,716,330 |    495,626,938 |   285,991,687 |
| branches          |     73,541,553 |     73,692,400 |    51,087,747 |
| branch-misses     |         55,310 |         55,865 |        48,357 |
| faults            |          4,181 |          4,184 |         4,180 |
| migrations        |              0 |              0 |             0 |
| time elapsed, sec |           0.05 |           0.05 |          0.04 |
| time (user, sec)  |           0.05 |           0.05 |          0.04 |
| time (sys, sec)   |           0.00 |           0.01 |          0.00 |

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (24 preceding siblings ...)
  2020-09-08 17:30 ` redi at gcc dot gnu.org
@ 2020-09-08 20:20 ` dmitriy.ovdienko at gmail dot com
  2020-09-08 20:30 ` dmitriy.ovdienko at gmail dot com
                   ` (16 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-08 20:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #26 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
Created attachment 49201
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49201&action=edit
Modified solution (thread per iteration)

Attached is a code similar to what Rust sample is doing (parallel
iterations-loop rather than depth-loop).

What I'd like to improve is to reuse allocated memory rather than allocate
every iteration. 

According to requirements to the task, I cannot implement my own memory arena.
So I have to find way how to use STL to achieve same effect.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (25 preceding siblings ...)
  2020-09-08 20:20 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-08 20:30 ` dmitriy.ovdienko at gmail dot com
  2020-09-08 20:36 ` dmitriy.ovdienko at gmail dot com
                   ` (15 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-08 20:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #27 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
Following are CPU counters I've got for improved C++ test vs Rust and original
C++ test by Danial Klimkin

| CPU counter      |           Rust |     C++ before |        C++ now |
|------------------|---------------:|---------------:|---------------:|
| cache-references |     29,949,104 |     46,639,647 |     60,495,704 |
| cache-misses     |     12,378,855 |     24,667,211 |     25,741,109 |
| cycles           | 24,700,616,015 | 20,066,358,482 | 23,714,062,385 |
| instructions     | 31,819,068,745 | 30,416,793,570 | 30,455,923,689 |
| branches         |  4,835,891,456 |  4,829,846,926 |  4,832,915,354 |
| branch-misses    |     10,768,008 |     11,313,719 |     11,643,211 |
| faults           |         82,346 |        234,891 |        239,866 |
| migrations       |              2 |              4 |             21 |
| Time-elapsed     |         2.0779 |         2.0884 |         2.0061 |
| Seconds-user     |         7.1599 |         5.5976 |         6.5808 |
| Seconds-sys      |         0.1803 |         0.3678 |         0.4595 |

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (26 preceding siblings ...)
  2020-09-08 20:30 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-08 20:36 ` dmitriy.ovdienko at gmail dot com
  2020-09-08 20:41 ` dmitriy.ovdienko at gmail dot com
                   ` (14 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-08 20:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #28 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
Added CPU counters for malloc-based allocator as a base point

| CPU counter      |           Rust |     C++ before |        C++ now |     C++
malloc |
|------------------|---------------:|---------------:|---------------:|---------------:|
| cache-references |     29,949,104 |     46,639,647 |     60,495,704 |    
56,695,473 |
| cache-misses     |     12,378,855 |     24,667,211 |     25,741,109 |    
17,430,917 |
| cycles           | 24,700,616,015 | 20,066,358,482 | 23,714,062,385 |
18,304,126,667 |
| instructions     | 31,819,068,745 | 30,416,793,570 | 30,455,923,689 |
20,827,116,950 |
| branches         |  4,835,891,456 |  4,829,846,926 |  4,832,915,354 | 
4,083,330,168 |
| branch-misses    |     10,768,008 |     11,313,719 |     11,643,211 |     
8,868,653 |
| faults           |         82,346 |        234,891 |        239,866 |        
92,955 |
| migrations       |              2 |              4 |             21 |        
    13 |
| Time-elapsed     |         2.0779 |         2.0884 |         2.0061 |        
1.6051 |
| Seconds-user     |         7.1599 |         5.5976 |         6.5808 |        
5.2582 |
| Seconds-sys      |         0.1803 |         0.3678 |         0.4595 |        
0.1872 |

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (27 preceding siblings ...)
  2020-09-08 20:36 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-08 20:41 ` dmitriy.ovdienko at gmail dot com
  2020-09-08 21:01 ` dmitriy.ovdienko at gmail dot com
                   ` (13 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-08 20:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #29 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
Table above isn't readable. Value for `cache-references` .. faults
 are divided  by 1000 

Sorry for flood. 

| CPU counter      |       Rust | C++ before |    C++ now | C++ malloc |
|------------------|-----------:|-----------:|-----------:|-----------:|
| cache-references |     29,949 |     46,639 |     60,495 |     56,695 |
| cache-misses     |     12,378 |     24,667 |     25,741 |     17,430 |
| cycles           | 24,700,616 | 20,066,358 | 23,714,062 | 18,304,126 |
| instructions     | 31,819,068 | 30,416,793 | 30,455,923 | 20,827,116 |
| branches         |  4,835,891 |  4,829,846 |  4,832,915 |  4,083,330 |
| branch-misses    |     10,768 |     11,313 |     11,643 |      8,868 |
| faults           |         82 |        234 |        239 |         92 |
| migrations       |          2 |          4 |         21 |         13 |
| Time-elapsed     |     2.0779 |     2.0884 |     2.0061 |     1.6051 |
| Seconds-user     |     7.1599 |     5.5976 |     6.5808 |     5.2582 |
| Seconds-sys      |     0.1803 |     0.3678 |     0.4595 |     0.1872 |

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (28 preceding siblings ...)
  2020-09-08 20:41 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-08 21:01 ` dmitriy.ovdienko at gmail dot com
  2020-09-08 21:33 ` dmitriy.ovdienko at gmail dot com
                   ` (12 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-08 21:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #30 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
> Dividing estimated size by 2 to counter the over-allocation effect:

Good idea... but it smells bad :)
What if someone change allocation algorithm...?

> Since the poolSize function actually returns sizeof(Node) more than 
> it needs, and sizeof(Node) > 11, the overallocation should be 
> avoidable by simply fixing poolSize to return the right value

Also good idea. I will apply that

> m_b_r really needs a "rewind()" member to mark all allocated 
> memory as unused again, without actually de-allocating it and 
> returning it to the upstream resource.

I'm not sure we can modify interface of the classes defined in std::pmr... Or
we can?

> If you don't need the additional features of std::m_b_r, 
> then of course you can beat its performance.

Additional logic comes in play when memory is not enough. If it is true, then
std::m_b_r should be also fast, no?

> Alternatively, you can use a pooling allocator

Unfortunately exactly for that test I cannot use custom allocators... 

> To request a power of two size from the upstream resource, 
> so that we don't e.g. ask for 4090 bytes and waste the end of a page.

In case if allocations of two blocks by 4k is faster than allocation of 1 block
of 8k, then I believe allocator has to allocate memory by 4k blocks only. There
is no sense to allocate huge blocks of memory. Allocators are used for small
objects. We do not need continuous memory regions.

Allocation of 2x4k vs 1x8k has to be tested...

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (29 preceding siblings ...)
  2020-09-08 21:01 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-08 21:33 ` dmitriy.ovdienko at gmail dot com
  2020-09-08 21:36 ` dmitriy.ovdienko at gmail dot com
                   ` (11 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-08 21:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #31 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
Created attachment 49202
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49202&action=edit
Modified solution with 2-level memory pools

I believe I'm done with this task. Attached is a solution based on our
conversation:

1. I allocate memory by 4k blocks. As far as PMR adds 11, actuall allocated
block size will be 4096

2. I use `std::pmr::unsynchronized_pool_resource` as it does not release memory
in `deallocate()` memober function

3. I parallel `iterations` loop rather than `depth` loop

4. I do not use std::condition_variable + std::mutex as in fact it is veeeeery
slow. std::atomic_int is much faster for this purpose.

Following are results:


| CPU counter          |       Rust |    C++ now | C++ malloc |   C++ pool |
|----------------------|-----------:|-----------:|-----------:|-----------:|
| cache-references, 1k |     29,949 |     60,495 |     56,695 |     54,608 |
| cache-misses, 1k     |     12,378 |     25,741 |     17,430 |     17,963 |
| cycles, 1k           | 24,700,616 | 23,714,062 | 18,304,126 | 22,599,042 |
| instructions, 1k     | 31,819,068 | 30,455,923 | 20,827,116 | 29,420,285 |
| branches, 1k         |  4,835,891 |  4,832,915 |  4,083,330 |  4,607,526 |
| branch-misses, 1k    |     10,768 |     11,643 |      8,868 |      9,379 |
| faults, 1k           |         82 |        239 |         92 |         82 |
| migrations           |          2 |         21 |         13 |         12 |
| Time-elapsed         |     2.0779 |     2.0061 |     1.6051 |     1.9445 |
| Seconds-user         |     7.1599 |     6.5808 |     5.2582 |     6.5315 |
| Seconds-sys          |     0.1803 |     0.4595 |     0.1872 |     0.1838 |

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (30 preceding siblings ...)
  2020-09-08 21:33 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-08 21:36 ` dmitriy.ovdienko at gmail dot com
  2020-09-08 22:24 ` dmitriy.ovdienko at gmail dot com
                   ` (10 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-08 21:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #32 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
What bothers me is does why Rust generate less cpucache-references.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (31 preceding siblings ...)
  2020-09-08 21:36 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-08 22:24 ` dmitriy.ovdienko at gmail dot com
  2020-09-09  9:10 ` redi at gcc dot gnu.org
                   ` (9 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-08 22:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #33 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
@Jonathan Wakely

I have one idea to improve code of p_m_r

I expect that allocation are very rare operation. If true, it makes sense to
add __unlikelly to `if (!__p)` inside the `do_allocate` member function

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (32 preceding siblings ...)
  2020-09-08 22:24 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-09  9:10 ` redi at gcc dot gnu.org
  2020-09-09  9:49 ` redi at gcc dot gnu.org
                   ` (8 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: redi at gcc dot gnu.org @ 2020-09-09  9:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #34 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Dmitriy Ovdienko from comment #26)
> According to requirements to the task, I cannot implement my own memory
> arena. So I have to find way how to use STL to achieve same effect.

I assume you can't just preallocate a buffer for the pool?

  std::unique_ptr<std::byte[]> bytes(new std::byte[poolSize(stretch_depth)]);
  // Alloc then dealloc stretchdepth tree.
  int count = 0;
  for(int i = 0; i < 20; ++i)
  {
    // num_nodes = 0;
    MemoryPool store (bytes.get(), poolSize(stretch_depth));

Here store.release() will reuse the original buffer supplied to the
constructor, so it never actually allocates.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (33 preceding siblings ...)
  2020-09-09  9:10 ` redi at gcc dot gnu.org
@ 2020-09-09  9:49 ` redi at gcc dot gnu.org
  2020-09-09 11:59 ` dmitriy.ovdienko at gmail dot com
                   ` (7 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: redi at gcc dot gnu.org @ 2020-09-09  9:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #35 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Dmitriy Ovdienko from comment #33)
> @Jonathan Wakely
> 
> I have one idea to improve code of p_m_r
> 
> I expect that allocation are very rare operation. If true, it makes sense to
> add __unlikelly to `if (!__p)` inside the `do_allocate` member function

It doesn't seem to make much difference.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (34 preceding siblings ...)
  2020-09-09  9:49 ` redi at gcc dot gnu.org
@ 2020-09-09 11:59 ` dmitriy.ovdienko at gmail dot com
  2020-09-09 12:06 ` dmitriy.ovdienko at gmail dot com
                   ` (6 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-09 11:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #36 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
> It doesn't seem to make much difference.

It is visible in the assembly. In case if you use __unlikelly, compiler moves
this code out of hot path minimizing the amount of instructions to decode.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (35 preceding siblings ...)
  2020-09-09 11:59 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-09 12:06 ` dmitriy.ovdienko at gmail dot com
  2020-09-10  8:13 ` dmitriy.ovdienko at gmail dot com
                   ` (5 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-09 12:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #37 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
> I assume you can't just preallocate a buffer for the pool?

I dunno... here is a requirement:

 * When possible, use default GC; otherwise use per node allocation or use a
library memory pool.

 * As a practical matter, the myriad ways to tune GC will not be accepted.

 * As a practical matter, the myriad ways to custom allocate memory will not be
accepted.

 * Please don't implement your own custom "arena" or "memory pool" or "free
list" - they will not be accepted.


I've submitted the code yesterday. Let's see if they accept it. If they do, I
will submit the next version with what you @Jonathan suggest

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (36 preceding siblings ...)
  2020-09-09 12:06 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-10  8:13 ` dmitriy.ovdienko at gmail dot com
  2020-09-10 14:42 ` cvs-commit at gcc dot gnu.org
                   ` (4 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-10  8:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #38 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
Wohoooooo! Accepted:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/binarytrees.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (37 preceding siblings ...)
  2020-09-10  8:13 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-10 14:42 ` cvs-commit at gcc dot gnu.org
  2020-09-10 14:46 ` redi at gcc dot gnu.org
                   ` (3 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2020-09-10 14:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #39 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>:

https://gcc.gnu.org/g:1e718ec51a223d65a09757b999202390871b778d

commit r11-3101-g1e718ec51a223d65a09757b999202390871b778d
Author: Jonathan Wakely <jwakely@redhat.com>
Date:   Thu Sep 10 15:39:15 2020 +0100

    libstdc++: Reduce monotonic_buffer_resource overallocation [PR 96942]

    The primary reason for this change is to reduce the size of buffers
    allocated by std::pmr::monotonic_buffer_resource. Previously, a new
    buffer would always add the size of the linked list node (11 bytes) and
    then round up to the next power of two. This results in a huge increase
    if the expected size of the next buffer is already a power of two. For
    example, if the resource is constructed with a desired initial size of
    4096 the first buffer it allocates will be std::bit_ceil(4096+11) which
    is 8192.  If the user has carefully selected the initial size to match
    their expected memory requirements then allocating double that amount
    wastes a lot of memory.

    After this patch the allocated size will be rounded up to a 64-byte
    boundary, instead of to a power of two. This means for an initial size
    of 4096 only 4160 bytes get allocated.

    Previously only the base-2 logarithm of the size was stored, which could
    be stored in a single 8-bit integer. Now that the size isn't always a
    power of two we need to use more bits to store it. As the size is always
    a multiple of 64 the low six bits are not needed, and so we can use the
    same approach that the pool resources already use of storing the base-2
    logarithm of the alignment in the low bits that are not used for the
    size. To avoid code duplication, a new aligned_size<N> helper class is
    introduced by this patch, which is then used by both the pool resources'
    big_block type and the monotonic_buffer_resource::_Chunk type.

    Originally the big_block type used two bit-fields to store the size and
    alignment in the space of a single size_t member. The aligned_size type
    uses a single size_t member and uses masks and bitwise operations to
    manipulate the size and alignment values. This results in better code
    than the old version, because the bit-fields weren't optimally ordered
    for little endian architectures, so the alignment was actually stored in
    the high bits, not the unused low bits, requiring additional shifts to
    calculate the values. Using bitwise operations directly avoids needing
    to reorder the bit-fields depending on the endianness.

    While adapting the _Chunk and big_block types to use aligned_size<N> I
    also added checks for size overflows (technically, unsigned wraparound).
    The memory resources now ensure that when they require an allocation
    that is too large to represent in size_t they will request SIZE_MAX
    bytes from the upstream resource, rather than requesting a small value
    that results from wrapround. The testsuite is enhanced to verify this.

    libstdc++-v3/ChangeLog:

            PR libstdc++/96942
            * include/std/memory_resource
(monotonic_buffer_resource::do_allocate):
            Use __builtin_expect when checking if a new buffer needs to be
            allocated from the upstream resource, and for checks for edge
            cases like zero sized buffers and allocations.
            * src/c++17/memory_resource.cc (aligned_size): New class template.
            (aligned_ceil): New helper function to round up to a given
            alignment.
            (monotonic_buffer_resource::chunk): Replace _M_size and _M_align
            with an aligned_size member. Remove _M_canary member. Change
_M_next
            to pointer instead of unaligned buffer.
            (monotonic_buffer_resource::chunk::allocate): Round up to multiple
            of 64 instead of to power of two. Check for size overflow. Remove
            redundant check for minimum required alignment.
            (monotonic_buffer_resource::chunk::release): Adjust for changes
            to data members.
            (monotonic_buffer_resource::_M_new_buffer): Use aligned_ceil.
            (big_block): Replace _M_size and _M_align with aligned_size
            member.
            (big_block::big_block): Check for size overflow.
            (big_block::size, big_block::align): Adjust to use aligned_size.
            (big_block::alloc_size): Use aligned_ceil.
            (munge_options): Use aligned_ceil.
            (__pool_resource::allocate): Use big_block::align for alignment.
            * testsuite/20_util/monotonic_buffer_resource/allocate.cc: Check
            upstream resource gets expected values for impossible sizes.
            * testsuite/20_util/unsynchronized_pool_resource/allocate.cc:
            Likewise. Adjust checks for expected alignment in existing test.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (38 preceding siblings ...)
  2020-09-10 14:42 ` cvs-commit at gcc dot gnu.org
@ 2020-09-10 14:46 ` redi at gcc dot gnu.org
  2020-09-10 17:53 ` dmitriy.ovdienko at gmail dot com
                   ` (2 subsequent siblings)
  42 siblings, 0 replies; 44+ messages in thread
From: redi at gcc dot gnu.org @ 2020-09-10 14:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #40 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Dmitriy Ovdienko from comment #38)
> Wohoooooo! Accepted:
> 
> https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/
> binarytrees.html

Nice.

I've added some __builtin_expect hints and also changed the buffer resource to
allocate multiples of 64 instead of powers of two. I don't think there's much
more that be done here right now.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (39 preceding siblings ...)
  2020-09-10 14:46 ` redi at gcc dot gnu.org
@ 2020-09-10 17:53 ` dmitriy.ovdienko at gmail dot com
  2020-09-10 17:57 ` dmitriy.ovdienko at gmail dot com
  2020-09-10 18:16 ` redi at gcc dot gnu.org
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-10 17:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #41 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
@Jonathan

Did you have chance to review why default-constructed M-B-R works faster than
another one constructed with the initial buffer size?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (40 preceding siblings ...)
  2020-09-10 17:53 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-10 17:57 ` dmitriy.ovdienko at gmail dot com
  2020-09-10 18:16 ` redi at gcc dot gnu.org
  42 siblings, 0 replies; 44+ messages in thread
From: dmitriy.ovdienko at gmail dot com @ 2020-09-10 17:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #42 from Dmitriy Ovdienko <dmitriy.ovdienko at gmail dot com> ---
> The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>:

Very good commit and comment. I hope this change, made for synthetic benchmark,
wont affect real production applications.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
  2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
                   ` (41 preceding siblings ...)
  2020-09-10 17:57 ` dmitriy.ovdienko at gmail dot com
@ 2020-09-10 18:16 ` redi at gcc dot gnu.org
  42 siblings, 0 replies; 44+ messages in thread
From: redi at gcc dot gnu.org @ 2020-09-10 18:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #43 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Dmitriy Ovdienko from comment #41)
> Did you have chance to review why default-constructed M-B-R works faster
> than another one constructed with the initial buffer size?

Is that explained by the kernel pre-faulting the extra pages that are
over-allocated, as amonakov said in comment 9? I haven't actually looked into
it.


(In reply to Dmitriy Ovdienko from comment #42)
> Very good commit and comment. I hope this change, made for synthetic
> benchmark, wont affect real production applications.

Well if you need a larger initial size you can get it by asking for it, instead
of getting it by accident :-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2020-09-10 18:16 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-05  9:42 [Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses dmitriy.ovdienko at gmail dot com
2020-09-05  9:42 ` [Bug libstdc++/96942] " dmitriy.ovdienko at gmail dot com
2020-09-05  9:43 ` dmitriy.ovdienko at gmail dot com
2020-09-07  7:21 ` dmitriy.ovdienko at gmail dot com
2020-09-07  7:24 ` dmitriy.ovdienko at gmail dot com
2020-09-07 10:49 ` amonakov at gcc dot gnu.org
2020-09-07 16:13 ` dmitriy.ovdienko at gmail dot com
2020-09-07 20:53 ` dmitriy.ovdienko at gmail dot com
2020-09-07 21:21 ` dmitriy.ovdienko at gmail dot com
2020-09-08  7:11 ` amonakov at gcc dot gnu.org
2020-09-08  9:39 ` dmitriy.ovdienko at gmail dot com
2020-09-08 11:19 ` redi at gcc dot gnu.org
2020-09-08 11:23 ` redi at gcc dot gnu.org
2020-09-08 11:32 ` redi at gcc dot gnu.org
2020-09-08 11:35 ` amonakov at gcc dot gnu.org
2020-09-08 11:48 ` redi at gcc dot gnu.org
2020-09-08 11:49 ` redi at gcc dot gnu.org
2020-09-08 11:50 ` redi at gcc dot gnu.org
2020-09-08 11:58 ` amonakov at gcc dot gnu.org
2020-09-08 12:03 ` redi at gcc dot gnu.org
2020-09-08 12:17 ` amonakov at gcc dot gnu.org
2020-09-08 12:53 ` dmitriy.ovdienko at gmail dot com
2020-09-08 16:13 ` redi at gcc dot gnu.org
2020-09-08 16:28 ` amonakov at gcc dot gnu.org
2020-09-08 16:44 ` redi at gcc dot gnu.org
2020-09-08 17:30 ` redi at gcc dot gnu.org
2020-09-08 20:20 ` dmitriy.ovdienko at gmail dot com
2020-09-08 20:30 ` dmitriy.ovdienko at gmail dot com
2020-09-08 20:36 ` dmitriy.ovdienko at gmail dot com
2020-09-08 20:41 ` dmitriy.ovdienko at gmail dot com
2020-09-08 21:01 ` dmitriy.ovdienko at gmail dot com
2020-09-08 21:33 ` dmitriy.ovdienko at gmail dot com
2020-09-08 21:36 ` dmitriy.ovdienko at gmail dot com
2020-09-08 22:24 ` dmitriy.ovdienko at gmail dot com
2020-09-09  9:10 ` redi at gcc dot gnu.org
2020-09-09  9:49 ` redi at gcc dot gnu.org
2020-09-09 11:59 ` dmitriy.ovdienko at gmail dot com
2020-09-09 12:06 ` dmitriy.ovdienko at gmail dot com
2020-09-10  8:13 ` dmitriy.ovdienko at gmail dot com
2020-09-10 14:42 ` cvs-commit at gcc dot gnu.org
2020-09-10 14:46 ` redi at gcc dot gnu.org
2020-09-10 17:53 ` dmitriy.ovdienko at gmail dot com
2020-09-10 17:57 ` dmitriy.ovdienko at gmail dot com
2020-09-10 18:16 ` redi at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).