[PATCH] Optimize to

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] Optimize to_chars
@ 2019-08-30 14:39 Antony Polukhin
  2019-08-30 16:35 ` Jonathan Wakely
  2019-08-30 19:41 ` Martin Sebor
  0 siblings, 2 replies; 10+ messages in thread
From: Antony Polukhin @ 2019-08-30 14:39 UTC (permalink / raw)
  To: libstdc++, gcc-patches List

[-- Attachment #1: Type: text/plain, Size: 2069 bytes --]

Bunch of micro optimizations for std::to_chars:
* For base == 8 replacing the lookup in __digits table with arithmetic
computations leads to a same CPU cycles for a loop (exchanges two
movzx with 3 bit ops https://godbolt.org/z/RTui7m ). However this
saves 129 bytes of data and totally avoids a chance of cache misses on
__digits.
* For base == 16 replacing the lookup in __digits table with
arithmetic computations leads to a few additional instructions, but
totally avoids a chance of cache misses on __digits (- ~9 cache misses
for worst case) and saves 513 bytes of const data.
* Replacing __first[pos] and __first[pos - 1] with __first[1] and
__first[0] on final iterations saves ~2% of code size.
* Removing trailing '\0' from arrays of digits allows the linker to
merge the symbols (so that "0123456789abcdefghijklmnopqrstuvwxyz" and
"0123456789abcdef" could share the same address). This improves data
locality and reduces binary sizes.
* Using __detail::__to_chars_len_2 instead of a generic
__detail::__to_chars_len makes the operation O(1) instead of O(N). It
also makes the code two times shorter ( https://godbolt.org/z/Peq_PG)
.

In sum: this significantly reduces the size of a binary (for about
4KBs only for base-8 conversion https://godbolt.org/z/WPKijS ), deals
with latency (CPU cache misses) without changing the iterations count
and without adding costly instructions into the loops.

Changelog:
    * include/std/charconv (__detail::__to_chars_8,
    __detail::__to_chars_16): Replace array of precomputed digits
    with arithmetic operations to avoid CPU cache misses. Remove
    zero termination from array of digits to allow symbol merge with
    generic implementation of __detail::__to_chars. Replace final
    offsets with constants. Use __detail::__to_chars_len_2 instead
    of a generic __detail::__to_chars_len.
    * include/std/charconv (__detail::__to_chars): Remove
    zero termination from array of digits.
    * include/std/charconv (__detail::__to_chars_2): Leading digit
    is always '1'.

-- 
Best regards,
Antony Polukhin

[-- Attachment #2: charconv_patch.txt --]
[-- Type: text/plain, Size: 5042 bytes --]

diff --git a/libstdc++-v3/ChangeLog b/libstdc++-v3/ChangeLog
index eaa6f74..35706d0 100644
--- a/libstdc++-v3/ChangeLog
+++ b/libstdc++-v3/ChangeLog
@@ -1,3 +1,17 @@
+2019-08-30  Antony Polukhin  <antoshkka@gmail.com>
+
+	* include/std/charconv (__detail::__to_chars_8,
+	__detail::__to_chars_16): Replace array of precomputed digits
+	with arithmetic operations to avoid CPU cache misses. Remove
+	zero termination from array of digits to allow symbol merge with
+	generic implementation of __detail::__to_chars. Replace final
+	offsets with constants. Use __detail::__to_chars_len_2 instead
+	of a generic __detail::__to_chars_len.
+	* include/std/charconv (__detail::__to_chars): Remove
+	zero termination from array of digits.
+	* include/std/charconv (__detail::__to_chars_2): Leading digit
+	is always '1'.
+
 2019-08-29  Jonathan Wakely  <jwakely@redhat.com>
 
 	PR libstdc++/91067
diff --git a/libstdc++-v3/include/std/charconv b/libstdc++-v3/include/std/charconv
index 53aa63e..4e94c39 100644
--- a/libstdc++-v3/include/std/charconv
+++ b/libstdc++-v3/include/std/charconv
@@ -131,7 +131,7 @@ namespace __detail
 	    : 1u;
 	}
       else
-	return __to_chars_len(__value, 8);
+	return (__to_chars_len_2(__value) + 2) / 3;
     }
 
   // Generic implementation for arbitrary bases.
@@ -155,8 +155,12 @@ namespace __detail
 
       unsigned __pos = __len - 1;
 
-      static constexpr char __digits[]
-	= "0123456789abcdefghijklmnopqrstuvwxyz";
+      static constexpr char __digits[] = {
+	'0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
+	'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
+	'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
+	'u', 'v', 'w', 'x', 'y', 'z'
+      };
 
       while (__val >= __base)
 	{
@@ -181,7 +185,7 @@ namespace __detail
 
       to_chars_result __res;
 
-      const unsigned __len = __to_chars_len(__val, 0x10);
+      const unsigned __len = (__to_chars_len_2(__val) + 3) / 4;
 
       if (__builtin_expect((__last - __first) < __len, 0))
 	{
@@ -190,32 +194,30 @@ namespace __detail
 	  return __res;
 	}
 
-      static constexpr char __digits[513] =
-	"000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f"
-	"202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f"
-	"404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f"
-	"606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f"
-	"808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9f"
-	"a0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebf"
-	"c0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedf"
-	"e0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff";
+      static constexpr char __digits[] = {
+	'0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
+	'a', 'b', 'c', 'd', 'e', 'f'
+      };
       unsigned __pos = __len - 1;
       while (__val >= 0x100)
 	{
-	  auto const __num = (__val % 0x100) * 2;
-	  __val /= 0x100;
-	  __first[__pos] = __digits[__num + 1];
+	  auto __num = __val & 0xF;
+	  __val >>= 4;
+	  __first[__pos] = __digits[__num];
+	  __num = __val & 0xF;
+	  __val >>= 4;
 	  __first[__pos - 1] = __digits[__num];
 	  __pos -= 2;
 	}
       if (__val >= 0x10)
 	{
-	  auto const __num = __val * 2;
-	  __first[__pos] = __digits[__num + 1];
-	  __first[__pos - 1] = __digits[__num];
+	  const auto __num = __val & 0xF;
+	  __val >>= 4;
+	  __first[1] = __digits[__num];
+	  __first[0] = __digits[__val];
 	}
       else
-	__first[__pos] = "0123456789abcdef"[__val];
+	__first[0] = __digits[__val];
       __res.ptr = __first + __len;
       __res.ec = {};
       return __res;
@@ -263,28 +265,26 @@ namespace __detail
 	  return __res;
 	}
 
-      static constexpr char __digits[129] =
-	"00010203040506071011121314151617"
-	"20212223242526273031323334353637"
-	"40414243444546475051525354555657"
-	"60616263646566677071727374757677";
       unsigned __pos = __len - 1;
       while (__val >= 0100)
 	{
-	  auto const __num = (__val % 0100) * 2;
-	  __val /= 0100;
-	  __first[__pos] = __digits[__num + 1];
-	  __first[__pos - 1] = __digits[__num];
+	  auto __num = __val & 7;
+	  __val >>= 3;
+	  __first[__pos] = '0' + __num;
+	  __num = __val & 7;
+	  __val >>= 3;
+	  __first[__pos - 1] = '0' + __num;
 	  __pos -= 2;
 	}
       if (__val >= 010)
 	{
-	  auto const __num = __val * 2;
-	  __first[__pos] = __digits[__num + 1];
-	  __first[__pos - 1] = __digits[__num];
+	  auto const __num = __val & 7;
+	  __val >>= 3;
+	  __first[1] = '0' + __num;
+	  __first[0] = '0' + __val;
 	}
       else
-	__first[__pos] = '0' + __val;
+	__first[0] = '0' + __val;
       __res.ptr = __first + __len;
       __res.ec = {};
       return __res;
@@ -315,7 +315,10 @@ namespace __detail
 	  __first[__pos--] = '0' + (__val & 1);
 	  __val >>= 1;
 	}
-      *__first = '0' + (__val & 1);
+      // First digit is always '1' because __to_chars_len_2 skips
+      // leading zero bits and std::to_chars handles zero values
+      // directly.
+      __first[0] = '1';
 
       __res.ptr = __first + __len;
       __res.ec = {};

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimize to_chars
  2019-08-30 14:39 [PATCH] Optimize to_chars Antony Polukhin
@ 2019-08-30 16:35 ` Jonathan Wakely
  2019-08-30 16:44   ` Jonathan Wakely
  2019-08-30 21:50   ` Antony Polukhin
  2019-08-30 19:41 ` Martin Sebor
  1 sibling, 2 replies; 10+ messages in thread
From: Jonathan Wakely @ 2019-08-30 16:35 UTC (permalink / raw)
  To: Antony Polukhin; +Cc: libstdc++, gcc-patches List

On 30/08/19 17:27 +0300, Antony Polukhin wrote:
>Bunch of micro optimizations for std::to_chars:
>* For base == 8 replacing the lookup in __digits table with arithmetic
>computations leads to a same CPU cycles for a loop (exchanges two
>movzx with 3 bit ops https://godbolt.org/z/RTui7m ). However this
>saves 129 bytes of data and totally avoids a chance of cache misses on
>__digits.
>* For base == 16 replacing the lookup in __digits table with
>arithmetic computations leads to a few additional instructions, but
>totally avoids a chance of cache misses on __digits (- ~9 cache misses
>for worst case) and saves 513 bytes of const data.
>* Replacing __first[pos] and __first[pos - 1] with __first[1] and
>__first[0] on final iterations saves ~2% of code size.
>* Removing trailing '\0' from arrays of digits allows the linker to
>merge the symbols (so that "0123456789abcdefghijklmnopqrstuvwxyz" and
>"0123456789abcdef" could share the same address). This improves data
>locality and reduces binary sizes.
>* Using __detail::__to_chars_len_2 instead of a generic
>__detail::__to_chars_len makes the operation O(1) instead of O(N). It
>also makes the code two times shorter ( https://godbolt.org/z/Peq_PG)
>.
>
>In sum: this significantly reduces the size of a binary (for about
>4KBs only for base-8 conversion https://godbolt.org/z/WPKijS ), deals
>with latency (CPU cache misses) without changing the iterations count
>and without adding costly instructions into the loops.

This is great, thanks.

Have you tried comparing the improved code to libc++'s implementation?
I believe they use precomputed arrays of digits, but they use larger
arrays that allow 4 bytes to be written at once, which is considerably
faster (and those precomputed arrays libe in libc++.so not in the
header). Would we be better off keeping the precomputed arrays and
expanding them to do 4-byte writes?

Since we don't have a patch to do that, I think I'll commit yours. We
can always go back to precomputed arrays later if somebody does that
work.

My only comments are on the changelog:

>Changelog:
>    * include/std/charconv (__detail::__to_chars_8,
>    __detail::__to_chars_16): Replace array of precomputed digits

When the list of changed functions is split across lines it should be
like this:

    * include/std/charconv (__detail::__to_chars_8)
    (__detail::__to_chars_16): Replace array of precomputed digits

i.e close the parentheses before the line break, and reopen on the
next line.

>    with arithmetic operations to avoid CPU cache misses. Remove
>    zero termination from array of digits to allow symbol merge with
>    generic implementation of __detail::__to_chars. Replace final
>    offsets with constants. Use __detail::__to_chars_len_2 instead
>    of a generic __detail::__to_chars_len.
>    * include/std/charconv (__detail::__to_chars): Remove

Don't repeat the asterisk and filename for later changes in the same
file, i.e.

    (__detail::__to_chars): Remove zero termination from array of digits.
    (__detail::__to_chars_2): Leading digit is always '1'.

There's no changelog entry for the changes to __to_chars_len_8 and
__to_chars_len_16.

Also, please don't include the ChangeLog diff in the patch, because it
just makes it hard to apply the patch (the ChangeLog part will almost
always fail to apply because somebody else will have modified the
ChangeLog file since you created the patch, and so that hunk won't
apply. The ChangeLog text should be sent as a separate (plain text)
attachment, or just in the email body (as you did above).

I'll test this and commit it, thanks!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimize to_chars
  2019-08-30 16:35 ` Jonathan Wakely
@ 2019-08-30 16:44   ` Jonathan Wakely
  2019-09-02 11:32     ` Jonathan Wakely
  2019-08-30 21:50   ` Antony Polukhin
  1 sibling, 1 reply; 10+ messages in thread
From: Jonathan Wakely @ 2019-08-30 16:44 UTC (permalink / raw)
  To: Antony Polukhin; +Cc: libstdc++, gcc-patches List

On 30/08/19 17:01 +0100, Jonathan Wakely wrote:
>On 30/08/19 17:27 +0300, Antony Polukhin wrote:
>>Bunch of micro optimizations for std::to_chars:
>>* For base == 8 replacing the lookup in __digits table with arithmetic
>>computations leads to a same CPU cycles for a loop (exchanges two
>>movzx with 3 bit ops https://godbolt.org/z/RTui7m ). However this
>>saves 129 bytes of data and totally avoids a chance of cache misses on
>>__digits.
>>* For base == 16 replacing the lookup in __digits table with
>>arithmetic computations leads to a few additional instructions, but
>>totally avoids a chance of cache misses on __digits (- ~9 cache misses
>>for worst case) and saves 513 bytes of const data.
>>* Replacing __first[pos] and __first[pos - 1] with __first[1] and
>>__first[0] on final iterations saves ~2% of code size.
>>* Removing trailing '\0' from arrays of digits allows the linker to
>>merge the symbols (so that "0123456789abcdefghijklmnopqrstuvwxyz" and
>>"0123456789abcdef" could share the same address). This improves data
>>locality and reduces binary sizes.
>>* Using __detail::__to_chars_len_2 instead of a generic
>>__detail::__to_chars_len makes the operation O(1) instead of O(N). It
>>also makes the code two times shorter ( https://godbolt.org/z/Peq_PG)
>>.
>>
>>In sum: this significantly reduces the size of a binary (for about
>>4KBs only for base-8 conversion https://godbolt.org/z/WPKijS ), deals
>>with latency (CPU cache misses) without changing the iterations count
>>and without adding costly instructions into the loops.
>
>This is great, thanks.
>
>Have you tried comparing the improved code to libc++'s implementation?
>I believe they use precomputed arrays of digits, but they use larger
>arrays that allow 4 bytes to be written at once, which is considerably
>faster (and those precomputed arrays libe in libc++.so not in the
>header). Would we be better off keeping the precomputed arrays and
>expanding them to do 4-byte writes?
>
>Since we don't have a patch to do that, I think I'll commit yours. We
>can always go back to precomputed arrays later if somebody does that
>work.
>
>My only comments are on the changelog:
>
>>Changelog:
>>   * include/std/charconv (__detail::__to_chars_8,
>>   __detail::__to_chars_16): Replace array of precomputed digits
>
>When the list of changed functions is split across lines it should be
>like this:
>
>   * include/std/charconv (__detail::__to_chars_8)
>   (__detail::__to_chars_16): Replace array of precomputed digits
>
>i.e close the parentheses before the line break, and reopen on the
>next line.
>
>>   with arithmetic operations to avoid CPU cache misses. Remove
>>   zero termination from array of digits to allow symbol merge with
>>   generic implementation of __detail::__to_chars. Replace final
>>   offsets with constants. Use __detail::__to_chars_len_2 instead
>>   of a generic __detail::__to_chars_len.
>>   * include/std/charconv (__detail::__to_chars): Remove
>
>Don't repeat the asterisk and filename for later changes in the same
>file, i.e.
>
>   (__detail::__to_chars): Remove zero termination from array of digits.
>   (__detail::__to_chars_2): Leading digit is always '1'.
>
>There's no changelog entry for the changes to __to_chars_len_8 and
>__to_chars_len_16.

Oh, there's no separate __to_chars_len_16 function anyway, and you did
mention it as "Use __detail::__to_chars_len_2 instead ..." - sorry!

I think we might as well inline __to_chars_len_8 into __to_chars_8,
there's not much benefit to having a separate function. I'll do that
as a follow up patch.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimize to_chars
  2019-08-30 14:39 [PATCH] Optimize to_chars Antony Polukhin
  2019-08-30 16:35 ` Jonathan Wakely
@ 2019-08-30 19:41 ` Martin Sebor
  2019-08-30 21:11   ` Jonathan Wakely
  1 sibling, 1 reply; 10+ messages in thread
From: Martin Sebor @ 2019-08-30 19:41 UTC (permalink / raw)
  To: Antony Polukhin, libstdc++, gcc-patches List

On 8/30/19 8:27 AM, Antony Polukhin wrote:
> Bunch of micro optimizations for std::to_chars:
> * For base == 8 replacing the lookup in __digits table with arithmetic
> computations leads to a same CPU cycles for a loop (exchanges two
> movzx with 3 bit ops https://godbolt.org/z/RTui7m ). However this
> saves 129 bytes of data and totally avoids a chance of cache misses on
> __digits.
> * For base == 16 replacing the lookup in __digits table with
> arithmetic computations leads to a few additional instructions, but
> totally avoids a chance of cache misses on __digits (- ~9 cache misses
> for worst case) and saves 513 bytes of const data.
> * Replacing __first[pos] and __first[pos - 1] with __first[1] and
> __first[0] on final iterations saves ~2% of code size.
> * Removing trailing '\0' from arrays of digits allows the linker to
> merge the symbols (so that "0123456789abcdefghijklmnopqrstuvwxyz" and
> "0123456789abcdef" could share the same address). This improves data
> locality and reduces binary sizes.
> * Using __detail::__to_chars_len_2 instead of a generic
> __detail::__to_chars_len makes the operation O(1) instead of O(N). It
> also makes the code two times shorter ( https://godbolt.org/z/Peq_PG)
> .
> 
> In sum: this significantly reduces the size of a binary (for about
> 4KBs only for base-8 conversion https://godbolt.org/z/WPKijS ), deals
> with latency (CPU cache misses) without changing the iterations count
> and without adding costly instructions into the loops.

Would it make sense to move some of this code into GCC as
a built-in so that it could also be used by GCC to expand
some strtol and sprintf calls?

Martin

> 
> Changelog:
>      * include/std/charconv (__detail::__to_chars_8,
>      __detail::__to_chars_16): Replace array of precomputed digits
>      with arithmetic operations to avoid CPU cache misses. Remove
>      zero termination from array of digits to allow symbol merge with
>      generic implementation of __detail::__to_chars. Replace final
>      offsets with constants. Use __detail::__to_chars_len_2 instead
>      of a generic __detail::__to_chars_len.
>      * include/std/charconv (__detail::__to_chars): Remove
>      zero termination from array of digits.
>      * include/std/charconv (__detail::__to_chars_2): Leading digit
>      is always '1'.
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimize to_chars
  2019-08-30 19:41 ` Martin Sebor
@ 2019-08-30 21:11   ` Jonathan Wakely
  0 siblings, 0 replies; 10+ messages in thread
From: Jonathan Wakely @ 2019-08-30 21:11 UTC (permalink / raw)
  To: Martin Sebor; +Cc: Antony Polukhin, libstdc++, gcc-patches List

On 30/08/19 11:03 -0600, Martin Sebor wrote:
>On 8/30/19 8:27 AM, Antony Polukhin wrote:
>>Bunch of micro optimizations for std::to_chars:
>>* For base == 8 replacing the lookup in __digits table with arithmetic
>>computations leads to a same CPU cycles for a loop (exchanges two
>>movzx with 3 bit ops https://godbolt.org/z/RTui7m ). However this
>>saves 129 bytes of data and totally avoids a chance of cache misses on
>>__digits.
>>* For base == 16 replacing the lookup in __digits table with
>>arithmetic computations leads to a few additional instructions, but
>>totally avoids a chance of cache misses on __digits (- ~9 cache misses
>>for worst case) and saves 513 bytes of const data.
>>* Replacing __first[pos] and __first[pos - 1] with __first[1] and
>>__first[0] on final iterations saves ~2% of code size.
>>* Removing trailing '\0' from arrays of digits allows the linker to
>>merge the symbols (so that "0123456789abcdefghijklmnopqrstuvwxyz" and
>>"0123456789abcdef" could share the same address). This improves data
>>locality and reduces binary sizes.
>>* Using __detail::__to_chars_len_2 instead of a generic
>>__detail::__to_chars_len makes the operation O(1) instead of O(N). It
>>also makes the code two times shorter ( https://godbolt.org/z/Peq_PG)
>>.
>>
>>In sum: this significantly reduces the size of a binary (for about
>>4KBs only for base-8 conversion https://godbolt.org/z/WPKijS ), deals
>>with latency (CPU cache misses) without changing the iterations count
>>and without adding costly instructions into the loops.
>
>Would it make sense to move some of this code into GCC as
>a built-in so that it could also be used by GCC to expand
>some strtol and sprintf calls?

Makes sense, although we'd still need it in libstdc++ until Clang and
EDG implement the same built-in.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimize to_chars
  2019-08-30 16:35 ` Jonathan Wakely
  2019-08-30 16:44   ` Jonathan Wakely
@ 2019-08-30 21:50   ` Antony Polukhin
  2019-08-30 23:33     ` Jonathan Wakely
  1 sibling, 1 reply; 10+ messages in thread
From: Antony Polukhin @ 2019-08-30 21:50 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: libstdc++, gcc-patches List

пт, 30 авг. 2019 г. в 19:01, Jonathan Wakely <jwakely@redhat.com>:
<...>
> Have you tried comparing the improved code to libc++'s implementation?
> I believe they use precomputed arrays of digits, but they use larger
> arrays that allow 4 bytes to be written at once, which is considerably
> faster (and those precomputed arrays libe in libc++.so not in the
> header). Would we be better off keeping the precomputed arrays and
> expanding them to do 4-byte writes?

This would not do good for bases 2, 8 and 16. For power of two bases
there is no costly `mod` or `div` instructions, only bit operations.
By increasing the digits table size the cache misses become more
likely.

For base 10... A few decades ago it was definitely a good idea to have
a big array of digits. Nowadays compilers know how to replace `__val /
10` and `__val % 10` with much cheaper multiplications and bit magic.
This is not as cheap as bit operations for power of two bases, but
some cache misses could nullify the advantage.

I need to experiment with base 10. There are ways to compress the
digits table, but it requires benchmarking. Because of that this patch
does not touch the __detail::__to_chars_10.

>
> Since we don't have a patch to do that, I think I'll commit yours. We
> can always go back to precomputed arrays later if somebody does that
> work.
>
> My only comments are on the changelog:
>
> >Changelog:
> >    * include/std/charconv (__detail::__to_chars_8,
> >    __detail::__to_chars_16): Replace array of precomputed digits
>
> When the list of changed functions is split across lines it should be
> like this:
>
>     * include/std/charconv (__detail::__to_chars_8)
>     (__detail::__to_chars_16): Replace array of precomputed digits
>
> i.e close the parentheses before the line break, and reopen on the
> next line.
>
> >    with arithmetic operations to avoid CPU cache misses. Remove
> >    zero termination from array of digits to allow symbol merge with
> >    generic implementation of __detail::__to_chars. Replace final
> >    offsets with constants. Use __detail::__to_chars_len_2 instead
> >    of a generic __detail::__to_chars_len.
> >    * include/std/charconv (__detail::__to_chars): Remove
>
> Don't repeat the asterisk and filename for later changes in the same
> file, i.e.
>
>     (__detail::__to_chars): Remove zero termination from array of digits.
>     (__detail::__to_chars_2): Leading digit is always '1'.
>
> There's no changelog entry for the changes to __to_chars_len_8 and
> __to_chars_len_16.
>
> Also, please don't include the ChangeLog diff in the patch, because it
> just makes it hard to apply the patch (the ChangeLog part will almost
> always fail to apply because somebody else will have modified the
> ChangeLog file since you created the patch, and so that hunk won't
> apply. The ChangeLog text should be sent as a separate (plain text)
> attachment, or just in the email body (as you did above).

Thanks! I'll take it into account next time.

-- 
Best regards,
Antony Polukhin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimize to_chars
  2019-08-30 21:50   ` Antony Polukhin
@ 2019-08-30 23:33     ` Jonathan Wakely
  0 siblings, 0 replies; 10+ messages in thread
From: Jonathan Wakely @ 2019-08-30 23:33 UTC (permalink / raw)
  To: Antony Polukhin; +Cc: libstdc++, gcc-patches List

On 30/08/19 22:46 +0300, Antony Polukhin wrote:
>Ð¿Ñ‚, 30 Ð°Ð²Ð³. 2019 Ð³. Ð² 19:01, Jonathan Wakely <jwakely@redhat.com>:
><...>
>> Have you tried comparing the improved code to libc++'s implementation?
>> I believe they use precomputed arrays of digits, but they use larger
>> arrays that allow 4 bytes to be written at once, which is considerably
>> faster (and those precomputed arrays libe in libc++.so not in the
>> header). Would we be better off keeping the precomputed arrays and
>> expanding them to do 4-byte writes?
>
>This would not do good for bases 2, 8 and 16. For power of two bases
>there is no costly `mod` or `div` instructions, only bit operations.
>By increasing the digits table size the cache misses become more
>likely.

OK, thanks. I'll try benchmarking your improved code against the
libc++ version next week.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimize to_chars
  2019-08-30 16:44   ` Jonathan Wakely
@ 2019-09-02 11:32     ` Jonathan Wakely
  2019-09-08 13:45       ` Antony Polukhin
  0 siblings, 1 reply; 10+ messages in thread
From: Jonathan Wakely @ 2019-09-02 11:32 UTC (permalink / raw)
  To: Antony Polukhin; +Cc: libstdc++, gcc-patches List

[-- Attachment #1: Type: text/plain, Size: 3780 bytes --]

On 30/08/19 17:08 +0100, Jonathan Wakely wrote:
>On 30/08/19 17:01 +0100, Jonathan Wakely wrote:
>>On 30/08/19 17:27 +0300, Antony Polukhin wrote:
>>>Bunch of micro optimizations for std::to_chars:
>>>* For base == 8 replacing the lookup in __digits table with arithmetic
>>>computations leads to a same CPU cycles for a loop (exchanges two
>>>movzx with 3 bit ops https://godbolt.org/z/RTui7m ). However this
>>>saves 129 bytes of data and totally avoids a chance of cache misses on
>>>__digits.
>>>* For base == 16 replacing the lookup in __digits table with
>>>arithmetic computations leads to a few additional instructions, but
>>>totally avoids a chance of cache misses on __digits (- ~9 cache misses
>>>for worst case) and saves 513 bytes of const data.
>>>* Replacing __first[pos] and __first[pos - 1] with __first[1] and
>>>__first[0] on final iterations saves ~2% of code size.
>>>* Removing trailing '\0' from arrays of digits allows the linker to
>>>merge the symbols (so that "0123456789abcdefghijklmnopqrstuvwxyz" and
>>>"0123456789abcdef" could share the same address). This improves data
>>>locality and reduces binary sizes.
>>>* Using __detail::__to_chars_len_2 instead of a generic
>>>__detail::__to_chars_len makes the operation O(1) instead of O(N). It
>>>also makes the code two times shorter ( https://godbolt.org/z/Peq_PG)
>>>.
>>>
>>>In sum: this significantly reduces the size of a binary (for about
>>>4KBs only for base-8 conversion https://godbolt.org/z/WPKijS ), deals
>>>with latency (CPU cache misses) without changing the iterations count
>>>and without adding costly instructions into the loops.
>>
>>This is great, thanks.
>>
>>Have you tried comparing the improved code to libc++'s implementation?
>>I believe they use precomputed arrays of digits, but they use larger
>>arrays that allow 4 bytes to be written at once, which is considerably
>>faster (and those precomputed arrays libe in libc++.so not in the
>>header). Would we be better off keeping the precomputed arrays and
>>expanding them to do 4-byte writes?
>>
>>Since we don't have a patch to do that, I think I'll commit yours. We
>>can always go back to precomputed arrays later if somebody does that
>>work.
>>
>>My only comments are on the changelog:
>>
>>>Changelog:
>>>  * include/std/charconv (__detail::__to_chars_8,
>>>  __detail::__to_chars_16): Replace array of precomputed digits
>>
>>When the list of changed functions is split across lines it should be
>>like this:
>>
>>  * include/std/charconv (__detail::__to_chars_8)
>>  (__detail::__to_chars_16): Replace array of precomputed digits
>>
>>i.e close the parentheses before the line break, and reopen on the
>>next line.
>>
>>>  with arithmetic operations to avoid CPU cache misses. Remove
>>>  zero termination from array of digits to allow symbol merge with
>>>  generic implementation of __detail::__to_chars. Replace final
>>>  offsets with constants. Use __detail::__to_chars_len_2 instead
>>>  of a generic __detail::__to_chars_len.
>>>  * include/std/charconv (__detail::__to_chars): Remove
>>
>>Don't repeat the asterisk and filename for later changes in the same
>>file, i.e.
>>
>>  (__detail::__to_chars): Remove zero termination from array of digits.
>>  (__detail::__to_chars_2): Leading digit is always '1'.
>>
>>There's no changelog entry for the changes to __to_chars_len_8 and
>>__to_chars_len_16.
>
>Oh, there's no separate __to_chars_len_16 function anyway, and you did
>mention it as "Use __detail::__to_chars_len_2 instead ..." - sorry!
>
>I think we might as well inline __to_chars_len_8 into __to_chars_8,
>there's not much benefit to having a separate function. I'll do that
>as a follow up patch.

I've committed this patch to simplify the code a bit.

Tested x86_64-linux, committed to trunk.


[-- Attachment #2: patch.txt --]
[-- Type: text/x-patch, Size: 3344 bytes --]

commit 8ea5798914e9cf5b823d245d85db4d76e0631d0e
Author: Jonathan Wakely <jwakely@redhat.com>
Date:   Fri Aug 30 17:27:48 2019 +0100

    Minor simplifications for std::to_chars implementation
    
            * include/std/charconv (__detail::__to_chars_2_len): Use std::log2p1.
            (__detail::__to_chars_8_len): Remove.
            (__detail::__to_chars_8): Inline length calculation here.
            (__detail::__from_chars_binary): Use numeric_limits instead of
            CHAR_BIT.

diff --git a/libstdc++-v3/include/std/charconv b/libstdc++-v3/include/std/charconv
index 4e94c39656d..ceefa3b6778 100644
--- a/libstdc++-v3/include/std/charconv
+++ b/libstdc++-v3/include/std/charconv
@@ -35,8 +35,9 @@
 
 #include <type_traits>
 #include <limits>
-#include <cctype>
-#include <bits/charconv.h> // for __to_chars_len, __to_chars_10_impl
+#include <bit>			// for __log2p1
+#include <cctype>		// for isdigit
+#include <bits/charconv.h>	// for __to_chars_len, __to_chars_10_impl
 #include <bits/error_constants.h> // for std::errc
 
 // Define when floating point is supported: #define __cpp_lib_to_chars 201611L
@@ -96,43 +97,7 @@ namespace __detail
   template<typename _Tp>
     constexpr unsigned
     __to_chars_len_2(_Tp __value) noexcept
-    {
-      static_assert(is_integral<_Tp>::value, "implementation bug");
-      static_assert(is_unsigned<_Tp>::value, "implementation bug");
-
-      constexpr size_t __nbits = __CHAR_BIT__ * sizeof(_Tp);
-
-      // N.B. __builtin_clzll is undefined if __value == 0, but std::to_chars
-      // handles zero values directly.
-
-      // For sizeof(_Tp) > 1 this is an order of magnitude faster than
-      // the generic __to_chars_len.
-      return __nbits
-	- (__builtin_clzll(__value)
-	    - ((__CHAR_BIT__ * sizeof(long long)) - __nbits));
-    }
-
-  template<typename _Tp>
-    constexpr unsigned
-    __to_chars_len_8(_Tp __value) noexcept
-    {
-      static_assert(is_integral<_Tp>::value, "implementation bug");
-      static_assert(is_unsigned<_Tp>::value, "implementation bug");
-
-      constexpr size_t __nbits = __CHAR_BIT__ * sizeof(_Tp);
-
-      if _GLIBCXX17_CONSTEXPR (__nbits <= 16)
-	{
-	  return __value > 077777u ? 6u
-	    : __value > 07777u ? 5u
-	    : __value > 0777u ? 4u
-	    : __value > 077u ? 3u
-	    : __value > 07u ? 2u
-	    : 1u;
-	}
-      else
-	return (__to_chars_len_2(__value) + 2) / 3;
-    }
+    { return std::__log2p1(__value); }
 
   // Generic implementation for arbitrary bases.
   template<typename _Tp>
@@ -255,8 +220,19 @@ namespace __detail
       static_assert(is_unsigned<_Tp>::value, "implementation bug");
 
       to_chars_result __res;
+      unsigned __len;
 
-      const unsigned __len = __to_chars_len_8(__val);
+      if _GLIBCXX17_CONSTEXPR (numeric_limits<_Tp>::digits <= 16)
+	{
+	  __len = __val > 077777u ? 6u
+	    : __val > 07777u ? 5u
+	    : __val > 0777u ? 4u
+	    : __val > 077u ? 3u
+	    : __val > 07u ? 2u
+	    : 1u;
+	}
+      else
+	__len = (__to_chars_len_2(__val) + 2) / 3;
 
       if (__builtin_expect((__last - __first) < __len, 0))
 	{
@@ -397,7 +373,7 @@ namespace __detail
 	  __i++;
 	}
       __first += __i;
-      return __i <= (sizeof(_Tp) * __CHAR_BIT__);
+      return __i <= numeric_limits<_Tp>::digits;
     }
 
   /// std::from_chars implementation for integers in bases 3 to 10.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimize to_chars
  2019-09-02 11:32     ` Jonathan Wakely
@ 2019-09-08 13:45       ` Antony Polukhin
  2019-09-09 11:13         ` Jonathan Wakely
  0 siblings, 1 reply; 10+ messages in thread
From: Antony Polukhin @ 2019-09-08 13:45 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: libstdc++, gcc-patches List

[-- Attachment #1: Type: text/plain, Size: 1672 bytes --]

We've already beaten this topic to death, so let's put a final nail in
the coffin:


__to_chars_10_impl is quite fast. According to the IACA the main loop
takes only 6.0 cycles, the whole function with one iteration takes
10.0 cycles. Replacing the __first[pos] and __first[pos - 1] with
__first[0] and __first[1] drops the function time to 7.53 cycles.

Changelog:

2019-09-08  Antony Polukhin  <antoshkka@gmail.com>

    * include/bits/charconv.h (__detail::__to_chars_10_impl): Replace
    final offsets with constants.


And that's the only optimization that improves all the usecases and
reduces code size on 3 instructions.

Different approaches for optimizing the loop were showing different
results depending on the workload. The most interesting result gives
the compressed table of binary coded decimals:

static constexpr unsigned char __binary_coded_decimals[50] =
{ 0x00, 0x02, 0x04, 0x06, 0x08,  0x10... 0x98 };
unsigned __pos = __len - 1;
while (__val >= 100)
{
  auto const addition = __val & 1;
  auto const __num = (__val % 100) >> 1;
  __val /= 100;
  auto const __bcd = __binary_coded_decimals[__num];
  __first[__pos] = '0' + (__bcd & 0xf) + addition;
  __first[__pos - 1] = '0' + (__bcd >> 4);
  __pos -= 2;
}

That approach shows the same results or even outperforms the existing
approach with __digits[201] = "0001020304..." in case of cold cache.
It also produces slightly smaller binaries. Unfortunately on a warmed
up cache it's slower than the existing approach. I don't think that
it's a worth change.

Attaching some of the benchmarks as a separate file (not for merge,
just something to experiment with).


-- 
Best regards,
Antony Polukhin

[-- Attachment #2: to_chars_10_impl_patch.txt --]
[-- Type: text/plain, Size: 572 bytes --]

diff --git a/libstdc++-v3/include/bits/charconv.h b/libstdc++-v3/include/bits/charconv.h
index 0911660..a5b6be5 100644
--- a/libstdc++-v3/include/bits/charconv.h
+++ b/libstdc++-v3/include/bits/charconv.h
@@ -92,11 +92,11 @@ namespace __detail
       if (__val >= 10)
 	{
 	  auto const __num = __val * 2;
-	  __first[__pos] = __digits[__num + 1];
-	  __first[__pos - 1] = __digits[__num];
+	  __first[1] = __digits[__num + 1];
+	  __first[0] = __digits[__num];
 	}
       else
-	__first[__pos] = '0' + __val;
+	__first[0] = '0' + __val;
     }
 
 } // namespace __detail

[-- Attachment #3: algorithm_x.hpp --]
[-- Type: text/x-c++hdr, Size: 9223 bytes --]

#include <type_traits>
#include <vector>
#include <benchmark/benchmark.h>


// Adjust those variables to test on different workloads
constexpr unsigned From = 130000;
constexpr unsigned To   = 130004;
constexpr unsigned Len  = 6;
// Bytes of CPU cache to reset
#define RANGES ->Args({16 * 1024 * 1024})



namespace std { namespace __detail {
  template<typename _Tp>
    int
    __to_chars_10_table_2_impl(char* __first, unsigned __len, _Tp __val) noexcept
    {
      static_assert(is_integral<_Tp>::value, "implementation bug");
      static_assert(is_unsigned<_Tp>::value, "implementation bug");

      static constexpr char __digits[201] =
    "0001020304050607080910111213141516171819"
    "2021222324252627282930313233343536373839"
    "4041424344454647484950515253545556575859"
    "6061626364656667686970717273747576777879"
    "8081828384858687888990919293949596979899";
      unsigned __pos = __len - 1;
      while (__val >= 100)
    {
      auto const __num = (__val % 100) * 2;
      __val /= 100;
      __first[__pos] = __digits[__num + 1];
      __first[__pos - 1] = __digits[__num];
      __pos -= 2;
    }
      if (__val >= 10)
    {
      auto const __num = __val * 2;
      __first[1] = __digits[__num + 1];
      __first[0] = __digits[__num];
    }
      else
    __first[0] = '0' + __val;
    return 0;
    }

      template<typename _Tp>
    int
    __to_chars_10_bcd_impl(char* __first, unsigned __len, _Tp __val) noexcept
    {
      static_assert(is_integral<_Tp>::value, "implementation bug");
      static_assert(is_unsigned<_Tp>::value, "implementation bug");

      static constexpr unsigned char __binary_coded_decimals[100] =
    { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x10
  , 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x20
  , 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x30
  , 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x40
  , 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x50
  , 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x60
  , 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x70
  , 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x80
  , 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x90
  , 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99};
      unsigned __pos = __len - 1;
      while (__val >= 100)
    {
      auto const __num = __val % 100;
      __val /= 100;
    auto const __bcd = __binary_coded_decimals[__num];
      __first[__pos] = '0' + (__bcd & 0xf);
      __first[__pos - 1] = '0' + (__bcd >> 4);
      __pos -= 2;
    }
      if (__val >= 10)
    {
    auto const __bcd = __binary_coded_decimals[__val];
      __first[1] = '0' + (__bcd & 0xf);
      __first[0] = '0' + (__bcd >> 4);
    }
      else
    __first[0] = '0' + __val;
    return 0;
    }

      template<typename _Tp>
    int
    __to_chars_10_bcd_compressed_impl(char* __first, unsigned __len, _Tp __val) noexcept
    {
      static_assert(is_integral<_Tp>::value, "implementation bug");
      static_assert(is_unsigned<_Tp>::value, "implementation bug");

      static constexpr unsigned char __binary_coded_decimals[50] =
    { 0x00, 0x02, 0x04, 0x06, 0x08,  0x10
        , 0x12, 0x14, 0x16, 0x18,  0x20
        , 0x22, 0x24, 0x26, 0x28,  0x30
        , 0x32, 0x34, 0x36, 0x38,  0x40
        , 0x42, 0x44, 0x46, 0x48,  0x50
        , 0x52, 0x54, 0x56, 0x58,  0x60
        , 0x62, 0x64, 0x66, 0x68,  0x70
        , 0x72, 0x74, 0x76, 0x78,  0x80
        , 0x82, 0x84, 0x86, 0x88,  0x90
        , 0x92, 0x94, 0x96, 0x98 };
      unsigned __pos = __len - 1;
      while (__val >= 100)
    {
    auto const addition = __val & 1;
      auto const __num = (__val % 100) >> 1;
      __val /= 100;
    auto const __bcd = __binary_coded_decimals[__num];
      __first[__pos] = '0' + (__bcd & 0xf) + addition;
      __first[__pos - 1] = '0' + (__bcd >> 4);
      __pos -= 2;
    }
      if (__val >= 10)
    {
    auto const __bcd = __binary_coded_decimals[__val >> 1];
      __first[1] = '0' + (__bcd & 0xf) + (__val & 1);
      __first[0] = '0' + (__bcd >> 4);
    }
      else
    __first[0] = '0' + __val;
    return 0;
    }

      template<typename _Tp>
    _GLIBCXX14_CONSTEXPR int
    __to_chars_10_naive_impl(char* __first, unsigned __len, _Tp __val) noexcept
    {
      static_assert(is_integral<_Tp>::value, "implementation bug");
      static_assert(is_unsigned<_Tp>::value, "implementation bug");

      unsigned __pos = __len - 1;
      while (__val >= 100)
    {
      auto __num = __val % 10;
      __val /= 10;
      __first[__pos] = '0' + __num;
      __num = __val % 10;
      __val /= 10;
      __first[__pos - 1] = '0' + __num;
      __pos -= 2;
    }
      if (__val >= 10)
    {
      auto const __num = __val % 10;
      __val /= 10;
      __first[1] = '0' + __num;
      __first[0] = '0' + __val;
    }
      else
    __first[0] = '0' + __val;
    return 0;
    }

      template<typename _Tp>
    _GLIBCXX14_CONSTEXPR int
    __to_chars_10_naive_unr_impl(char* __first, unsigned __len, _Tp __val) noexcept
    {
      static_assert(is_integral<_Tp>::value, "implementation bug");
      static_assert(is_unsigned<_Tp>::value, "implementation bug");

      while (__len >= 6)
    {
      unsigned __pos = __len - 1;
      auto __num = __val % 10;
      __val /= 10;
      __first[__pos] = '0' + __num;

      __num = __val % 10;
      __val /= 10;
      __first[__pos - 1] = '0' + __num;

    __num = __val % 10;
      __val /= 10;
      __first[__pos - 2] = '0' + __num;

      __num = __val % 10;
      __val /= 10;
      __first[__pos - 3] = '0' + __num;

    __num = __val % 10;
      __val /= 10;
      __first[__pos - 4] = '0' + __num;

      __num = __val % 10;
      __val /= 10;
      __first[__pos - 5] = '0' + __num;
    __len -= 6;
    }

  switch (__len) {
    case 5:
      __first[4] = '0' + __val % 10;
      __val /= 10;
    case 4:
      __first[3] = '0' + __val % 10;
      __val /= 10;
    case 3:
      __first[2] = '0' + __val % 10;
      __val /= 10;
    case 2:
      __first[1] = '0' + __val % 10;
      __val /= 10;
    case 1:
      __first[0] = '0' + __val;
  }
    return 0;
    }

}}



static void FlushCacheNonpausing(unsigned FlushCacheBaytes) {
    std::vector<unsigned> resetter;
    resetter.resize(FlushCacheBaytes / sizeof(unsigned), FlushCacheBaytes);

    unsigned trash = static_cast<unsigned>(FlushCacheBaytes * FlushCacheBaytes);
    for (unsigned& v : resetter) {
        trash += v;
        v += trash;
    }

    benchmark::DoNotOptimize(trash);
    benchmark::DoNotOptimize(resetter);
}

static void naive(benchmark::State& state) {
    const unsigned FlushCacheBaytes = state.range();
  for (auto _ : state) {
       if (FlushCacheBaytes) {
        state.PauseTiming();
        FlushCacheNonpausing(FlushCacheBaytes);
        state.ResumeTiming();
       }
     for(unsigned i = From; i != To; ++i)
     {
        char buffer[Len];
        benchmark::DoNotOptimize(std::__detail::__to_chars_10_naive_impl(buffer, Len, i));
        benchmark::DoNotOptimize(buffer);
     }

  }
}
BENCHMARK(naive) RANGES;

static void unrolled(benchmark::State& state) {
    const unsigned FlushCacheBaytes = state.range();
  for (auto _ : state) {
       if (FlushCacheBaytes) {
        state.PauseTiming();
        FlushCacheNonpausing(FlushCacheBaytes);
        state.ResumeTiming();
       }
     for(unsigned i = From; i != To; ++i)
     {
        char buffer[Len];
        benchmark::DoNotOptimize(std::__detail::__to_chars_10_naive_unr_impl(buffer, Len, i));
        benchmark::DoNotOptimize(buffer);
     }

  }
}
BENCHMARK(unrolled) RANGES;


static void table_2(benchmark::State& state) {
    const unsigned FlushCacheBaytes = state.range();
  for (auto _ : state) {
       if (FlushCacheBaytes) {
        state.PauseTiming();
        FlushCacheNonpausing(FlushCacheBaytes);
        state.ResumeTiming();
       }
     for(unsigned i = From; i != To; ++i)
     {
        char buffer[Len];
        benchmark::DoNotOptimize(std::__detail::__to_chars_10_table_2_impl(buffer, Len, i));
        benchmark::DoNotOptimize(buffer);
     }

  }
}
BENCHMARK(table_2) RANGES;


static void bcd(benchmark::State& state) {
    const unsigned FlushCacheBaytes = state.range();
  for (auto _ : state) {
       if (FlushCacheBaytes) {
        state.PauseTiming();
        FlushCacheNonpausing(FlushCacheBaytes);
        state.ResumeTiming();
       }
     for(unsigned i = From; i != To; ++i)
     {
        char buffer[Len];
        benchmark::DoNotOptimize(std::__detail::__to_chars_10_bcd_impl(buffer, Len, i));
        benchmark::DoNotOptimize(buffer);
     }

  }
}
BENCHMARK(bcd) RANGES;


static void compressed_bin_coded(benchmark::State& state) {
    const unsigned FlushCacheBaytes = state.range();
  for (auto _ : state) {
       if (FlushCacheBaytes) {
        state.PauseTiming();
        FlushCacheNonpausing(FlushCacheBaytes);
        state.ResumeTiming();
       }
     for(unsigned i = From; i != To; ++i)
     {
        char buffer[Len];
        benchmark::DoNotOptimize(std::__detail::__to_chars_10_bcd_compressed_impl(buffer, Len, i));
        benchmark::DoNotOptimize(buffer);
     }

  }
}
BENCHMARK(compressed_bin_coded) RANGES;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimize to_chars
  2019-09-08 13:45       ` Antony Polukhin
@ 2019-09-09 11:13         ` Jonathan Wakely
  0 siblings, 0 replies; 10+ messages in thread
From: Jonathan Wakely @ 2019-09-09 11:13 UTC (permalink / raw)
  To: Antony Polukhin; +Cc: libstdc++, gcc-patches List

On 08/09/19 16:44 +0300, Antony Polukhin wrote:
>We've already beaten this topic to death, so let's put a final nail in
>the coffin:
>
>
>__to_chars_10_impl is quite fast. According to the IACA the main loop
>takes only 6.0 cycles, the whole function with one iteration takes
>10.0 cycles. Replacing the __first[pos] and __first[pos - 1] with
>__first[0] and __first[1] drops the function time to 7.53 cycles.
>
>Changelog:
>
>2019-09-08  Antony Polukhin  <antoshkka@gmail.com>
>
>    * include/bits/charconv.h (__detail::__to_chars_10_impl): Replace
>    final offsets with constants.

Excellent, thanks for the patch and all the benchmarking!

I've committed this to trunk now.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-09-09 11:13 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-30 14:39 [PATCH] Optimize to_chars Antony Polukhin
2019-08-30 16:35 ` Jonathan Wakely
2019-08-30 16:44   ` Jonathan Wakely
2019-09-02 11:32     ` Jonathan Wakely
2019-09-08 13:45       ` Antony Polukhin
2019-09-09 11:13         ` Jonathan Wakely
2019-08-30 21:50   ` Antony Polukhin
2019-08-30 23:33     ` Jonathan Wakely
2019-08-30 19:41 ` Martin Sebor
2019-08-30 21:11   ` Jonathan Wakely

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).