On 30/08/19 17:08 +0100, Jonathan Wakely wrote:
>On 30/08/19 17:01 +0100, Jonathan Wakely wrote:
>>On 30/08/19 17:27 +0300, Antony Polukhin wrote:
>>>Bunch of micro optimizations for std::to_chars:
>>>* For base == 8 replacing the lookup in __digits table with arithmetic
>>>computations leads to a same CPU cycles for a loop (exchanges two
>>>movzx with 3 bit ops https://godbolt.org/z/RTui7m ). However this
>>>saves 129 bytes of data and totally avoids a chance of cache misses on
>>>__digits.
>>>* For base == 16 replacing the lookup in __digits table with
>>>arithmetic computations leads to a few additional instructions, but
>>>totally avoids a chance of cache misses on __digits (- ~9 cache misses
>>>for worst case) and saves 513 bytes of const data.
>>>* Replacing __first[pos] and __first[pos - 1] with __first[1] and
>>>__first[0] on final iterations saves ~2% of code size.
>>>* Removing trailing '\0' from arrays of digits allows the linker to
>>>merge the symbols (so that "0123456789abcdefghijklmnopqrstuvwxyz" and
>>>"0123456789abcdef" could share the same address). This improves data
>>>locality and reduces binary sizes.
>>>* Using __detail::__to_chars_len_2 instead of a generic
>>>__detail::__to_chars_len makes the operation O(1) instead of O(N). It
>>>also makes the code two times shorter ( https://godbolt.org/z/Peq_PG)
>>>.
>>>
>>>In sum: this significantly reduces the size of a binary (for about
>>>4KBs only for base-8 conversion https://godbolt.org/z/WPKijS ), deals
>>>with latency (CPU cache misses) without changing the iterations count
>>>and without adding costly instructions into the loops.
>>
>>This is great, thanks.
>>
>>Have you tried comparing the improved code to libc++'s implementation?
>>I believe they use precomputed arrays of digits, but they use larger
>>arrays that allow 4 bytes to be written at once, which is considerably
>>faster (and those precomputed arrays libe in libc++.so not in the
>>header). Would we be better off keeping the precomputed arrays and
>>expanding them to do 4-byte writes?
>>
>>Since we don't have a patch to do that, I think I'll commit yours. We
>>can always go back to precomputed arrays later if somebody does that
>>work.
>>
>>My only comments are on the changelog:
>>
>>>Changelog:
>>>  * include/std/charconv (__detail::__to_chars_8,
>>>  __detail::__to_chars_16): Replace array of precomputed digits
>>
>>When the list of changed functions is split across lines it should be
>>like this:
>>
>>  * include/std/charconv (__detail::__to_chars_8)
>>  (__detail::__to_chars_16): Replace array of precomputed digits
>>
>>i.e close the parentheses before the line break, and reopen on the
>>next line.
>>
>>>  with arithmetic operations to avoid CPU cache misses. Remove
>>>  zero termination from array of digits to allow symbol merge with
>>>  generic implementation of __detail::__to_chars. Replace final
>>>  offsets with constants. Use __detail::__to_chars_len_2 instead
>>>  of a generic __detail::__to_chars_len.
>>>  * include/std/charconv (__detail::__to_chars): Remove
>>
>>Don't repeat the asterisk and filename for later changes in the same
>>file, i.e.
>>
>>  (__detail::__to_chars): Remove zero termination from array of digits.
>>  (__detail::__to_chars_2): Leading digit is always '1'.
>>
>>There's no changelog entry for the changes to __to_chars_len_8 and
>>__to_chars_len_16.
>
>Oh, there's no separate __to_chars_len_16 function anyway, and you did
>mention it as "Use __detail::__to_chars_len_2 instead ..." - sorry!
>
>I think we might as well inline __to_chars_len_8 into __to_chars_8,
>there's not much benefit to having a separate function. I'll do that
>as a follow up patch.

I've committed this patch to simplify the code a bit.

Tested x86_64-linux, committed to trunk.