On 30/08/19 17:08 +0100, Jonathan Wakely wrote: >On 30/08/19 17:01 +0100, Jonathan Wakely wrote: >>On 30/08/19 17:27 +0300, Antony Polukhin wrote: >>>Bunch of micro optimizations for std::to_chars: >>>* For base == 8 replacing the lookup in __digits table with arithmetic >>>computations leads to a same CPU cycles for a loop (exchanges two >>>movzx with 3 bit ops https://godbolt.org/z/RTui7m ). However this >>>saves 129 bytes of data and totally avoids a chance of cache misses on >>>__digits. >>>* For base == 16 replacing the lookup in __digits table with >>>arithmetic computations leads to a few additional instructions, but >>>totally avoids a chance of cache misses on __digits (- ~9 cache misses >>>for worst case) and saves 513 bytes of const data. >>>* Replacing __first[pos] and __first[pos - 1] with __first[1] and >>>__first[0] on final iterations saves ~2% of code size. >>>* Removing trailing '\0' from arrays of digits allows the linker to >>>merge the symbols (so that "0123456789abcdefghijklmnopqrstuvwxyz" and >>>"0123456789abcdef" could share the same address). This improves data >>>locality and reduces binary sizes. >>>* Using __detail::__to_chars_len_2 instead of a generic >>>__detail::__to_chars_len makes the operation O(1) instead of O(N). It >>>also makes the code two times shorter ( https://godbolt.org/z/Peq_PG) >>>. >>> >>>In sum: this significantly reduces the size of a binary (for about >>>4KBs only for base-8 conversion https://godbolt.org/z/WPKijS ), deals >>>with latency (CPU cache misses) without changing the iterations count >>>and without adding costly instructions into the loops. >> >>This is great, thanks. >> >>Have you tried comparing the improved code to libc++'s implementation? >>I believe they use precomputed arrays of digits, but they use larger >>arrays that allow 4 bytes to be written at once, which is considerably >>faster (and those precomputed arrays libe in libc++.so not in the >>header). Would we be better off keeping the precomputed arrays and >>expanding them to do 4-byte writes? >> >>Since we don't have a patch to do that, I think I'll commit yours. We >>can always go back to precomputed arrays later if somebody does that >>work. >> >>My only comments are on the changelog: >> >>>Changelog: >>> * include/std/charconv (__detail::__to_chars_8, >>> __detail::__to_chars_16): Replace array of precomputed digits >> >>When the list of changed functions is split across lines it should be >>like this: >> >> * include/std/charconv (__detail::__to_chars_8) >> (__detail::__to_chars_16): Replace array of precomputed digits >> >>i.e close the parentheses before the line break, and reopen on the >>next line. >> >>> with arithmetic operations to avoid CPU cache misses. Remove >>> zero termination from array of digits to allow symbol merge with >>> generic implementation of __detail::__to_chars. Replace final >>> offsets with constants. Use __detail::__to_chars_len_2 instead >>> of a generic __detail::__to_chars_len. >>> * include/std/charconv (__detail::__to_chars): Remove >> >>Don't repeat the asterisk and filename for later changes in the same >>file, i.e. >> >> (__detail::__to_chars): Remove zero termination from array of digits. >> (__detail::__to_chars_2): Leading digit is always '1'. >> >>There's no changelog entry for the changes to __to_chars_len_8 and >>__to_chars_len_16. > >Oh, there's no separate __to_chars_len_16 function anyway, and you did >mention it as "Use __detail::__to_chars_len_2 instead ..." - sorry! > >I think we might as well inline __to_chars_len_8 into __to_chars_8, >there's not much benefit to having a separate function. I'll do that >as a follow up patch. I've committed this patch to simplify the code a bit. Tested x86_64-linux, committed to trunk.