We've already beaten this topic to death, so let's put a final nail in the coffin: __to_chars_10_impl is quite fast. According to the IACA the main loop takes only 6.0 cycles, the whole function with one iteration takes 10.0 cycles. Replacing the __first[pos] and __first[pos - 1] with __first[0] and __first[1] drops the function time to 7.53 cycles. Changelog: 2019-09-08 Antony Polukhin * include/bits/charconv.h (__detail::__to_chars_10_impl): Replace final offsets with constants. And that's the only optimization that improves all the usecases and reduces code size on 3 instructions. Different approaches for optimizing the loop were showing different results depending on the workload. The most interesting result gives the compressed table of binary coded decimals: static constexpr unsigned char __binary_coded_decimals[50] = { 0x00, 0x02, 0x04, 0x06, 0x08, 0x10... 0x98 }; unsigned __pos = __len - 1; while (__val >= 100) { auto const addition = __val & 1; auto const __num = (__val % 100) >> 1; __val /= 100; auto const __bcd = __binary_coded_decimals[__num]; __first[__pos] = '0' + (__bcd & 0xf) + addition; __first[__pos - 1] = '0' + (__bcd >> 4); __pos -= 2; } That approach shows the same results or even outperforms the existing approach with __digits[201] = "0001020304..." in case of cold cache. It also produces slightly smaller binaries. Unfortunately on a warmed up cache it's slower than the existing approach. I don't think that it's a worth change. Attaching some of the benchmarks as a separate file (not for merge, just something to experiment with). -- Best regards, Antony Polukhin