Andrew, > This is NOT a win on thunderX at least for single precision because you have to do the divide and sqrt in the same time as it takes 5 multiples (estimate and step are multiplies in the thunderX pipeline). Doubles is 10 multiplies which is just the same as what the patch does (but it is really slightly less than 10, I rounded up). So in the end this is NOT a win at all for thunderX unless we do one less step for both single and double. Yes, the expected benefit from rsqrt estimation is implementation specific. If one has a better initial rsqrte or an application that can trade precision for execution time, we could offer a command line option to do only 2 steps for doulbe and 1 step for float; similar to -mrecip-precision for PowerPC. What are your thoughts on that? Best regards, Benedikt