Hi,

Integer output in libgfortran is done by passing values as the largest integer type available. This is what our gfc_itoa() function for conversion to decimal form uses, as well, performing series of divisions by 10. On targets with a 128-bit integer type (which is most targets, really, nowadays), division is slow, because it is implemented in software and requires a call to a libgcc function.

We can speed this up in two easy ways:
- If the value fits into 64-bit, use a simple 64-bit itoa() function, which does the series of divisions by 10 with hardware. Most I/O will actually fall into that case, in real-life, unless you’re printing very big 128-bit integers.
- If the value does not fit into 64-bit, perform only one slow division, by 10^19, and use two calls to the 64-bit function to output each part (the low part needing zero-padding).


What is the speed-up? It really depends on the exact nature of the I/O done. For the most common-case, list-directed I/O with no special format, the patch does not speed (or slow!) things for values up to HUGE(KIND=4), but speeds things up for larger values. For very large 128-bit values, it can cut the I/O time in half.

I attach my own timing code to this email. Results before the patch (with previous itoa-patch applied, though):

 Timing for INTEGER(KIND=1)
 Value 0, time:  0.191409990    
 Value HUGE(KIND=1), time:  0.173687011    
 Timing for INTEGER(KIND=4)
 Value 0, time:  0.171809018    
 Value 1049, time:  0.177439988    
 Value HUGE(KIND=4), time:  0.217984974    
 Timing for INTEGER(KIND=8)
 Value 0, time:  0.178072989    
 Value HUGE(KIND=4), time:  0.214841008    
 Value HUGE(KIND=8), time:  0.276726007    
 Timing for INTEGER(KIND=16)
 Value 0, time:  0.175235987    
 Value HUGE(KIND=4), time:  0.217689037    
 Value HUGE(KIND=8), time:  0.280257106    
 Value HUGE(KIND=16), time:  0.420036077    

Results after the patch:

 Timing for INTEGER(KIND=1)
 Value 0, time:  0.194633007    
 Value HUGE(KIND=1), time:  0.172436997    
 Timing for INTEGER(KIND=4)
 Value 0, time:  0.167517006    
 Value 1049, time:  0.176503003    
 Value HUGE(KIND=4), time:  0.172892988    
 Timing for INTEGER(KIND=8)
 Value 0, time:  0.171101034    
 Value HUGE(KIND=4), time:  0.174461007    
 Value HUGE(KIND=8), time:  0.180289030    
 Timing for INTEGER(KIND=16)
 Value 0, time:  0.175765991    
 Value HUGE(KIND=4), time:  0.181162953    
 Value HUGE(KIND=8), time:  0.186082959    
 Value HUGE(KIND=16), time:  0.207401991    

Times are CPU times in seconds, for one million integer writes into a buffer string. With the patch, we see that integer decimal output is almost independent of the value written, meaning the I/O library overhead is dominant, not the decimal conversion. For this reason, I don’t think we really need a faster implementation of the 64-bit itoa, and can keep the current series-of-division-by-10 approach.

---------------

This patch applies on top of my previous itoa-related patch at https://gcc.gnu.org/pipermail/fortran/2021-December/057218.html

The patch has been bootstrapped and regtested on two 64-bit targets: aarch64-apple-darwin21 (development branch) and x86_64-pc-gnu-linux. I would like it to be tested on a 32-bit target without 128-bit integer type. Does someone have access to that?

Once tested on a 32-bit target, OK to commit?

FX