Hi, currently byte swapped unformatted IO can be quite slow compared to the same code with no byte swapping. There are two major reasons for this: 1) The byte swapping code path resorts to transferring data element by element, leading to a lot of overhead in the IO library. 2) The function used for the actual byte swapping, reverse_memcpy , while able to handle general element sizes, is not particularly fast, especially considering that many CPU's have fast byte swapping instructions (e.g. BSWAP on x86). In order to access these fast byte swapping instructions, gcc provides the __builtin_bswap{16,32,64} builtins, falling back to libgcc code for targets that lack support. The attached patch fixes these issues. For issue (1), the read path uses in-place byte swapping of the data that has been read into the user buffer, while the write path uses a larger temporary buffer (since we are not allowed to modify the user supplied data in this case). For issue(2), the patch uses __builtin_bswap{16,32,64} where appropriate, only falling back to reverse_memcpy for other sizes. With the attached test program run on a tmpfs filesystem to avoid doing actual disk IO, I get the following: - With no byte swapping: Unformatted sequential write/read performance test Record size Write MB/s Read MB/s ========================================================== 4 52.723842817422202 72.721158943820441 8 77.508296890856386 97.237815640377221 16 110.26209495334321 143.80831184546381 32 173.94872143231535 221.89704881197937 64 282.19818562682684 373.77854583735541 128 442.22084579742244 628.80041029142183 256 636.69620860705299 966.37723642576316 512 826.05968840738080 1380.8835166612221 1024 987.18686465197561 1763.5990036057208 2048 1047.6721544191710 2058.0875622043550 4096 1115.5817147134801 2251.8731832850176 8192 1191.5021150996590 2283.8893409728184 16384 1417.6110909519391 2441.0530373866482 32768 1570.4413479046018 2543.0836384048471 65536 1673.0378706502966 2651.2182395008308 131072 1697.4944246188445 2688.2398923155783 262144 1669.6329862145872 2735.6611118973292 524288 1594.4669935231552 2697.7208298823243 - Before patch, with byte swapping: Unformatted sequential write/read performance test Record size Write MB/s Read MB/s ========================================================== 4 50.572812893689793 68.858701306591627 8 58.688513300690317 81.591733130441327 16 73.551188480607820 96.638995590227665 32 91.593767813989018 116.65817140076214 64 107.41379323761915 128.32512066346368 128 121.33499652432221 147.80777892360237 256 128.99627771476628 155.91619889220266 512 135.02742063670030 161.30042382365372 1024 137.02276709585524 164.11267056940963 2048 138.62774254302394 165.22456826188971 4096 139.27695763341924 166.34707691429571 8192 147.64584950575932 166.59526981475742 16384 147.91235479266419 166.77890398940283 32768 150.77029430529927 166.90834867503827 65536 151.59474472614465 166.84075600288520 131072 155.75202672623249 166.96550283835097 262144 155.36506626794849 166.78075976148853 524288 155.64305086921487 167.44468828946083 - After patch, with byte swapping: Unformatted sequential write/read performance test Record size Write MB/s Read MB/s ========================================================== 4 49.414771776821361 70.808060042286343 8 72.918156402459772 93.234093684373946 16 102.72461544178078 136.21700026949074 32 160.57240200649090 205.97612602315186 64 249.32082957447636 331.85515010907363 128 385.71299236810387 522.06354804855266 256 535.40608912076459 766.59668706247294 512 669.47864120368524 1006.4275938227961 1024 742.90538895500265 1187.9846039167674 2048 789.71340557340523 1333.8411634622269 4096 826.44253204731683 1395.5536995933605 8192 832.93540316116662 1361.4621716558986 16384 897.95081977010113 1469.0940087507722 32768 961.18736308033317 1533.7736812111871 65536 989.41384908496832 1564.7013916917260 131072 1003.6113762068040 1597.4063253370084 262144 980.03067664324396 1602.3188995993287 524288 985.82645661078755 1568.9537807626730 Regtested on x86_64-unknown-linux-gnu, Ok for trunk? 2013-01-04 Janne Blomqvist * io/file_pos.c (unformatted_backspace): Use __builtin_bswapXX instead of reverse_memcpy. * io/io.h (reverse_memcpy): Remove prototype. * io/transfer.c (reverse_memcpy): Make static, move towards beginning of file. (bswap_array): New function. (unformatted_read): Use bswap_array to byte swap the data in-place. (unformatted_write): Use a larger temp buffer and bswap_array. (us_read): Use __builtin_bswapXX instead of reverse_memcpy. (write_us_marker): Likewise. -- Janne Blomqvist