From: Janne Blomqvist <blomqvist.janne@gmail.com>
To: GCC Patches <gcc-patches@gcc.gnu.org>,
Fortran List <fortran@gcc.gnu.org>
Subject: [Patch, libfortran] Improve performance of byte swapped IO
Date: Fri, 04 Jan 2013 22:15:00 -0000 [thread overview]
Message-ID: <CAO9iq9HB5us6X3faKPt=ZaAcBz_KeVQBTy4ZDiZ6XfeTOVwohA@mail.gmail.com> (raw)
[-- Attachment #1: Type: text/plain, Size: 5653 bytes --]
Hi,
currently byte swapped unformatted IO can be quite slow compared to
the same code with no byte swapping. There are two major reasons for
this:
1) The byte swapping code path resorts to transferring data element by
element, leading to a lot of overhead in the IO library.
2) The function used for the actual byte swapping, reverse_memcpy ,
while able to handle general element sizes, is not particularly fast,
especially considering that many CPU's have fast byte swapping
instructions (e.g. BSWAP on x86). In order to access these fast byte
swapping instructions, gcc provides the __builtin_bswap{16,32,64}
builtins, falling back to libgcc code for targets that lack support.
The attached patch fixes these issues. For issue (1), the read path
uses in-place byte swapping of the data that has been read into the
user buffer, while the write path uses a larger temporary buffer
(since we are not allowed to modify the user supplied data in this
case). For issue(2), the patch uses __builtin_bswap{16,32,64} where
appropriate, only falling back to reverse_memcpy for other sizes.
With the attached test program run on a tmpfs filesystem to avoid
doing actual disk IO, I get the following:
- With no byte swapping:
Unformatted sequential write/read performance test
Record size Write MB/s Read MB/s
==========================================================
4 52.723842817422202 72.721158943820441
8 77.508296890856386 97.237815640377221
16 110.26209495334321 143.80831184546381
32 173.94872143231535 221.89704881197937
64 282.19818562682684 373.77854583735541
128 442.22084579742244 628.80041029142183
256 636.69620860705299 966.37723642576316
512 826.05968840738080 1380.8835166612221
1024 987.18686465197561 1763.5990036057208
2048 1047.6721544191710 2058.0875622043550
4096 1115.5817147134801 2251.8731832850176
8192 1191.5021150996590 2283.8893409728184
16384 1417.6110909519391 2441.0530373866482
32768 1570.4413479046018 2543.0836384048471
65536 1673.0378706502966 2651.2182395008308
131072 1697.4944246188445 2688.2398923155783
262144 1669.6329862145872 2735.6611118973292
524288 1594.4669935231552 2697.7208298823243
- Before patch, with byte swapping:
Unformatted sequential write/read performance test
Record size Write MB/s Read MB/s
==========================================================
4 50.572812893689793 68.858701306591627
8 58.688513300690317 81.591733130441327
16 73.551188480607820 96.638995590227665
32 91.593767813989018 116.65817140076214
64 107.41379323761915 128.32512066346368
128 121.33499652432221 147.80777892360237
256 128.99627771476628 155.91619889220266
512 135.02742063670030 161.30042382365372
1024 137.02276709585524 164.11267056940963
2048 138.62774254302394 165.22456826188971
4096 139.27695763341924 166.34707691429571
8192 147.64584950575932 166.59526981475742
16384 147.91235479266419 166.77890398940283
32768 150.77029430529927 166.90834867503827
65536 151.59474472614465 166.84075600288520
131072 155.75202672623249 166.96550283835097
262144 155.36506626794849 166.78075976148853
524288 155.64305086921487 167.44468828946083
- After patch, with byte swapping:
Unformatted sequential write/read performance test
Record size Write MB/s Read MB/s
==========================================================
4 49.414771776821361 70.808060042286343
8 72.918156402459772 93.234093684373946
16 102.72461544178078 136.21700026949074
32 160.57240200649090 205.97612602315186
64 249.32082957447636 331.85515010907363
128 385.71299236810387 522.06354804855266
256 535.40608912076459 766.59668706247294
512 669.47864120368524 1006.4275938227961
1024 742.90538895500265 1187.9846039167674
2048 789.71340557340523 1333.8411634622269
4096 826.44253204731683 1395.5536995933605
8192 832.93540316116662 1361.4621716558986
16384 897.95081977010113 1469.0940087507722
32768 961.18736308033317 1533.7736812111871
65536 989.41384908496832 1564.7013916917260
131072 1003.6113762068040 1597.4063253370084
262144 980.03067664324396 1602.3188995993287
524288 985.82645661078755 1568.9537807626730
Regtested on x86_64-unknown-linux-gnu, Ok for trunk?
2013-01-04 Janne Blomqvist <jb@gcc.gnu.org>
* io/file_pos.c (unformatted_backspace): Use __builtin_bswapXX
instead of reverse_memcpy.
* io/io.h (reverse_memcpy): Remove prototype.
* io/transfer.c (reverse_memcpy): Make static, move towards
beginning of file.
(bswap_array): New function.
(unformatted_read): Use bswap_array to byte swap the data
in-place.
(unformatted_write): Use a larger temp buffer and bswap_array.
(us_read): Use __builtin_bswapXX instead of reverse_memcpy.
(write_us_marker): Likewise.
--
Janne Blomqvist
[-- Attachment #2: us_perf2.f90 --]
[-- Type: application/octet-stream, Size: 1826 bytes --]
! Test performance of unformatted sequential with different sized records.
! Janne Blomqvist 2013
program us_perf
implicit none
integer, parameter :: d = 8
integer, parameter :: i64 = selected_int_kind(18)
integer :: ii
real(d) :: wspeed, rspeed
print *, 'Unformatted sequential write/read performance test'
print *, 'Record size Write MB/s Read MB/s'
print *, '=========================================================='
ii = 1
do
call run_us_test (ii, wspeed, rspeed)
print *, ii*4, wspeed, rspeed
if (ii > 100000) then
exit
end if
ii = ii * 2
end do
contains
subroutine run_us_test (n, ws, rs)
integer, intent(in) :: n
real(d), intent(out) :: ws, rs
integer, allocatable :: data(:)
real(d) :: t1, t2
integer :: ii, loops
integer, parameter :: nsize = 10000000 ! 10 MB
! Write nsize * log(n + 1) bytes, each record is n elements of 4 bytes each
! + two 4 byte record markers
loops = nsize * log(n + 1._d) / (n*4._d + 8._d)
allocate(data(n))
data = 123
open(10, file="usperf.dat", form='unformatted', access='sequential', status='replace')
call wtime(t1)
do ii = 1, loops
write (10) data
end do
call wtime(t2)
close(10)
ws = nsize * log(n+1._d) / 1024**2 / (t2-t1)
open(10, file="usperf.dat", form='unformatted', access='sequential', status='old')
call wtime(t1)
do ii = 1, loops
read (10) data
end do
call wtime(t2)
close(10, status='delete')
deallocate(data)
rs = nsize * log(n+1._d) / 1024**2 / (t2-t1)
end subroutine run_us_test
subroutine wtime(t)
real(d) :: t
integer(i64):: count, rate
call system_clock(count, rate)
t = real(count, d) / rate
end subroutine wtime
end program us_perf
[-- Attachment #3: bswap.diff --]
[-- Type: application/octet-stream, Size: 7619 bytes --]
diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
index c8ecc3a..bf2250a 100644
--- a/libgfortran/io/file_pos.c
+++ b/libgfortran/io/file_pos.c
@@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
}
else
{
+ uint32_t u32;
+ uint64_t u64;
switch (length)
{
case sizeof(GFC_INTEGER_4):
- reverse_memcpy (&m4, p, sizeof (m4));
+ memcpy (&u32, p, sizeof (u32));
+ u32 = __builtin_bswap32 (u32);
+ m4 = *(GFC_INTEGER_4*)&u32;
m = m4;
break;
case sizeof(GFC_INTEGER_8):
- reverse_memcpy (&m8, p, sizeof (m8));
+ memcpy (&u64, p, sizeof (u64));
+ u64 = __builtin_bswap64 (u64);
+ m8 = *(GFC_INTEGER_8*)&u64;
m = m8;
break;
diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
index 43aeafd..f17de19 100644
--- a/libgfortran/io/io.h
+++ b/libgfortran/io/io.h
@@ -649,9 +649,6 @@ internal_proto(init_loop_spec);
extern void next_record (st_parameter_dt *, int);
internal_proto(next_record);
-extern void reverse_memcpy (void *, const void *, size_t);
-internal_proto (reverse_memcpy);
-
extern void st_wait (st_parameter_wait *);
export_proto(st_wait);
diff --git a/libgfortran/io/transfer.c b/libgfortran/io/transfer.c
index 6dda1df..eb77df8a 100644
--- a/libgfortran/io/transfer.c
+++ b/libgfortran/io/transfer.c
@@ -878,50 +878,91 @@ write_buf (st_parameter_dt *dtp, void *buf, size_t nbytes)
}
-/* Master function for unformatted reads. */
+/* Reverse memcpy - used for byte swapping. */
static void
-unformatted_read (st_parameter_dt *dtp, bt type,
- void *dest, int kind, size_t size, size_t nelems)
+reverse_memcpy (void *dest, const void *src, size_t n)
{
- if (likely (dtp->u.p.current_unit->flags.convert == GFC_CONVERT_NATIVE)
- || kind == 1)
+ char *d, *s;
+ size_t i;
+
+ d = (char *) dest;
+ s = (char *) src + n - 1;
+
+ /* Write with ascending order - this is likely faster
+ on modern architectures because of write combining. */
+ for (i=0; i<n; i++)
+ *(d++) = *(s--);
+}
+
+
+/* Utility function for byteswapping an array, using the bswap
+ builtins if possible. dest and src can overlap. */
+
+static void
+bswap_array (void *dest, const void *src, size_t size, size_t nelems)
+{
+ char buffer[16];
+ const char *ps;
+ char *pd;
+
+ switch (size)
{
- if (type == BT_CHARACTER)
- size *= GFC_SIZE_OF_CHAR_KIND(kind);
- read_block_direct (dtp, dest, size * nelems);
+ case 1:
+ break;
+ case 2:
+ for (size_t i = 0; i < nelems; i++)
+ ((uint16_t*)dest)[i] = __builtin_bswap16 (((uint16_t*)src)[i]);
+ break;
+ case 4:
+ for (size_t i = 0; i < nelems; i++)
+ ((uint32_t*)dest)[i] = __builtin_bswap32 (((uint32_t*)src)[i]);
+ break;
+ case 8:
+ for (size_t i = 0; i < nelems; i++)
+ ((uint64_t*)dest)[i] = __builtin_bswap64 (((uint64_t*)src)[i]);
+ break;
+ default:
+ ps = src;
+ pd = dest;
+ for (size_t i = 0; i < nelems; i++)
+ {
+ reverse_memcpy (buffer, ps, size);
+ memcpy (pd, buffer, size);
+ ps += size;
+ pd += size;
+ }
}
- else
- {
- char buffer[16];
- char *p;
- size_t i;
+}
- p = dest;
+/* Master function for unformatted reads. */
+
+static void
+unformatted_read (st_parameter_dt *dtp, bt type,
+ void *dest, int kind, size_t size, size_t nelems)
+{
+ if (type == BT_CHARACTER)
+ size *= GFC_SIZE_OF_CHAR_KIND(kind);
+ read_block_direct (dtp, dest, size * nelems);
+
+ if (unlikely (dtp->u.p.current_unit->flags.convert == GFC_CONVERT_SWAP)
+ && kind != 1)
+ {
/* Handle wide chracters. */
- if (type == BT_CHARACTER && kind != 1)
- {
- nelems *= size;
- size = kind;
- }
+ if (type == BT_CHARACTER)
+ {
+ nelems *= size;
+ size = kind;
+ }
/* Break up complex into its constituent reals. */
- if (type == BT_COMPLEX)
- {
- nelems *= 2;
- size /= 2;
- }
-
- /* By now, all complex variables have been split into their
- constituent reals. */
-
- for (i = 0; i < nelems; i++)
- {
- read_block_direct (dtp, buffer, size);
- reverse_memcpy (p, buffer, size);
- p += size;
- }
+ else if (type == BT_COMPLEX)
+ {
+ nelems *= 2;
+ size /= 2;
+ }
+ bswap_array (dest, dest, size, nelems);
}
}
@@ -945,9 +986,10 @@ unformatted_write (st_parameter_dt *dtp, bt type,
}
else
{
- char buffer[16];
+#define BSWAP_BUFSZ 512
+ char buffer[BSWAP_BUFSZ];
char *p;
- size_t i;
+ size_t nrem;
p = source;
@@ -968,12 +1010,21 @@ unformatted_write (st_parameter_dt *dtp, bt type,
/* By now, all complex variables have been split into their
constituent reals. */
- for (i = 0; i < nelems; i++)
+ nrem = nelems;
+ do
{
- reverse_memcpy(buffer, p, size);
- p += size;
- write_buf (dtp, buffer, size);
+ size_t nc;
+ if (size * nrem > BSWAP_BUFSZ)
+ nc = BSWAP_BUFSZ / size;
+ else
+ nc = nrem;
+
+ bswap_array (buffer, p, size, nc);
+ write_buf (dtp, buffer, size * nc);
+ p += size * nc;
+ nrem -= nc;
}
+ while (nrem > 0);
}
}
@@ -2153,15 +2204,22 @@ us_read (st_parameter_dt *dtp, int continued)
}
}
else
+ {
+ uint32_t u32;
+ uint64_t u64;
switch (nr)
{
case sizeof(GFC_INTEGER_4):
- reverse_memcpy (&i4, &i, sizeof (i4));
+ memcpy (&u32, &i, sizeof (u32));
+ u32 = __builtin_bswap32 (u32);
+ i4 = *(GFC_INTEGER_4*)&u32;
i = i4;
break;
case sizeof(GFC_INTEGER_8):
- reverse_memcpy (&i8, &i, sizeof (i8));
+ memcpy (&u64, &i, sizeof (u64));
+ u64 = __builtin_bswap64 (u64);
+ i8 = *(GFC_INTEGER_8*)&u64;
i = i8;
break;
@@ -2169,6 +2227,7 @@ us_read (st_parameter_dt *dtp, int continued)
runtime_error ("Illegal value for record marker");
break;
}
+ }
if (i >= 0)
{
@@ -3036,7 +3095,6 @@ write_us_marker (st_parameter_dt *dtp, const gfc_offset buf)
size_t len;
GFC_INTEGER_4 buf4;
GFC_INTEGER_8 buf8;
- char p[sizeof (GFC_INTEGER_8)];
if (compile_options.record_marker == 0)
len = sizeof (GFC_INTEGER_4);
@@ -3065,18 +3123,20 @@ write_us_marker (st_parameter_dt *dtp, const gfc_offset buf)
}
else
{
+ uint32_t u32;
+ uint64_t u64;
switch (len)
{
case sizeof (GFC_INTEGER_4):
buf4 = buf;
- reverse_memcpy (p, &buf4, sizeof (GFC_INTEGER_4));
- return swrite (dtp->u.p.current_unit->s, p, len);
+ u32 = __builtin_bswap32 (*(uint32_t*)&buf4);
+ return swrite (dtp->u.p.current_unit->s, &u32, len);
break;
case sizeof (GFC_INTEGER_8):
buf8 = buf;
- reverse_memcpy (p, &buf8, sizeof (GFC_INTEGER_8));
- return swrite (dtp->u.p.current_unit->s, p, len);
+ u64 = __builtin_bswap64 (*(uint64_t*)&buf8);
+ return swrite (dtp->u.p.current_unit->s, &u64, len);
break;
default:
@@ -3713,22 +3773,6 @@ st_set_nml_var_dim (st_parameter_dt *dtp, GFC_INTEGER_4 n_dim,
GFC_DIMENSION_SET(nml->dim[n],lbound,ubound,stride);
}
-/* Reverse memcpy - used for byte swapping. */
-
-void reverse_memcpy (void *dest, const void *src, size_t n)
-{
- char *d, *s;
- size_t i;
-
- d = (char *) dest;
- s = (char *) src + n - 1;
-
- /* Write with ascending order - this is likely faster
- on modern architectures because of write combining. */
- for (i=0; i<n; i++)
- *(d++) = *(s--);
-}
-
/* Once upon a time, a poor innocent Fortran program was reading a
file, when suddenly it hit the end-of-file (EOF). Unfortunately
next reply other threads:[~2013-01-04 22:15 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-01-04 22:15 Janne Blomqvist [this message]
2013-01-04 22:35 ` Andreas Schwab
2013-01-05 15:35 ` Richard Biener
2013-01-05 21:13 ` Janne Blomqvist
2013-01-06 11:33 ` Richard Biener
2013-01-11 20:41 ` Janne Blomqvist
2013-01-13 22:44 ` Janne Blomqvist
2013-01-18 22:30 ` Janne Blomqvist
2013-01-22 22:32 ` Thomas Koenig
2013-01-23 21:57 ` Janne Blomqvist
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAO9iq9HB5us6X3faKPt=ZaAcBz_KeVQBTy4ZDiZ6XfeTOVwohA@mail.gmail.com' \
--to=blomqvist.janne@gmail.com \
--cc=fortran@gcc.gnu.org \
--cc=gcc-patches@gcc.gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).