[Patch, libfortran] Improve performance of byte swapped IO

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Janne Blomqvist <blomqvist.janne@gmail.com>
To: GCC Patches <gcc-patches@gcc.gnu.org>,
	Fortran List <fortran@gcc.gnu.org>
Subject: [Patch, libfortran] Improve performance of byte swapped IO
Date: Fri, 04 Jan 2013 22:15:00 -0000	[thread overview]
Message-ID: <CAO9iq9HB5us6X3faKPt=ZaAcBz_KeVQBTy4ZDiZ6XfeTOVwohA@mail.gmail.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 5653 bytes --]

Hi,

currently byte swapped unformatted IO can be quite slow compared to
the same code with no byte swapping. There are two major reasons for
this:

1) The byte swapping code path resorts to transferring data element by
element, leading to a lot of overhead in the IO library.

2) The function used for the actual byte swapping, reverse_memcpy ,
while able to handle general element sizes, is not particularly fast,
especially considering that many CPU's have fast byte swapping
instructions (e.g. BSWAP on x86). In order to access these fast byte
swapping instructions, gcc provides the __builtin_bswap{16,32,64}
builtins, falling back to libgcc code for targets that lack support.

The attached patch fixes these issues. For issue (1), the read path
uses in-place byte swapping of the data that has been read into the
user buffer, while the write path uses a larger temporary buffer
(since we are not allowed to modify the user supplied data in this
case). For issue(2), the patch uses __builtin_bswap{16,32,64} where
appropriate, only falling back to reverse_memcpy for other sizes.

With the attached test program run on a tmpfs filesystem to avoid
doing actual disk IO, I get the following:

- With no byte swapping:

 Unformatted sequential write/read performance test
 Record size           Write MB/s                 Read MB/s
 ==========================================================
           4   52.723842817422202        72.721158943820441
           8   77.508296890856386        97.237815640377221
          16   110.26209495334321        143.80831184546381
          32   173.94872143231535        221.89704881197937
          64   282.19818562682684        373.77854583735541
         128   442.22084579742244        628.80041029142183
         256   636.69620860705299        966.37723642576316
         512   826.05968840738080        1380.8835166612221
        1024   987.18686465197561        1763.5990036057208
        2048   1047.6721544191710        2058.0875622043550
        4096   1115.5817147134801        2251.8731832850176
        8192   1191.5021150996590        2283.8893409728184
       16384   1417.6110909519391        2441.0530373866482
       32768   1570.4413479046018        2543.0836384048471
       65536   1673.0378706502966        2651.2182395008308
      131072   1697.4944246188445        2688.2398923155783
      262144   1669.6329862145872        2735.6611118973292
      524288   1594.4669935231552        2697.7208298823243

- Before patch, with byte swapping:

 Unformatted sequential write/read performance test
 Record size           Write MB/s                 Read MB/s
 ==========================================================
           4   50.572812893689793        68.858701306591627
           8   58.688513300690317        81.591733130441327
          16   73.551188480607820        96.638995590227665
          32   91.593767813989018        116.65817140076214
          64   107.41379323761915        128.32512066346368
         128   121.33499652432221        147.80777892360237
         256   128.99627771476628        155.91619889220266
         512   135.02742063670030        161.30042382365372
        1024   137.02276709585524        164.11267056940963
        2048   138.62774254302394        165.22456826188971
        4096   139.27695763341924        166.34707691429571
        8192   147.64584950575932        166.59526981475742
       16384   147.91235479266419        166.77890398940283
       32768   150.77029430529927        166.90834867503827
       65536   151.59474472614465        166.84075600288520
      131072   155.75202672623249        166.96550283835097
      262144   155.36506626794849        166.78075976148853
      524288   155.64305086921487        167.44468828946083

- After patch, with byte swapping:

 Unformatted sequential write/read performance test
 Record size           Write MB/s                 Read MB/s
 ==========================================================
           4   49.414771776821361        70.808060042286343
           8   72.918156402459772        93.234093684373946
          16   102.72461544178078        136.21700026949074
          32   160.57240200649090        205.97612602315186
          64   249.32082957447636        331.85515010907363
         128   385.71299236810387        522.06354804855266
         256   535.40608912076459        766.59668706247294
         512   669.47864120368524        1006.4275938227961
        1024   742.90538895500265        1187.9846039167674
        2048   789.71340557340523        1333.8411634622269
        4096   826.44253204731683        1395.5536995933605
        8192   832.93540316116662        1361.4621716558986
       16384   897.95081977010113        1469.0940087507722
       32768   961.18736308033317        1533.7736812111871
       65536   989.41384908496832        1564.7013916917260
      131072   1003.6113762068040        1597.4063253370084
      262144   980.03067664324396        1602.3188995993287
      524288   985.82645661078755        1568.9537807626730

Regtested on x86_64-unknown-linux-gnu, Ok for trunk?

2013-01-04  Janne Blomqvist  <jb@gcc.gnu.org>

	* io/file_pos.c (unformatted_backspace): Use __builtin_bswapXX
	instead of reverse_memcpy.
	* io/io.h (reverse_memcpy): Remove prototype.
	* io/transfer.c (reverse_memcpy): Make static, move towards
	beginning of file.
	(bswap_array): New function.
	(unformatted_read): Use bswap_array to byte swap the data
	in-place.
	(unformatted_write): Use a larger temp buffer and bswap_array.
	(us_read): Use __builtin_bswapXX instead of reverse_memcpy.
	(write_us_marker): Likewise.


-- 
Janne Blomqvist

[-- Attachment #2: us_perf2.f90 --]
[-- Type: application/octet-stream, Size: 1826 bytes --]

! Test performance of unformatted sequential with different sized records.
! Janne Blomqvist 2013
program us_perf
  implicit none
  integer, parameter :: d = 8
  integer, parameter :: i64 = selected_int_kind(18)
  integer :: ii
  real(d) :: wspeed, rspeed

  print *, 'Unformatted sequential write/read performance test'
  print *, 'Record size           Write MB/s                 Read MB/s'
  print *, '=========================================================='
  ii = 1
  do
     call run_us_test (ii, wspeed, rspeed)
     print *, ii*4, wspeed, rspeed
     if (ii > 100000) then
        exit
     end if
     ii = ii * 2
  end do

contains
  subroutine run_us_test (n, ws, rs)
    integer, intent(in) :: n
    real(d), intent(out) :: ws, rs
    integer, allocatable :: data(:)
    real(d) :: t1, t2
    integer :: ii, loops
    integer, parameter :: nsize = 10000000 ! 10 MB

    ! Write nsize * log(n + 1) bytes,  each record is n elements of 4 bytes each
    ! + two 4 byte record markers
    loops = nsize * log(n + 1._d) / (n*4._d + 8._d)

    allocate(data(n))
    data = 123
    open(10, file="usperf.dat", form='unformatted', access='sequential', status='replace')
    call wtime(t1)
    do ii = 1, loops
       write (10) data
    end do
    call wtime(t2)
    close(10)
    ws = nsize * log(n+1._d) / 1024**2 / (t2-t1)
    open(10, file="usperf.dat", form='unformatted', access='sequential', status='old')
    call wtime(t1)
    do ii = 1, loops
       read (10) data
    end do
    call wtime(t2)
    close(10, status='delete')
    deallocate(data)
    rs = nsize * log(n+1._d) / 1024**2 / (t2-t1)
  end subroutine run_us_test

  subroutine wtime(t)
    real(d) :: t
    integer(i64):: count, rate
    call system_clock(count, rate)
    t = real(count, d) / rate
  end subroutine wtime
    
end program us_perf

[-- Attachment #3: bswap.diff --]
[-- Type: application/octet-stream, Size: 7619 bytes --]

diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
index c8ecc3a..bf2250a 100644
--- a/libgfortran/io/file_pos.c
+++ b/libgfortran/io/file_pos.c
@@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
 	}
       else
 	{
+	  uint32_t u32;
+	  uint64_t u64;
 	  switch (length)
 	    {
 	    case sizeof(GFC_INTEGER_4):
-	      reverse_memcpy (&m4, p, sizeof (m4));
+	      memcpy (&u32, p, sizeof (u32));
+	      u32 = __builtin_bswap32 (u32);
+	      m4 = *(GFC_INTEGER_4*)&u32;
 	      m = m4;
 	      break;
 
 	    case sizeof(GFC_INTEGER_8):
-	      reverse_memcpy (&m8, p, sizeof (m8));
+	      memcpy (&u64, p, sizeof (u64));
+	      u64 = __builtin_bswap64 (u64);
+	      m8 = *(GFC_INTEGER_8*)&u64;
 	      m = m8;
 	      break;
 
diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
index 43aeafd..f17de19 100644
--- a/libgfortran/io/io.h
+++ b/libgfortran/io/io.h
@@ -649,9 +649,6 @@ internal_proto(init_loop_spec);
 extern void next_record (st_parameter_dt *, int);
 internal_proto(next_record);
 
-extern void reverse_memcpy (void *, const void *, size_t);
-internal_proto (reverse_memcpy);
-
 extern void st_wait (st_parameter_wait *);
 export_proto(st_wait);
 
diff --git a/libgfortran/io/transfer.c b/libgfortran/io/transfer.c
index 6dda1df..eb77df8a 100644
--- a/libgfortran/io/transfer.c
+++ b/libgfortran/io/transfer.c
@@ -878,50 +878,91 @@ write_buf (st_parameter_dt *dtp, void *buf, size_t nbytes)
 }
 
 
-/* Master function for unformatted reads.  */
+/* Reverse memcpy - used for byte swapping.  */
 
 static void
-unformatted_read (st_parameter_dt *dtp, bt type,
-		  void *dest, int kind, size_t size, size_t nelems)
+reverse_memcpy (void *dest, const void *src, size_t n)
 {
-  if (likely (dtp->u.p.current_unit->flags.convert == GFC_CONVERT_NATIVE)
-      || kind == 1)
+  char *d, *s;
+  size_t i;
+
+  d = (char *) dest;
+  s = (char *) src + n - 1;
+
+  /* Write with ascending order - this is likely faster
+     on modern architectures because of write combining.  */
+  for (i=0; i<n; i++)
+      *(d++) = *(s--);
+}
+
+
+/* Utility function for byteswapping an array, using the bswap
+   builtins if possible. dest and src can overlap.  */
+
+static void
+bswap_array (void *dest, const void *src, size_t size, size_t nelems)
+{
+  char buffer[16];
+  const char *ps; 
+  char *pd;
+
+  switch (size)
     {
-      if (type == BT_CHARACTER)
-	size *= GFC_SIZE_OF_CHAR_KIND(kind);
-      read_block_direct (dtp, dest, size * nelems);
+    case 1:
+      break;
+    case 2:
+      for (size_t i = 0; i < nelems; i++)
+	((uint16_t*)dest)[i] = __builtin_bswap16 (((uint16_t*)src)[i]);
+      break;
+    case 4:
+      for (size_t i = 0; i < nelems; i++)
+	((uint32_t*)dest)[i] = __builtin_bswap32 (((uint32_t*)src)[i]);
+      break;
+    case 8:
+      for (size_t i = 0; i < nelems; i++)
+	((uint64_t*)dest)[i] = __builtin_bswap64 (((uint64_t*)src)[i]);
+      break;
+    default:
+      ps = src;
+      pd = dest;
+      for (size_t i = 0; i < nelems; i++)
+	{
+	  reverse_memcpy (buffer, ps, size);
+	  memcpy (pd, buffer, size);
+	  ps += size;
+	  pd += size;
+	}
     }
-  else
-    {
-      char buffer[16];
-      char *p;
-      size_t i;
+}
 
-      p = dest;
 
+/* Master function for unformatted reads.  */
+
+static void
+unformatted_read (st_parameter_dt *dtp, bt type,
+		  void *dest, int kind, size_t size, size_t nelems)
+{
+  if (type == BT_CHARACTER)
+    size *= GFC_SIZE_OF_CHAR_KIND(kind);
+  read_block_direct (dtp, dest, size * nelems);
+
+  if (unlikely (dtp->u.p.current_unit->flags.convert == GFC_CONVERT_SWAP)
+      && kind != 1)
+    {
       /* Handle wide chracters.  */
-      if (type == BT_CHARACTER && kind != 1)
-	{
-	  nelems *= size;
-	  size = kind;
-	}
+      if (type == BT_CHARACTER)
+  	{
+  	  nelems *= size;
+  	  size = kind;
+  	}
 
       /* Break up complex into its constituent reals.  */
-      if (type == BT_COMPLEX)
-	{
-	  nelems *= 2;
-	  size /= 2;
-	}
-      
-      /* By now, all complex variables have been split into their
-	 constituent reals.  */
-      
-      for (i = 0; i < nelems; i++)
-	{
- 	  read_block_direct (dtp, buffer, size);
- 	  reverse_memcpy (p, buffer, size);
- 	  p += size;
- 	}
+      else if (type == BT_COMPLEX)
+  	{
+  	  nelems *= 2;
+  	  size /= 2;
+  	}
+      bswap_array (dest, dest, size, nelems);
     }
 }
 
@@ -945,9 +986,10 @@ unformatted_write (st_parameter_dt *dtp, bt type,
     }
   else
     {
-      char buffer[16];
+#define BSWAP_BUFSZ 512
+      char buffer[BSWAP_BUFSZ];
       char *p;
-      size_t i;
+      size_t nrem;
 
       p = source;
 
@@ -968,12 +1010,21 @@ unformatted_write (st_parameter_dt *dtp, bt type,
       /* By now, all complex variables have been split into their
 	 constituent reals.  */
 
-      for (i = 0; i < nelems; i++)
+      nrem = nelems;
+      do
 	{
-	  reverse_memcpy(buffer, p, size);
- 	  p += size;
-	  write_buf (dtp, buffer, size);
+	  size_t nc;
+	  if (size * nrem > BSWAP_BUFSZ)
+	    nc = BSWAP_BUFSZ / size;
+	  else
+	    nc = nrem;
+
+	  bswap_array (buffer, p, size, nc);
+	  write_buf (dtp, buffer, size * nc);
+	  p += size * nc;
+	  nrem -= nc;
 	}
+      while (nrem > 0);
     }
 }
 
@@ -2153,15 +2204,22 @@ us_read (st_parameter_dt *dtp, int continued)
 	}
     }
   else
+    {
+      uint32_t u32;
+      uint64_t u64;
       switch (nr)
 	{
 	case sizeof(GFC_INTEGER_4):
-	  reverse_memcpy (&i4, &i, sizeof (i4));
+	  memcpy (&u32, &i, sizeof (u32));
+	  u32 = __builtin_bswap32 (u32);
+	  i4 = *(GFC_INTEGER_4*)&u32;
 	  i = i4;
 	  break;
 
 	case sizeof(GFC_INTEGER_8):
-	  reverse_memcpy (&i8, &i, sizeof (i8));
+	  memcpy (&u64, &i, sizeof (u64));
+	  u64 = __builtin_bswap64 (u64);
+	  i8 = *(GFC_INTEGER_8*)&u64;
 	  i = i8;
 	  break;
 
@@ -2169,6 +2227,7 @@ us_read (st_parameter_dt *dtp, int continued)
 	  runtime_error ("Illegal value for record marker");
 	  break;
 	}
+    }
 
   if (i >= 0)
     {
@@ -3036,7 +3095,6 @@ write_us_marker (st_parameter_dt *dtp, const gfc_offset buf)
   size_t len;
   GFC_INTEGER_4 buf4;
   GFC_INTEGER_8 buf8;
-  char p[sizeof (GFC_INTEGER_8)];
 
   if (compile_options.record_marker == 0)
     len = sizeof (GFC_INTEGER_4);
@@ -3065,18 +3123,20 @@ write_us_marker (st_parameter_dt *dtp, const gfc_offset buf)
     }
   else
     {
+      uint32_t u32;
+      uint64_t u64;
       switch (len)
 	{
 	case sizeof (GFC_INTEGER_4):
 	  buf4 = buf;
-	  reverse_memcpy (p, &buf4, sizeof (GFC_INTEGER_4));
-	  return swrite (dtp->u.p.current_unit->s, p, len);
+	  u32 = __builtin_bswap32 (*(uint32_t*)&buf4);
+	  return swrite (dtp->u.p.current_unit->s, &u32, len);
 	  break;
 
 	case sizeof (GFC_INTEGER_8):
 	  buf8 = buf;
-	  reverse_memcpy (p, &buf8, sizeof (GFC_INTEGER_8));
-	  return swrite (dtp->u.p.current_unit->s, p, len);
+	  u64 = __builtin_bswap64 (*(uint64_t*)&buf8);
+	  return swrite (dtp->u.p.current_unit->s, &u64, len);
 	  break;
 
 	default:
@@ -3713,22 +3773,6 @@ st_set_nml_var_dim (st_parameter_dt *dtp, GFC_INTEGER_4 n_dim,
   GFC_DIMENSION_SET(nml->dim[n],lbound,ubound,stride);
 }
 
-/* Reverse memcpy - used for byte swapping.  */
-
-void reverse_memcpy (void *dest, const void *src, size_t n)
-{
-  char *d, *s;
-  size_t i;
-
-  d = (char *) dest;
-  s = (char *) src + n - 1;
-
-  /* Write with ascending order - this is likely faster
-     on modern architectures because of write combining.  */
-  for (i=0; i<n; i++)
-      *(d++) = *(s--);
-}
-
 
 /* Once upon a time, a poor innocent Fortran program was reading a
    file, when suddenly it hit the end-of-file (EOF).  Unfortunately

next             reply	other threads:[~2013-01-04 22:15 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-04 22:15 Janne Blomqvist [this message]
2013-01-04 22:35 ` Andreas Schwab
2013-01-05 15:35   ` Richard Biener
2013-01-05 21:13     ` Janne Blomqvist
2013-01-06 11:33       ` Richard Biener
2013-01-11 20:41       ` Janne Blomqvist
2013-01-13 22:44         ` Janne Blomqvist
2013-01-18 22:30           ` Janne Blomqvist
2013-01-22 22:32             ` Thomas Koenig
2013-01-23 21:57               ` Janne Blomqvist

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAO9iq9HB5us6X3faKPt=ZaAcBz_KeVQBTy4ZDiZ6XfeTOVwohA@mail.gmail.com' \
    --to=blomqvist.janne@gmail.com \
    --cc=fortran@gcc.gnu.org \
    --cc=gcc-patches@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).