public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [Patch, libfortran] Improve performance of byte swapped IO
@ 2013-01-04 22:15 Janne Blomqvist
  2013-01-04 22:35 ` Andreas Schwab
  0 siblings, 1 reply; 10+ messages in thread
From: Janne Blomqvist @ 2013-01-04 22:15 UTC (permalink / raw)
  To: GCC Patches, Fortran List

[-- Attachment #1: Type: text/plain, Size: 5653 bytes --]

Hi,

currently byte swapped unformatted IO can be quite slow compared to
the same code with no byte swapping. There are two major reasons for
this:

1) The byte swapping code path resorts to transferring data element by
element, leading to a lot of overhead in the IO library.

2) The function used for the actual byte swapping, reverse_memcpy ,
while able to handle general element sizes, is not particularly fast,
especially considering that many CPU's have fast byte swapping
instructions (e.g. BSWAP on x86). In order to access these fast byte
swapping instructions, gcc provides the __builtin_bswap{16,32,64}
builtins, falling back to libgcc code for targets that lack support.

The attached patch fixes these issues. For issue (1), the read path
uses in-place byte swapping of the data that has been read into the
user buffer, while the write path uses a larger temporary buffer
(since we are not allowed to modify the user supplied data in this
case). For issue(2), the patch uses __builtin_bswap{16,32,64} where
appropriate, only falling back to reverse_memcpy for other sizes.

With the attached test program run on a tmpfs filesystem to avoid
doing actual disk IO, I get the following:

- With no byte swapping:

 Unformatted sequential write/read performance test
 Record size           Write MB/s                 Read MB/s
 ==========================================================
           4   52.723842817422202        72.721158943820441
           8   77.508296890856386        97.237815640377221
          16   110.26209495334321        143.80831184546381
          32   173.94872143231535        221.89704881197937
          64   282.19818562682684        373.77854583735541
         128   442.22084579742244        628.80041029142183
         256   636.69620860705299        966.37723642576316
         512   826.05968840738080        1380.8835166612221
        1024   987.18686465197561        1763.5990036057208
        2048   1047.6721544191710        2058.0875622043550
        4096   1115.5817147134801        2251.8731832850176
        8192   1191.5021150996590        2283.8893409728184
       16384   1417.6110909519391        2441.0530373866482
       32768   1570.4413479046018        2543.0836384048471
       65536   1673.0378706502966        2651.2182395008308
      131072   1697.4944246188445        2688.2398923155783
      262144   1669.6329862145872        2735.6611118973292
      524288   1594.4669935231552        2697.7208298823243

- Before patch, with byte swapping:

 Unformatted sequential write/read performance test
 Record size           Write MB/s                 Read MB/s
 ==========================================================
           4   50.572812893689793        68.858701306591627
           8   58.688513300690317        81.591733130441327
          16   73.551188480607820        96.638995590227665
          32   91.593767813989018        116.65817140076214
          64   107.41379323761915        128.32512066346368
         128   121.33499652432221        147.80777892360237
         256   128.99627771476628        155.91619889220266
         512   135.02742063670030        161.30042382365372
        1024   137.02276709585524        164.11267056940963
        2048   138.62774254302394        165.22456826188971
        4096   139.27695763341924        166.34707691429571
        8192   147.64584950575932        166.59526981475742
       16384   147.91235479266419        166.77890398940283
       32768   150.77029430529927        166.90834867503827
       65536   151.59474472614465        166.84075600288520
      131072   155.75202672623249        166.96550283835097
      262144   155.36506626794849        166.78075976148853
      524288   155.64305086921487        167.44468828946083

- After patch, with byte swapping:

 Unformatted sequential write/read performance test
 Record size           Write MB/s                 Read MB/s
 ==========================================================
           4   49.414771776821361        70.808060042286343
           8   72.918156402459772        93.234093684373946
          16   102.72461544178078        136.21700026949074
          32   160.57240200649090        205.97612602315186
          64   249.32082957447636        331.85515010907363
         128   385.71299236810387        522.06354804855266
         256   535.40608912076459        766.59668706247294
         512   669.47864120368524        1006.4275938227961
        1024   742.90538895500265        1187.9846039167674
        2048   789.71340557340523        1333.8411634622269
        4096   826.44253204731683        1395.5536995933605
        8192   832.93540316116662        1361.4621716558986
       16384   897.95081977010113        1469.0940087507722
       32768   961.18736308033317        1533.7736812111871
       65536   989.41384908496832        1564.7013916917260
      131072   1003.6113762068040        1597.4063253370084
      262144   980.03067664324396        1602.3188995993287
      524288   985.82645661078755        1568.9537807626730

Regtested on x86_64-unknown-linux-gnu, Ok for trunk?

2013-01-04  Janne Blomqvist  <jb@gcc.gnu.org>

	* io/file_pos.c (unformatted_backspace): Use __builtin_bswapXX
	instead of reverse_memcpy.
	* io/io.h (reverse_memcpy): Remove prototype.
	* io/transfer.c (reverse_memcpy): Make static, move towards
	beginning of file.
	(bswap_array): New function.
	(unformatted_read): Use bswap_array to byte swap the data
	in-place.
	(unformatted_write): Use a larger temp buffer and bswap_array.
	(us_read): Use __builtin_bswapXX instead of reverse_memcpy.
	(write_us_marker): Likewise.


-- 
Janne Blomqvist

[-- Attachment #2: us_perf2.f90 --]
[-- Type: application/octet-stream, Size: 1826 bytes --]

! Test performance of unformatted sequential with different sized records.
! Janne Blomqvist 2013
program us_perf
  implicit none
  integer, parameter :: d = 8
  integer, parameter :: i64 = selected_int_kind(18)
  integer :: ii
  real(d) :: wspeed, rspeed

  print *, 'Unformatted sequential write/read performance test'
  print *, 'Record size           Write MB/s                 Read MB/s'
  print *, '=========================================================='
  ii = 1
  do
     call run_us_test (ii, wspeed, rspeed)
     print *, ii*4, wspeed, rspeed
     if (ii > 100000) then
        exit
     end if
     ii = ii * 2
  end do

contains
  subroutine run_us_test (n, ws, rs)
    integer, intent(in) :: n
    real(d), intent(out) :: ws, rs
    integer, allocatable :: data(:)
    real(d) :: t1, t2
    integer :: ii, loops
    integer, parameter :: nsize = 10000000 ! 10 MB

    ! Write nsize * log(n + 1) bytes,  each record is n elements of 4 bytes each
    ! + two 4 byte record markers
    loops = nsize * log(n + 1._d) / (n*4._d + 8._d)

    allocate(data(n))
    data = 123
    open(10, file="usperf.dat", form='unformatted', access='sequential', status='replace')
    call wtime(t1)
    do ii = 1, loops
       write (10) data
    end do
    call wtime(t2)
    close(10)
    ws = nsize * log(n+1._d) / 1024**2 / (t2-t1)
    open(10, file="usperf.dat", form='unformatted', access='sequential', status='old')
    call wtime(t1)
    do ii = 1, loops
       read (10) data
    end do
    call wtime(t2)
    close(10, status='delete')
    deallocate(data)
    rs = nsize * log(n+1._d) / 1024**2 / (t2-t1)
  end subroutine run_us_test

  subroutine wtime(t)
    real(d) :: t
    integer(i64):: count, rate
    call system_clock(count, rate)
    t = real(count, d) / rate
  end subroutine wtime
    
end program us_perf

[-- Attachment #3: bswap.diff --]
[-- Type: application/octet-stream, Size: 7619 bytes --]

diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
index c8ecc3a..bf2250a 100644
--- a/libgfortran/io/file_pos.c
+++ b/libgfortran/io/file_pos.c
@@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
 	}
       else
 	{
+	  uint32_t u32;
+	  uint64_t u64;
 	  switch (length)
 	    {
 	    case sizeof(GFC_INTEGER_4):
-	      reverse_memcpy (&m4, p, sizeof (m4));
+	      memcpy (&u32, p, sizeof (u32));
+	      u32 = __builtin_bswap32 (u32);
+	      m4 = *(GFC_INTEGER_4*)&u32;
 	      m = m4;
 	      break;
 
 	    case sizeof(GFC_INTEGER_8):
-	      reverse_memcpy (&m8, p, sizeof (m8));
+	      memcpy (&u64, p, sizeof (u64));
+	      u64 = __builtin_bswap64 (u64);
+	      m8 = *(GFC_INTEGER_8*)&u64;
 	      m = m8;
 	      break;
 
diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
index 43aeafd..f17de19 100644
--- a/libgfortran/io/io.h
+++ b/libgfortran/io/io.h
@@ -649,9 +649,6 @@ internal_proto(init_loop_spec);
 extern void next_record (st_parameter_dt *, int);
 internal_proto(next_record);
 
-extern void reverse_memcpy (void *, const void *, size_t);
-internal_proto (reverse_memcpy);
-
 extern void st_wait (st_parameter_wait *);
 export_proto(st_wait);
 
diff --git a/libgfortran/io/transfer.c b/libgfortran/io/transfer.c
index 6dda1df..eb77df8a 100644
--- a/libgfortran/io/transfer.c
+++ b/libgfortran/io/transfer.c
@@ -878,50 +878,91 @@ write_buf (st_parameter_dt *dtp, void *buf, size_t nbytes)
 }
 
 
-/* Master function for unformatted reads.  */
+/* Reverse memcpy - used for byte swapping.  */
 
 static void
-unformatted_read (st_parameter_dt *dtp, bt type,
-		  void *dest, int kind, size_t size, size_t nelems)
+reverse_memcpy (void *dest, const void *src, size_t n)
 {
-  if (likely (dtp->u.p.current_unit->flags.convert == GFC_CONVERT_NATIVE)
-      || kind == 1)
+  char *d, *s;
+  size_t i;
+
+  d = (char *) dest;
+  s = (char *) src + n - 1;
+
+  /* Write with ascending order - this is likely faster
+     on modern architectures because of write combining.  */
+  for (i=0; i<n; i++)
+      *(d++) = *(s--);
+}
+
+
+/* Utility function for byteswapping an array, using the bswap
+   builtins if possible. dest and src can overlap.  */
+
+static void
+bswap_array (void *dest, const void *src, size_t size, size_t nelems)
+{
+  char buffer[16];
+  const char *ps; 
+  char *pd;
+
+  switch (size)
     {
-      if (type == BT_CHARACTER)
-	size *= GFC_SIZE_OF_CHAR_KIND(kind);
-      read_block_direct (dtp, dest, size * nelems);
+    case 1:
+      break;
+    case 2:
+      for (size_t i = 0; i < nelems; i++)
+	((uint16_t*)dest)[i] = __builtin_bswap16 (((uint16_t*)src)[i]);
+      break;
+    case 4:
+      for (size_t i = 0; i < nelems; i++)
+	((uint32_t*)dest)[i] = __builtin_bswap32 (((uint32_t*)src)[i]);
+      break;
+    case 8:
+      for (size_t i = 0; i < nelems; i++)
+	((uint64_t*)dest)[i] = __builtin_bswap64 (((uint64_t*)src)[i]);
+      break;
+    default:
+      ps = src;
+      pd = dest;
+      for (size_t i = 0; i < nelems; i++)
+	{
+	  reverse_memcpy (buffer, ps, size);
+	  memcpy (pd, buffer, size);
+	  ps += size;
+	  pd += size;
+	}
     }
-  else
-    {
-      char buffer[16];
-      char *p;
-      size_t i;
+}
 
-      p = dest;
 
+/* Master function for unformatted reads.  */
+
+static void
+unformatted_read (st_parameter_dt *dtp, bt type,
+		  void *dest, int kind, size_t size, size_t nelems)
+{
+  if (type == BT_CHARACTER)
+    size *= GFC_SIZE_OF_CHAR_KIND(kind);
+  read_block_direct (dtp, dest, size * nelems);
+
+  if (unlikely (dtp->u.p.current_unit->flags.convert == GFC_CONVERT_SWAP)
+      && kind != 1)
+    {
       /* Handle wide chracters.  */
-      if (type == BT_CHARACTER && kind != 1)
-	{
-	  nelems *= size;
-	  size = kind;
-	}
+      if (type == BT_CHARACTER)
+  	{
+  	  nelems *= size;
+  	  size = kind;
+  	}
 
       /* Break up complex into its constituent reals.  */
-      if (type == BT_COMPLEX)
-	{
-	  nelems *= 2;
-	  size /= 2;
-	}
-      
-      /* By now, all complex variables have been split into their
-	 constituent reals.  */
-      
-      for (i = 0; i < nelems; i++)
-	{
- 	  read_block_direct (dtp, buffer, size);
- 	  reverse_memcpy (p, buffer, size);
- 	  p += size;
- 	}
+      else if (type == BT_COMPLEX)
+  	{
+  	  nelems *= 2;
+  	  size /= 2;
+  	}
+      bswap_array (dest, dest, size, nelems);
     }
 }
 
@@ -945,9 +986,10 @@ unformatted_write (st_parameter_dt *dtp, bt type,
     }
   else
     {
-      char buffer[16];
+#define BSWAP_BUFSZ 512
+      char buffer[BSWAP_BUFSZ];
       char *p;
-      size_t i;
+      size_t nrem;
 
       p = source;
 
@@ -968,12 +1010,21 @@ unformatted_write (st_parameter_dt *dtp, bt type,
       /* By now, all complex variables have been split into their
 	 constituent reals.  */
 
-      for (i = 0; i < nelems; i++)
+      nrem = nelems;
+      do
 	{
-	  reverse_memcpy(buffer, p, size);
- 	  p += size;
-	  write_buf (dtp, buffer, size);
+	  size_t nc;
+	  if (size * nrem > BSWAP_BUFSZ)
+	    nc = BSWAP_BUFSZ / size;
+	  else
+	    nc = nrem;
+
+	  bswap_array (buffer, p, size, nc);
+	  write_buf (dtp, buffer, size * nc);
+	  p += size * nc;
+	  nrem -= nc;
 	}
+      while (nrem > 0);
     }
 }
 
@@ -2153,15 +2204,22 @@ us_read (st_parameter_dt *dtp, int continued)
 	}
     }
   else
+    {
+      uint32_t u32;
+      uint64_t u64;
       switch (nr)
 	{
 	case sizeof(GFC_INTEGER_4):
-	  reverse_memcpy (&i4, &i, sizeof (i4));
+	  memcpy (&u32, &i, sizeof (u32));
+	  u32 = __builtin_bswap32 (u32);
+	  i4 = *(GFC_INTEGER_4*)&u32;
 	  i = i4;
 	  break;
 
 	case sizeof(GFC_INTEGER_8):
-	  reverse_memcpy (&i8, &i, sizeof (i8));
+	  memcpy (&u64, &i, sizeof (u64));
+	  u64 = __builtin_bswap64 (u64);
+	  i8 = *(GFC_INTEGER_8*)&u64;
 	  i = i8;
 	  break;
 
@@ -2169,6 +2227,7 @@ us_read (st_parameter_dt *dtp, int continued)
 	  runtime_error ("Illegal value for record marker");
 	  break;
 	}
+    }
 
   if (i >= 0)
     {
@@ -3036,7 +3095,6 @@ write_us_marker (st_parameter_dt *dtp, const gfc_offset buf)
   size_t len;
   GFC_INTEGER_4 buf4;
   GFC_INTEGER_8 buf8;
-  char p[sizeof (GFC_INTEGER_8)];
 
   if (compile_options.record_marker == 0)
     len = sizeof (GFC_INTEGER_4);
@@ -3065,18 +3123,20 @@ write_us_marker (st_parameter_dt *dtp, const gfc_offset buf)
     }
   else
     {
+      uint32_t u32;
+      uint64_t u64;
       switch (len)
 	{
 	case sizeof (GFC_INTEGER_4):
 	  buf4 = buf;
-	  reverse_memcpy (p, &buf4, sizeof (GFC_INTEGER_4));
-	  return swrite (dtp->u.p.current_unit->s, p, len);
+	  u32 = __builtin_bswap32 (*(uint32_t*)&buf4);
+	  return swrite (dtp->u.p.current_unit->s, &u32, len);
 	  break;
 
 	case sizeof (GFC_INTEGER_8):
 	  buf8 = buf;
-	  reverse_memcpy (p, &buf8, sizeof (GFC_INTEGER_8));
-	  return swrite (dtp->u.p.current_unit->s, p, len);
+	  u64 = __builtin_bswap64 (*(uint64_t*)&buf8);
+	  return swrite (dtp->u.p.current_unit->s, &u64, len);
 	  break;
 
 	default:
@@ -3713,22 +3773,6 @@ st_set_nml_var_dim (st_parameter_dt *dtp, GFC_INTEGER_4 n_dim,
   GFC_DIMENSION_SET(nml->dim[n],lbound,ubound,stride);
 }
 
-/* Reverse memcpy - used for byte swapping.  */
-
-void reverse_memcpy (void *dest, const void *src, size_t n)
-{
-  char *d, *s;
-  size_t i;
-
-  d = (char *) dest;
-  s = (char *) src + n - 1;
-
-  /* Write with ascending order - this is likely faster
-     on modern architectures because of write combining.  */
-  for (i=0; i<n; i++)
-      *(d++) = *(s--);
-}
-
 
 /* Once upon a time, a poor innocent Fortran program was reading a
    file, when suddenly it hit the end-of-file (EOF).  Unfortunately

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch, libfortran] Improve performance of byte swapped IO
  2013-01-04 22:15 [Patch, libfortran] Improve performance of byte swapped IO Janne Blomqvist
@ 2013-01-04 22:35 ` Andreas Schwab
  2013-01-05 15:35   ` Richard Biener
  0 siblings, 1 reply; 10+ messages in thread
From: Andreas Schwab @ 2013-01-04 22:35 UTC (permalink / raw)
  To: Janne Blomqvist; +Cc: GCC Patches, Fortran List

Janne Blomqvist <blomqvist.janne@gmail.com> writes:

> diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
> index c8ecc3a..bf2250a 100644
> --- a/libgfortran/io/file_pos.c
> +++ b/libgfortran/io/file_pos.c
> @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
>  	}
>        else
>  	{
> +	  uint32_t u32;
> +	  uint64_t u64;
>  	  switch (length)
>  	    {
>  	    case sizeof(GFC_INTEGER_4):
> -	      reverse_memcpy (&m4, p, sizeof (m4));
> +	      memcpy (&u32, p, sizeof (u32));
> +	      u32 = __builtin_bswap32 (u32);
> +	      m4 = *(GFC_INTEGER_4*)&u32;

Isn't that an aliasing violation?

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch, libfortran] Improve performance of byte swapped IO
  2013-01-04 22:35 ` Andreas Schwab
@ 2013-01-05 15:35   ` Richard Biener
  2013-01-05 21:13     ` Janne Blomqvist
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Biener @ 2013-01-05 15:35 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Janne Blomqvist, GCC Patches, Fortran List

On Fri, Jan 4, 2013 at 11:35 PM, Andreas Schwab <schwab@linux-m68k.org> wrote:
> Janne Blomqvist <blomqvist.janne@gmail.com> writes:
>
>> diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
>> index c8ecc3a..bf2250a 100644
>> --- a/libgfortran/io/file_pos.c
>> +++ b/libgfortran/io/file_pos.c
>> @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
>>       }
>>        else
>>       {
>> +       uint32_t u32;
>> +       uint64_t u64;
>>         switch (length)
>>           {
>>           case sizeof(GFC_INTEGER_4):
>> -           reverse_memcpy (&m4, p, sizeof (m4));
>> +           memcpy (&u32, p, sizeof (u32));
>> +           u32 = __builtin_bswap32 (u32);
>> +           m4 = *(GFC_INTEGER_4*)&u32;
>
> Isn't that an aliasing violation?

It looks like one.  Why not simply do

   m4 = (GFC_INTEGER_4) u32;

?  I suppose GFC_INTEGER_4 is always the same size as uint32_t but signed?

Richard.

>
> Andreas.
>
> --
> Andreas Schwab, schwab@linux-m68k.org
> GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
> "And now for something completely different."

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch, libfortran] Improve performance of byte swapped IO
  2013-01-05 15:35   ` Richard Biener
@ 2013-01-05 21:13     ` Janne Blomqvist
  2013-01-06 11:33       ` Richard Biener
  2013-01-11 20:41       ` Janne Blomqvist
  0 siblings, 2 replies; 10+ messages in thread
From: Janne Blomqvist @ 2013-01-05 21:13 UTC (permalink / raw)
  To: Richard Biener; +Cc: Andreas Schwab, GCC Patches, Fortran List

[-- Attachment #1: Type: text/plain, Size: 1562 bytes --]

On Sat, Jan 5, 2013 at 5:35 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Jan 4, 2013 at 11:35 PM, Andreas Schwab <schwab@linux-m68k.org> wrote:
>> Janne Blomqvist <blomqvist.janne@gmail.com> writes:
>>
>>> diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
>>> index c8ecc3a..bf2250a 100644
>>> --- a/libgfortran/io/file_pos.c
>>> +++ b/libgfortran/io/file_pos.c
>>> @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
>>>       }
>>>        else
>>>       {
>>> +       uint32_t u32;
>>> +       uint64_t u64;
>>>         switch (length)
>>>           {
>>>           case sizeof(GFC_INTEGER_4):
>>> -           reverse_memcpy (&m4, p, sizeof (m4));
>>> +           memcpy (&u32, p, sizeof (u32));
>>> +           u32 = __builtin_bswap32 (u32);
>>> +           m4 = *(GFC_INTEGER_4*)&u32;
>>
>> Isn't that an aliasing violation?
>
> It looks like one.  Why not simply do
>
>    m4 = (GFC_INTEGER_4) u32;
>
> ?  I suppose GFC_INTEGER_4 is always the same size as uint32_t but signed?

Yes, GFC_INTEGER_4 is a typedef for int32_t. As for why I didn't do
the above, C99 6.3.1.3(3) says that if the unsigned value is outside
the range of the signed variable, the result is
implementation-defined. Though I suppose the sensible
"implementation-defined behavior" in this case on a two's complement
target is to just do a bitwise copy.

Anyway, to be really safe one could use memcpy instead; the compiler
optimizes small fixed size memcpy's just fine. Updated patch attached.


-- 
Janne Blomqvist

[-- Attachment #2: bswap2.diff --]
[-- Type: application/octet-stream, Size: 7690 bytes --]

diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
index c8ecc3a..22d19dd 100644
--- a/libgfortran/io/file_pos.c
+++ b/libgfortran/io/file_pos.c
@@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
 	}
       else
 	{
+	  uint32_t u32;
+	  uint64_t u64;
 	  switch (length)
 	    {
 	    case sizeof(GFC_INTEGER_4):
-	      reverse_memcpy (&m4, p, sizeof (m4));
+	      memcpy (&u32, p, sizeof (u32));
+	      u32 = __builtin_bswap32 (u32);
+	      memcpy (&m4, &u32, sizeof (m4));
 	      m = m4;
 	      break;
 
 	    case sizeof(GFC_INTEGER_8):
-	      reverse_memcpy (&m8, p, sizeof (m8));
+	      memcpy (&u64, p, sizeof (u64));
+	      u64 = __builtin_bswap64 (u64);
+	      memcpy (&m8, &u64, sizeof (m8));
 	      m = m8;
 	      break;
 
diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
index 43aeafd..f17de19 100644
--- a/libgfortran/io/io.h
+++ b/libgfortran/io/io.h
@@ -649,9 +649,6 @@ internal_proto(init_loop_spec);
 extern void next_record (st_parameter_dt *, int);
 internal_proto(next_record);
 
-extern void reverse_memcpy (void *, const void *, size_t);
-internal_proto (reverse_memcpy);
-
 extern void st_wait (st_parameter_wait *);
 export_proto(st_wait);
 
diff --git a/libgfortran/io/transfer.c b/libgfortran/io/transfer.c
index 6dda1df..5cefc38 100644
--- a/libgfortran/io/transfer.c
+++ b/libgfortran/io/transfer.c
@@ -878,50 +878,91 @@ write_buf (st_parameter_dt *dtp, void *buf, size_t nbytes)
 }
 
 
-/* Master function for unformatted reads.  */
+/* Reverse memcpy - used for byte swapping.  */
 
 static void
-unformatted_read (st_parameter_dt *dtp, bt type,
-		  void *dest, int kind, size_t size, size_t nelems)
+reverse_memcpy (void *dest, const void *src, size_t n)
 {
-  if (likely (dtp->u.p.current_unit->flags.convert == GFC_CONVERT_NATIVE)
-      || kind == 1)
+  char *d, *s;
+  size_t i;
+
+  d = (char *) dest;
+  s = (char *) src + n - 1;
+
+  /* Write with ascending order - this is likely faster
+     on modern architectures because of write combining.  */
+  for (i=0; i<n; i++)
+      *(d++) = *(s--);
+}
+
+
+/* Utility function for byteswapping an array, using the bswap
+   builtins if possible. dest and src can overlap.  */
+
+static void
+bswap_array (void *dest, const void *src, size_t size, size_t nelems)
+{
+  char buffer[16];
+  const char *ps; 
+  char *pd;
+
+  switch (size)
     {
-      if (type == BT_CHARACTER)
-	size *= GFC_SIZE_OF_CHAR_KIND(kind);
-      read_block_direct (dtp, dest, size * nelems);
+    case 1:
+      break;
+    case 2:
+      for (size_t i = 0; i < nelems; i++)
+	((uint16_t*)dest)[i] = __builtin_bswap16 (((uint16_t*)src)[i]);
+      break;
+    case 4:
+      for (size_t i = 0; i < nelems; i++)
+	((uint32_t*)dest)[i] = __builtin_bswap32 (((uint32_t*)src)[i]);
+      break;
+    case 8:
+      for (size_t i = 0; i < nelems; i++)
+	((uint64_t*)dest)[i] = __builtin_bswap64 (((uint64_t*)src)[i]);
+      break;
+    default:
+      ps = src;
+      pd = dest;
+      for (size_t i = 0; i < nelems; i++)
+	{
+	  reverse_memcpy (buffer, ps, size);
+	  memcpy (pd, buffer, size);
+	  ps += size;
+	  pd += size;
+	}
     }
-  else
-    {
-      char buffer[16];
-      char *p;
-      size_t i;
+}
 
-      p = dest;
 
+/* Master function for unformatted reads.  */
+
+static void
+unformatted_read (st_parameter_dt *dtp, bt type,
+		  void *dest, int kind, size_t size, size_t nelems)
+{
+  if (type == BT_CHARACTER)
+    size *= GFC_SIZE_OF_CHAR_KIND(kind);
+  read_block_direct (dtp, dest, size * nelems);
+
+  if (unlikely (dtp->u.p.current_unit->flags.convert == GFC_CONVERT_SWAP)
+      && kind != 1)
+    {
       /* Handle wide chracters.  */
-      if (type == BT_CHARACTER && kind != 1)
-	{
-	  nelems *= size;
-	  size = kind;
-	}
+      if (type == BT_CHARACTER)
+  	{
+  	  nelems *= size;
+  	  size = kind;
+  	}
 
       /* Break up complex into its constituent reals.  */
-      if (type == BT_COMPLEX)
-	{
-	  nelems *= 2;
-	  size /= 2;
-	}
-      
-      /* By now, all complex variables have been split into their
-	 constituent reals.  */
-      
-      for (i = 0; i < nelems; i++)
-	{
- 	  read_block_direct (dtp, buffer, size);
- 	  reverse_memcpy (p, buffer, size);
- 	  p += size;
- 	}
+      else if (type == BT_COMPLEX)
+  	{
+  	  nelems *= 2;
+  	  size /= 2;
+  	}
+      bswap_array (dest, dest, size, nelems);
     }
 }
 
@@ -945,9 +986,10 @@ unformatted_write (st_parameter_dt *dtp, bt type,
     }
   else
     {
-      char buffer[16];
+#define BSWAP_BUFSZ 512
+      char buffer[BSWAP_BUFSZ];
       char *p;
-      size_t i;
+      size_t nrem;
 
       p = source;
 
@@ -968,12 +1010,21 @@ unformatted_write (st_parameter_dt *dtp, bt type,
       /* By now, all complex variables have been split into their
 	 constituent reals.  */
 
-      for (i = 0; i < nelems; i++)
+      nrem = nelems;
+      do
 	{
-	  reverse_memcpy(buffer, p, size);
- 	  p += size;
-	  write_buf (dtp, buffer, size);
+	  size_t nc;
+	  if (size * nrem > BSWAP_BUFSZ)
+	    nc = BSWAP_BUFSZ / size;
+	  else
+	    nc = nrem;
+
+	  bswap_array (buffer, p, size, nc);
+	  write_buf (dtp, buffer, size * nc);
+	  p += size * nc;
+	  nrem -= nc;
 	}
+      while (nrem > 0);
     }
 }
 
@@ -2153,15 +2204,22 @@ us_read (st_parameter_dt *dtp, int continued)
 	}
     }
   else
+    {
+      uint32_t u32;
+      uint64_t u64;
       switch (nr)
 	{
 	case sizeof(GFC_INTEGER_4):
-	  reverse_memcpy (&i4, &i, sizeof (i4));
+	  memcpy (&u32, &i, sizeof (u32));
+	  u32 = __builtin_bswap32 (u32);
+	  memcpy (&i4, &u32, sizeof (i4));
 	  i = i4;
 	  break;
 
 	case sizeof(GFC_INTEGER_8):
-	  reverse_memcpy (&i8, &i, sizeof (i8));
+	  memcpy (&u64, &i, sizeof (u64));
+	  u64 = __builtin_bswap64 (u64);
+	  memcpy (&i8, &u64, sizeof (i8));
 	  i = i8;
 	  break;
 
@@ -2169,6 +2227,7 @@ us_read (st_parameter_dt *dtp, int continued)
 	  runtime_error ("Illegal value for record marker");
 	  break;
 	}
+    }
 
   if (i >= 0)
     {
@@ -3036,7 +3095,6 @@ write_us_marker (st_parameter_dt *dtp, const gfc_offset buf)
   size_t len;
   GFC_INTEGER_4 buf4;
   GFC_INTEGER_8 buf8;
-  char p[sizeof (GFC_INTEGER_8)];
 
   if (compile_options.record_marker == 0)
     len = sizeof (GFC_INTEGER_4);
@@ -3065,18 +3123,22 @@ write_us_marker (st_parameter_dt *dtp, const gfc_offset buf)
     }
   else
     {
+      uint32_t u32;
+      uint64_t u64;
       switch (len)
 	{
 	case sizeof (GFC_INTEGER_4):
 	  buf4 = buf;
-	  reverse_memcpy (p, &buf4, sizeof (GFC_INTEGER_4));
-	  return swrite (dtp->u.p.current_unit->s, p, len);
+	  memcpy (&u32, &buf4, sizeof (u32));
+	  u32 = __builtin_bswap32 (u32);
+	  return swrite (dtp->u.p.current_unit->s, &u32, len);
 	  break;
 
 	case sizeof (GFC_INTEGER_8):
 	  buf8 = buf;
-	  reverse_memcpy (p, &buf8, sizeof (GFC_INTEGER_8));
-	  return swrite (dtp->u.p.current_unit->s, p, len);
+	  memcpy (&u64, &buf8, sizeof (u64));
+	  u64 = __builtin_bswap64 (u64);
+	  return swrite (dtp->u.p.current_unit->s, &u64, len);
 	  break;
 
 	default:
@@ -3713,22 +3775,6 @@ st_set_nml_var_dim (st_parameter_dt *dtp, GFC_INTEGER_4 n_dim,
   GFC_DIMENSION_SET(nml->dim[n],lbound,ubound,stride);
 }
 
-/* Reverse memcpy - used for byte swapping.  */
-
-void reverse_memcpy (void *dest, const void *src, size_t n)
-{
-  char *d, *s;
-  size_t i;
-
-  d = (char *) dest;
-  s = (char *) src + n - 1;
-
-  /* Write with ascending order - this is likely faster
-     on modern architectures because of write combining.  */
-  for (i=0; i<n; i++)
-      *(d++) = *(s--);
-}
-
 
 /* Once upon a time, a poor innocent Fortran program was reading a
    file, when suddenly it hit the end-of-file (EOF).  Unfortunately

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch, libfortran] Improve performance of byte swapped IO
  2013-01-05 21:13     ` Janne Blomqvist
@ 2013-01-06 11:33       ` Richard Biener
  2013-01-11 20:41       ` Janne Blomqvist
  1 sibling, 0 replies; 10+ messages in thread
From: Richard Biener @ 2013-01-06 11:33 UTC (permalink / raw)
  To: Janne Blomqvist; +Cc: Andreas Schwab, GCC Patches, Fortran List

On Sat, Jan 5, 2013 at 10:13 PM, Janne Blomqvist
<blomqvist.janne@gmail.com> wrote:
> On Sat, Jan 5, 2013 at 5:35 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, Jan 4, 2013 at 11:35 PM, Andreas Schwab <schwab@linux-m68k.org> wrote:
>>> Janne Blomqvist <blomqvist.janne@gmail.com> writes:
>>>
>>>> diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
>>>> index c8ecc3a..bf2250a 100644
>>>> --- a/libgfortran/io/file_pos.c
>>>> +++ b/libgfortran/io/file_pos.c
>>>> @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
>>>>       }
>>>>        else
>>>>       {
>>>> +       uint32_t u32;
>>>> +       uint64_t u64;
>>>>         switch (length)
>>>>           {
>>>>           case sizeof(GFC_INTEGER_4):
>>>> -           reverse_memcpy (&m4, p, sizeof (m4));
>>>> +           memcpy (&u32, p, sizeof (u32));
>>>> +           u32 = __builtin_bswap32 (u32);
>>>> +           m4 = *(GFC_INTEGER_4*)&u32;
>>>
>>> Isn't that an aliasing violation?
>>
>> It looks like one.  Why not simply do
>>
>>    m4 = (GFC_INTEGER_4) u32;
>>
>> ?  I suppose GFC_INTEGER_4 is always the same size as uint32_t but signed?
>
> Yes, GFC_INTEGER_4 is a typedef for int32_t. As for why I didn't do
> the above, C99 6.3.1.3(3) says that if the unsigned value is outside
> the range of the signed variable, the result is
> implementation-defined. Though I suppose the sensible
> "implementation-defined behavior" in this case on a two's complement
> target is to just do a bitwise copy.

As libgfortran is a target library and thus always compiled by GCC you
can rely on GCCs documented implementation-defined behavior here
(which is to do bitwise re-interpretation).  No need to obfuscate the
code more than necessary.

Richard.

> Anyway, to be really safe one could use memcpy instead; the compiler
> optimizes small fixed size memcpy's just fine. Updated patch attached.
>
>
> --
> Janne Blomqvist

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch, libfortran] Improve performance of byte swapped IO
  2013-01-05 21:13     ` Janne Blomqvist
  2013-01-06 11:33       ` Richard Biener
@ 2013-01-11 20:41       ` Janne Blomqvist
  2013-01-13 22:44         ` Janne Blomqvist
  1 sibling, 1 reply; 10+ messages in thread
From: Janne Blomqvist @ 2013-01-11 20:41 UTC (permalink / raw)
  To: Richard Biener; +Cc: Andreas Schwab, GCC Patches, Fortran List

[-- Attachment #1: Type: text/plain, Size: 4491 bytes --]

PING.

Slightly updated patch attached, which further improves the generic
size fallback that is used when the element size is not 2/4/8 bytes.
Changing the us_perf benchmark to use real(10), with the v2 patch the
performance is:

 Unformatted sequential write/read performance test
 Record size           Write MB/s                 Read MB/s
 ==========================================================
           4   59.028550429522085        86.019754350948787
           8   79.028327063130590        95.803502000733374
          16   99.980457395413296        138.68367462874946
          32   122.56886206338788        180.05609910155042
          64   152.00478266944486        212.69931319407567
         128   197.74137934940202        235.19728791956828
         256   155.36245780017779        244.60578379215929
         512   157.13385845966246        245.07467397691480
        1024   177.26553799130201        260.44908357795623
        2048   208.22852888945587        260.21587143113527
        4096   222.88410474980634        262.66162209490591
        8192   226.71167580652920        265.81191407123663
       16384   206.51818241747065        263.59395165591724
       32768   230.18707026455866        265.88990325026526
       65536   229.19783089391504        268.04485112932684
      131072   231.12215662044449        267.40543904427710
      262144   230.72012123598142        267.60086931504122
      524288   230.48959460456055        268.78750211303725

With the new v3 patch I get

 Unformatted sequential write/read performance test
 Record size           Write MB/s                 Read MB/s
 ==========================================================
           4   59.779061121239941        92.777125264010024
           8   92.727504266051341        126.64775563782673
          16   128.94793911163904        184.69194300482837
          32   169.78916283536847        267.06752001266767
          64   209.50296476919556        341.60515130910238
         128   236.36709738360679        416.73212655882151
         256   251.79029695383340        465.46804746749740
         512   259.62269939828633        500.87346060356265
        1024   265.08842337586458        508.95530627428275
        2048   268.71795530051884        532.12211365683640
        4096   280.86546884821030        546.88907054369884
        8192   286.96049684823578        569.60958187426183
       16384   292.04368984868103        608.11503416324865
       32768   292.96677387959392        629.80651297065833
       65536   291.69098580137114        624.27103478079641
      131072   292.75666234956418        605.99766136491496
      262144   291.35520038228975        611.59061455535834
      524288   292.15446100501691        623.76232623081580


On Sat, Jan 5, 2013 at 11:13 PM, Janne Blomqvist
<blomqvist.janne@gmail.com> wrote:
> On Sat, Jan 5, 2013 at 5:35 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, Jan 4, 2013 at 11:35 PM, Andreas Schwab <schwab@linux-m68k.org> wrote:
>>> Janne Blomqvist <blomqvist.janne@gmail.com> writes:
>>>
>>>> diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
>>>> index c8ecc3a..bf2250a 100644
>>>> --- a/libgfortran/io/file_pos.c
>>>> +++ b/libgfortran/io/file_pos.c
>>>> @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
>>>>       }
>>>>        else
>>>>       {
>>>> +       uint32_t u32;
>>>> +       uint64_t u64;
>>>>         switch (length)
>>>>           {
>>>>           case sizeof(GFC_INTEGER_4):
>>>> -           reverse_memcpy (&m4, p, sizeof (m4));
>>>> +           memcpy (&u32, p, sizeof (u32));
>>>> +           u32 = __builtin_bswap32 (u32);
>>>> +           m4 = *(GFC_INTEGER_4*)&u32;
>>>
>>> Isn't that an aliasing violation?
>>
>> It looks like one.  Why not simply do
>>
>>    m4 = (GFC_INTEGER_4) u32;
>>
>> ?  I suppose GFC_INTEGER_4 is always the same size as uint32_t but signed?
>
> Yes, GFC_INTEGER_4 is a typedef for int32_t. As for why I didn't do
> the above, C99 6.3.1.3(3) says that if the unsigned value is outside
> the range of the signed variable, the result is
> implementation-defined. Though I suppose the sensible
> "implementation-defined behavior" in this case on a two's complement
> target is to just do a bitwise copy.
>
> Anyway, to be really safe one could use memcpy instead; the compiler
> optimizes small fixed size memcpy's just fine. Updated patch attached.
>
>
> --
> Janne Blomqvist



-- 
Janne Blomqvist

[-- Attachment #2: bswap3.diff --]
[-- Type: application/octet-stream, Size: 8093 bytes --]

diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
index c8ecc3a..22d19dd 100644
--- a/libgfortran/io/file_pos.c
+++ b/libgfortran/io/file_pos.c
@@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
 	}
       else
 	{
+	  uint32_t u32;
+	  uint64_t u64;
 	  switch (length)
 	    {
 	    case sizeof(GFC_INTEGER_4):
-	      reverse_memcpy (&m4, p, sizeof (m4));
+	      memcpy (&u32, p, sizeof (u32));
+	      u32 = __builtin_bswap32 (u32);
+	      memcpy (&m4, &u32, sizeof (m4));
 	      m = m4;
 	      break;
 
 	    case sizeof(GFC_INTEGER_8):
-	      reverse_memcpy (&m8, p, sizeof (m8));
+	      memcpy (&u64, p, sizeof (u64));
+	      u64 = __builtin_bswap64 (u64);
+	      memcpy (&m8, &u64, sizeof (m8));
 	      m = m8;
 	      break;
 
diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
index 43aeafd..f17de19 100644
--- a/libgfortran/io/io.h
+++ b/libgfortran/io/io.h
@@ -649,9 +649,6 @@ internal_proto(init_loop_spec);
 extern void next_record (st_parameter_dt *, int);
 internal_proto(next_record);
 
-extern void reverse_memcpy (void *, const void *, size_t);
-internal_proto (reverse_memcpy);
-
 extern void st_wait (st_parameter_wait *);
 export_proto(st_wait);
 
diff --git a/libgfortran/io/transfer.c b/libgfortran/io/transfer.c
index 6dda1df..d812fc1 100644
--- a/libgfortran/io/transfer.c
+++ b/libgfortran/io/transfer.c
@@ -878,50 +878,111 @@ write_buf (st_parameter_dt *dtp, void *buf, size_t nbytes)
 }
 
 
-/* Master function for unformatted reads.  */
+/* Reverse memcpy - used for byte swapping.  */
 
 static void
-unformatted_read (st_parameter_dt *dtp, bt type,
-		  void *dest, int kind, size_t size, size_t nelems)
+reverse_memcpy (void *dest, const void *src, size_t n)
 {
-  if (likely (dtp->u.p.current_unit->flags.convert == GFC_CONVERT_NATIVE)
-      || kind == 1)
+  char *d, *s;
+  size_t i;
+
+  d = (char *) dest;
+  s = (char *) src + n - 1;
+
+  /* Write with ascending order - this is likely faster
+     on modern architectures because of write combining.  */
+  for (i=0; i<n; i++)
+      *(d++) = *(s--);
+}
+
+
+/* Utility function for byteswapping an array, using the bswap
+   builtins if possible. dest and src can overlap completely, or then
+   they must point to separate objects; partial overlaps are not
+   allowed.  */
+
+static void
+bswap_array (void *dest, const void *src, size_t size, size_t nelems)
+{
+  const char *ps; 
+  char *pd;
+
+  switch (size)
     {
-      if (type == BT_CHARACTER)
-	size *= GFC_SIZE_OF_CHAR_KIND(kind);
-      read_block_direct (dtp, dest, size * nelems);
+    case 1:
+      break;
+    case 2:
+      for (size_t i = 0; i < nelems; i++)
+	((uint16_t*)dest)[i] = __builtin_bswap16 (((uint16_t*)src)[i]);
+      break;
+    case 4:
+      for (size_t i = 0; i < nelems; i++)
+	((uint32_t*)dest)[i] = __builtin_bswap32 (((uint32_t*)src)[i]);
+      break;
+    case 8:
+      for (size_t i = 0; i < nelems; i++)
+	((uint64_t*)dest)[i] = __builtin_bswap64 (((uint64_t*)src)[i]);
+      break;
+    default:
+      pd = dest;
+      if (dest != src)
+	{
+	  ps = src;
+	  for (size_t i = 0; i < nelems; i++)
+	    {
+	      reverse_memcpy (pd, ps, size);
+	      ps += size;
+	      pd += size;
+	    }
+	}
+      else
+	{
+	  /* In-place byte swap.  */
+	  for (size_t i = 0; i < nelems; i++)
+	    {
+	      char tmp, *low = pd, *high = pd + size - 1;
+	      for (size_t j = 0; j < size/2; j++)
+		{
+		  tmp = *low;
+		  *low = *high;
+		  *high = tmp;
+		  low++;
+		  high--;
+		}
+	      pd += size;
+	    }
+	}
     }
-  else
-    {
-      char buffer[16];
-      char *p;
-      size_t i;
+}
 
-      p = dest;
 
+/* Master function for unformatted reads.  */
+
+static void
+unformatted_read (st_parameter_dt *dtp, bt type,
+		  void *dest, int kind, size_t size, size_t nelems)
+{
+  if (type == BT_CHARACTER)
+    size *= GFC_SIZE_OF_CHAR_KIND(kind);
+  read_block_direct (dtp, dest, size * nelems);
+
+  if (unlikely (dtp->u.p.current_unit->flags.convert == GFC_CONVERT_SWAP)
+      && kind != 1)
+    {
       /* Handle wide chracters.  */
-      if (type == BT_CHARACTER && kind != 1)
-	{
-	  nelems *= size;
-	  size = kind;
-	}
+      if (type == BT_CHARACTER)
+  	{
+  	  nelems *= size;
+  	  size = kind;
+  	}
 
       /* Break up complex into its constituent reals.  */
-      if (type == BT_COMPLEX)
-	{
-	  nelems *= 2;
-	  size /= 2;
-	}
-      
-      /* By now, all complex variables have been split into their
-	 constituent reals.  */
-      
-      for (i = 0; i < nelems; i++)
-	{
- 	  read_block_direct (dtp, buffer, size);
- 	  reverse_memcpy (p, buffer, size);
- 	  p += size;
- 	}
+      else if (type == BT_COMPLEX)
+  	{
+  	  nelems *= 2;
+  	  size /= 2;
+  	}
+      bswap_array (dest, dest, size, nelems);
     }
 }
 
@@ -945,9 +1006,10 @@ unformatted_write (st_parameter_dt *dtp, bt type,
     }
   else
     {
-      char buffer[16];
+#define BSWAP_BUFSZ 512
+      char buffer[BSWAP_BUFSZ];
       char *p;
-      size_t i;
+      size_t nrem;
 
       p = source;
 
@@ -968,12 +1030,21 @@ unformatted_write (st_parameter_dt *dtp, bt type,
       /* By now, all complex variables have been split into their
 	 constituent reals.  */
 
-      for (i = 0; i < nelems; i++)
+      nrem = nelems;
+      do
 	{
-	  reverse_memcpy(buffer, p, size);
- 	  p += size;
-	  write_buf (dtp, buffer, size);
+	  size_t nc;
+	  if (size * nrem > BSWAP_BUFSZ)
+	    nc = BSWAP_BUFSZ / size;
+	  else
+	    nc = nrem;
+
+	  bswap_array (buffer, p, size, nc);
+	  write_buf (dtp, buffer, size * nc);
+	  p += size * nc;
+	  nrem -= nc;
 	}
+      while (nrem > 0);
     }
 }
 
@@ -2153,15 +2224,22 @@ us_read (st_parameter_dt *dtp, int continued)
 	}
     }
   else
+    {
+      uint32_t u32;
+      uint64_t u64;
       switch (nr)
 	{
 	case sizeof(GFC_INTEGER_4):
-	  reverse_memcpy (&i4, &i, sizeof (i4));
+	  memcpy (&u32, &i, sizeof (u32));
+	  u32 = __builtin_bswap32 (u32);
+	  memcpy (&i4, &u32, sizeof (i4));
 	  i = i4;
 	  break;
 
 	case sizeof(GFC_INTEGER_8):
-	  reverse_memcpy (&i8, &i, sizeof (i8));
+	  memcpy (&u64, &i, sizeof (u64));
+	  u64 = __builtin_bswap64 (u64);
+	  memcpy (&i8, &u64, sizeof (i8));
 	  i = i8;
 	  break;
 
@@ -2169,6 +2247,7 @@ us_read (st_parameter_dt *dtp, int continued)
 	  runtime_error ("Illegal value for record marker");
 	  break;
 	}
+    }
 
   if (i >= 0)
     {
@@ -3036,7 +3115,6 @@ write_us_marker (st_parameter_dt *dtp, const gfc_offset buf)
   size_t len;
   GFC_INTEGER_4 buf4;
   GFC_INTEGER_8 buf8;
-  char p[sizeof (GFC_INTEGER_8)];
 
   if (compile_options.record_marker == 0)
     len = sizeof (GFC_INTEGER_4);
@@ -3065,18 +3143,22 @@ write_us_marker (st_parameter_dt *dtp, const gfc_offset buf)
     }
   else
     {
+      uint32_t u32;
+      uint64_t u64;
       switch (len)
 	{
 	case sizeof (GFC_INTEGER_4):
 	  buf4 = buf;
-	  reverse_memcpy (p, &buf4, sizeof (GFC_INTEGER_4));
-	  return swrite (dtp->u.p.current_unit->s, p, len);
+	  memcpy (&u32, &buf4, sizeof (u32));
+	  u32 = __builtin_bswap32 (u32);
+	  return swrite (dtp->u.p.current_unit->s, &u32, len);
 	  break;
 
 	case sizeof (GFC_INTEGER_8):
 	  buf8 = buf;
-	  reverse_memcpy (p, &buf8, sizeof (GFC_INTEGER_8));
-	  return swrite (dtp->u.p.current_unit->s, p, len);
+	  memcpy (&u64, &buf8, sizeof (u64));
+	  u64 = __builtin_bswap64 (u64);
+	  return swrite (dtp->u.p.current_unit->s, &u64, len);
 	  break;
 
 	default:
@@ -3713,22 +3795,6 @@ st_set_nml_var_dim (st_parameter_dt *dtp, GFC_INTEGER_4 n_dim,
   GFC_DIMENSION_SET(nml->dim[n],lbound,ubound,stride);
 }
 
-/* Reverse memcpy - used for byte swapping.  */
-
-void reverse_memcpy (void *dest, const void *src, size_t n)
-{
-  char *d, *s;
-  size_t i;
-
-  d = (char *) dest;
-  s = (char *) src + n - 1;
-
-  /* Write with ascending order - this is likely faster
-     on modern architectures because of write combining.  */
-  for (i=0; i<n; i++)
-      *(d++) = *(s--);
-}
-
 
 /* Once upon a time, a poor innocent Fortran program was reading a
    file, when suddenly it hit the end-of-file (EOF).  Unfortunately

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch, libfortran] Improve performance of byte swapped IO
  2013-01-11 20:41       ` Janne Blomqvist
@ 2013-01-13 22:44         ` Janne Blomqvist
  2013-01-18 22:30           ` Janne Blomqvist
  0 siblings, 1 reply; 10+ messages in thread
From: Janne Blomqvist @ 2013-01-13 22:44 UTC (permalink / raw)
  To: Richard Biener; +Cc: Andreas Schwab, GCC Patches, Fortran List

[-- Attachment #1: Type: text/plain, Size: 7582 bytes --]

PING**1.2

Yet another slightly updated patch attached. Compared to the previous
version, now with specializations for size 12 and 16 as well. For the
real(10) benchmark, with the previous v3 patch (please disregard the
absolute values in the post quoted below, there were wrong due to a
bug):

  Unformatted sequential write/read performance test
 Record size           Write MB/s                 Read MB/s
 ==========================================================
           4   80.578833140738340        127.33074266188656
           8   137.61682156650559        184.49033790407984
          16   202.72871312800621        275.98801561061816
          32   275.33538767460863        413.43956672052303
          64   341.04488670485119        555.13744525826564
         128   384.77917051919820        671.44655208024699
         256   410.97208129045833        763.97660513918527
         512   425.76619227779878        826.41086693364593
        1024   430.77035999730009        840.30757120448550
        2048   438.30318459339475        885.50033810296600
        4096   455.79422809097599        919.78265920652086
        8192   465.74499205886326        959.06963983370918
       16384   472.48133493971142        991.11244162081744
       32768   471.00024619567603        1015.7428144049615
       65536   474.91235280949985        1021.2150519080892
      131072   475.18664487440901        1006.3701982554830
      262144   478.00435092846868        985.17141300594039
      524288   476.72837201590363        991.74226579987987

With the new v4 patch:

 Unformatted sequential write/read performance test
 Record size           Write MB/s                 Read MB/s
 ==========================================================
           4   87.353141847504133        145.09410391177835
           8   166.95093628370549        223.60877830048437
          16   272.20937208187746        364.91673986840277
          32   415.26016354252715        599.41744252952310
          64   592.97676703528009        900.53345964312450
         128   748.27218547147686        1189.7131837787238
         256   874.83098506714384        1561.3649529261234
         512   935.69494481144284        1823.1760143164879
        1024   983.51689491813215        1931.8773088107300
        2048   1009.5491761651396        1971.6978586130062
        4096   1115.5862027658552        2119.4151169997808
        8192   1172.9400229568287        2184.1403983641089
       16384   1222.6659284153168        2258.5490449229878
       32768   1242.2417626697293        2251.8159046253918
       65536   1227.9967555594396        2313.4106672387143
      131072   1204.4295656544052        2129.1309150039478
      262144   1135.7905614378458        2154.7146453789856
      524288   1075.5769074402640        2170.5151501933169


On Fri, Jan 11, 2013 at 10:41 PM, Janne Blomqvist
<blomqvist.janne@gmail.com> wrote:
> PING.
>
> Slightly updated patch attached, which further improves the generic
> size fallback that is used when the element size is not 2/4/8 bytes.
> Changing the us_perf benchmark to use real(10), with the v2 patch the
> performance is:
>
>  Unformatted sequential write/read performance test
>  Record size           Write MB/s                 Read MB/s
>  ==========================================================
>            4   59.028550429522085        86.019754350948787
>            8   79.028327063130590        95.803502000733374
>           16   99.980457395413296        138.68367462874946
>           32   122.56886206338788        180.05609910155042
>           64   152.00478266944486        212.69931319407567
>          128   197.74137934940202        235.19728791956828
>          256   155.36245780017779        244.60578379215929
>          512   157.13385845966246        245.07467397691480
>         1024   177.26553799130201        260.44908357795623
>         2048   208.22852888945587        260.21587143113527
>         4096   222.88410474980634        262.66162209490591
>         8192   226.71167580652920        265.81191407123663
>        16384   206.51818241747065        263.59395165591724
>        32768   230.18707026455866        265.88990325026526
>        65536   229.19783089391504        268.04485112932684
>       131072   231.12215662044449        267.40543904427710
>       262144   230.72012123598142        267.60086931504122
>       524288   230.48959460456055        268.78750211303725
>
> With the new v3 patch I get
>
>  Unformatted sequential write/read performance test
>  Record size           Write MB/s                 Read MB/s
>  ==========================================================
>            4   59.779061121239941        92.777125264010024
>            8   92.727504266051341        126.64775563782673
>           16   128.94793911163904        184.69194300482837
>           32   169.78916283536847        267.06752001266767
>           64   209.50296476919556        341.60515130910238
>          128   236.36709738360679        416.73212655882151
>          256   251.79029695383340        465.46804746749740
>          512   259.62269939828633        500.87346060356265
>         1024   265.08842337586458        508.95530627428275
>         2048   268.71795530051884        532.12211365683640
>         4096   280.86546884821030        546.88907054369884
>         8192   286.96049684823578        569.60958187426183
>        16384   292.04368984868103        608.11503416324865
>        32768   292.96677387959392        629.80651297065833
>        65536   291.69098580137114        624.27103478079641
>       131072   292.75666234956418        605.99766136491496
>       262144   291.35520038228975        611.59061455535834
>       524288   292.15446100501691        623.76232623081580
>
>
> On Sat, Jan 5, 2013 at 11:13 PM, Janne Blomqvist
> <blomqvist.janne@gmail.com> wrote:
>> On Sat, Jan 5, 2013 at 5:35 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Jan 4, 2013 at 11:35 PM, Andreas Schwab <schwab@linux-m68k.org> wrote:
>>>> Janne Blomqvist <blomqvist.janne@gmail.com> writes:
>>>>
>>>>> diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
>>>>> index c8ecc3a..bf2250a 100644
>>>>> --- a/libgfortran/io/file_pos.c
>>>>> +++ b/libgfortran/io/file_pos.c
>>>>> @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
>>>>>       }
>>>>>        else
>>>>>       {
>>>>> +       uint32_t u32;
>>>>> +       uint64_t u64;
>>>>>         switch (length)
>>>>>           {
>>>>>           case sizeof(GFC_INTEGER_4):
>>>>> -           reverse_memcpy (&m4, p, sizeof (m4));
>>>>> +           memcpy (&u32, p, sizeof (u32));
>>>>> +           u32 = __builtin_bswap32 (u32);
>>>>> +           m4 = *(GFC_INTEGER_4*)&u32;
>>>>
>>>> Isn't that an aliasing violation?
>>>
>>> It looks like one.  Why not simply do
>>>
>>>    m4 = (GFC_INTEGER_4) u32;
>>>
>>> ?  I suppose GFC_INTEGER_4 is always the same size as uint32_t but signed?
>>
>> Yes, GFC_INTEGER_4 is a typedef for int32_t. As for why I didn't do
>> the above, C99 6.3.1.3(3) says that if the unsigned value is outside
>> the range of the signed variable, the result is
>> implementation-defined. Though I suppose the sensible
>> "implementation-defined behavior" in this case on a two's complement
>> target is to just do a bitwise copy.
>>
>> Anyway, to be really safe one could use memcpy instead; the compiler
>> optimizes small fixed size memcpy's just fine. Updated patch attached.
>>
>>
>> --
>> Janne Blomqvist
>
>
>
> --
> Janne Blomqvist



-- 
Janne Blomqvist

[-- Attachment #2: bswap4.diff --]
[-- Type: application/octet-stream, Size: 8772 bytes --]

diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
index c8ecc3a..22d19dd 100644
--- a/libgfortran/io/file_pos.c
+++ b/libgfortran/io/file_pos.c
@@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
 	}
       else
 	{
+	  uint32_t u32;
+	  uint64_t u64;
 	  switch (length)
 	    {
 	    case sizeof(GFC_INTEGER_4):
-	      reverse_memcpy (&m4, p, sizeof (m4));
+	      memcpy (&u32, p, sizeof (u32));
+	      u32 = __builtin_bswap32 (u32);
+	      memcpy (&m4, &u32, sizeof (m4));
 	      m = m4;
 	      break;
 
 	    case sizeof(GFC_INTEGER_8):
-	      reverse_memcpy (&m8, p, sizeof (m8));
+	      memcpy (&u64, p, sizeof (u64));
+	      u64 = __builtin_bswap64 (u64);
+	      memcpy (&m8, &u64, sizeof (m8));
 	      m = m8;
 	      break;
 
diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
index 43aeafd..f17de19 100644
--- a/libgfortran/io/io.h
+++ b/libgfortran/io/io.h
@@ -649,9 +649,6 @@ internal_proto(init_loop_spec);
 extern void next_record (st_parameter_dt *, int);
 internal_proto(next_record);
 
-extern void reverse_memcpy (void *, const void *, size_t);
-internal_proto (reverse_memcpy);
-
 extern void st_wait (st_parameter_wait *);
 export_proto(st_wait);
 
diff --git a/libgfortran/io/transfer.c b/libgfortran/io/transfer.c
index 6dda1df..020b6f0 100644
--- a/libgfortran/io/transfer.c
+++ b/libgfortran/io/transfer.c
@@ -878,50 +878,138 @@ write_buf (st_parameter_dt *dtp, void *buf, size_t nbytes)
 }
 
 
-/* Master function for unformatted reads.  */
+/* Reverse memcpy - used for byte swapping.  */
 
 static void
-unformatted_read (st_parameter_dt *dtp, bt type,
-		  void *dest, int kind, size_t size, size_t nelems)
+reverse_memcpy (void *dest, const void *src, size_t n)
 {
-  if (likely (dtp->u.p.current_unit->flags.convert == GFC_CONVERT_NATIVE)
-      || kind == 1)
+  char *d, *s;
+  size_t i;
+
+  d = (char *) dest;
+  s = (char *) src + n - 1;
+
+  /* Write with ascending order - this is likely faster
+     on modern architectures because of write combining.  */
+  for (i=0; i<n; i++)
+      *(d++) = *(s--);
+}
+
+
+/* Utility function for byteswapping an array, using the bswap
+   builtins if possible. dest and src can overlap completely, or then
+   they must point to separate objects; partial overlaps are not
+   allowed.  */
+
+static void
+bswap_array (void *dest, const void *src, size_t size, size_t nelems)
+{
+  const char *ps; 
+  char *pd;
+
+  switch (size)
     {
-      if (type == BT_CHARACTER)
-	size *= GFC_SIZE_OF_CHAR_KIND(kind);
-      read_block_direct (dtp, dest, size * nelems);
+    case 1:
+      break;
+    case 2:
+      for (size_t i = 0; i < nelems; i++)
+	((uint16_t*)dest)[i] = __builtin_bswap16 (((uint16_t*)src)[i]);
+      break;
+    case 4:
+      for (size_t i = 0; i < nelems; i++)
+	((uint32_t*)dest)[i] = __builtin_bswap32 (((uint32_t*)src)[i]);
+      break;
+    case 8:
+      for (size_t i = 0; i < nelems; i++)
+	((uint64_t*)dest)[i] = __builtin_bswap64 (((uint64_t*)src)[i]);
+      break;
+    case 12:
+      ps = src;
+      pd = dest;
+      for (size_t i = 0; i < nelems; i++)
+	{
+	  uint32_t tmp;
+	  memcpy (&tmp, ps, 4);
+	  *(uint32_t*)pd = __builtin_bswap32 (*(uint32_t*)(ps + 8));
+	  *(uint32_t*)(pd + 4) = __builtin_bswap32 (*(uint32_t*)(ps + 4));
+	  *(uint32_t*)(pd + 8) = __builtin_bswap32 (tmp);
+	  ps += size;
+	  pd += size;
+	}
+      break;
+    case 16:
+      ps = src;
+      pd = dest;
+      for (size_t i = 0; i < nelems; i++)
+	{
+	  uint64_t tmp;
+	  memcpy (&tmp, ps, 8);
+	  *(uint64_t*)pd = __builtin_bswap64 (*(uint64_t*)(ps + 8));
+	  *(uint64_t*)(pd + 8) = __builtin_bswap64 (tmp);
+	  ps += size;
+	  pd += size;
+	}
+      break;
+    default:
+      pd = dest;
+      if (dest != src)
+	{
+	  ps = src;
+	  for (size_t i = 0; i < nelems; i++)
+	    {
+	      reverse_memcpy (pd, ps, size);
+	      ps += size;
+	      pd += size;
+	    }
+	}
+      else
+	{
+	  /* In-place byte swap.  */
+	  for (size_t i = 0; i < nelems; i++)
+	    {
+	      char tmp, *low = pd, *high = pd + size - 1;
+	      for (size_t j = 0; j < size/2; j++)
+		{
+		  tmp = *low;
+		  *low = *high;
+		  *high = tmp;
+		  low++;
+		  high--;
+		}
+	      pd += size;
+	    }
+	}
     }
-  else
-    {
-      char buffer[16];
-      char *p;
-      size_t i;
+}
+
 
-      p = dest;
+/* Master function for unformatted reads.  */
+
+static void
+unformatted_read (st_parameter_dt *dtp, bt type,
+		  void *dest, int kind, size_t size, size_t nelems)
+{
+  if (type == BT_CHARACTER)
+    size *= GFC_SIZE_OF_CHAR_KIND(kind);
+  read_block_direct (dtp, dest, size * nelems);
 
+  if (unlikely (dtp->u.p.current_unit->flags.convert == GFC_CONVERT_SWAP)
+      && kind != 1)
+    {
       /* Handle wide chracters.  */
-      if (type == BT_CHARACTER && kind != 1)
-	{
-	  nelems *= size;
-	  size = kind;
-	}
+      if (type == BT_CHARACTER)
+  	{
+  	  nelems *= size;
+  	  size = kind;
+  	}
 
       /* Break up complex into its constituent reals.  */
-      if (type == BT_COMPLEX)
-	{
-	  nelems *= 2;
-	  size /= 2;
-	}
-      
-      /* By now, all complex variables have been split into their
-	 constituent reals.  */
-      
-      for (i = 0; i < nelems; i++)
-	{
- 	  read_block_direct (dtp, buffer, size);
- 	  reverse_memcpy (p, buffer, size);
- 	  p += size;
- 	}
+      else if (type == BT_COMPLEX)
+  	{
+  	  nelems *= 2;
+  	  size /= 2;
+  	}
+      bswap_array (dest, dest, size, nelems);
     }
 }
 
@@ -945,9 +1033,10 @@ unformatted_write (st_parameter_dt *dtp, bt type,
     }
   else
     {
-      char buffer[16];
+#define BSWAP_BUFSZ 512
+      char buffer[BSWAP_BUFSZ];
       char *p;
-      size_t i;
+      size_t nrem;
 
       p = source;
 
@@ -968,12 +1057,21 @@ unformatted_write (st_parameter_dt *dtp, bt type,
       /* By now, all complex variables have been split into their
 	 constituent reals.  */
 
-      for (i = 0; i < nelems; i++)
+      nrem = nelems;
+      do
 	{
-	  reverse_memcpy(buffer, p, size);
- 	  p += size;
-	  write_buf (dtp, buffer, size);
+	  size_t nc;
+	  if (size * nrem > BSWAP_BUFSZ)
+	    nc = BSWAP_BUFSZ / size;
+	  else
+	    nc = nrem;
+
+	  bswap_array (buffer, p, size, nc);
+	  write_buf (dtp, buffer, size * nc);
+	  p += size * nc;
+	  nrem -= nc;
 	}
+      while (nrem > 0);
     }
 }
 
@@ -2153,15 +2251,22 @@ us_read (st_parameter_dt *dtp, int continued)
 	}
     }
   else
+    {
+      uint32_t u32;
+      uint64_t u64;
       switch (nr)
 	{
 	case sizeof(GFC_INTEGER_4):
-	  reverse_memcpy (&i4, &i, sizeof (i4));
+	  memcpy (&u32, &i, sizeof (u32));
+	  u32 = __builtin_bswap32 (u32);
+	  memcpy (&i4, &u32, sizeof (i4));
 	  i = i4;
 	  break;
 
 	case sizeof(GFC_INTEGER_8):
-	  reverse_memcpy (&i8, &i, sizeof (i8));
+	  memcpy (&u64, &i, sizeof (u64));
+	  u64 = __builtin_bswap64 (u64);
+	  memcpy (&i8, &u64, sizeof (i8));
 	  i = i8;
 	  break;
 
@@ -2169,6 +2274,7 @@ us_read (st_parameter_dt *dtp, int continued)
 	  runtime_error ("Illegal value for record marker");
 	  break;
 	}
+    }
 
   if (i >= 0)
     {
@@ -3036,7 +3142,6 @@ write_us_marker (st_parameter_dt *dtp, const gfc_offset buf)
   size_t len;
   GFC_INTEGER_4 buf4;
   GFC_INTEGER_8 buf8;
-  char p[sizeof (GFC_INTEGER_8)];
 
   if (compile_options.record_marker == 0)
     len = sizeof (GFC_INTEGER_4);
@@ -3065,18 +3170,22 @@ write_us_marker (st_parameter_dt *dtp, const gfc_offset buf)
     }
   else
     {
+      uint32_t u32;
+      uint64_t u64;
       switch (len)
 	{
 	case sizeof (GFC_INTEGER_4):
 	  buf4 = buf;
-	  reverse_memcpy (p, &buf4, sizeof (GFC_INTEGER_4));
-	  return swrite (dtp->u.p.current_unit->s, p, len);
+	  memcpy (&u32, &buf4, sizeof (u32));
+	  u32 = __builtin_bswap32 (u32);
+	  return swrite (dtp->u.p.current_unit->s, &u32, len);
 	  break;
 
 	case sizeof (GFC_INTEGER_8):
 	  buf8 = buf;
-	  reverse_memcpy (p, &buf8, sizeof (GFC_INTEGER_8));
-	  return swrite (dtp->u.p.current_unit->s, p, len);
+	  memcpy (&u64, &buf8, sizeof (u64));
+	  u64 = __builtin_bswap64 (u64);
+	  return swrite (dtp->u.p.current_unit->s, &u64, len);
 	  break;
 
 	default:
@@ -3713,22 +3822,6 @@ st_set_nml_var_dim (st_parameter_dt *dtp, GFC_INTEGER_4 n_dim,
   GFC_DIMENSION_SET(nml->dim[n],lbound,ubound,stride);
 }
 
-/* Reverse memcpy - used for byte swapping.  */
-
-void reverse_memcpy (void *dest, const void *src, size_t n)
-{
-  char *d, *s;
-  size_t i;
-
-  d = (char *) dest;
-  s = (char *) src + n - 1;
-
-  /* Write with ascending order - this is likely faster
-     on modern architectures because of write combining.  */
-  for (i=0; i<n; i++)
-      *(d++) = *(s--);
-}
-
 
 /* Once upon a time, a poor innocent Fortran program was reading a
    file, when suddenly it hit the end-of-file (EOF).  Unfortunately

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch, libfortran] Improve performance of byte swapped IO
  2013-01-13 22:44         ` Janne Blomqvist
@ 2013-01-18 22:30           ` Janne Blomqvist
  2013-01-22 22:32             ` Thomas Koenig
  0 siblings, 1 reply; 10+ messages in thread
From: Janne Blomqvist @ 2013-01-18 22:30 UTC (permalink / raw)
  To: Richard Biener; +Cc: Andreas Schwab, GCC Patches, Fortran List

PING**2

On Mon, Jan 14, 2013 at 12:44 AM, Janne Blomqvist
<blomqvist.janne@gmail.com> wrote:
> PING**1.2
>
> Yet another slightly updated patch attached. Compared to the previous
> version, now with specializations for size 12 and 16 as well. For the
> real(10) benchmark, with the previous v3 patch (please disregard the
> absolute values in the post quoted below, there were wrong due to a
> bug):
>
>   Unformatted sequential write/read performance test
>  Record size           Write MB/s                 Read MB/s
>  ==========================================================
>            4   80.578833140738340        127.33074266188656
>            8   137.61682156650559        184.49033790407984
>           16   202.72871312800621        275.98801561061816
>           32   275.33538767460863        413.43956672052303
>           64   341.04488670485119        555.13744525826564
>          128   384.77917051919820        671.44655208024699
>          256   410.97208129045833        763.97660513918527
>          512   425.76619227779878        826.41086693364593
>         1024   430.77035999730009        840.30757120448550
>         2048   438.30318459339475        885.50033810296600
>         4096   455.79422809097599        919.78265920652086
>         8192   465.74499205886326        959.06963983370918
>        16384   472.48133493971142        991.11244162081744
>        32768   471.00024619567603        1015.7428144049615
>        65536   474.91235280949985        1021.2150519080892
>       131072   475.18664487440901        1006.3701982554830
>       262144   478.00435092846868        985.17141300594039
>       524288   476.72837201590363        991.74226579987987
>
> With the new v4 patch:
>
>  Unformatted sequential write/read performance test
>  Record size           Write MB/s                 Read MB/s
>  ==========================================================
>            4   87.353141847504133        145.09410391177835
>            8   166.95093628370549        223.60877830048437
>           16   272.20937208187746        364.91673986840277
>           32   415.26016354252715        599.41744252952310
>           64   592.97676703528009        900.53345964312450
>          128   748.27218547147686        1189.7131837787238
>          256   874.83098506714384        1561.3649529261234
>          512   935.69494481144284        1823.1760143164879
>         1024   983.51689491813215        1931.8773088107300
>         2048   1009.5491761651396        1971.6978586130062
>         4096   1115.5862027658552        2119.4151169997808
>         8192   1172.9400229568287        2184.1403983641089
>        16384   1222.6659284153168        2258.5490449229878
>        32768   1242.2417626697293        2251.8159046253918
>        65536   1227.9967555594396        2313.4106672387143
>       131072   1204.4295656544052        2129.1309150039478
>       262144   1135.7905614378458        2154.7146453789856
>       524288   1075.5769074402640        2170.5151501933169
>
>
> On Fri, Jan 11, 2013 at 10:41 PM, Janne Blomqvist
> <blomqvist.janne@gmail.com> wrote:
>> PING.
>>
>> Slightly updated patch attached, which further improves the generic
>> size fallback that is used when the element size is not 2/4/8 bytes.
>> Changing the us_perf benchmark to use real(10), with the v2 patch the
>> performance is:
>>
>>  Unformatted sequential write/read performance test
>>  Record size           Write MB/s                 Read MB/s
>>  ==========================================================
>>            4   59.028550429522085        86.019754350948787
>>            8   79.028327063130590        95.803502000733374
>>           16   99.980457395413296        138.68367462874946
>>           32   122.56886206338788        180.05609910155042
>>           64   152.00478266944486        212.69931319407567
>>          128   197.74137934940202        235.19728791956828
>>          256   155.36245780017779        244.60578379215929
>>          512   157.13385845966246        245.07467397691480
>>         1024   177.26553799130201        260.44908357795623
>>         2048   208.22852888945587        260.21587143113527
>>         4096   222.88410474980634        262.66162209490591
>>         8192   226.71167580652920        265.81191407123663
>>        16384   206.51818241747065        263.59395165591724
>>        32768   230.18707026455866        265.88990325026526
>>        65536   229.19783089391504        268.04485112932684
>>       131072   231.12215662044449        267.40543904427710
>>       262144   230.72012123598142        267.60086931504122
>>       524288   230.48959460456055        268.78750211303725
>>
>> With the new v3 patch I get
>>
>>  Unformatted sequential write/read performance test
>>  Record size           Write MB/s                 Read MB/s
>>  ==========================================================
>>            4   59.779061121239941        92.777125264010024
>>            8   92.727504266051341        126.64775563782673
>>           16   128.94793911163904        184.69194300482837
>>           32   169.78916283536847        267.06752001266767
>>           64   209.50296476919556        341.60515130910238
>>          128   236.36709738360679        416.73212655882151
>>          256   251.79029695383340        465.46804746749740
>>          512   259.62269939828633        500.87346060356265
>>         1024   265.08842337586458        508.95530627428275
>>         2048   268.71795530051884        532.12211365683640
>>         4096   280.86546884821030        546.88907054369884
>>         8192   286.96049684823578        569.60958187426183
>>        16384   292.04368984868103        608.11503416324865
>>        32768   292.96677387959392        629.80651297065833
>>        65536   291.69098580137114        624.27103478079641
>>       131072   292.75666234956418        605.99766136491496
>>       262144   291.35520038228975        611.59061455535834
>>       524288   292.15446100501691        623.76232623081580
>>
>>
>> On Sat, Jan 5, 2013 at 11:13 PM, Janne Blomqvist
>> <blomqvist.janne@gmail.com> wrote:
>>> On Sat, Jan 5, 2013 at 5:35 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Fri, Jan 4, 2013 at 11:35 PM, Andreas Schwab <schwab@linux-m68k.org> wrote:
>>>>> Janne Blomqvist <blomqvist.janne@gmail.com> writes:
>>>>>
>>>>>> diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
>>>>>> index c8ecc3a..bf2250a 100644
>>>>>> --- a/libgfortran/io/file_pos.c
>>>>>> +++ b/libgfortran/io/file_pos.c
>>>>>> @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
>>>>>>       }
>>>>>>        else
>>>>>>       {
>>>>>> +       uint32_t u32;
>>>>>> +       uint64_t u64;
>>>>>>         switch (length)
>>>>>>           {
>>>>>>           case sizeof(GFC_INTEGER_4):
>>>>>> -           reverse_memcpy (&m4, p, sizeof (m4));
>>>>>> +           memcpy (&u32, p, sizeof (u32));
>>>>>> +           u32 = __builtin_bswap32 (u32);
>>>>>> +           m4 = *(GFC_INTEGER_4*)&u32;
>>>>>
>>>>> Isn't that an aliasing violation?
>>>>
>>>> It looks like one.  Why not simply do
>>>>
>>>>    m4 = (GFC_INTEGER_4) u32;
>>>>
>>>> ?  I suppose GFC_INTEGER_4 is always the same size as uint32_t but signed?
>>>
>>> Yes, GFC_INTEGER_4 is a typedef for int32_t. As for why I didn't do
>>> the above, C99 6.3.1.3(3) says that if the unsigned value is outside
>>> the range of the signed variable, the result is
>>> implementation-defined. Though I suppose the sensible
>>> "implementation-defined behavior" in this case on a two's complement
>>> target is to just do a bitwise copy.
>>>
>>> Anyway, to be really safe one could use memcpy instead; the compiler
>>> optimizes small fixed size memcpy's just fine. Updated patch attached.
>>>
>>>
>>> --
>>> Janne Blomqvist
>>
>>
>>
>> --
>> Janne Blomqvist
>
>
>
> --
> Janne Blomqvist



-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch, libfortran] Improve performance of byte swapped IO
  2013-01-18 22:30           ` Janne Blomqvist
@ 2013-01-22 22:32             ` Thomas Koenig
  2013-01-23 21:57               ` Janne Blomqvist
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas Koenig @ 2013-01-22 22:32 UTC (permalink / raw)
  To: Janne Blomqvist; +Cc: Richard Biener, Andreas Schwab, GCC Patches, Fortran List

Hi Janne,

> PING**2

this is OK.  Thanks a lot for the work you put into this!

	Thomas


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch, libfortran] Improve performance of byte swapped IO
  2013-01-22 22:32             ` Thomas Koenig
@ 2013-01-23 21:57               ` Janne Blomqvist
  0 siblings, 0 replies; 10+ messages in thread
From: Janne Blomqvist @ 2013-01-23 21:57 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: Richard Biener, Andreas Schwab, GCC Patches, Fortran List

On Wed, Jan 23, 2013 at 12:32 AM, Thomas Koenig <tkoenig@netcologne.de> wrote:
> Hi Janne,
>
>> PING**2
>
>
> this is OK.  Thanks a lot for the work you put into this!

Thanks for the review; committed as r195413.

-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-01-23 21:57 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-04 22:15 [Patch, libfortran] Improve performance of byte swapped IO Janne Blomqvist
2013-01-04 22:35 ` Andreas Schwab
2013-01-05 15:35   ` Richard Biener
2013-01-05 21:13     ` Janne Blomqvist
2013-01-06 11:33       ` Richard Biener
2013-01-11 20:41       ` Janne Blomqvist
2013-01-13 22:44         ` Janne Blomqvist
2013-01-18 22:30           ` Janne Blomqvist
2013-01-22 22:32             ` Thomas Koenig
2013-01-23 21:57               ` Janne Blomqvist

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).