public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Janne Blomqvist <blomqvist.janne@gmail.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: Andreas Schwab <schwab@linux-m68k.org>,
	GCC Patches <gcc-patches@gcc.gnu.org>,
		Fortran List <fortran@gcc.gnu.org>
Subject: Re: [Patch, libfortran] Improve performance of byte swapped IO
Date: Sun, 13 Jan 2013 22:44:00 -0000	[thread overview]
Message-ID: <CAO9iq9FLjgPtW4_e0CEdU48r-baQ7Uvx6P8OhWMcWiAmqtcZyw@mail.gmail.com> (raw)
In-Reply-To: <CAO9iq9GvGxrjzyByskWoua5JLJbJrcPeW9tHpeL3wWKs91cNdg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 7582 bytes --]

PING**1.2

Yet another slightly updated patch attached. Compared to the previous
version, now with specializations for size 12 and 16 as well. For the
real(10) benchmark, with the previous v3 patch (please disregard the
absolute values in the post quoted below, there were wrong due to a
bug):

  Unformatted sequential write/read performance test
 Record size           Write MB/s                 Read MB/s
 ==========================================================
           4   80.578833140738340        127.33074266188656
           8   137.61682156650559        184.49033790407984
          16   202.72871312800621        275.98801561061816
          32   275.33538767460863        413.43956672052303
          64   341.04488670485119        555.13744525826564
         128   384.77917051919820        671.44655208024699
         256   410.97208129045833        763.97660513918527
         512   425.76619227779878        826.41086693364593
        1024   430.77035999730009        840.30757120448550
        2048   438.30318459339475        885.50033810296600
        4096   455.79422809097599        919.78265920652086
        8192   465.74499205886326        959.06963983370918
       16384   472.48133493971142        991.11244162081744
       32768   471.00024619567603        1015.7428144049615
       65536   474.91235280949985        1021.2150519080892
      131072   475.18664487440901        1006.3701982554830
      262144   478.00435092846868        985.17141300594039
      524288   476.72837201590363        991.74226579987987

With the new v4 patch:

 Unformatted sequential write/read performance test
 Record size           Write MB/s                 Read MB/s
 ==========================================================
           4   87.353141847504133        145.09410391177835
           8   166.95093628370549        223.60877830048437
          16   272.20937208187746        364.91673986840277
          32   415.26016354252715        599.41744252952310
          64   592.97676703528009        900.53345964312450
         128   748.27218547147686        1189.7131837787238
         256   874.83098506714384        1561.3649529261234
         512   935.69494481144284        1823.1760143164879
        1024   983.51689491813215        1931.8773088107300
        2048   1009.5491761651396        1971.6978586130062
        4096   1115.5862027658552        2119.4151169997808
        8192   1172.9400229568287        2184.1403983641089
       16384   1222.6659284153168        2258.5490449229878
       32768   1242.2417626697293        2251.8159046253918
       65536   1227.9967555594396        2313.4106672387143
      131072   1204.4295656544052        2129.1309150039478
      262144   1135.7905614378458        2154.7146453789856
      524288   1075.5769074402640        2170.5151501933169


On Fri, Jan 11, 2013 at 10:41 PM, Janne Blomqvist
<blomqvist.janne@gmail.com> wrote:
> PING.
>
> Slightly updated patch attached, which further improves the generic
> size fallback that is used when the element size is not 2/4/8 bytes.
> Changing the us_perf benchmark to use real(10), with the v2 patch the
> performance is:
>
>  Unformatted sequential write/read performance test
>  Record size           Write MB/s                 Read MB/s
>  ==========================================================
>            4   59.028550429522085        86.019754350948787
>            8   79.028327063130590        95.803502000733374
>           16   99.980457395413296        138.68367462874946
>           32   122.56886206338788        180.05609910155042
>           64   152.00478266944486        212.69931319407567
>          128   197.74137934940202        235.19728791956828
>          256   155.36245780017779        244.60578379215929
>          512   157.13385845966246        245.07467397691480
>         1024   177.26553799130201        260.44908357795623
>         2048   208.22852888945587        260.21587143113527
>         4096   222.88410474980634        262.66162209490591
>         8192   226.71167580652920        265.81191407123663
>        16384   206.51818241747065        263.59395165591724
>        32768   230.18707026455866        265.88990325026526
>        65536   229.19783089391504        268.04485112932684
>       131072   231.12215662044449        267.40543904427710
>       262144   230.72012123598142        267.60086931504122
>       524288   230.48959460456055        268.78750211303725
>
> With the new v3 patch I get
>
>  Unformatted sequential write/read performance test
>  Record size           Write MB/s                 Read MB/s
>  ==========================================================
>            4   59.779061121239941        92.777125264010024
>            8   92.727504266051341        126.64775563782673
>           16   128.94793911163904        184.69194300482837
>           32   169.78916283536847        267.06752001266767
>           64   209.50296476919556        341.60515130910238
>          128   236.36709738360679        416.73212655882151
>          256   251.79029695383340        465.46804746749740
>          512   259.62269939828633        500.87346060356265
>         1024   265.08842337586458        508.95530627428275
>         2048   268.71795530051884        532.12211365683640
>         4096   280.86546884821030        546.88907054369884
>         8192   286.96049684823578        569.60958187426183
>        16384   292.04368984868103        608.11503416324865
>        32768   292.96677387959392        629.80651297065833
>        65536   291.69098580137114        624.27103478079641
>       131072   292.75666234956418        605.99766136491496
>       262144   291.35520038228975        611.59061455535834
>       524288   292.15446100501691        623.76232623081580
>
>
> On Sat, Jan 5, 2013 at 11:13 PM, Janne Blomqvist
> <blomqvist.janne@gmail.com> wrote:
>> On Sat, Jan 5, 2013 at 5:35 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Jan 4, 2013 at 11:35 PM, Andreas Schwab <schwab@linux-m68k.org> wrote:
>>>> Janne Blomqvist <blomqvist.janne@gmail.com> writes:
>>>>
>>>>> diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
>>>>> index c8ecc3a..bf2250a 100644
>>>>> --- a/libgfortran/io/file_pos.c
>>>>> +++ b/libgfortran/io/file_pos.c
>>>>> @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
>>>>>       }
>>>>>        else
>>>>>       {
>>>>> +       uint32_t u32;
>>>>> +       uint64_t u64;
>>>>>         switch (length)
>>>>>           {
>>>>>           case sizeof(GFC_INTEGER_4):
>>>>> -           reverse_memcpy (&m4, p, sizeof (m4));
>>>>> +           memcpy (&u32, p, sizeof (u32));
>>>>> +           u32 = __builtin_bswap32 (u32);
>>>>> +           m4 = *(GFC_INTEGER_4*)&u32;
>>>>
>>>> Isn't that an aliasing violation?
>>>
>>> It looks like one.  Why not simply do
>>>
>>>    m4 = (GFC_INTEGER_4) u32;
>>>
>>> ?  I suppose GFC_INTEGER_4 is always the same size as uint32_t but signed?
>>
>> Yes, GFC_INTEGER_4 is a typedef for int32_t. As for why I didn't do
>> the above, C99 6.3.1.3(3) says that if the unsigned value is outside
>> the range of the signed variable, the result is
>> implementation-defined. Though I suppose the sensible
>> "implementation-defined behavior" in this case on a two's complement
>> target is to just do a bitwise copy.
>>
>> Anyway, to be really safe one could use memcpy instead; the compiler
>> optimizes small fixed size memcpy's just fine. Updated patch attached.
>>
>>
>> --
>> Janne Blomqvist
>
>
>
> --
> Janne Blomqvist



-- 
Janne Blomqvist

[-- Attachment #2: bswap4.diff --]
[-- Type: application/octet-stream, Size: 8772 bytes --]

diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
index c8ecc3a..22d19dd 100644
--- a/libgfortran/io/file_pos.c
+++ b/libgfortran/io/file_pos.c
@@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
 	}
       else
 	{
+	  uint32_t u32;
+	  uint64_t u64;
 	  switch (length)
 	    {
 	    case sizeof(GFC_INTEGER_4):
-	      reverse_memcpy (&m4, p, sizeof (m4));
+	      memcpy (&u32, p, sizeof (u32));
+	      u32 = __builtin_bswap32 (u32);
+	      memcpy (&m4, &u32, sizeof (m4));
 	      m = m4;
 	      break;
 
 	    case sizeof(GFC_INTEGER_8):
-	      reverse_memcpy (&m8, p, sizeof (m8));
+	      memcpy (&u64, p, sizeof (u64));
+	      u64 = __builtin_bswap64 (u64);
+	      memcpy (&m8, &u64, sizeof (m8));
 	      m = m8;
 	      break;
 
diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
index 43aeafd..f17de19 100644
--- a/libgfortran/io/io.h
+++ b/libgfortran/io/io.h
@@ -649,9 +649,6 @@ internal_proto(init_loop_spec);
 extern void next_record (st_parameter_dt *, int);
 internal_proto(next_record);
 
-extern void reverse_memcpy (void *, const void *, size_t);
-internal_proto (reverse_memcpy);
-
 extern void st_wait (st_parameter_wait *);
 export_proto(st_wait);
 
diff --git a/libgfortran/io/transfer.c b/libgfortran/io/transfer.c
index 6dda1df..020b6f0 100644
--- a/libgfortran/io/transfer.c
+++ b/libgfortran/io/transfer.c
@@ -878,50 +878,138 @@ write_buf (st_parameter_dt *dtp, void *buf, size_t nbytes)
 }
 
 
-/* Master function for unformatted reads.  */
+/* Reverse memcpy - used for byte swapping.  */
 
 static void
-unformatted_read (st_parameter_dt *dtp, bt type,
-		  void *dest, int kind, size_t size, size_t nelems)
+reverse_memcpy (void *dest, const void *src, size_t n)
 {
-  if (likely (dtp->u.p.current_unit->flags.convert == GFC_CONVERT_NATIVE)
-      || kind == 1)
+  char *d, *s;
+  size_t i;
+
+  d = (char *) dest;
+  s = (char *) src + n - 1;
+
+  /* Write with ascending order - this is likely faster
+     on modern architectures because of write combining.  */
+  for (i=0; i<n; i++)
+      *(d++) = *(s--);
+}
+
+
+/* Utility function for byteswapping an array, using the bswap
+   builtins if possible. dest and src can overlap completely, or then
+   they must point to separate objects; partial overlaps are not
+   allowed.  */
+
+static void
+bswap_array (void *dest, const void *src, size_t size, size_t nelems)
+{
+  const char *ps; 
+  char *pd;
+
+  switch (size)
     {
-      if (type == BT_CHARACTER)
-	size *= GFC_SIZE_OF_CHAR_KIND(kind);
-      read_block_direct (dtp, dest, size * nelems);
+    case 1:
+      break;
+    case 2:
+      for (size_t i = 0; i < nelems; i++)
+	((uint16_t*)dest)[i] = __builtin_bswap16 (((uint16_t*)src)[i]);
+      break;
+    case 4:
+      for (size_t i = 0; i < nelems; i++)
+	((uint32_t*)dest)[i] = __builtin_bswap32 (((uint32_t*)src)[i]);
+      break;
+    case 8:
+      for (size_t i = 0; i < nelems; i++)
+	((uint64_t*)dest)[i] = __builtin_bswap64 (((uint64_t*)src)[i]);
+      break;
+    case 12:
+      ps = src;
+      pd = dest;
+      for (size_t i = 0; i < nelems; i++)
+	{
+	  uint32_t tmp;
+	  memcpy (&tmp, ps, 4);
+	  *(uint32_t*)pd = __builtin_bswap32 (*(uint32_t*)(ps + 8));
+	  *(uint32_t*)(pd + 4) = __builtin_bswap32 (*(uint32_t*)(ps + 4));
+	  *(uint32_t*)(pd + 8) = __builtin_bswap32 (tmp);
+	  ps += size;
+	  pd += size;
+	}
+      break;
+    case 16:
+      ps = src;
+      pd = dest;
+      for (size_t i = 0; i < nelems; i++)
+	{
+	  uint64_t tmp;
+	  memcpy (&tmp, ps, 8);
+	  *(uint64_t*)pd = __builtin_bswap64 (*(uint64_t*)(ps + 8));
+	  *(uint64_t*)(pd + 8) = __builtin_bswap64 (tmp);
+	  ps += size;
+	  pd += size;
+	}
+      break;
+    default:
+      pd = dest;
+      if (dest != src)
+	{
+	  ps = src;
+	  for (size_t i = 0; i < nelems; i++)
+	    {
+	      reverse_memcpy (pd, ps, size);
+	      ps += size;
+	      pd += size;
+	    }
+	}
+      else
+	{
+	  /* In-place byte swap.  */
+	  for (size_t i = 0; i < nelems; i++)
+	    {
+	      char tmp, *low = pd, *high = pd + size - 1;
+	      for (size_t j = 0; j < size/2; j++)
+		{
+		  tmp = *low;
+		  *low = *high;
+		  *high = tmp;
+		  low++;
+		  high--;
+		}
+	      pd += size;
+	    }
+	}
     }
-  else
-    {
-      char buffer[16];
-      char *p;
-      size_t i;
+}
+
 
-      p = dest;
+/* Master function for unformatted reads.  */
+
+static void
+unformatted_read (st_parameter_dt *dtp, bt type,
+		  void *dest, int kind, size_t size, size_t nelems)
+{
+  if (type == BT_CHARACTER)
+    size *= GFC_SIZE_OF_CHAR_KIND(kind);
+  read_block_direct (dtp, dest, size * nelems);
 
+  if (unlikely (dtp->u.p.current_unit->flags.convert == GFC_CONVERT_SWAP)
+      && kind != 1)
+    {
       /* Handle wide chracters.  */
-      if (type == BT_CHARACTER && kind != 1)
-	{
-	  nelems *= size;
-	  size = kind;
-	}
+      if (type == BT_CHARACTER)
+  	{
+  	  nelems *= size;
+  	  size = kind;
+  	}
 
       /* Break up complex into its constituent reals.  */
-      if (type == BT_COMPLEX)
-	{
-	  nelems *= 2;
-	  size /= 2;
-	}
-      
-      /* By now, all complex variables have been split into their
-	 constituent reals.  */
-      
-      for (i = 0; i < nelems; i++)
-	{
- 	  read_block_direct (dtp, buffer, size);
- 	  reverse_memcpy (p, buffer, size);
- 	  p += size;
- 	}
+      else if (type == BT_COMPLEX)
+  	{
+  	  nelems *= 2;
+  	  size /= 2;
+  	}
+      bswap_array (dest, dest, size, nelems);
     }
 }
 
@@ -945,9 +1033,10 @@ unformatted_write (st_parameter_dt *dtp, bt type,
     }
   else
     {
-      char buffer[16];
+#define BSWAP_BUFSZ 512
+      char buffer[BSWAP_BUFSZ];
       char *p;
-      size_t i;
+      size_t nrem;
 
       p = source;
 
@@ -968,12 +1057,21 @@ unformatted_write (st_parameter_dt *dtp, bt type,
       /* By now, all complex variables have been split into their
 	 constituent reals.  */
 
-      for (i = 0; i < nelems; i++)
+      nrem = nelems;
+      do
 	{
-	  reverse_memcpy(buffer, p, size);
- 	  p += size;
-	  write_buf (dtp, buffer, size);
+	  size_t nc;
+	  if (size * nrem > BSWAP_BUFSZ)
+	    nc = BSWAP_BUFSZ / size;
+	  else
+	    nc = nrem;
+
+	  bswap_array (buffer, p, size, nc);
+	  write_buf (dtp, buffer, size * nc);
+	  p += size * nc;
+	  nrem -= nc;
 	}
+      while (nrem > 0);
     }
 }
 
@@ -2153,15 +2251,22 @@ us_read (st_parameter_dt *dtp, int continued)
 	}
     }
   else
+    {
+      uint32_t u32;
+      uint64_t u64;
       switch (nr)
 	{
 	case sizeof(GFC_INTEGER_4):
-	  reverse_memcpy (&i4, &i, sizeof (i4));
+	  memcpy (&u32, &i, sizeof (u32));
+	  u32 = __builtin_bswap32 (u32);
+	  memcpy (&i4, &u32, sizeof (i4));
 	  i = i4;
 	  break;
 
 	case sizeof(GFC_INTEGER_8):
-	  reverse_memcpy (&i8, &i, sizeof (i8));
+	  memcpy (&u64, &i, sizeof (u64));
+	  u64 = __builtin_bswap64 (u64);
+	  memcpy (&i8, &u64, sizeof (i8));
 	  i = i8;
 	  break;
 
@@ -2169,6 +2274,7 @@ us_read (st_parameter_dt *dtp, int continued)
 	  runtime_error ("Illegal value for record marker");
 	  break;
 	}
+    }
 
   if (i >= 0)
     {
@@ -3036,7 +3142,6 @@ write_us_marker (st_parameter_dt *dtp, const gfc_offset buf)
   size_t len;
   GFC_INTEGER_4 buf4;
   GFC_INTEGER_8 buf8;
-  char p[sizeof (GFC_INTEGER_8)];
 
   if (compile_options.record_marker == 0)
     len = sizeof (GFC_INTEGER_4);
@@ -3065,18 +3170,22 @@ write_us_marker (st_parameter_dt *dtp, const gfc_offset buf)
     }
   else
     {
+      uint32_t u32;
+      uint64_t u64;
       switch (len)
 	{
 	case sizeof (GFC_INTEGER_4):
 	  buf4 = buf;
-	  reverse_memcpy (p, &buf4, sizeof (GFC_INTEGER_4));
-	  return swrite (dtp->u.p.current_unit->s, p, len);
+	  memcpy (&u32, &buf4, sizeof (u32));
+	  u32 = __builtin_bswap32 (u32);
+	  return swrite (dtp->u.p.current_unit->s, &u32, len);
 	  break;
 
 	case sizeof (GFC_INTEGER_8):
 	  buf8 = buf;
-	  reverse_memcpy (p, &buf8, sizeof (GFC_INTEGER_8));
-	  return swrite (dtp->u.p.current_unit->s, p, len);
+	  memcpy (&u64, &buf8, sizeof (u64));
+	  u64 = __builtin_bswap64 (u64);
+	  return swrite (dtp->u.p.current_unit->s, &u64, len);
 	  break;
 
 	default:
@@ -3713,22 +3822,6 @@ st_set_nml_var_dim (st_parameter_dt *dtp, GFC_INTEGER_4 n_dim,
   GFC_DIMENSION_SET(nml->dim[n],lbound,ubound,stride);
 }
 
-/* Reverse memcpy - used for byte swapping.  */
-
-void reverse_memcpy (void *dest, const void *src, size_t n)
-{
-  char *d, *s;
-  size_t i;
-
-  d = (char *) dest;
-  s = (char *) src + n - 1;
-
-  /* Write with ascending order - this is likely faster
-     on modern architectures because of write combining.  */
-  for (i=0; i<n; i++)
-      *(d++) = *(s--);
-}
-
 
 /* Once upon a time, a poor innocent Fortran program was reading a
    file, when suddenly it hit the end-of-file (EOF).  Unfortunately

  reply	other threads:[~2013-01-13 22:44 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-04 22:15 Janne Blomqvist
2013-01-04 22:35 ` Andreas Schwab
2013-01-05 15:35   ` Richard Biener
2013-01-05 21:13     ` Janne Blomqvist
2013-01-06 11:33       ` Richard Biener
2013-01-11 20:41       ` Janne Blomqvist
2013-01-13 22:44         ` Janne Blomqvist [this message]
2013-01-18 22:30           ` Janne Blomqvist
2013-01-22 22:32             ` Thomas Koenig
2013-01-23 21:57               ` Janne Blomqvist

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAO9iq9FLjgPtW4_e0CEdU48r-baQ7Uvx6P8OhWMcWiAmqtcZyw@mail.gmail.com \
    --to=blomqvist.janne@gmail.com \
    --cc=fortran@gcc.gnu.org \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=richard.guenther@gmail.com \
    --cc=schwab@linux-m68k.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).