From: Janne Blomqvist <blomqvist.janne@gmail.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: Andreas Schwab <schwab@linux-m68k.org>,
GCC Patches <gcc-patches@gcc.gnu.org>,
Fortran List <fortran@gcc.gnu.org>
Subject: Re: [Patch, libfortran] Improve performance of byte swapped IO
Date: Sun, 13 Jan 2013 22:44:00 -0000 [thread overview]
Message-ID: <CAO9iq9FLjgPtW4_e0CEdU48r-baQ7Uvx6P8OhWMcWiAmqtcZyw@mail.gmail.com> (raw)
In-Reply-To: <CAO9iq9GvGxrjzyByskWoua5JLJbJrcPeW9tHpeL3wWKs91cNdg@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 7582 bytes --]
PING**1.2
Yet another slightly updated patch attached. Compared to the previous
version, now with specializations for size 12 and 16 as well. For the
real(10) benchmark, with the previous v3 patch (please disregard the
absolute values in the post quoted below, there were wrong due to a
bug):
Unformatted sequential write/read performance test
Record size Write MB/s Read MB/s
==========================================================
4 80.578833140738340 127.33074266188656
8 137.61682156650559 184.49033790407984
16 202.72871312800621 275.98801561061816
32 275.33538767460863 413.43956672052303
64 341.04488670485119 555.13744525826564
128 384.77917051919820 671.44655208024699
256 410.97208129045833 763.97660513918527
512 425.76619227779878 826.41086693364593
1024 430.77035999730009 840.30757120448550
2048 438.30318459339475 885.50033810296600
4096 455.79422809097599 919.78265920652086
8192 465.74499205886326 959.06963983370918
16384 472.48133493971142 991.11244162081744
32768 471.00024619567603 1015.7428144049615
65536 474.91235280949985 1021.2150519080892
131072 475.18664487440901 1006.3701982554830
262144 478.00435092846868 985.17141300594039
524288 476.72837201590363 991.74226579987987
With the new v4 patch:
Unformatted sequential write/read performance test
Record size Write MB/s Read MB/s
==========================================================
4 87.353141847504133 145.09410391177835
8 166.95093628370549 223.60877830048437
16 272.20937208187746 364.91673986840277
32 415.26016354252715 599.41744252952310
64 592.97676703528009 900.53345964312450
128 748.27218547147686 1189.7131837787238
256 874.83098506714384 1561.3649529261234
512 935.69494481144284 1823.1760143164879
1024 983.51689491813215 1931.8773088107300
2048 1009.5491761651396 1971.6978586130062
4096 1115.5862027658552 2119.4151169997808
8192 1172.9400229568287 2184.1403983641089
16384 1222.6659284153168 2258.5490449229878
32768 1242.2417626697293 2251.8159046253918
65536 1227.9967555594396 2313.4106672387143
131072 1204.4295656544052 2129.1309150039478
262144 1135.7905614378458 2154.7146453789856
524288 1075.5769074402640 2170.5151501933169
On Fri, Jan 11, 2013 at 10:41 PM, Janne Blomqvist
<blomqvist.janne@gmail.com> wrote:
> PING.
>
> Slightly updated patch attached, which further improves the generic
> size fallback that is used when the element size is not 2/4/8 bytes.
> Changing the us_perf benchmark to use real(10), with the v2 patch the
> performance is:
>
> Unformatted sequential write/read performance test
> Record size Write MB/s Read MB/s
> ==========================================================
> 4 59.028550429522085 86.019754350948787
> 8 79.028327063130590 95.803502000733374
> 16 99.980457395413296 138.68367462874946
> 32 122.56886206338788 180.05609910155042
> 64 152.00478266944486 212.69931319407567
> 128 197.74137934940202 235.19728791956828
> 256 155.36245780017779 244.60578379215929
> 512 157.13385845966246 245.07467397691480
> 1024 177.26553799130201 260.44908357795623
> 2048 208.22852888945587 260.21587143113527
> 4096 222.88410474980634 262.66162209490591
> 8192 226.71167580652920 265.81191407123663
> 16384 206.51818241747065 263.59395165591724
> 32768 230.18707026455866 265.88990325026526
> 65536 229.19783089391504 268.04485112932684
> 131072 231.12215662044449 267.40543904427710
> 262144 230.72012123598142 267.60086931504122
> 524288 230.48959460456055 268.78750211303725
>
> With the new v3 patch I get
>
> Unformatted sequential write/read performance test
> Record size Write MB/s Read MB/s
> ==========================================================
> 4 59.779061121239941 92.777125264010024
> 8 92.727504266051341 126.64775563782673
> 16 128.94793911163904 184.69194300482837
> 32 169.78916283536847 267.06752001266767
> 64 209.50296476919556 341.60515130910238
> 128 236.36709738360679 416.73212655882151
> 256 251.79029695383340 465.46804746749740
> 512 259.62269939828633 500.87346060356265
> 1024 265.08842337586458 508.95530627428275
> 2048 268.71795530051884 532.12211365683640
> 4096 280.86546884821030 546.88907054369884
> 8192 286.96049684823578 569.60958187426183
> 16384 292.04368984868103 608.11503416324865
> 32768 292.96677387959392 629.80651297065833
> 65536 291.69098580137114 624.27103478079641
> 131072 292.75666234956418 605.99766136491496
> 262144 291.35520038228975 611.59061455535834
> 524288 292.15446100501691 623.76232623081580
>
>
> On Sat, Jan 5, 2013 at 11:13 PM, Janne Blomqvist
> <blomqvist.janne@gmail.com> wrote:
>> On Sat, Jan 5, 2013 at 5:35 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Jan 4, 2013 at 11:35 PM, Andreas Schwab <schwab@linux-m68k.org> wrote:
>>>> Janne Blomqvist <blomqvist.janne@gmail.com> writes:
>>>>
>>>>> diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
>>>>> index c8ecc3a..bf2250a 100644
>>>>> --- a/libgfortran/io/file_pos.c
>>>>> +++ b/libgfortran/io/file_pos.c
>>>>> @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
>>>>> }
>>>>> else
>>>>> {
>>>>> + uint32_t u32;
>>>>> + uint64_t u64;
>>>>> switch (length)
>>>>> {
>>>>> case sizeof(GFC_INTEGER_4):
>>>>> - reverse_memcpy (&m4, p, sizeof (m4));
>>>>> + memcpy (&u32, p, sizeof (u32));
>>>>> + u32 = __builtin_bswap32 (u32);
>>>>> + m4 = *(GFC_INTEGER_4*)&u32;
>>>>
>>>> Isn't that an aliasing violation?
>>>
>>> It looks like one. Why not simply do
>>>
>>> m4 = (GFC_INTEGER_4) u32;
>>>
>>> ? I suppose GFC_INTEGER_4 is always the same size as uint32_t but signed?
>>
>> Yes, GFC_INTEGER_4 is a typedef for int32_t. As for why I didn't do
>> the above, C99 6.3.1.3(3) says that if the unsigned value is outside
>> the range of the signed variable, the result is
>> implementation-defined. Though I suppose the sensible
>> "implementation-defined behavior" in this case on a two's complement
>> target is to just do a bitwise copy.
>>
>> Anyway, to be really safe one could use memcpy instead; the compiler
>> optimizes small fixed size memcpy's just fine. Updated patch attached.
>>
>>
>> --
>> Janne Blomqvist
>
>
>
> --
> Janne Blomqvist
--
Janne Blomqvist
[-- Attachment #2: bswap4.diff --]
[-- Type: application/octet-stream, Size: 8772 bytes --]
diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
index c8ecc3a..22d19dd 100644
--- a/libgfortran/io/file_pos.c
+++ b/libgfortran/io/file_pos.c
@@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u)
}
else
{
+ uint32_t u32;
+ uint64_t u64;
switch (length)
{
case sizeof(GFC_INTEGER_4):
- reverse_memcpy (&m4, p, sizeof (m4));
+ memcpy (&u32, p, sizeof (u32));
+ u32 = __builtin_bswap32 (u32);
+ memcpy (&m4, &u32, sizeof (m4));
m = m4;
break;
case sizeof(GFC_INTEGER_8):
- reverse_memcpy (&m8, p, sizeof (m8));
+ memcpy (&u64, p, sizeof (u64));
+ u64 = __builtin_bswap64 (u64);
+ memcpy (&m8, &u64, sizeof (m8));
m = m8;
break;
diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
index 43aeafd..f17de19 100644
--- a/libgfortran/io/io.h
+++ b/libgfortran/io/io.h
@@ -649,9 +649,6 @@ internal_proto(init_loop_spec);
extern void next_record (st_parameter_dt *, int);
internal_proto(next_record);
-extern void reverse_memcpy (void *, const void *, size_t);
-internal_proto (reverse_memcpy);
-
extern void st_wait (st_parameter_wait *);
export_proto(st_wait);
diff --git a/libgfortran/io/transfer.c b/libgfortran/io/transfer.c
index 6dda1df..020b6f0 100644
--- a/libgfortran/io/transfer.c
+++ b/libgfortran/io/transfer.c
@@ -878,50 +878,138 @@ write_buf (st_parameter_dt *dtp, void *buf, size_t nbytes)
}
-/* Master function for unformatted reads. */
+/* Reverse memcpy - used for byte swapping. */
static void
-unformatted_read (st_parameter_dt *dtp, bt type,
- void *dest, int kind, size_t size, size_t nelems)
+reverse_memcpy (void *dest, const void *src, size_t n)
{
- if (likely (dtp->u.p.current_unit->flags.convert == GFC_CONVERT_NATIVE)
- || kind == 1)
+ char *d, *s;
+ size_t i;
+
+ d = (char *) dest;
+ s = (char *) src + n - 1;
+
+ /* Write with ascending order - this is likely faster
+ on modern architectures because of write combining. */
+ for (i=0; i<n; i++)
+ *(d++) = *(s--);
+}
+
+
+/* Utility function for byteswapping an array, using the bswap
+ builtins if possible. dest and src can overlap completely, or then
+ they must point to separate objects; partial overlaps are not
+ allowed. */
+
+static void
+bswap_array (void *dest, const void *src, size_t size, size_t nelems)
+{
+ const char *ps;
+ char *pd;
+
+ switch (size)
{
- if (type == BT_CHARACTER)
- size *= GFC_SIZE_OF_CHAR_KIND(kind);
- read_block_direct (dtp, dest, size * nelems);
+ case 1:
+ break;
+ case 2:
+ for (size_t i = 0; i < nelems; i++)
+ ((uint16_t*)dest)[i] = __builtin_bswap16 (((uint16_t*)src)[i]);
+ break;
+ case 4:
+ for (size_t i = 0; i < nelems; i++)
+ ((uint32_t*)dest)[i] = __builtin_bswap32 (((uint32_t*)src)[i]);
+ break;
+ case 8:
+ for (size_t i = 0; i < nelems; i++)
+ ((uint64_t*)dest)[i] = __builtin_bswap64 (((uint64_t*)src)[i]);
+ break;
+ case 12:
+ ps = src;
+ pd = dest;
+ for (size_t i = 0; i < nelems; i++)
+ {
+ uint32_t tmp;
+ memcpy (&tmp, ps, 4);
+ *(uint32_t*)pd = __builtin_bswap32 (*(uint32_t*)(ps + 8));
+ *(uint32_t*)(pd + 4) = __builtin_bswap32 (*(uint32_t*)(ps + 4));
+ *(uint32_t*)(pd + 8) = __builtin_bswap32 (tmp);
+ ps += size;
+ pd += size;
+ }
+ break;
+ case 16:
+ ps = src;
+ pd = dest;
+ for (size_t i = 0; i < nelems; i++)
+ {
+ uint64_t tmp;
+ memcpy (&tmp, ps, 8);
+ *(uint64_t*)pd = __builtin_bswap64 (*(uint64_t*)(ps + 8));
+ *(uint64_t*)(pd + 8) = __builtin_bswap64 (tmp);
+ ps += size;
+ pd += size;
+ }
+ break;
+ default:
+ pd = dest;
+ if (dest != src)
+ {
+ ps = src;
+ for (size_t i = 0; i < nelems; i++)
+ {
+ reverse_memcpy (pd, ps, size);
+ ps += size;
+ pd += size;
+ }
+ }
+ else
+ {
+ /* In-place byte swap. */
+ for (size_t i = 0; i < nelems; i++)
+ {
+ char tmp, *low = pd, *high = pd + size - 1;
+ for (size_t j = 0; j < size/2; j++)
+ {
+ tmp = *low;
+ *low = *high;
+ *high = tmp;
+ low++;
+ high--;
+ }
+ pd += size;
+ }
+ }
}
- else
- {
- char buffer[16];
- char *p;
- size_t i;
+}
+
- p = dest;
+/* Master function for unformatted reads. */
+
+static void
+unformatted_read (st_parameter_dt *dtp, bt type,
+ void *dest, int kind, size_t size, size_t nelems)
+{
+ if (type == BT_CHARACTER)
+ size *= GFC_SIZE_OF_CHAR_KIND(kind);
+ read_block_direct (dtp, dest, size * nelems);
+ if (unlikely (dtp->u.p.current_unit->flags.convert == GFC_CONVERT_SWAP)
+ && kind != 1)
+ {
/* Handle wide chracters. */
- if (type == BT_CHARACTER && kind != 1)
- {
- nelems *= size;
- size = kind;
- }
+ if (type == BT_CHARACTER)
+ {
+ nelems *= size;
+ size = kind;
+ }
/* Break up complex into its constituent reals. */
- if (type == BT_COMPLEX)
- {
- nelems *= 2;
- size /= 2;
- }
-
- /* By now, all complex variables have been split into their
- constituent reals. */
-
- for (i = 0; i < nelems; i++)
- {
- read_block_direct (dtp, buffer, size);
- reverse_memcpy (p, buffer, size);
- p += size;
- }
+ else if (type == BT_COMPLEX)
+ {
+ nelems *= 2;
+ size /= 2;
+ }
+ bswap_array (dest, dest, size, nelems);
}
}
@@ -945,9 +1033,10 @@ unformatted_write (st_parameter_dt *dtp, bt type,
}
else
{
- char buffer[16];
+#define BSWAP_BUFSZ 512
+ char buffer[BSWAP_BUFSZ];
char *p;
- size_t i;
+ size_t nrem;
p = source;
@@ -968,12 +1057,21 @@ unformatted_write (st_parameter_dt *dtp, bt type,
/* By now, all complex variables have been split into their
constituent reals. */
- for (i = 0; i < nelems; i++)
+ nrem = nelems;
+ do
{
- reverse_memcpy(buffer, p, size);
- p += size;
- write_buf (dtp, buffer, size);
+ size_t nc;
+ if (size * nrem > BSWAP_BUFSZ)
+ nc = BSWAP_BUFSZ / size;
+ else
+ nc = nrem;
+
+ bswap_array (buffer, p, size, nc);
+ write_buf (dtp, buffer, size * nc);
+ p += size * nc;
+ nrem -= nc;
}
+ while (nrem > 0);
}
}
@@ -2153,15 +2251,22 @@ us_read (st_parameter_dt *dtp, int continued)
}
}
else
+ {
+ uint32_t u32;
+ uint64_t u64;
switch (nr)
{
case sizeof(GFC_INTEGER_4):
- reverse_memcpy (&i4, &i, sizeof (i4));
+ memcpy (&u32, &i, sizeof (u32));
+ u32 = __builtin_bswap32 (u32);
+ memcpy (&i4, &u32, sizeof (i4));
i = i4;
break;
case sizeof(GFC_INTEGER_8):
- reverse_memcpy (&i8, &i, sizeof (i8));
+ memcpy (&u64, &i, sizeof (u64));
+ u64 = __builtin_bswap64 (u64);
+ memcpy (&i8, &u64, sizeof (i8));
i = i8;
break;
@@ -2169,6 +2274,7 @@ us_read (st_parameter_dt *dtp, int continued)
runtime_error ("Illegal value for record marker");
break;
}
+ }
if (i >= 0)
{
@@ -3036,7 +3142,6 @@ write_us_marker (st_parameter_dt *dtp, const gfc_offset buf)
size_t len;
GFC_INTEGER_4 buf4;
GFC_INTEGER_8 buf8;
- char p[sizeof (GFC_INTEGER_8)];
if (compile_options.record_marker == 0)
len = sizeof (GFC_INTEGER_4);
@@ -3065,18 +3170,22 @@ write_us_marker (st_parameter_dt *dtp, const gfc_offset buf)
}
else
{
+ uint32_t u32;
+ uint64_t u64;
switch (len)
{
case sizeof (GFC_INTEGER_4):
buf4 = buf;
- reverse_memcpy (p, &buf4, sizeof (GFC_INTEGER_4));
- return swrite (dtp->u.p.current_unit->s, p, len);
+ memcpy (&u32, &buf4, sizeof (u32));
+ u32 = __builtin_bswap32 (u32);
+ return swrite (dtp->u.p.current_unit->s, &u32, len);
break;
case sizeof (GFC_INTEGER_8):
buf8 = buf;
- reverse_memcpy (p, &buf8, sizeof (GFC_INTEGER_8));
- return swrite (dtp->u.p.current_unit->s, p, len);
+ memcpy (&u64, &buf8, sizeof (u64));
+ u64 = __builtin_bswap64 (u64);
+ return swrite (dtp->u.p.current_unit->s, &u64, len);
break;
default:
@@ -3713,22 +3822,6 @@ st_set_nml_var_dim (st_parameter_dt *dtp, GFC_INTEGER_4 n_dim,
GFC_DIMENSION_SET(nml->dim[n],lbound,ubound,stride);
}
-/* Reverse memcpy - used for byte swapping. */
-
-void reverse_memcpy (void *dest, const void *src, size_t n)
-{
- char *d, *s;
- size_t i;
-
- d = (char *) dest;
- s = (char *) src + n - 1;
-
- /* Write with ascending order - this is likely faster
- on modern architectures because of write combining. */
- for (i=0; i<n; i++)
- *(d++) = *(s--);
-}
-
/* Once upon a time, a poor innocent Fortran program was reading a
file, when suddenly it hit the end-of-file (EOF). Unfortunately
next prev parent reply other threads:[~2013-01-13 22:44 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-01-04 22:15 Janne Blomqvist
2013-01-04 22:35 ` Andreas Schwab
2013-01-05 15:35 ` Richard Biener
2013-01-05 21:13 ` Janne Blomqvist
2013-01-06 11:33 ` Richard Biener
2013-01-11 20:41 ` Janne Blomqvist
2013-01-13 22:44 ` Janne Blomqvist [this message]
2013-01-18 22:30 ` Janne Blomqvist
2013-01-22 22:32 ` Thomas Koenig
2013-01-23 21:57 ` Janne Blomqvist
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAO9iq9FLjgPtW4_e0CEdU48r-baQ7Uvx6P8OhWMcWiAmqtcZyw@mail.gmail.com \
--to=blomqvist.janne@gmail.com \
--cc=fortran@gcc.gnu.org \
--cc=gcc-patches@gcc.gnu.org \
--cc=richard.guenther@gmail.com \
--cc=schwab@linux-m68k.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).