* Re: generating unaligned vector load instructions?
@ 2013-09-18 23:01 Norbert Lange
2013-09-19 1:27 ` Tim Prince
0 siblings, 1 reply; 5+ messages in thread
From: Norbert Lange @ 2013-09-18 23:01 UTC (permalink / raw)
To: gcc-help
Hello Tim,
can you specify which versions, maybe post the commandline, or trying to
compile for 32bit (-m32 switch)?
Also I dont understand the comment about splitting - to avoid
misunderstanding - the generated code segfaults on my AthlonX2 so its not
a question about optimal code, but actually working one
Im unable to generate the right instruction, and I dont exactly know why
it should differ between versions (... except bugs of course...).
And I just want to know the right way to force unaligned loads, without
inline assembly.
Btw: The code doesnt compile on gcc < 4.7 as I just realised - cant
multipy vector with scalars on older versions.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: generating unaligned vector load instructions?
2013-09-18 23:01 generating unaligned vector load instructions? Norbert Lange
@ 2013-09-19 1:27 ` Tim Prince
2013-09-19 7:25 ` Norbert Lange
0 siblings, 1 reply; 5+ messages in thread
From: Tim Prince @ 2013-09-19 1:27 UTC (permalink / raw)
To: gcc-help
On 9/18/2013 7:01 PM, Norbert Lange wrote:
> Hello Tim,
>
> can you specify which versions, maybe post the commandline, or trying
> to compile for 32bit (-m32 switch)?
> Also I dont understand the comment about splitting - to avoid
> misunderstanding - the generated code segfaults on my AthlonX2 so its
> not a question about optimal code, but actually working one
>
> Im unable to generate the right instruction, and I dont exactly know
> why it should differ between versions (... except bugs of course...).
> And I just want to know the right way to force unaligned loads,
> without inline assembly.
>
> Btw: The code doesnt compile on gcc < 4.7 as I just realised - cant
> multipy vector with scalars on older versions.
I wasn't even certain which of my gcc installations had 32-bit
counterparts, but Red Hat 4.4.6 appeared to accept your code for -m64
but reject it for -m32. Intel icc, which shares a lot of stuff with the
active gcc, rejected your code. Many people here advocate options such
as -pedantic -Wall to increase the number of warnings, so you will get
those warnings even where gcc accepts your code.
I thought X2 could accept nearly all normal sse2 code (original Turion
didn't) but I guess you are wanting to test its limits. Now that you've
revealed your actual target, someone might suggest a more appropriate
arch option. Did you read about the errata for this instruction on your
chip? http://support.amd.com/us/Processor_TechDocs/25759.pdf
Splitting unaligned 128-bit moves into separate 64-bit moves was a
common tactic likely to improve performance on CPUs prior to AMD
Barcelona and Intel Nehalem (not to mention avoid bugs in hardware
implementation). It probably didn't hurt to split the instruction
explicitly on a CPU where the hardware would split it anyway (I thought
this might be true of X2). Even with Intel Westmere there were
situations where splitting might improve performance. So gcc can't be
faulted if it makes that translation, when you didn't tell it to compile
for a more recent CPU, or you specify a target which is known to have
problems with certain instructions.
--
Tim Prince
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: generating unaligned vector load instructions?
2013-09-19 1:27 ` Tim Prince
@ 2013-09-19 7:25 ` Norbert Lange
0 siblings, 0 replies; 5+ messages in thread
From: Norbert Lange @ 2013-09-19 7:25 UTC (permalink / raw)
To: gcc-help
[-- Attachment #1: Type: text/plain, Size: 3579 bytes --]
Am 19.09.2013, 03:27 Uhr, schrieb Tim Prince <n8tm@aol.com>:
> On 9/18/2013 7:01 PM, Norbert Lange wrote:
>> Hello Tim,
>>
>> can you specify which versions, maybe post the commandline, or trying
>> to compile for 32bit (-m32 switch)?
>> Also I dont understand the comment about splitting - to avoid
>> misunderstanding - the generated code segfaults on my AthlonX2 so its
>> not a question about optimal code, but actually working one
>>
>> Im unable to generate the right instruction, and I dont exactly know
>> why it should differ between versions (... except bugs of course...).
>> And I just want to know the right way to force unaligned loads, without
>> inline assembly.
>>
>> Btw: The code doesnt compile on gcc < 4.7 as I just realised - cant
>> multipy vector with scalars on older versions.
> I wasn't even certain which of my gcc installations had 32-bit
> counterparts, but Red Hat 4.4.6 appeared to accept your code for -m64
> but reject it for -m32. Intel icc, which shares a lot of stuff with the
> active gcc, rejected your code. Many people here advocate options such
> as -pedantic -Wall to increase the number of warnings, so you will get
> those warnings even where gcc accepts your code.
> I thought X2 could accept nearly all normal sse2 code (original Turion
> didn't) but I guess you are wanting to test its limits. Now that you've
> revealed your actual target, someone might suggest a more appropriate
> arch option. Did you read about the errata for this instruction on your
> chip? http://support.amd.com/us/Processor_TechDocs/25759.pdf
> Splitting unaligned 128-bit moves into separate 64-bit moves was a
> common tactic likely to improve performance on CPUs prior to AMD
> Barcelona and Intel Nehalem (not to mention avoid bugs in hardware
> implementation). It probably didn't hurt to split the instruction
> explicitly on a CPU where the hardware would split it anyway (I thought
> this might be true of X2). Even with Intel Westmere there were
> situations where splitting might improve performance. So gcc can't be
> faulted if it makes that translation, when you didn't tell it to compile
> for a more recent CPU, or you specify a target which is known to have
> problems with certain instructions.
>
Thanks for your time and help, but I believe you miss the main point.
the code in question generates an aligned load instruction "movdqa" which
will cause an alignment fault on ALL cpus (unless the data appears at a
16-byte boundary, but thats based on luck since its alignment is 4).
"movdqu" is the one that should be generated, and it works fine if I use
inline-assembly for the load - but thats precisely what I dont want. It
simply produces wrong code (and consistently, no matter what I put into
march), nothing about tuning.
The idea was to use the vector extension and let gcc output the optimal
scalar or vector code.
Well, I added a new version with a main routine, so this should allow
running the code. with -msse2 the binary does segfault with the unaligned
pointer, no matter what I do.
some other funny bits:
*compiling for arm correctly generates unaligned byte-loads with this code
(doesnt has vector isa for ints), so it might be the x86 backend that
loses the unaligned property somewhere
*memcpy seems to be able to generate the "movdqu" instruction, but its
very fragile... using a pointer to the packed struct generates the
singular "movdqu" instruction while correctly using a pointer to the
member generates a scalar inline-memcopy
[-- Attachment #2: testvecs.c --]
[-- Type: application/octet-stream, Size: 1888 bytes --]
#include <stdint.h>
#define PRIME32_1 2654435761U
#define PRIME32_2 2246822519U
#define PRIME32_3 3266489917U
#define PRIME32_4 668265263U
#define PRIME32_5 374761393U
typedef uint32_t v4si __attribute__ ((vector_size (16)));
#pragma pack(push,1)
typedef struct _SVecPacked { v4si v; } SVecPacked;
#pragma pack(pop)
static inline uint32_t rol32(uint32_t val, unsigned shift)
{
return (val << shift) | (val >> (32 - shift));
}
unsigned calcVector(const void* const input, const int len, uint32_t seed)
{
const SVecPacked* const limit = (const SVecPacked *)((const uint8_t*)input + (len & ~0xF));
static const v4si s_Init = {PRIME32_1 + PRIME32_2, PRIME32_2, 0, -PRIME32_1};
v4si v = s_Init + seed;
const SVecPacked *p = (const SVecPacked *)input;
do
{
#if 0 && defined(__SSE__) && !defined(__AVX__)
v4si cval;
#if defined(__SSE2__)
asm ("movdqu (%1), %0\r\n"
: "=x"(cval) : "r"(p), "m"(*(const v4si *)p));
#else
asm ("movups (%1), %0\r\n"
: "=x"(cval) : "r"(p), "m"(*(const v4si *)p));
#endif
#else
const v4si cval = (*p).v;
// v4si cval;
// __builtin_memcpy(&cval, p, 16);
#endif
v = (((v + cval * PRIME32_2) << 13) | ((v + cval * PRIME32_2) >> (32 - 13))) * PRIME32_1;
++p;
} while (p != limit);
return rol32(v[0], 1) + rol32(v[1], 7) + rol32(v[2], 12) + rol32(v[3], 18);
}
#include <stdio.h>
int main()
{
static const uint8_t s_Data[48];
static volatile unsigned res;
uintptr_t ptrAligned = (uintptr_t)s_Data & (~(uintptr_t)16);
printf("Data resides at %016llx\n", (uint64_t)(uintptr_t)s_Data);
printf("Calling with aligned pointer %016llx\n", (uint64_t)ptrAligned);
res = calcVector((const void *)ptrAligned, 16, 1);
printf("Calling with unaligned pointer %016llx\n", (uint64_t)(ptrAligned + 4));
res = calcVector((const void *)(ptrAligned + 4), 16, 1);
printf("Done\n");
return 0;
}
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: generating unaligned vector load instructions?
2013-09-18 16:14 Norbert Lange
@ 2013-09-18 18:28 ` Tim Prince
0 siblings, 0 replies; 5+ messages in thread
From: Tim Prince @ 2013-09-18 18:28 UTC (permalink / raw)
To: gcc-help
On 9/18/2013 12:14 PM, Norbert Lange wrote:
> Hi,
>
> I wonder how one could get the compiler to generate the "movdqu"
> instruction, since the vector extensions always seem to assume that
> everything will be aligned to 16 byte.
> I tried using a packed struct and this dint help much. Of course one
> can always resort to inline assembly but this should not be necessary
>
> Compile with:
> gcc -O2 -S -msse2 testvecs.c
>
> --------------------------
I do see a movdqu, over a range of gcc (64-bit) versions from 4.4.6 to
4.9. Some of the compilers are complaining about mixed data type
arithmetic on lines 29 and 42.
I don't know whether it applies here, but splitting an unaligned memory
move is likely to be the right thing on platforms up through Intel
Westmere, so you would want to specify -march=native to optimize for
newer ones.
--
Tim Prince
^ permalink raw reply [flat|nested] 5+ messages in thread
* generating unaligned vector load instructions?
@ 2013-09-18 16:14 Norbert Lange
2013-09-18 18:28 ` Tim Prince
0 siblings, 1 reply; 5+ messages in thread
From: Norbert Lange @ 2013-09-18 16:14 UTC (permalink / raw)
To: gcc-help
[-- Attachment #1: Type: text/plain, Size: 1323 bytes --]
Hi,
I wonder how one could get the compiler to generate the "movdqu"
instruction, since the vector extensions always seem to assume that
everything will be aligned to 16 byte.
I tried using a packed struct and this dint help much. Of course one can
always resort to inline assembly but this should not be necessary
Compile with:
gcc -O2 -S -msse2 testvecs.c
--------------------------
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/i486-linux-gnu/4.7/lto-wrapper
Target: i486-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.7.2-5'
--with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs
--enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr
--program-suffix=-4.7 --enable-shared --enable-linker-build-id
--with-system-zlib --libexecdir=/usr/lib --without-included-gettext
--enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7
--libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--enable-gnu-unique-object --enable-plugin --enable-objc-gc
--enable-targets=all --with-arch-32=i586 --with-tune=generic
--enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu
--target=i486-linux-gnu
Thread model: posix
gcc version 4.7.2 (Debian 4.7.2-5)
[-- Attachment #2: testvecs.c --]
[-- Type: application/octet-stream, Size: 1360 bytes --]
#include <stdint.h>
#define PRIME32_1 2654435761U
#define PRIME32_2 2246822519U
#define PRIME32_3 3266489917U
#define PRIME32_4 668265263U
#define PRIME32_5 374761393U
typedef uint32_t T __attribute__((aligned (2)));
typedef T v4si __attribute__ ((vector_size (16)));
#pragma pack(push,1)
typedef struct _SVecPacked { v4si v; } SVecPacked;
#pragma pack(pop)
static inline rol32(uint32_t val, unsigned shift)
{
return (val << shift) | (val >> (32 - shift));
}
unsigned calcVector(const void* const input, const int len, uint32_t seed)
{
typedef uint32_t v4si __attribute__ ((vector_size (16)));
const uint8_t* const limit = (const uint8_t*)input + (len & ~0xF);
static const v4si s_Init = {PRIME32_1 + PRIME32_2, PRIME32_2, 0, -PRIME32_1};
v4si v = s_Init + seed;
const uint8_t* p=(const uint8_t*)input;
do
{
#if defined(__SSE__) || 0
v4si cval;
asm ("movdqu (%1), %0\r\n"
: "=x"(cval) : "r"(p), "m"(*(const v4si *)p));
#else
const v4si cval = (*(const SVecPacked *)p).v;
#endif
p += 16;
v = (((v + cval * PRIME32_2) << 13) | ((v + cval * PRIME32_2) >> (32 - 13))) * PRIME32_1;
} while (p != limit);
const union
{
v4si _vec;
uint32_t _arr[4];
} castHelp = {v};
return rol32(castHelp._arr[0], 1) + rol32(castHelp._arr[1], 7) + rol32(castHelp._arr[2], 12) + rol32(castHelp._arr[3], 18);
}
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2013-09-19 7:25 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-18 23:01 generating unaligned vector load instructions? Norbert Lange
2013-09-19 1:27 ` Tim Prince
2013-09-19 7:25 ` Norbert Lange
-- strict thread matches above, loose matches on Subject: below --
2013-09-18 16:14 Norbert Lange
2013-09-18 18:28 ` Tim Prince
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).