Re: generating unaligned vector load instructions?

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: generating unaligned vector load instructions?
@ 2013-09-18 23:01 Norbert Lange
  2013-09-19  1:27 ` Tim Prince
  0 siblings, 1 reply; 5+ messages in thread
From: Norbert Lange @ 2013-09-18 23:01 UTC (permalink / raw)
  To: gcc-help

Hello Tim,

can you specify which versions, maybe post the commandline, or trying to  
compile for 32bit (-m32 switch)?
Also I dont understand the comment about splitting - to avoid  
misunderstanding - the generated code segfaults on my AthlonX2 so its not  
a question about optimal code, but actually working one

Im unable to generate the right instruction, and I dont exactly know why  
it should differ between versions (... except bugs of course...).
And I just want to know the right way to force unaligned loads, without  
inline assembly.

Btw: The code doesnt compile on gcc < 4.7 as I just realised - cant  
multipy vector with scalars on older versions.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: generating unaligned vector load instructions?
  2013-09-18 23:01 generating unaligned vector load instructions? Norbert Lange
@ 2013-09-19  1:27 ` Tim Prince
  2013-09-19  7:25   ` Norbert Lange
  0 siblings, 1 reply; 5+ messages in thread
From: Tim Prince @ 2013-09-19  1:27 UTC (permalink / raw)
  To: gcc-help

On 9/18/2013 7:01 PM, Norbert Lange wrote:
> Hello Tim,
>
> can you specify which versions, maybe post the commandline, or trying 
> to compile for 32bit (-m32 switch)?
> Also I dont understand the comment about splitting - to avoid 
> misunderstanding - the generated code segfaults on my AthlonX2 so its 
> not a question about optimal code, but actually working one
>
> Im unable to generate the right instruction, and I dont exactly know 
> why it should differ between versions (... except bugs of course...).
> And I just want to know the right way to force unaligned loads, 
> without inline assembly.
>
> Btw: The code doesnt compile on gcc < 4.7 as I just realised - cant 
> multipy vector with scalars on older versions.
I wasn't even certain which of my gcc installations had 32-bit 
counterparts, but Red Hat 4.4.6 appeared to accept your code for -m64 
but reject it for -m32.  Intel icc, which shares a lot of stuff with the 
active gcc, rejected your code.  Many people here advocate options such 
as -pedantic -Wall to increase the number of warnings, so you will get 
those warnings even where gcc accepts your code.
I thought X2 could accept nearly all normal sse2 code (original Turion 
didn't) but I guess you are wanting to test its limits.  Now that you've 
revealed your actual target, someone might suggest a more appropriate 
arch option.  Did you read about the errata for this instruction on your 
chip? http://support.amd.com/us/Processor_TechDocs/25759.pdf
Splitting unaligned 128-bit moves into separate 64-bit moves was a 
common tactic likely to improve performance on CPUs prior to AMD 
Barcelona and Intel Nehalem (not to mention avoid bugs in hardware 
implementation).  It probably didn't hurt to split the instruction 
explicitly on a CPU where the hardware would split it anyway (I thought 
this might be true of X2).  Even with Intel Westmere there were 
situations where splitting might improve performance.  So gcc can't be 
faulted if it makes that translation, when you didn't tell it to compile 
for a more recent CPU, or you specify a target which is known to have 
problems with certain instructions.

-- 
Tim Prince

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: generating unaligned vector load instructions?
  2013-09-19  1:27 ` Tim Prince
@ 2013-09-19  7:25   ` Norbert Lange
  0 siblings, 0 replies; 5+ messages in thread
From: Norbert Lange @ 2013-09-19  7:25 UTC (permalink / raw)
  To: gcc-help

[-- Attachment #1: Type: text/plain, Size: 3579 bytes --]

Am 19.09.2013, 03:27 Uhr, schrieb Tim Prince <n8tm@aol.com>:

> On 9/18/2013 7:01 PM, Norbert Lange wrote:
>> Hello Tim,
>>
>> can you specify which versions, maybe post the commandline, or trying  
>> to compile for 32bit (-m32 switch)?
>> Also I dont understand the comment about splitting - to avoid  
>> misunderstanding - the generated code segfaults on my AthlonX2 so its  
>> not a question about optimal code, but actually working one
>>
>> Im unable to generate the right instruction, and I dont exactly know  
>> why it should differ between versions (... except bugs of course...).
>> And I just want to know the right way to force unaligned loads, without  
>> inline assembly.
>>
>> Btw: The code doesnt compile on gcc < 4.7 as I just realised - cant  
>> multipy vector with scalars on older versions.
> I wasn't even certain which of my gcc installations had 32-bit  
> counterparts, but Red Hat 4.4.6 appeared to accept your code for -m64  
> but reject it for -m32.  Intel icc, which shares a lot of stuff with the  
> active gcc, rejected your code.  Many people here advocate options such  
> as -pedantic -Wall to increase the number of warnings, so you will get  
> those warnings even where gcc accepts your code.
> I thought X2 could accept nearly all normal sse2 code (original Turion  
> didn't) but I guess you are wanting to test its limits.  Now that you've  
> revealed your actual target, someone might suggest a more appropriate  
> arch option.  Did you read about the errata for this instruction on your  
> chip? http://support.amd.com/us/Processor_TechDocs/25759.pdf
> Splitting unaligned 128-bit moves into separate 64-bit moves was a  
> common tactic likely to improve performance on CPUs prior to AMD  
> Barcelona and Intel Nehalem (not to mention avoid bugs in hardware  
> implementation).  It probably didn't hurt to split the instruction  
> explicitly on a CPU where the hardware would split it anyway (I thought  
> this might be true of X2).  Even with Intel Westmere there were  
> situations where splitting might improve performance.  So gcc can't be  
> faulted if it makes that translation, when you didn't tell it to compile  
> for a more recent CPU, or you specify a target which is known to have  
> problems with certain instructions.
>

Thanks for your time and help, but I believe you miss the main point.
the code in question generates an aligned load instruction "movdqa" which  
will cause an alignment fault on ALL cpus (unless the data appears at a  
16-byte boundary, but thats based on luck since its alignment is 4).  
"movdqu" is the one that should be generated, and it works fine if I use  
inline-assembly for the load - but thats precisely what I dont want. It  
simply produces wrong code (and consistently, no matter what I put into  
march), nothing about tuning.
The idea was to use the vector extension and let gcc output the optimal  
scalar or vector code.
Well, I added a new version with a main routine, so this should allow  
running the code. with -msse2 the binary does segfault with the unaligned  
pointer, no matter what I do.

some other funny bits:
*compiling for arm correctly generates unaligned byte-loads with this code  
(doesnt has vector isa for ints), so it might be the x86 backend that  
loses the unaligned property somewhere
*memcpy seems to be able to generate the "movdqu" instruction, but its  
very fragile... using a pointer to the packed struct generates the  
singular "movdqu" instruction while correctly using a pointer to the  
member generates a scalar inline-memcopy

[-- Attachment #2: testvecs.c --]
[-- Type: application/octet-stream, Size: 1888 bytes --]

#include <stdint.h>

#define PRIME32_1   2654435761U
#define PRIME32_2   2246822519U
#define PRIME32_3   3266489917U
#define PRIME32_4    668265263U
#define PRIME32_5    374761393U

typedef uint32_t v4si __attribute__ ((vector_size (16)));

#pragma pack(push,1)
typedef struct _SVecPacked { v4si v; } SVecPacked;
#pragma pack(pop)




static inline uint32_t rol32(uint32_t val, unsigned shift)
{
	return (val << shift) | (val >> (32 - shift));
}

unsigned calcVector(const void* const input, const int len, uint32_t seed)
{
	const SVecPacked* const limit = (const SVecPacked *)((const uint8_t*)input + (len & ~0xF));
	static const v4si s_Init = {PRIME32_1 + PRIME32_2, PRIME32_2, 0, -PRIME32_1};
	v4si v = s_Init + seed;
	const SVecPacked *p = (const SVecPacked *)input;

        do
        {
#if 0 && defined(__SSE__) && !defined(__AVX__)
		v4si cval;
#if defined(__SSE2__)
		asm ("movdqu	(%1), %0\r\n"
			: "=x"(cval) : "r"(p), "m"(*(const v4si *)p));
#else
		asm ("movups	(%1), %0\r\n"
			: "=x"(cval) : "r"(p), "m"(*(const v4si *)p));
#endif
#else
		const v4si cval = (*p).v;
//		v4si cval;
//		__builtin_memcpy(&cval, p, 16);
#endif
		v = (((v + cval * PRIME32_2) << 13) | ((v + cval * PRIME32_2) >> (32 - 13))) * PRIME32_1;
		++p;
        } while (p != limit);

	return rol32(v[0], 1) + rol32(v[1], 7) + rol32(v[2], 12) + rol32(v[3], 18);
}

#include <stdio.h>

int main()
{
	static const uint8_t s_Data[48];
	static volatile unsigned res;
	uintptr_t ptrAligned = (uintptr_t)s_Data & (~(uintptr_t)16);
	printf("Data resides at                %016llx\n", (uint64_t)(uintptr_t)s_Data);
	printf("Calling with   aligned pointer %016llx\n", (uint64_t)ptrAligned);
	res = calcVector((const void *)ptrAligned, 16, 1);
	printf("Calling with unaligned pointer %016llx\n", (uint64_t)(ptrAligned + 4));
	res = calcVector((const void *)(ptrAligned + 4), 16, 1);
	printf("Done\n");
	return 0;
}

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: generating unaligned vector load instructions?
  2013-09-18 16:14 Norbert Lange
@ 2013-09-18 18:28 ` Tim Prince
  0 siblings, 0 replies; 5+ messages in thread
From: Tim Prince @ 2013-09-18 18:28 UTC (permalink / raw)
  To: gcc-help

On 9/18/2013 12:14 PM, Norbert Lange wrote:
> Hi,
>
> I wonder how one could get the compiler to generate the "movdqu" 
> instruction, since the vector extensions always seem to assume that 
> everything will be aligned to 16 byte.
> I tried using a packed struct and this dint help much. Of course one 
> can always resort to inline assembly but this should not be necessary
>
> Compile with:
> gcc -O2 -S -msse2  testvecs.c
>
> --------------------------
I do see a movdqu, over a range of gcc (64-bit) versions from 4.4.6 to 
4.9.  Some of the compilers are complaining about mixed data type 
arithmetic on lines 29 and 42.
I don't know whether it applies here, but splitting an unaligned memory 
move is likely to be the right thing on platforms up through Intel 
Westmere, so you would want to specify -march=native to optimize for 
newer ones.

-- 
Tim Prince

^ permalink raw reply	[flat|nested] 5+ messages in thread

* generating unaligned vector load instructions?
@ 2013-09-18 16:14 Norbert Lange
  2013-09-18 18:28 ` Tim Prince
  0 siblings, 1 reply; 5+ messages in thread
From: Norbert Lange @ 2013-09-18 16:14 UTC (permalink / raw)
  To: gcc-help

[-- Attachment #1: Type: text/plain, Size: 1323 bytes --]

Hi,

I wonder how one could get the compiler to generate the "movdqu"  
instruction, since the vector extensions always seem to assume that  
everything will be aligned to 16 byte.
I tried using a packed struct and this dint help much. Of course one can  
always resort to inline assembly but this should not be necessary

Compile with:
gcc -O2 -S -msse2  testvecs.c

--------------------------
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/i486-linux-gnu/4.7/lto-wrapper
Target: i486-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.7.2-5'  
--with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs  
--enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr  
--program-suffix=-4.7 --enable-shared --enable-linker-build-id  
--with-system-zlib --libexecdir=/usr/lib --without-included-gettext  
--enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7  
--libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu  
--enable-libstdcxx-debug --enable-libstdcxx-time=yes  
--enable-gnu-unique-object --enable-plugin --enable-objc-gc  
--enable-targets=all --with-arch-32=i586 --with-tune=generic  
--enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu  
--target=i486-linux-gnu
Thread model: posix
gcc version 4.7.2 (Debian 4.7.2-5)

[-- Attachment #2: testvecs.c --]
[-- Type: application/octet-stream, Size: 1360 bytes --]

#include <stdint.h>

#define PRIME32_1   2654435761U
#define PRIME32_2   2246822519U
#define PRIME32_3   3266489917U
#define PRIME32_4    668265263U
#define PRIME32_5    374761393U

typedef uint32_t T __attribute__((aligned (2)));
typedef T v4si __attribute__ ((vector_size (16)));

#pragma pack(push,1)
typedef struct _SVecPacked { v4si v; } SVecPacked;
#pragma pack(pop)




static inline rol32(uint32_t val, unsigned shift)
{
	return (val << shift) | (val >> (32 - shift));
}

unsigned calcVector(const void* const input, const int len, uint32_t seed)
{
	typedef uint32_t v4si __attribute__ ((vector_size (16)));
	const uint8_t* const limit = (const uint8_t*)input + (len & ~0xF);
	static const v4si s_Init = {PRIME32_1 + PRIME32_2, PRIME32_2, 0, -PRIME32_1};
	v4si v = s_Init + seed;
	const uint8_t* p=(const uint8_t*)input;

        do
        {
#if defined(__SSE__) || 0
		v4si cval;
		asm ("movdqu	(%1), %0\r\n"
			: "=x"(cval) : "r"(p), "m"(*(const v4si *)p));
#else
		const v4si cval = (*(const SVecPacked *)p).v;
#endif
		p += 16;
		v = (((v + cval * PRIME32_2) << 13) | ((v + cval * PRIME32_2) >> (32 - 13))) * PRIME32_1;
        } while (p != limit);

	const union
	{
		v4si _vec;
		uint32_t _arr[4];
	} castHelp = {v};

	return rol32(castHelp._arr[0], 1) + rol32(castHelp._arr[1], 7) + rol32(castHelp._arr[2], 12) + rol32(castHelp._arr[3], 18);
}

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-09-19  7:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-18 23:01 generating unaligned vector load instructions? Norbert Lange
2013-09-19  1:27 ` Tim Prince
2013-09-19  7:25   ` Norbert Lange
  -- strict thread matches above, loose matches on Subject: below --
2013-09-18 16:14 Norbert Lange
2013-09-18 18:28 ` Tim Prince

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).