[Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
@ 2015-08-26 21:42 yann.collet.73 at gmail dot com
  2015-08-27  7:39 ` [Bug target/67366] " rguenth at gcc dot gnu.org
                   ` (14 more replies)
  0 siblings, 15 replies; 16+ messages in thread
From: yann.collet.73 at gmail dot com @ 2015-08-26 21:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

            Bug ID: 67366
           Summary: Poor assembly generation for unaligned memory accesses
                    on ARM v6 & v7 cpus
           Product: gcc
           Version: 4.8.2
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: yann.collet.73 at gmail dot com
  Target Milestone: ---

Accessing unaligned memory positions used to be forbidden on ARM cpus. But
since ARMv6 (quite many years by now), this operation is supported.

However, GCC 4.5 - 4.6 - 4.7 - 4.8 seem to generate sub-optimal code on these
targets.

In theory, it's illegal to issue a direct statement such as :

u32 read32(const void* ptr) { return *(const u32*)ptr; }

if ptr is not properly aligned.

There are 2 work-around that I know.
The first is to use `packed` instruction, which is not portable (compiler
specific).

The second and better one is to use memcpy() :

u32 read32(const void* ptr) { u32 v; memcpy(&u, ptr, sizeof(v)); return v; }

This version is portable and safe.
It also works very well on multiple platform, such as x86/x64 or PPC, or ARM64,
being reduced to an optimal assembly sequence (single instruction).

Unfortunately, GCC 4.5 - 4.6 - 4.7 - 4.8 generate suboptimal assembly for this
function on ARMv6 or ARMv7 :

read32(void const*):
        ldr     r0, [r0]        @ unaligned
        sub     sp, sp, #8
        str     r0, [sp, #4]    @ unaligned
        ldr     r0, [sp, #4]
        add     sp, sp, #8
        bx      lr

This in stark contrast with clang, which generates a much more efficient
assembly :

read32(void const*):                           @ @read32(void const*)
        ldr     r0, [r0]
        bx      lr

(assembly can be generated and displayed using a simple tool :
https://goo.gl/7FWDB8)

It's not that gcc is unaware of cpu's unaligned memory access capability,
since it does use it : `ldr r0, [r0]`
but then lose a lot of time on useless operations on a discardable temporary
variable,
storing data into stack just to read it again.

Inlining does not save the day. -O3 help at reducing the impact, but it's still
large.

On a recent exercise comparing efficient vs inefficient memory access on ARMv6
and ARMv7,
the measured difference was very large : up to 6x faster at -O2 settings.
See :
http://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html

It's definitely a too large difference to be ignored.
As a consequence, to preserve performance, source code must try a bunch of
possibilities depending on target and compiler, if not version.
In some circumstances (gcc with ARMv6, or gcc <= 4.5), it's even necessary to
write illegal code (see !st version above) to reach optimal performance on
targets.

This looks like a waste of energy, and a recipe for bugs, especially compared
to clang, which generates clean code in all circumstances for all targets.

Considering the huge performance difference such an improvement could make, is
that something the gcc team would like to look into ?

Regards

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/67366] Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
  2015-08-26 21:42 [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus yann.collet.73 at gmail dot com
@ 2015-08-27  7:39 ` rguenth at gcc dot gnu.org
  2015-08-27  9:36 ` rearnsha at gcc dot gnu.org
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-08-27  7:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2015-08-27
                 CC|                            |rguenth at gcc dot gnu.org
            Version|4.8.2                       |6.0
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
I think this boils down to the fact that memcpy expansion is done too late and
that (with more recent GCC) the "inlining" done on the GIMPLE level is
restricted
to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
unconditionally.

I guess arm goes through the movmisalign optab for unaligned accesses and the
GIMPLE inlining could be enabled as well if movmisalign is supported.

Note that GCC 4.8 is no longer supported and enhancements will go to GCC 6 at
most.

With current trunk and -O2 -march=armv6 I get for

typedef unsigned int u32;
u32 read32(const void* ptr) { u32 v; __builtin_memcpy(&v, ptr, sizeof(v));
return v; }

read32:
        @ args = 0, pretend = 0, frame = 8
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldr     r0, [r0]        @ unaligned
        sub     sp, sp, #8
        add     sp, sp, #8
        @ sp needed
        bx      lr

so apart from the stack slot not getting removed this has already improved,
but the issue I mention still happens as we expand from

read32 (const void * ptr)
{
  u32 v;
  u32 _4;

  <bb 2>:
  __builtin_memcpy (&v, ptr_2(D), 4);
  _4 = v;
  v ={v} {CLOBBER};
  return _4;

so partially confirmed, for the GIMPLE memory op "inlining" issue.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/67366] Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
  2015-08-26 21:42 [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus yann.collet.73 at gmail dot com
  2015-08-27  7:39 ` [Bug target/67366] " rguenth at gcc dot gnu.org
@ 2015-08-27  9:36 ` rearnsha at gcc dot gnu.org
  2015-08-27 10:21 ` rguenther at suse dot de
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rearnsha at gcc dot gnu.org @ 2015-08-27  9:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

--- Comment #2 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #1)
> I think this boils down to the fact that memcpy expansion is done too late
> and
> that (with more recent GCC) the "inlining" done on the GIMPLE level is
> restricted
> to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
> unconditionally.
> 

Yep, we have to define STRICT_ALIGNMENT to 1 because not all load instructions
work with misaligned addresses (ldm, for example).  The only way to handle
misaligned copies is through the movmisalign API.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/67366] Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
  2015-08-26 21:42 [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus yann.collet.73 at gmail dot com
  2015-08-27  7:39 ` [Bug target/67366] " rguenth at gcc dot gnu.org
  2015-08-27  9:36 ` rearnsha at gcc dot gnu.org
@ 2015-08-27 10:21 ` rguenther at suse dot de
  2015-08-27 10:42 ` ramana at gcc dot gnu.org
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenther at suse dot de @ 2015-08-27 10:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 27 Aug 2015, rearnsha at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> 
> --- Comment #2 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #1)
> > I think this boils down to the fact that memcpy expansion is done too late
> > and
> > that (with more recent GCC) the "inlining" done on the GIMPLE level is
> > restricted
> > to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
> > unconditionally.
> > 
> 
> Yep, we have to define STRICT_ALIGNMENT to 1 because not all load instructions
> work with misaligned addresses (ldm, for example).  The only way to handle
> misaligned copies is through the movmisalign API.

Are the movmisalign handled ones reasonably efficient?  That is, more
efficient than memcpy/memmove?  Then we should experiment with

Index: gcc/gimple-fold.c
===================================================================
--- gcc/gimple-fold.c   (revision 227252)
+++ gcc/gimple-fold.c   (working copy)
@@ -708,7 +708,9 @@ gimple_fold_builtin_memory_op (gimple_st
                  /* If the destination pointer is not aligned we must be 
able
                     to emit an unaligned store.  */
                  && (dest_align >= GET_MODE_ALIGNMENT (TYPE_MODE (type))
-                     || !SLOW_UNALIGNED_ACCESS (TYPE_MODE (type), 
dest_align)))
+                     || !SLOW_UNALIGNED_ACCESS (TYPE_MODE (type), 
dest_align)
+                     || (optab_handler (movmisalign_optab, TYPE_MODE 
(type))
+                         != CODE_FOR_nothing)))
                {
                  tree srctype = type;
                  tree desttype = type;
@@ -720,7 +722,10 @@ gimple_fold_builtin_memory_op (gimple_st
                    srcmem = tem;
                  else if (src_align < GET_MODE_ALIGNMENT (TYPE_MODE 
(type))
                           && SLOW_UNALIGNED_ACCESS (TYPE_MODE (type),
-                                                    src_align))
+                                                    src_align)
+                          && (optab_handler (movmisalign_optab,
+                                             TYPE_MODE (type))
+                              == CODE_FOR_nothing))
                    srcmem = NULL_TREE;
                  if (srcmem)
                    {


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/67366] Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
  2015-08-26 21:42 [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus yann.collet.73 at gmail dot com
                   ` (2 preceding siblings ...)
  2015-08-27 10:21 ` rguenther at suse dot de
@ 2015-08-27 10:42 ` ramana at gcc dot gnu.org
  2015-08-27 10:47 ` ramana at gcc dot gnu.org
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: ramana at gcc dot gnu.org @ 2015-08-27 10:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ramana at gcc dot gnu.org

--- Comment #4 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
(In reply to Richard Earnshaw from comment #2)
> (In reply to Richard Biener from comment #1)
> > I think this boils down to the fact that memcpy expansion is done too late
> > and
> > that (with more recent GCC) the "inlining" done on the GIMPLE level is
> > restricted
> > to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
> > unconditionally.
> > 
> 
> Yep, we have to define STRICT_ALIGNMENT to 1 because not all load
> instructions work with misaligned addresses (ldm, for example).  The only
> way to handle misaligned copies is through the movmisalign API.

Hmmm, actually that's not completely true - for SImode and HImode objects there
don't appear to be movmisalign interfaces. For things like memcpy this is done
through the memcpy interface where we end up generating copies using
unaligned_loadsi and unaliged_storesi.

We may need to experiment a bit with the movmisalign interface too.


Ramana


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/67366] Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
  2015-08-26 21:42 [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus yann.collet.73 at gmail dot com
                   ` (3 preceding siblings ...)
  2015-08-27 10:42 ` ramana at gcc dot gnu.org
@ 2015-08-27 10:47 ` ramana at gcc dot gnu.org
  2015-08-27 11:08 ` ramana at gcc dot gnu.org
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: ramana at gcc dot gnu.org @ 2015-08-27 10:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

--- Comment #5 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #3)
> On Thu, 27 Aug 2015, rearnsha at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> > 
> > --- Comment #2 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #1)
> > > I think this boils down to the fact that memcpy expansion is done too late
> > > and
> > > that (with more recent GCC) the "inlining" done on the GIMPLE level is
> > > restricted
> > > to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
> > > unconditionally.
> > > 
> > 
> > Yep, we have to define STRICT_ALIGNMENT to 1 because not all load instructions
> > work with misaligned addresses (ldm, for example).  The only way to handle
> > misaligned copies is through the movmisalign API.
> 
> Are the movmisalign handled ones reasonably efficient?  That is, more
> efficient than memcpy/memmove?  Then we should experiment with

What do you mean by more efficient here ? they'll end up calling down to the
same unspec block if we implemented them - memcpy / memmove go through the
backend block_move interface.

Will try the patch with a hacked up movmisalign pattern in the backend.

> 
> Index: gcc/gimple-fold.c
> ===================================================================
> --- gcc/gimple-fold.c   (revision 227252)
> +++ gcc/gimple-fold.c   (working copy)
> @@ -708,7 +708,9 @@ gimple_fold_builtin_memory_op (gimple_st
>                   /* If the destination pointer is not aligned we must be 
> able
>                      to emit an unaligned store.  */
>                   && (dest_align >= GET_MODE_ALIGNMENT (TYPE_MODE (type))
> -                     || !SLOW_UNALIGNED_ACCESS (TYPE_MODE (type), 
> dest_align)))
> +                     || !SLOW_UNALIGNED_ACCESS (TYPE_MODE (type), 
> dest_align)
> +                     || (optab_handler (movmisalign_optab, TYPE_MODE 
> (type))
> +                         != CODE_FOR_nothing)))
>                 {
>                   tree srctype = type;
>                   tree desttype = type;
> @@ -720,7 +722,10 @@ gimple_fold_builtin_memory_op (gimple_st
>                     srcmem = tem;
>                   else if (src_align < GET_MODE_ALIGNMENT (TYPE_MODE 
> (type))
>                            && SLOW_UNALIGNED_ACCESS (TYPE_MODE (type),
> -                                                    src_align))
> +                                                    src_align)
> +                          && (optab_handler (movmisalign_optab,
> +                                             TYPE_MODE (type))
> +                              == CODE_FOR_nothing))
>                     srcmem = NULL_TREE;
>                   if (srcmem)
>                     {


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/67366] Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
  2015-08-26 21:42 [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus yann.collet.73 at gmail dot com
                   ` (4 preceding siblings ...)
  2015-08-27 10:47 ` ramana at gcc dot gnu.org
@ 2015-08-27 11:08 ` ramana at gcc dot gnu.org
  2015-08-27 11:13 ` rguenther at suse dot de
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: ramana at gcc dot gnu.org @ 2015-08-27 11:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

--- Comment #6 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #3)
> On Thu, 27 Aug 2015, rearnsha at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> > 
> > --- Comment #2 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #1)
> > > I think this boils down to the fact that memcpy expansion is done too late
> > > and
> > > that (with more recent GCC) the "inlining" done on the GIMPLE level is
> > > restricted
> > > to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
> > > unconditionally.
> > > 
> > 
> > Yep, we have to define STRICT_ALIGNMENT to 1 because not all load instructions
> > work with misaligned addresses (ldm, for example).  The only way to handle
> > misaligned copies is through the movmisalign API.
> 
> Are the movmisalign handled ones reasonably efficient?  That is, more
> efficient than memcpy/memmove?  Then we should experiment with

minor nit - missing include of optabs.h - fixing that and adding a
movmisalignsi pattern in the backend that just generates either an unaligned /
storesi insn generates the following for me for the above mentioned testcase.


read32:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldr     r0, [r0]        @ unaligned
        bx      lr




I'm on holiday from this evening so don't really want to push something today
... 





> 
> Index: gcc/gimple-fold.c
> ===================================================================
> --- gcc/gimple-fold.c   (revision 227252)
> +++ gcc/gimple-fold.c   (working copy)
> @@ -708,7 +708,9 @@ gimple_fold_builtin_memory_op (gimple_st
>                   /* If the destination pointer is not aligned we must be 
> able
>                      to emit an unaligned store.  */
>                   && (dest_align >= GET_MODE_ALIGNMENT (TYPE_MODE (type))
> -                     || !SLOW_UNALIGNED_ACCESS (TYPE_MODE (type), 
> dest_align)))
> +                     || !SLOW_UNALIGNED_ACCESS (TYPE_MODE (type), 
> dest_align)
> +                     || (optab_handler (movmisalign_optab, TYPE_MODE 
> (type))
> +                         != CODE_FOR_nothing)))
>                 {
>                   tree srctype = type;
>                   tree desttype = type;
> @@ -720,7 +722,10 @@ gimple_fold_builtin_memory_op (gimple_st
>                     srcmem = tem;
>                   else if (src_align < GET_MODE_ALIGNMENT (TYPE_MODE 
> (type))
>                            && SLOW_UNALIGNED_ACCESS (TYPE_MODE (type),
> -                                                    src_align))
> +                                                    src_align)
> +                          && (optab_handler (movmisalign_optab,
> +                                             TYPE_MODE (type))
> +                              == CODE_FOR_nothing))
>                     srcmem = NULL_TREE;
>                   if (srcmem)
>                     {


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/67366] Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
  2015-08-26 21:42 [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus yann.collet.73 at gmail dot com
                   ` (5 preceding siblings ...)
  2015-08-27 11:08 ` ramana at gcc dot gnu.org
@ 2015-08-27 11:13 ` rguenther at suse dot de
  2015-08-27 11:17 ` ramana at gcc dot gnu.org
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenther at suse dot de @ 2015-08-27 11:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

--- Comment #7 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 27 Aug 2015, ramana at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> 
> --- Comment #6 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #3)
> > On Thu, 27 Aug 2015, rearnsha at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> > > 
> > > --- Comment #2 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> > > (In reply to Richard Biener from comment #1)
> > > > I think this boils down to the fact that memcpy expansion is done too late
> > > > and
> > > > that (with more recent GCC) the "inlining" done on the GIMPLE level is
> > > > restricted
> > > > to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
> > > > unconditionally.
> > > > 
> > > 
> > > Yep, we have to define STRICT_ALIGNMENT to 1 because not all load instructions
> > > work with misaligned addresses (ldm, for example).  The only way to handle
> > > misaligned copies is through the movmisalign API.
> > 
> > Are the movmisalign handled ones reasonably efficient?  That is, more
> > efficient than memcpy/memmove?  Then we should experiment with
> 
> minor nit - missing include of optabs.h - fixing that and adding a
> movmisalignsi pattern in the backend that just generates either an unaligned /
> storesi insn generates the following for me for the above mentioned testcase.
> 
> 
> read32:
>         @ args = 0, pretend = 0, frame = 0
>         @ frame_needed = 0, uses_anonymous_args = 0
>         @ link register save eliminated.
>         ldr     r0, [r0]        @ unaligned
>         bx      lr
> 
> 
> 
> 
> I'm on holiday from this evening so don't really want to push something today
> ... 

Sure ;)  When adding the GIMPLE folding I was just careful here as I
don't really have a STRICT_ALIGNMENT machine with movmisalign handling
available.  Thus full testing is appreciated (might turn up some
testcases that need adjustment).  There are more STRICT_ALIGN
guarded cases below in the function, eventually they can be modified
as well (at which point splitting out the alignment check to a separate
function makes sense).


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/67366] Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
  2015-08-26 21:42 [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus yann.collet.73 at gmail dot com
                   ` (6 preceding siblings ...)
  2015-08-27 11:13 ` rguenther at suse dot de
@ 2015-08-27 11:17 ` ramana at gcc dot gnu.org
  2015-08-27 14:31 ` rearnsha at gcc dot gnu.org
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: ramana at gcc dot gnu.org @ 2015-08-27 11:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

--- Comment #8 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #7)
> On Thu, 27 Aug 2015, ramana at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> > 
> > --- Comment #6 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
> > (In reply to rguenther@suse.de from comment #3)
> > > On Thu, 27 Aug 2015, rearnsha at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> > > > 
> > > > --- Comment #2 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> > > > (In reply to Richard Biener from comment #1)
> > > > > I think this boils down to the fact that memcpy expansion is done too late
> > > > > and
> > > > > that (with more recent GCC) the "inlining" done on the GIMPLE level is
> > > > > restricted
> > > > > to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
> > > > > unconditionally.
> > > > > 
> > > > 
> > > > Yep, we have to define STRICT_ALIGNMENT to 1 because not all load instructions
> > > > work with misaligned addresses (ldm, for example).  The only way to handle
> > > > misaligned copies is through the movmisalign API.
> > > 
> > > Are the movmisalign handled ones reasonably efficient?  That is, more
> > > efficient than memcpy/memmove?  Then we should experiment with
> > 
> > minor nit - missing include of optabs.h - fixing that and adding a
> > movmisalignsi pattern in the backend that just generates either an unaligned /
> > storesi insn generates the following for me for the above mentioned testcase.
> > 
> > 
> > read32:
> >         @ args = 0, pretend = 0, frame = 0
> >         @ frame_needed = 0, uses_anonymous_args = 0
> >         @ link register save eliminated.
> >         ldr     r0, [r0]        @ unaligned
> >         bx      lr
> > 
> > 
> > 
> > 
> > I'm on holiday from this evening so don't really want to push something today
> > ... 
> 
> Sure ;)  When adding the GIMPLE folding I was just careful here as I
> don't really have a STRICT_ALIGNMENT machine with movmisalign handling
> available.  Thus full testing is appreciated (might turn up some
> testcases that need adjustment).  There are more STRICT_ALIGN
> guarded cases below in the function, eventually they can be modified
> as well (at which point splitting out the alignment check to a separate
> function makes sense).

This was the backend hack I was playing with. It needs extension to HImode
values, cleaning up and regression testing and some small amount of
benchmarking to see we're doing the right thing.



diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 288bbb9..eaff494 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -11423,6 +11423,27 @@
   }"
 )

+(define_expand "movmisalignsi"
+  [(match_operand:SI 0 "general_operand")
+   (match_operand:SI 1 "general_operand")]
+  "unaligned_access"
+{
+  /* This pattern is not permitted to fail during expansion: if both arguments
+     are non-registers (e.g. memory := constant, which can be created by the
+     auto-vectorizer), force operand 1 into a register.  */
+  if (!s_register_operand (operands[0], SImode)
+      && !s_register_operand (operands[1], SImode))
+    operands[1] = force_reg (SImode, operands[1]);
+
+  if (MEM_P (operands[1]))
+    emit_insn (gen_unaligned_loadsi (operands[0], operands[1]));
+  else
+    emit_insn (gen_unaligned_storesi (operands[0], operands[1]));
+
+  DONE;
+})
+
+
 ;; Vector bits common to IWMMXT and Neon
 (include "vec-common.md")
 ;; Load the Intel Wireless Multimedia Extension patterns


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/67366] Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
  2015-08-26 21:42 [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus yann.collet.73 at gmail dot com
                   ` (7 preceding siblings ...)
  2015-08-27 11:17 ` ramana at gcc dot gnu.org
@ 2015-08-27 14:31 ` rearnsha at gcc dot gnu.org
  2015-08-27 14:36 ` rguenther at suse dot de
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rearnsha at gcc dot gnu.org @ 2015-08-27 14:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

--- Comment #9 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
Does that really do the right thing?  That is, does force_reg understand a
misaligned memory op?

Also, what if one memory operand is aligned, but the other not?  Don't we want
to have the right combination of aligned/misaligned instructions?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/67366] Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
  2015-08-26 21:42 [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus yann.collet.73 at gmail dot com
                   ` (8 preceding siblings ...)
  2015-08-27 14:31 ` rearnsha at gcc dot gnu.org
@ 2015-08-27 14:36 ` rguenther at suse dot de
  2015-08-27 14:45 ` ramana at gcc dot gnu.org
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenther at suse dot de @ 2015-08-27 14:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

--- Comment #10 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 27 Aug 2015, rearnsha at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> 
> --- Comment #9 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> Does that really do the right thing?  That is, does force_reg understand a
> misaligned memory op?
> 
> Also, what if one memory operand is aligned, but the other not?  Don't we want
> to have the right combination of aligned/misaligned instructions?

I think you never get two MEMs, you at most get a constant on the
RHS of a store.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/67366] Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
  2015-08-26 21:42 [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus yann.collet.73 at gmail dot com
                   ` (9 preceding siblings ...)
  2015-08-27 14:36 ` rguenther at suse dot de
@ 2015-08-27 14:45 ` ramana at gcc dot gnu.org
  2015-09-09 15:28 ` ramana at gcc dot gnu.org
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: ramana at gcc dot gnu.org @ 2015-08-27 14:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

--- Comment #11 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---

(In reply to rguenther@suse.de from comment #10)
> On Thu, 27 Aug 2015, rearnsha at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> > 
> > --- Comment #9 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> > Does that really do the right thing?  That is, does force_reg understand a
> > misaligned memory op?

In practice I think yes.

> > 
> > Also, what if one memory operand is aligned, but the other not?  Don't we want
> > to have the right combination of aligned/misaligned instructions?
> 
> I think you never get two MEMs, you at most get a constant on the
> RHS of a store.

Yep that's my understanding too. movmisalign<mode> has stricter rules compared
to mov<mode> so we have to handle some of these carefully. The logic in here is
essentially from neon.md : movmisalign<mode>, so some of it may not be relevant
here in the scalar case, but it's probably better to be a bit defensive here.

I tried playing with the HImode case but the pain I had was with the fact that
HImode movmisalign expects a load into a HImode register and I hadn't got my
head around that given other things I needed to finish up before leaving.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/67366] Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
  2015-08-26 21:42 [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus yann.collet.73 at gmail dot com
                   ` (10 preceding siblings ...)
  2015-08-27 14:45 ` ramana at gcc dot gnu.org
@ 2015-09-09 15:28 ` ramana at gcc dot gnu.org
  2015-10-09 11:08 ` ramana at gcc dot gnu.org
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: ramana at gcc dot gnu.org @ 2015-09-09 15:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

--- Comment #12 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #3)
> On Thu, 27 Aug 2015, rearnsha at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> > 
> > --- Comment #2 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #1)
> > > I think this boils down to the fact that memcpy expansion is done too late
> > > and
> > > that (with more recent GCC) the "inlining" done on the GIMPLE level is
> > > restricted
> > > to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
> > > unconditionally.
> > > 
> > 
> > Yep, we have to define STRICT_ALIGNMENT to 1 because not all load instructions
> > work with misaligned addresses (ldm, for example).  The only way to handle
> > misaligned copies is through the movmisalign API.
> 
> Are the movmisalign handled ones reasonably efficient?  That is, more
> efficient than memcpy/memmove?  Then we should experiment with
> 
> Index: gcc/gimple-fold.c
> ===================================================================
> --- gcc/gimple-fold.c   (revision 227252)
> +++ gcc/gimple-fold.c   (working copy)
> @@ -708,7 +708,9 @@ gimple_fold_builtin_memory_op (gimple_st
>                   /* If the destination pointer is not aligned we must be 
> able
>                      to emit an unaligned store.  */
>                   && (dest_align >= GET_MODE_ALIGNMENT (TYPE_MODE (type))
> -                     || !SLOW_UNALIGNED_ACCESS (TYPE_MODE (type), 
> dest_align)))
> +                     || !SLOW_UNALIGNED_ACCESS (TYPE_MODE (type), 
> dest_align)
> +                     || (optab_handler (movmisalign_optab, TYPE_MODE 
> (type))
> +                         != CODE_FOR_nothing)))
>                 {
>                   tree srctype = type;
>                   tree desttype = type;
> @@ -720,7 +722,10 @@ gimple_fold_builtin_memory_op (gimple_st
>                     srcmem = tem;
>                   else if (src_align < GET_MODE_ALIGNMENT (TYPE_MODE 
> (type))
>                            && SLOW_UNALIGNED_ACCESS (TYPE_MODE (type),
> -                                                    src_align))
> +                                                    src_align)
> +                          && (optab_handler (movmisalign_optab,
> +                                             TYPE_MODE (type))
> +                              == CODE_FOR_nothing))
>                     srcmem = NULL_TREE;
>                   if (srcmem)
>                     {

This plus the backend changes to deal with unaligned himode and simode values
tested ok on armhf with only 2 extra failures in strlen-opt-8.c. Prima-facie
they appear to be testisms, but it will be fun to handle this across all
architecture levels for the arm target as unaligned access depends on
architecture levels.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/67366] Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
  2015-08-26 21:42 [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus yann.collet.73 at gmail dot com
                   ` (11 preceding siblings ...)
  2015-09-09 15:28 ` ramana at gcc dot gnu.org
@ 2015-10-09 11:08 ` ramana at gcc dot gnu.org
  2015-10-11 10:33 ` fredrik.hederstierna@securitas-direct.com
  2015-10-13  9:16 ` ramana at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: ramana at gcc dot gnu.org @ 2015-10-09 11:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

--- Comment #15 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
Author: ramana
Revision: 228644
Modified property: svn:log

Modified: svn:log at Fri Oct  9 11:08:05 2015
------------------------------------------------------------------------------
--- svn:log (original)
+++ svn:log Fri Oct  9 11:08:05 2015
@@ -1,45 +1,43 @@
-[AArch64] Handle literal pools for functions > 1 MiB in size.
-    
+[PATCH PR target/67366 2/2] [gimple-fold.c] Support movmisalign optabs in
gimple-fold.c

-This patch fixes the issue in PR63304 where we have
-functions that are > 1MiB. The idea is to use adrp / ldr or adrp / add
-instructions to address the literal pools under the use of a command line
-option. I would like to turn this on by default on trunk but keep this
-disabled by default for the release branches in order to get some
-serious testing for this feature while it bakes on trunk.
+This patch by Richard allows for movmisalign optabs to be supported
+in gimple-fold.c. This caused a bit of pain in the testsuite with
strlenopt-8.c
+in conjunction with the ARM support for movmisalign_optabs as the test
+was coded up to do different things depending on whether the target
+supported misaligned access or not. However now with unaligned access
+being allowed for different levels of the architecture in the arm backend,
+the concept of the helper function non_strict_align mapping identically
+to the definition of STRICT_ALIGNMENT disappears.

-As a follow-up I would like to try and see if estimate_num_insns or
-something else can give us a heuristic to turn this on for "large" functions.
-After all the number of incidences of this are quite low in real life,
-so may be we should look to restrict this use as much as possible on the
-grounds that this code generation implies an extra integer register for
-addressing for every floating point and vector constant and I don't think
-that's great in code that already may have high register pressure.
+Adjusted thusly for ARM. The testsuite/lib changes were tested with an
+arm-none-eabi multilib that included architecture variants that did not
+support unaligned access and architecture variants that did.

-Tested on aarch64-none-elf with no regressions. A previous
-version was bootstrapped and regression tested.
+The testing matrix for this patch was:

-Applied to trunk.
+1. x86_64 bootstrap and regression test - no regressions.
+2. armhf bootstrap and regression test - no regressions.
+3. arm-none-eabi cross build and regression test for

-regards
-Ramana
+{-marm/-march=armv7-a/-mfpu=vfpv3-d16/-mfloat-abi=softfp}
+{-mthumb/-march=armv8-a/-mfpu=crypto-neon-fp-armv8/-mfloat-abi=hard}
+{-marm/-mcpu=arm7tdmi/-mfloat-abi=soft}
+{-mthumb/-mcpu=arm7tdmi/-mfloat-abi=soft}

-2015-09-14  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>
+with no regressions.

-       PR target/63304
-       * config/aarch64/aarch64.c (aarch64_expand_mov_immediate): Handle
-       nopcrelative_literal_loads.
-       (aarch64_classify_address): Likewise.
-       (aarch64_constant_pool_reload_icode): Define.
-       (aarch64_secondary_reload): Handle secondary reloads for
-       literal pools.
-       (aarch64_override_options): Handle nopcrelative_literal_loads.
-       (aarch64_classify_symbol): Handle nopcrelative_literal_loads.
-       * config/aarch64/aarch64.md
(aarch64_reload_movcp<GPF_TF:mode><P:mode>):
-       Define.
-       (aarch64_reload_movcp<VALL:mode><P:mode>): Likewise.
-       * config/aarch64/aarch64.opt (mpc-relative-literal-loads): New option.
-       * config/aarch64/predicates.md (aarch64_constant_pool_symref): New
-       predicate.
-       * doc/invoke.texi (mpc-relative-literal-loads): Document.
+Ok to apply ?

+2015-10-09  Richard Biener  <rguenth@suse.de>
+
+       PR target/67366
+       * gimple-fold.c (optabs-query.h): Include
+       (gimple_fold_builtin_memory_op): Allow unaligned stores
+       when movmisalign_optabs are available.
+
+2015-10-09  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>
+
+       PR target/67366
+       * lib/target-supports.exp (check_effective_target_non_strict_align):
+       Adjust for arm*-*-*.
+       * gcc.target/arm/pr67366.c: New test.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/67366] Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
  2015-08-26 21:42 [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus yann.collet.73 at gmail dot com
                   ` (12 preceding siblings ...)
  2015-10-09 11:08 ` ramana at gcc dot gnu.org
@ 2015-10-11 10:33 ` fredrik.hederstierna@securitas-direct.com
  2015-10-13  9:16 ` ramana at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: fredrik.hederstierna@securitas-direct.com @ 2015-10-11 10:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

Fredrik Hederstierna <fredrik.hederstierna@securitas-direct.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fredrik.hederstierna@securi
                   |                            |tas-direct.com

--- Comment #16 from Fredrik Hederstierna <fredrik.hederstierna@securitas-direct.com> ---
Could this fix also possibly improve:
Bug 67507 - "Code size increase with -Os from GCC 4.8.x to GCC 4.9.x for ARM
thumb1", which also seems related to alignment on ARM targets?
Thanks, Fredrik


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/67366] Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
  2015-08-26 21:42 [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus yann.collet.73 at gmail dot com
                   ` (13 preceding siblings ...)
  2015-10-11 10:33 ` fredrik.hederstierna@securitas-direct.com
@ 2015-10-13  9:16 ` ramana at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: ramana at gcc dot gnu.org @ 2015-10-13  9:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

--- Comment #17 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
(In reply to Fredrik Hederstierna from comment #16)
> Could this fix also possibly improve:
> Bug 67507 - "Code size increase with -Os from GCC 4.8.x to GCC 4.9.x for ARM
> thumb1", which also seems related to alignment on ARM targets?
> Thanks, Fredrik

No I don't think so ...


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-10-13  9:16 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-26 21:42 [Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus yann.collet.73 at gmail dot com
2015-08-27  7:39 ` [Bug target/67366] " rguenth at gcc dot gnu.org
2015-08-27  9:36 ` rearnsha at gcc dot gnu.org
2015-08-27 10:21 ` rguenther at suse dot de
2015-08-27 10:42 ` ramana at gcc dot gnu.org
2015-08-27 10:47 ` ramana at gcc dot gnu.org
2015-08-27 11:08 ` ramana at gcc dot gnu.org
2015-08-27 11:13 ` rguenther at suse dot de
2015-08-27 11:17 ` ramana at gcc dot gnu.org
2015-08-27 14:31 ` rearnsha at gcc dot gnu.org
2015-08-27 14:36 ` rguenther at suse dot de
2015-08-27 14:45 ` ramana at gcc dot gnu.org
2015-09-09 15:28 ` ramana at gcc dot gnu.org
2015-10-09 11:08 ` ramana at gcc dot gnu.org
2015-10-11 10:33 ` fredrik.hederstierna@securitas-direct.com
2015-10-13  9:16 ` ramana at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).