[Bug libc/28432] New: Aarch64 memcpy used on device-memory

public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug libc/28432] New: Aarch64 memcpy used on device-memory
@ 2021-10-07 12:39 jon@solid-run.com
  2021-10-07 12:59 ` [Bug libc/28432] " adhemerval.zanella at linaro dot org
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: jon@solid-run.com @ 2021-10-07 12:39 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=28432

            Bug ID: 28432
           Summary: Aarch64 memcpy used on device-memory
           Product: glibc
           Version: 2.32
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: libc
          Assignee: unassigned at sourceware dot org
          Reporter: jon@solid-run.com
                CC: drepper.fsp at gmail dot com
  Target Milestone: ---

Created attachment 13713
  --> https://sourceware.org/bugzilla/attachment.cgi?id=13713&action=edit
patch to memcpy memmove for better device-memory compatibility

It was first reported 4 years ago with the Macchiatobin writing to a memory
mapped framebuffer of a PCIe device.  The error was narrowed down to
overlapping stps causing device memory to be 0'd out or not written at all. 
There were many discussions on if it was valid to use mem* functions on device
memory mapped as uncached / writecombined.  Recently I tracked down a rendering
problem on the HoneyComb LX2K to a similar failure.  Since between the 3 SOCs,
the only similarity is the Cortex-A72 cores (They all have different
combinations of CCN's and PCIe IP) I started looking a bit more into possible
causes.  I came across this documentation regarding how the Cortex-A72 does ACE
transfers, https://developer.arm.com/documentation/100095/0001/way1381846851421

because I had already narrowed down the failure to memcpy's of 97-110 size
unaligned copies I realized that it was always the last 2 stp's of the copy96
routine. Since the ordering should not matter, I instead moved the backwards
copy to happen first which would then allow from what I understand of the
document above the 4 forward progressing stp's could be sent as a single
4x128bit WRAP write.

This does fix the issue I was trying to solve in both my specific test case as
well as a few real world rendering failures.  Since we are only re-ordering
stp's there should be no functional or performance regressions, and all the
glibc test's do pass.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libc/28432] Aarch64 memcpy used on device-memory
  2021-10-07 12:39 [Bug libc/28432] New: Aarch64 memcpy used on device-memory jon@solid-run.com
@ 2021-10-07 12:59 ` adhemerval.zanella at linaro dot org
  2021-10-13 12:36 ` wdijkstr at arm dot com
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: adhemerval.zanella at linaro dot org @ 2021-10-07 12:59 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=28432

Adhemerval Zanella <adhemerval.zanella at linaro dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |adhemerval.zanella at linaro dot o
                   |                            |rg

--- Comment #1 from Adhemerval Zanella <adhemerval.zanella at linaro dot org> ---
Could you please send the patch to libc-alpha? If may check the contributor
maillist [1].  Patch discussion are done on the maillist and usually patches
attached on bug reports are not following up without a proper submission to
maillist.

[1] https://sourceware.org/glibc/wiki/Contribution%20checklist.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libc/28432] Aarch64 memcpy used on device-memory
  2021-10-07 12:39 [Bug libc/28432] New: Aarch64 memcpy used on device-memory jon@solid-run.com
  2021-10-07 12:59 ` [Bug libc/28432] " adhemerval.zanella at linaro dot org
@ 2021-10-13 12:36 ` wdijkstr at arm dot com
  2021-10-13 16:36 ` jon@solid-run.com
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: wdijkstr at arm dot com @ 2021-10-13 12:36 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=28432

Wilco <wdijkstr at arm dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wdijkstr at arm dot com

--- Comment #2 from Wilco <wdijkstr at arm dot com> ---
(In reply to Jon Nettleton from comment #0)

> This does fix the issue I was trying to solve in both my specific test case
> as well as a few real world rendering failures.  Since we are only
> re-ordering stp's there should be no functional or performance regressions,
> and all the glibc test's do pass.

It's not clear what the underlying issue is, but I don't believe reordering
stores would be sufficient - it might depend on alignment, state of the
writebuffer, cache miss/hit and many other factors that software can't control.

So my first question is whether this works around the issue 100%, ie. for all
possible combinations of src and dst alignment, overlap (for memmove) and
sizes?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libc/28432] Aarch64 memcpy used on device-memory
  2021-10-07 12:39 [Bug libc/28432] New: Aarch64 memcpy used on device-memory jon@solid-run.com
  2021-10-07 12:59 ` [Bug libc/28432] " adhemerval.zanella at linaro dot org
  2021-10-13 12:36 ` wdijkstr at arm dot com
@ 2021-10-13 16:36 ` jon@solid-run.com
  2021-10-13 18:28 ` nsz at gcc dot gnu.org
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jon@solid-run.com @ 2021-10-13 16:36 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=28432

--- Comment #3 from Jon Nettleton <jon@solid-run.com> ---
(In reply to Wilco from comment #2)
> (In reply to Jon Nettleton from comment #0)
> 
> > This does fix the issue I was trying to solve in both my specific test case
> > as well as a few real world rendering failures.  Since we are only
> > re-ordering stp's there should be no functional or performance regressions,
> > and all the glibc test's do pass.
> 
> It's not clear what the underlying issue is, but I don't believe reordering
> stores would be sufficient - it might depend on alignment, state of the
> writebuffer, cache miss/hit and many other factors that software can't
> control.
> 
> So my first question is whether this works around the issue 100%, ie. for
> all possible combinations of src and dst alignment, overlap (for memmove)
> and sizes?

This fixes a singular issue that we were seeing and was reproducible with this
modified version of a previous test program that was done when the problem was
first detected on the Macchiatobin. 
https://gist.github.com/jnettlet/80f8d09d01c0dc0ffc0122f36ed78de6

The issue addressed here is purely overlapping writes to device memory of sizes
97-110 would have bytes starting or immediately after 64 bytes that would be
either zero'd or unchanged.

This behaviour could be successfully worked around by simple breaking up
memcpy's of that size range into smaller copies. 
https://gist.github.com/jnettlet/80f8d09d01c0dc0ffc0122f36ed78de6#file-fb_test-c-L98

This does not solve all GPU rendering issues, I have another patch for that but
this does resolve the issues in my test program, as well as rendering errors
that were seen in glmark2 -b terrain as well as the game minetest that still
existed with my mesa patch.  Because these issues still existing with my
patched mesa I believe this is a separate bug.

This is another developer that reported the issues to me testing my glibc with
this patch included. https://twitter.com/sahajsarup/status/1445813643744985093

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libc/28432] Aarch64 memcpy used on device-memory
  2021-10-07 12:39 [Bug libc/28432] New: Aarch64 memcpy used on device-memory jon@solid-run.com
                   ` (2 preceding siblings ...)
  2021-10-13 16:36 ` jon@solid-run.com
@ 2021-10-13 18:28 ` nsz at gcc dot gnu.org
  2021-10-14  4:15 ` jon@solid-run.com
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: nsz at gcc dot gnu.org @ 2021-10-13 18:28 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=28432

Szabolcs Nagy <nsz at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |nsz at gcc dot gnu.org

--- Comment #4 from Szabolcs Nagy <nsz at gcc dot gnu.org> ---
the issue may be how the memory was mapped: it is invalid
to map with normal memory attributes if overlapping writes
are not supported the way hw expects normal memory to work.
(i'm unfortunately not an expert on this and i don't know
what's going on in the drivers, but any further information
would be helpful so others can understand the issue better)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libc/28432] Aarch64 memcpy used on device-memory
  2021-10-07 12:39 [Bug libc/28432] New: Aarch64 memcpy used on device-memory jon@solid-run.com
                   ` (3 preceding siblings ...)
  2021-10-13 18:28 ` nsz at gcc dot gnu.org
@ 2021-10-14  4:15 ` jon@solid-run.com
  2021-10-14  8:53 ` david at qore dot org
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jon@solid-run.com @ 2021-10-14  4:15 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=28432

--- Comment #5 from Jon Nettleton <jon@solid-run.com> ---
(In reply to Szabolcs Nagy from comment #4)
> the issue may be how the memory was mapped: it is invalid
> to map with normal memory attributes if overlapping writes
> are not supported the way hw expects normal memory to work.
> (i'm unfortunately not an expert on this and i don't know
> what's going on in the drivers, but any further information
> would be helpful so others can understand the issue better)

This is not a new issue.  It was first discussed in 2018 on this very long
kernel mailing list thread. 
https://lore.kernel.org/lkml/87h8k7h8q9.fsf@linux.ibm.com/T/

It was originally written off as a PCIe implementation problem.  Well I have
now seen the issue personally on three different combinations of Cortex-a72,
CCN's, and PCIe IP, from multiple manufacturer's. The only common part of all
the implementations is the Arm implementation of the Cortex-A72.  I will note,
but can not personally verify that most likely the Kunpeng SOC also exhibits
this behavior because they also have similar patches to the ones I have
written.

Someone could spend another 4 years trying to find the exact cause of this
behavior, since it has just sat around for the last 4 years and ignored my
guess is that this is not going to happen. Half of the engineers claim that
mapping device memory as uncached on Arm is the same as write-combine and
another half that claim that device memory can never be treated as normal
memory.  Truth be told unless we are going to rework the entire Linux graphics
stack we need some sort of compatibility with how the software currently works.

If we don't want to patch the main aarch64 memcpy then I will be happy to
submit a ifunc implementation.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libc/28432] Aarch64 memcpy used on device-memory
  2021-10-07 12:39 [Bug libc/28432] New: Aarch64 memcpy used on device-memory jon@solid-run.com
                   ` (4 preceding siblings ...)
  2021-10-14  4:15 ` jon@solid-run.com
@ 2021-10-14  8:53 ` david at qore dot org
  2021-10-25 16:46 ` nsz at gcc dot gnu.org
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: david at qore dot org @ 2021-10-14  8:53 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=28432

David Nichols <david at qore dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |david at qore dot org

--- Comment #6 from David Nichols <david at qore dot org> ---
This patch fixes real issues on the platforms listed above, without it users
need to run a custom glibc, which is a pain.

The patch has no performance penalty and no impact on other platforms - please
take into account the benefit to the impacted user base and the risk of
regressions (none) when considering this patch for merging.

In the absolute worst case it does no harm, and at this point it looks like
finding out the root cause in the HW is not going to happen in the near future.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libc/28432] Aarch64 memcpy used on device-memory
  2021-10-07 12:39 [Bug libc/28432] New: Aarch64 memcpy used on device-memory jon@solid-run.com
                   ` (5 preceding siblings ...)
  2021-10-14  8:53 ` david at qore dot org
@ 2021-10-25 16:46 ` nsz at gcc dot gnu.org
  2021-10-26  7:01 ` jon@solid-run.com
  2021-11-29 16:18 ` sam at gentoo dot org
  8 siblings, 0 replies; 10+ messages in thread
From: nsz at gcc dot gnu.org @ 2021-10-25 16:46 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=28432

--- Comment #7 from Szabolcs Nagy <nsz at gcc dot gnu.org> ---
for the patch to be accepted,
- it has to be posted to libc-alpha
- we need reasonable confidence that it fixes an issue
- there should be no relevant performance regression

looking at
https://lore.kernel.org/lkml/CAK8P3a1CdL5ZVvWPMriOknTCgkfbR4F-a6dRZ7ap9+NiD8H5cg@mail.gmail.com/
and followup mail i'd expect this to be a hw issue
somewhere between the core and the pci framebuffer.

i think we need to document what hw was this bug
observed on (all components involved).

(ideally it would be demonstrated which hw is at fault
e.g. via traces of intermediate protocols or trying
different combinations of hw components.)

(note that glibc is not the only runtime library with
this memcpy code, and the compiler may also generate
similar code patterns)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libc/28432] Aarch64 memcpy used on device-memory
  2021-10-07 12:39 [Bug libc/28432] New: Aarch64 memcpy used on device-memory jon@solid-run.com
                   ` (6 preceding siblings ...)
  2021-10-25 16:46 ` nsz at gcc dot gnu.org
@ 2021-10-26  7:01 ` jon@solid-run.com
  2021-11-29 16:18 ` sam at gentoo dot org
  8 siblings, 0 replies; 10+ messages in thread
From: jon@solid-run.com @ 2021-10-26  7:01 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=28432

--- Comment #8 from Jon Nettleton <jon@solid-run.com> ---
(In reply to Szabolcs Nagy from comment #7)
> for the patch to be accepted,
> - it has to be posted to libc-alpha

Yes we will get there.

> - we need reasonable confidence that it fixes an issue

Well we have one test program that was born from the original reported issue 4
years ago, and 2 real world programs with multiple confirmations that it fixes
the issue

> - there should be no relevant performance regression

Yes I agree, however I don't see how change the store order is going to effect
performance in this case. Was the ordering questioned when this implementation
of memcpy was submitted?

> 
> looking at
> https://lore.kernel.org/lkml/CAK8P3a1CdL5ZVvWPMriOknTCgkfbR4F-
> a6dRZ7ap9+NiD8H5cg@mail.gmail.com/
> and followup mail i'd expect this to be a hw issue
> somewhere between the core and the pci framebuffer.

> 
> i think we need to document what hw was this bug
> observed on (all components involved).

That has been the assumption but there are really only a handful of systems
that can be used to test and validate this. Ultimately it was originally blamed
on the Synopsis PCIe IP, however with the LX2160a V1 silicon we have reproduced
the issue on a completely different PCIe IP. Additionally the Macchiatobin with
the Armada 8040 and CN913x based systems have completely different CCN
implementations, but similar PCIe IP as the LX2160 but manufactured
differently. Really the only similarities are the Cortex-A72 cores and how ACE
talks to the coherency fabric which turns connects to the PCIe over AXI.

> 
> (ideally it would be demonstrated which hw is at fault
> e.g. via traces of intermediate protocols or trying
> different combinations of hw components.)

The only combination I have not tried are different Cortex-A* cores, just
because there aren't many other hardware options to test against. 

> 
> (note that glibc is not the only runtime library with
> this memcpy code, and the compiler may also generate
> similar code patterns)

Yes I understand this, and I already have a patch for the kernel. The kernel is
slightly different since it does have a specific memcpy_io implementation. I do
understand the compiler can also generate similar code patterns however just
changing glibc's memcpy has resolved all the known issues and bugs reported to
me on our platforms.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libc/28432] Aarch64 memcpy used on device-memory
  2021-10-07 12:39 [Bug libc/28432] New: Aarch64 memcpy used on device-memory jon@solid-run.com
                   ` (7 preceding siblings ...)
  2021-10-26  7:01 ` jon@solid-run.com
@ 2021-11-29 16:18 ` sam at gentoo dot org
  8 siblings, 0 replies; 10+ messages in thread
From: sam at gentoo dot org @ 2021-11-29 16:18 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=28432

Sam James <sam at gentoo dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sam at gentoo dot org

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-11-29 16:18 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-07 12:39 [Bug libc/28432] New: Aarch64 memcpy used on device-memory jon@solid-run.com
2021-10-07 12:59 ` [Bug libc/28432] " adhemerval.zanella at linaro dot org
2021-10-13 12:36 ` wdijkstr at arm dot com
2021-10-13 16:36 ` jon@solid-run.com
2021-10-13 18:28 ` nsz at gcc dot gnu.org
2021-10-14  4:15 ` jon@solid-run.com
2021-10-14  8:53 ` david at qore dot org
2021-10-25 16:46 ` nsz at gcc dot gnu.org
2021-10-26  7:01 ` jon@solid-run.com
2021-11-29 16:18 ` sam at gentoo dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).