public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Christophe Lyon <christophe.lyon@linaro.org>
To: Daniel Engel <libgcc@danielengel.com>
Cc: Richard Earnshaw <Richard.Earnshaw@foss.arm.com>,
	gcc Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
Date: Sat, 9 Jan 2021 14:09:25 +0100	[thread overview]
Message-ID: <CAKdteOYtmZ=-DJkouURmdhzyJmSCoiQ7d236STM_Teq6kzXwtQ@mail.gmail.com> (raw)
In-Reply-To: <17c601ba-60e4-446c-b7b4-b1674d8500e5@www.fastmail.com>

On Sat, 9 Jan 2021 at 13:27, Daniel Engel <libgcc@danielengel.com> wrote:
>
> On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
> > On 07/01/2021 00:59, Daniel Engel wrote:
> > > --snip--
> > >
> > > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> > >
> > >>
> > >> Thanks for working on this, Daniel.
> > >>
> > >> This is clearly stage1 material, so we've got time for a couple of
> > >> iterations to sort things out.
> > >
> > > I appreciate your feedback.  I had been hoping that with no regressions
> > > this might still be eligible for stage2.  Christophe never indicated
> > > either way. but the fact that he was looking at it seemed positive.
> > > I thought I would be a couple of weeks faster with this last
> > > iteration, but holidays got in the way.
> >
> > GCC doesn't have a stage 2 any more (historical wart).  We were in
> > (late) stage3 when this was first posted, and because of the significant
> > impact this might have on not just CM0 but other targets as well, I
> > don't think it's something we should try to squeeze in at the last
> > minute.  We're now in stage 4, so that is doubly the case.
>
> Of course I meant stage3.  Oops.  I actually thought stage 3 would
> continue through next week based on the average historical dates.

I expected stage4 to start on Jan 1st :-)

> It would have been nice to get this feedback when I emailed you a
> preview version of this patch (2020-Nov-11).  Christophe's logs have
> been very helpful on the technical integration, but it's proving a chore
> to go back and re-create some of the intermediate chunks.
>
> Regardless, I still have free time for at least a little while longer to
> work on this, so I'll push forward with any further feedback you are
> willing to give me.  I have failed to free up any time during the last 2
> years to actually work on this during stage1, and I have no guarantee
> this coming year will be different.
>
> >
> > Christophe is a very valuable member of our community, but he's not a
> > port maintainer and thus cannot really rule on what can go into the
> > tools, or when.
> >
> > >
> > > I actually think your comments below could all be addressable within a
> > > couple of days.  But, I'm not accounting for the review process.
> > >
> > >> Firstly, the patch is very large, but contains a large number of
> > >> distinct changes, so it would really benefit from being broken down into
> > >> a number of distinct patches.  This will make reviewing the individual
> > >> changes much more straight-forward.

And if you can generate the patch with git, that will help: the reason for the
previous build errors was because I had to "reformat" your patch before
submitting it for validation, and I messed things up.

> > >
> > > I have no context for "large" or "small" with respect to gcc.  This
> > > patch comprises about 30% of a previously-monolithic library that's
> > > been shipping since ~2016 (the rest is libm material).  Other than
> > > (1) the aforementioned change to div0(), (2) a nascent adaptation
> > > for __truncdfsf2() (not enabled), and (3) the gratuitous addition of
> > > the bitwise functions, the library remains pretty much as it was
> > > originally released.
> >
> > Large, like many other terms is relative.  For assembler file changes,
> > which this is primarily, the overall size can be much smaller and still
> > be considered 'large'.
> >
> > >
> > > The driving force in the development of this library was small size,
> > > which of course was never possible with the softfp routines.  It's not
> > > half-slow, either, for the limitations of the M0 architecture.   And,
> > > it's IEEE compliant.  But, that means that most of the functions are
> > > highly interconnected.  So, some of it can be broken up as you outline
> > > below, but that last patch is still worth more than half of the total.
> >
> > Nevertheless, having the floating-point code separated out will make
> > reviewing more straight forward.  I'll likely need to ask one of our FP
> > experts to have a specific look at that part and that will be easier if
> > it is disentangled from the other changes.
> > >
> > > I also have ~70k lines of test vectors that seem mostly redundant, but
> > > not completely.  I haven't decided what to do here.  For example, I have
> > > coverage for __aeabi_u/ldivmod, while GCC does not.  If I do anything
> > > with this code it will be in a separate thread.
> >
> > Publishing the test code, even if it isn't integrated into the GCC
> > testsuite would be useful.  Perhaps someone else could then help with that.
>
> Very brute force stuff, not production quality:
>     <http://danielengel.com/cm0_test_vectors.tgz> (160 kb)
>
> > >> I'd suggest:
> > >>
> > >> 1) Some basic makefile cleanups to ease initial integration - in
> > >> particular where we have things like
> > >>
> > >> LIB1FUNCS += <long list of functions>
> > >>
> > >> that this be rewritten with one function per line (and sorted
> > >> alphabetically) - then we can see which functions are being changed in
> > >> subsequent patches.  It makes the Makefile fragments longer, but the
> > >> improvement in clarity for makes this worthwhile.
> > >
> > > I know next to nothing about Makefiles, particularly ones as complex as
> > > GCC's.  I was just trying to work with the existing style to avoid
> > > breaking something.  However, I can certainly adopt this suggestion.
> > >
> > >> 2) The changes for the existing integer functions - preferably one
> > >> function per patch.
> > >>
> > >> 3) The new integer functions that you're adding
> > >
> > > These wouldn't be too hard to do, but what are the expectations for
> > > testing?  A clean build of GCC takes about 6 hours in my VM, and
> > > regression testing takes about 4 hours per architecture.  You would want
> > > a full regression report for each incremental patch?  I have no idea how
> > > to target regression tests that apply to particular runtime functions
> > > without the risk of missing something.
> > >
> >
> > Most of this can be tested in a cross-compile environment using qemu as
> > a model.  A cross build shouldn't take that long (especially if you
> > restrict the compiler to just C and C++ - other languages are
> > vanishingly unlikely to pick up errors in the parts of the compiler
> > you're changing).  But build checks will be mostly sufficient for most
> > of the intermediate patches.
> >
> > >> 4) The floating-point support.
> > >>
> > >> Some more general observations:
> > >>
> > >> - where functions are already in lib1funcs.asm, please leave them there.
> > >
> > > I guess I have a different vision here.  I have had a really hard time
> > > following all of the nested #ifdefs in lib1funcs, so I thought it would
> > > be helpful to begin breaking it up into logical units.
> >
> > Agreed, it's not easy.  But the restructuring, if any, should be done
> > separately from other changes, not mixed up with them.
> >
> > >
> > > The functions removed were all functions for which I had THUMB1
> > > sequences faster/smaller than lib1funcs:  __clzsi2, __clzdi2, __ctzsi2,
> > > __ashrdi3, __lshrdi3, __ashldi3.  In fact, the new THUMB1 of __clzsi2 is
> > > the same number of instructions as the previous ARM/THUMB2 version.
> > >
> > > You will find all of the previous ARM versions of these functions merged
> > > into the new files (with attribution) and the same preprocessor
> > > selection path.  So no architecture variant should be any worse off than
> > > before this patch, and some beyond v6m should benefit.
> > >
> > > In the future, I think that my versions of __divsi3 and __divdi3 will
> > > prove faster/smaller than the existing THUMB2 versions.  I know that my
> > > float routines are less than half the compiled size of THUMB2 versions
> > > in 'ieee754-sf.S'.  However, I haven't profiled the exact performance
> > > differences so I have left all this work for future patches. (It's also
> > > quite likely that my version can be further-refined with a few judicious
> > > uses of THUMB2 alternatives.)
> > >
> > > My long-term vision would be use lib1funcs as an architectural wrapper
> > > distinct from the implementation code.
> > >
> > >> - lets avoid having the cm0 subdirectory - in particular we should do
> > >> this when there is existing code for other targets in the same source
> > >> files.  It's OK to have any new files in the main 'arm' directory of the
> > >> source tree - just name the files appropriately if really needed.
> > >
> > > Fair point on the name.  In v1 of this patch, all these files were all
> > > preprocessor-selected for v6m only.  However, as I've stumbled through
> > > the finer points of integration, that line has blurred.  Name aside,
> > > the subdirectory does still represent a standalone library.   I think
> > > I've managed to add enough integration hooks that it works well in
> > > a libgcc context, but it still has a very distinct implementation style.
> > >
> > > I don't have a strong opinion on this, just preference.  But, keeping
> > > the subdirectory with a neutral name will probably make maintenance
> > > easier in the short term.  I would suggest "lib0" (since it caters to
> > > the lowest common denominator) or "eabi" (since that was the original
> > > target).  There are precedents in other architectures (e.g. avr).
> >
> > The issue here is that the selection of code from the various
> > subdirectories is not consistent.  In some cases we might be pulling in
> > a thumb1 implementation into a thumb2 environment, so having the code in
> > a directory that doesn't reflect this makes maintaining the code harder.
> >  I don't mind too much if some new files are introduced and their names
> > reflect both their function and the architecture they support - eg
> > t1-di-shift.S would obviously contain code for di-mode shifts in thumb1.
>
> You didn't say that a neutral directory name is off the table.
> I will propose something other than 'cm0'.
>
> > >
> > >> - let's avoid the CM0 prefix - this is 'thumb1' code, for want of a
> > >> better term, and that is used widely elsewhere in the compiler.  So if
> > >> you really need a term just use THUMB1, or even T1.
> > >
> > > Maybe.  The Cortex M0 includes a subset of THUMB2 instructions.  Most
> > > of this is probably THUMB1 clean, but it wasn't a design requirement.
> >
> > It's particularly the Thumb1 issue, just more the name is for a specific
> > CPU which might cause confusion later.  v6m would be preferable to that
> > if there really is a dependency on the instructions that are not in the
> > original Thumb1 ISA.
>
> I will remove the CM0 prefix and use/extend the standard macro names.
>
> >
> > >
> > > The CM0_FUNC_START exists so that I can specify subsections of ".text"
> > > for each function.  This was a fairly fundamental design decision that
> > > allowed me to make a number of branch optimizations between functions.
> > > The other macros are just duplicates for naming symmetry.
> >
> > This is something we'll have to get to during the main review of the
> > code - we used to have support for PE-COFF object files.  That might now
> > be obsolete, wince support is certainly deprecated - but we can't assume
> > that ELF is the only object format we'll ever have to support.
> >
> > >
> > > The existing  FUNC_START macro inserts extra conflicting ".text"
> > > directives that would break the build.  Of course, the prefix was
> > > arbitrary; I just took CM0 from the library name.  But, there's nothing
> > > architecturally significant about this macro at all, so THUMB1 and T1
> > > seems just about as wrong.  Maybe define a FUNC_START_SECTION macro with
> > > two parameters? For example:
> > >
> > >     FUNC_START_SECTION clzdi2 .text.sorted.libgcc.clz2.clzdi2
> > >
> > > Instead of:
> > >
> > >     .section .text.sorted.libgcc.clz2.clzdi2,"x"
> > >     CM0_FUNC_START clzdi2
> > >
> > >> - For the 64-bit shift functions, I expect the existing code to be
> > >> preferable whenever possible - I think it can even be tweaked to build
> > >> for thumb2 by inserting a suitable IT instruction.  So your code should
> > >> only be used when
> > >>
> > >>  #if !__ARM_ARCH_ISA_ARM && __ARM_ARCH_ISA_THUMB == 1
> > >
> > > That is the definition of NOT_ISA_TARGET_32BIT, which I am respecting.
> > > (The name doesn't seem quite right for Cortex M0, since it does support
> > > some 32 bit instructions, but that's beside the point.)
> >
> > The terms Thumb1 and Thumb2 predate the arm-v8m architecture
> > specifications - even then the term Thumb1 was interpreted as "mostly
> > 16-bit instructions" and thumb2 as "a mix of 16- and 32-bit".  Yes, the
> > 16/32-bit spilt has become more blurred and that will likely continue in
> > future since the 16-bit encoding space is pretty full.
> >
> > >
> > > The current lib1funcs ARM code path still exists, as described above. My
> > > THUMB1 implementations were 1 - 3 instructions shorter than the current
> > > versions, which is why I took the time to merge the files.
> > >
> > > Unfortunately, the Cortex M0 THUMB2 subset does not provide IT.  I don't
> > > see an advantage to eliminating the branch unless these functions were
> > > written with cryptographic side channel attacks in mind.
> >
> > On high performance cores branches are predicted - if the branch is
> > predictable then the common path will be taken and the unneeded
> > instructions will never be used.  But library functions like this tend
> > to have very unpredictable values used for calling them, so it's much
> > less likely that the hardware will predict the right path - at this
> > point conditional instructions tend to win (especially if there aren't
> > very many of them) because the cost (on average) of not executing the
> > unneeded instructions is much lower than the cost (on average) of
> > unwinding the processor state to execute the other arm of the
> > conditional branch.
>
> Got it.  I have been counting branches as 3 cycles of fixed cost, and
> ignoring penalties if a branch skips at least 2 instructions.
>
> Going forward, I will add 'IT<c>' compile options for any new code with
> scope beyond v6m.
>
> > >> - most, if not all, of your LSYM symbols should not be needed after
> > >> assembly, so should start with a captial 'L' (and no leading
> > >> underscores), the assembler will then automatically discard any that are
> > >> not needed for relocations.
> > >
> > > You don't want debugging symbols for libgcc internals :) ?  I sort of
> > > understand that, but those symbols have been useful to me in the past.
> > > The "." by itself seems to keep visibility local, so the extra symbols
> > > won't cause linker issuess. Would you object to a macro variant (e.g.
> > > LLSYM) that prepends the "L" but is easier to disable?
> >
> > It is a matter of taste, but I really prefer the local symbols to
> > disappear entirely once the file is compiled - it makes things like
> > backtrace gdb show the proper call heirarchy.
>
> Hearing no objection to LLSYM, I'll implement that for debugging.
> The released version will have ".L" symbols stripped.
>
> > >> - you'll need to write suitable commit messages for each patch, which
> > >> also contain a suitable ChangeLog style entry.
> > >
> > > OK.
> > >
> > >> - finally, your popcount implementations have data in the code segment.
> > >>  That's going to cause problems when we have compilation options such as
> > >> -mpure-code.
> > >
> > > I am just following the precedent of existing lib1funcs (e.g. __clz2si).
> > > If this matters, you'll need to point in the right direction for the
> > > fix.  I'm not sure it does matter, since these functions are PIC anyway.
> >
> > That might be a bug in the clz implementations - Christophe: Any thoughts?
>
> __clzsi2() has test coverage in "gcc.c-torture/execute/builtin-bitops-1.c"
Thanks, I'll have a closer look at why I didn't see problems.

> The 'clzs' and 'ctz' functions should never have problems.   -mpure-code
> appears to be valid only when the 'movt' instruction is available, which
> means that the 'clz' instruction will also be available, so no array loads.
No, -mpure-code is also supported with v6m.

> Is the -mpure-code state detectable as a preprocessor flag?  While
No.

> 'movw'/'movt' appears to be the canonical solution, I'm not sure it
> should be the default just because a processor supports Thumb-2.
>
> Do users wanting to use -mpure-code recompile the toolchain to avoid
> constant data in compiled C functions?  I don't think this is the
> default for the typical toolchain scripts.
No, users of -mpure-code do not recompile the toolchain.

> > >> I strongly suggest that, rather than using gcc snapshots (I'm assuming
> > >> this based on the diff style and directory naming in your patches), you
> > >> switch to using a git tree, then you'll be able to use tools such as
> > >> rebasing and the git posting tools to send the patch series for
> > >> subsequent review.
> > >
> > > Your assumption is correct.  I didn't think that I would have to get so
> > > deep into the gcc development process for this contribution.  I used
> > > this library as a bare metal alternative for libgcc/libm in the product
> > > for years, so I thought it would just drop in.  But, the libgcc compile
> > > mechanics have proved much more 'interesting'. I'm assuming this
> > > architecture was created years before the introduction of -ffunction-
> > > sections...
> > >
> >
> > I don't think I've time to write a history lesson, even if you wanted
> > it.  Suffice to say, this does date back to the days of a.out format
> > object files (with 4 relocation types, STABS debugging, and one code,
> > one data and one bss section).
> >
> > >>
> > >> Richard.
> > >>
> > >
> > > Thanks again,
> > > Daniel
> > >
> >
> > R.
> >
>
> To reiterate what I said above, I intend to push forward and incorporate
> your current recommendations plus any further feedback I may get.  I
> expect you to say that this doesn't merit inclusion yet, but I'd rather
> spend the time while I have it.
>
> I'll post a patch series for review within the next day or so.

Here are the results of the validation of your latest version (20210105):
https://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/r11-5993-g159b0bd9ce263dfb791eff5133b0ca0207201c84-cortex-m0-fplib-20210105.patch/report-build-info.html

"BIG-REGR" just means the regression report is large enough that it's
provided in compressed form to avoid overloading the browser.

So it really seems your patch introduces regressions in arm*linux* configs.
For the 2 arm-none-eabi configs which show regressions (cortex-m0 and
cortex-m3), the logs seem to indicate some tests timed out, and it's
possible the server used was overloaded.
The same applies to the 3 aarch64*elf cases, where the regressions
seem only caused by timed out; there's no reason your patch would have
an impact on aarch64.
(there 5 configs were tested on the same machine, so overload is indeed likely).

I didn't check why all the ubsan tests now seem to fail, they are in
the "unstable" category because in the past some of them had some
randomness.
I do not see such noise in trunk validation though.

Thanks,

Christophe

>
> Thanks again,
> Daniel

  reply	other threads:[~2021-01-09 13:09 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-12 23:04 [PATCH] " Daniel Engel
2020-11-26  9:14 ` Christophe Lyon
2020-12-02  3:32   ` Daniel Engel
2020-12-16 17:15     ` Christophe Lyon
2021-01-06 11:20       ` [PATCH v3] " Daniel Engel
2021-01-06 17:05         ` Richard Earnshaw
2021-01-07  0:59           ` Daniel Engel
2021-01-07 12:56             ` Richard Earnshaw
2021-01-07 13:27               ` Christophe Lyon
2021-01-07 16:44                 ` Richard Earnshaw
2021-01-09 12:28               ` Daniel Engel
2021-01-09 13:09                 ` Christophe Lyon [this message]
2021-01-09 18:04                   ` Daniel Engel
2021-01-11 14:49                     ` Richard Earnshaw
2021-01-09 18:48                   ` Daniel Engel
2021-01-11 16:07                   ` Christophe Lyon
2021-01-11 16:18                     ` Daniel Engel
2021-01-11 16:39                       ` Christophe Lyon
2021-01-15 11:40                         ` Daniel Engel
2021-01-15 12:30                           ` Christophe Lyon
2021-01-16 16:14                             ` Daniel Engel
2021-01-21 10:29                               ` Christophe Lyon
2021-01-21 20:35                                 ` Daniel Engel
2021-01-22 18:28                                   ` Christophe Lyon
2021-01-25 17:48                                     ` Christophe Lyon
2021-01-25 23:36                                       ` Daniel Engel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKdteOYtmZ=-DJkouURmdhzyJmSCoiQ7d236STM_Teq6kzXwtQ@mail.gmail.com' \
    --to=christophe.lyon@linaro.org \
    --cc=Richard.Earnshaw@foss.arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=libgcc@danielengel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).