Here's an updated version of my ldm/stm peepholes patch for current
trunk.  The main goal of this is to enable ldm/stm generation for Thumb
by using define_peephole2 and peep2_find_free_reg rather than
define_peephole; there are one or two new peepholes to recognize
additional opportunities.

I've rerun Cortex-A9 SPEC2000 benchmarks on our 4.4-based tree, where it
still causes a tiny performance improvement.  Please disregard the
previous set of benchmark results (for limiting this to 3/4- or
4-operation sequences only), I think those results were invalid.  I've
retested these, and limiting the transformation in such a way seems to
cause performance drops.

Previously there were requests to modify performance tuning, but there
were no answers to my questions about how exactly I should go about it,
and no information has been forthcoming about actual processor behaviour
which could be used to implement meaningful tuning.  As requested, I ran
some benchmarks and posted results, which were also ignored
(fortunately, see above).  Since the patch isn't primarily intended to
change code generation significantly on ARM/Thumb-2 code anyway (and
given the performance results mentioned above), I feel it is
unreasonable to hold this up any further.  Additional improvements may
be possible on top of it, but IMO it's a self-contained improvement
as-is.  An earlier patch already introduced the
multiple_operation_profitable_p function which can be used for tuning.

Tested with my usual arm-linux/qemu configuration.  Earlier, I posted a
fix for the PR44404 problem which showed up.

Ok?


Bernd