modifying the ARM generation behavior?

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* modifying the ARM generation behavior?
@ 2001-09-22  9:31 Josh Fryman
  2001-09-22 12:57 ` Ray Lehtiniemi
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Josh Fryman @ 2001-09-22  9:31 UTC (permalink / raw)
  To: gcc

hi all,

i've got a problem with the way the ARM code is being generated. i'm
wondering if there is any way around this problem that already exists,
or if i'm going to have the priveledge of writing some hacks... i 
asked on gcc-help first, but since that struck out, it's on to the this
dev list.

here's the problem.  gcc generates code with "data" values interspersed 
in the text segment.  that is, it might generate something like this:

      in C
      ----
      void foo( void )
      {
         int myvar=42;
         printf("myvar=%d\n",myvar);
      }

      in ARM-ASM  (edited to make shorter)
      ----------
      .section        .rodata
      .LC0:
              .ascii  "myvar=%d\012\000"
      .text
      foo:
         .....
              ldr     r0, .L4
              ldr     r1, [fp, #-16]
              bl      printf
              b       .L3
      .L4:
              .word   .LC0
              ,word   myvar
      .L3:      
              ldmea   fp, {fp, sp, pc}

note how gcc has "stuck" into the middle of the instruction stream 
some data values at .L4. it would make my life much easier (for research 
purposes) if i could move these little random scattered data segments 
into the main data segment or an alternate data segment...

i realize that the ARM doesn't support a decent load-immediate size 
(only 12 bits signed) for addresses or data, and that was probably
why this approach was taken.  however ... i'd like to make a fundamental
change.

with a few registers already pinned for other uses, like lr, sp, etc,
i'd like to reserve "another" register for being a pointer to a 
special data segment of these values - say r11.  then, at the very 
beginning of the program, r11 gets loaded with a pointer to the data
segment containing all these address offsets, and we no longer have to
mix data into the instruction stream.  this is almost what happens now
with "r3" throughout the program.  it spends most of its life as a 
pointer to a block of these variable addresses...

we also avoid having the dozens of "ldr r3,<some-var-block>" throughout
the code generation.  this would make for more efficient code.

i'd be happy to field any queries on more specifics or suggestsions on
existing ways to get around this...

thanks for your time,

josh fryman

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: modifying the ARM generation behavior?
  2001-09-22  9:31 modifying the ARM generation behavior? Josh Fryman
@ 2001-09-22 12:57 ` Ray Lehtiniemi
  2001-09-24  3:44 ` Nick Clifton
  2001-09-24  4:00 ` Richard Earnshaw
  2 siblings, 0 replies; 6+ messages in thread
From: Ray Lehtiniemi @ 2001-09-22 12:57 UTC (permalink / raw)
  To: Josh Fryman; +Cc: gcc

hi josh


On Sat, Sep 22, 2001 at 12:30:47PM -0400, Josh Fryman wrote:
> 
> hi all,
> 
> i've got a problem with the way the ARM code is being generated. i'm
> wondering if there is any way around this problem that already exists,
> or if i'm going to have the priveledge of writing some hacks... i 
> asked on gcc-help first, but since that struck out, it's on to the this
> dev list.

[snip]

> i'd be happy to field any queries on more specifics or suggestsions on
> existing ways to get around this...


i noticed this problem a few months ago and exchanged a few emails with philip
blundell

  http://gcc.gnu.org/ml/gcc/2001-06/msg00910.html


my time constraints and compiler-hacking skills are not up to the task of doing
this myself.  i'd be happy to see this patch appear, though, so please let me
know if there's anything i can do do help.


thanks

-- 
---------------------------------------------------------------------------
    Ray Lehtiniemi <rayl@mail.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: modifying the ARM generation behavior?
  2001-09-22  9:31 modifying the ARM generation behavior? Josh Fryman
  2001-09-22 12:57 ` Ray Lehtiniemi
@ 2001-09-24  3:44 ` Nick Clifton
  2001-09-24  4:00 ` Richard Earnshaw
  2 siblings, 0 replies; 6+ messages in thread
From: Nick Clifton @ 2001-09-24  3:44 UTC (permalink / raw)
  To: Josh Fryman; +Cc: gcc

Hi Josh,

>       in C
>       ----
>       void foo( void )
>       {
>          int myvar=42;
>          printf("myvar=%d\n",myvar);
>       }
> 
>       in ARM-ASM  (edited to make shorter)
>       ----------
>       .section        .rodata
>       .LC0:
>               .ascii  "myvar=%d\012\000"
>       .text
>       foo:
>          .....
>               ldr     r0, .L4
>               ldr     r1, [fp, #-16]
>               bl      printf
>               b       .L3
>       .L4:
>               .word   .LC0
>               .word   myvar
>       .L3:      
>               ldmea   fp, {fp, sp, pc}
> 
> note how gcc has "stuck" into the middle of the instruction stream 
> some data values at .L4.

This is a little unfair.  GCC will normally put these contants at the
end of the function, not in the middle of the instruction stream.
With the current CVS sources for example, your test case compiles as:

      ....
	ldr	r0, .L2
	mov	r1, #42
	ldr	lr, [sp], #4
	b	printf
.L2:
	.word	.LC0

(This is with -O3 and -fomit-frame-pointer, so the redundant variable
myvar has been eliminated and the call to printf has been turned into
a tailcall).  Even at -O0 the constants are still dumped at the end of
the function and the branch to .L3 is eliminated.  It it true of
course, that if the function is very big then the constant pool may
have to be dumped inside it, and branches a round the pool generated,
but this is rare occurrence.

> it would make my life much easier (for research purposes) if i could
> move these little random scattered data segments into the main data
> segment or an alternate data segment... 

I am intrigued - how would this help your research ?  Are you
investigating ARMs with Harvard architectures ?

> with a few registers already pinned for other uses, like lr, sp, etc,
> i'd like to reserve "another" register for being a pointer to a 
> special data segment of these values - say r11.

r11 is already in use.  It is the frame pointer.  In fact the ARM is
rather short of "free registers".  r9 to r15 have already been
assigned.  r0 - r3 are the argument registers which only leaves r4 -
r8.  The ARM ABI document (the ATPCS) specifies these as variable
registers, so reserving one as a global register would contravene the
specification.  This is not necessarily a huge problem, but you should
be aware of the fact.  It will also mean that you will need to make
sure that you mark the binaries with this feature so that they can be
distinguished from "ordinary" binaries.  (You may also be interested
to know that the next generation of the ARM ABI is being developed.
See http://www.armdevzone.com/ for more information).

> then, at the very beginning of the program, r11 gets loaded with a
> pointer to the data segment containing all these address offsets,
> and we no longer have to mix data into the instruction stream.

What happens if there is too much data to fit into the area pointed to
by r11 ?  (or whichever register is used).  Since this may only be
discovered at link time, it is too late to recompile the objects to
use the old system...

What about shared libraries ?  Would r11 be loaded with a different
value whenever a shared library function is called, or would the
share'd libraries data have to merged into the application's own data? 

Just some things to think about... :-)

Cheers
        Nick

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: modifying the ARM generation behavior?
  2001-09-22  9:31 modifying the ARM generation behavior? Josh Fryman
  2001-09-22 12:57 ` Ray Lehtiniemi
  2001-09-24  3:44 ` Nick Clifton
@ 2001-09-24  4:00 ` Richard Earnshaw
  2001-09-24  5:50   ` Josh Fryman
  2 siblings, 1 reply; 6+ messages in thread
From: Richard Earnshaw @ 2001-09-24  4:00 UTC (permalink / raw)
  To: Josh Fryman; +Cc: gcc, Richard.Earnshaw

> here's the problem.  gcc generates code with "data" values interspersed 
> in the text segment.  that is, it might generate something like this:
> 
>       in C
>       ----
>       void foo( void )
>       {
>          int myvar=42;
>          printf("myvar=%d\n",myvar);
>       }
> 
>       in ARM-ASM  (edited to make shorter)
>       ----------
>       .section        .rodata
>       .LC0:
>               .ascii  "myvar=%d\012\000"
>       .text
>       foo:
>          .....
>               ldr     r0, .L4
>               ldr     r1, [fp, #-16]
>               bl      printf
>               b       .L3
>       .L4:
>               .word   .LC0
>               ,word   myvar
>       .L3:      
>               ldmea   fp, {fp, sp, pc}

Hm, which compiler release are you using?  The latest release should at 
least move that section outside of the code-stream for this example.  
Something like

      .text
      foo:
         .....
              ldr     r0, .L4
              ldr     r1, [fp, #-16]
              bl      printf
              ldmea   fp, {fp, sp, pc}
      .L4:
              .word   .LC0
              .word   myvar

(in fact, this particular example will now tail-call if the optimizer is 
on).

> 
> note how gcc has "stuck" into the middle of the instruction stream 
> some data values at .L4. it would make my life much easier (for research 
> purposes) if i could move these little random scattered data segments 
> into the main data segment or an alternate data segment...
> 
> i realize that the ARM doesn't support a decent load-immediate size 
> (only 12 bits signed) for addresses or data, and that was probably
> why this approach was taken.  however ... i'd like to make a fundamental
> change.

Hm, what you are describing is a poition-independent data model (in ARM's 
ATPCS parlance, RWPI -- read-write position independent), but taken to the 
extreme that even constants are pushed into the global data tables.  
Analysis has shown that this is typically 3-4% less efficient than the 
current model used (see the ATPCS document on ARM's web pages).  Note that 
in order to make this work you would also need support of the linker, 
since you wouldn't know the offsets from your base register until link 
time.  Further, any moderately large program is going to exceed the 4k 
offset range of your base register, meaning that you will either need to 
create one base value per module (= more code at the start of each module 
to set the base register up) or you will have to compile on the assumption 
that a single ldr can't load a constant, something like

	add	Rtmp, Rbase, #OFFSET_HIGH(offset)
	ldr	Rx, [Rtmp, #OFFSET_LOW(offset)]

For really large programs you might even need two add instructions to get 
all the data.  In either case the linker would then have to be able to fix 
up such sequences once the offset was finally known.

> with a few registers already pinned for other uses, like lr, sp, etc,
> i'd like to reserve "another" register for being a pointer to a 
> special data segment of these values - say r11.  then, at the very 
> beginning of the program, r11 gets loaded with a pointer to the data
> segment containing all these address offsets, and we no longer have to
> mix data into the instruction stream.  this is almost what happens now
> with "r3" throughout the program.  it spends most of its life as a 
> pointer to a block of these variable addresses...

Hm, so on the ARM we currently have 16 registers (well, 15 really, since 
one is the PC).  Of these 5 are call clobbered (r0-r3,ip) and one more 
(lr) is effectively call-clobbered since it holds the return address.  
That leaves 9 registers that are call-saved.  But of these we have 3 
(sometimes 4) that already have designated fixed uses (sp is the stack, fp 
(r11) is needed for a frame pointer and r10 is used as the pic register -- 
on some compilation models, r9 is the pic register and r10 is stack-limit 
register).  That leaves 6, sometimes 5, registers that are call-saved for 
normal use.  You can't use r11 since it is already used, so you would have 
to use r9 (or for some compilations r8), that would use up 15-20% of the 
remaining call-saved registers -- that's likely to have a significant 
effect on the efficiency of the rest of your code, since the compiler will 
now have to spill more often.

> we also avoid having the dozens of "ldr r3,<some-var-block>" throughout
> the code generation.  this would make for more efficient code.

Please show me a real example where we get dozens of such accesses that 
would be avoided by your model; the existing model makes use of the PC as 
an effective base register, you would loose that benefit with your 
approach.

> i'd be happy to field any queries on more specifics or suggestsions on
> existing ways to get around this...
> 

I think it probable that code compiled the way you suggest could be made 
to work, but I very much doubt that it would be more efficient.

R.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: modifying the ARM generation behavior?
  2001-09-24  4:00 ` Richard Earnshaw
@ 2001-09-24  5:50   ` Josh Fryman
  2001-09-24 10:16     ` Richard Earnshaw
  0 siblings, 1 reply; 6+ messages in thread
From: Josh Fryman @ 2001-09-24  5:50 UTC (permalink / raw)
  To: Richard.Earnshaw; +Cc: gcc, nickc

hi,

thanks for the replies ;) i'm combining replies to avoid redundancy, 
hope you don't mind.

let me start by saying my current comparison is coming from gcc-2.95.2,
vanilla, and is targeted to the CRL Skiff board (SA-110).  i note now that
you may just tell me to upgrade my compiler version to 3.0.0 or 3.0.1, 
and if you think that will fix part of this in some way, i'll be happy to
give that a go... i don't think it will work, personally, but let me
explain why.

my knowledge of ARM is limited to what i've been picking up en route to
what i'm doing research-wise, but that is the source of the problem.  so
to give you a better understanding of what i want to do, i've put a bit
about what exactly i'm doing at the end of this email.  sorry, this wound
up being a bit windy :(

Nick:
> r11 is already in use.  It is the frame pointer.  In fact the ARM is

i was suggesting r11 as just an exercise to think about, sorry.  i 
didn't mean it literally.  you were both right to jump on me for my 
choice - bad thinking on my part...

Richard:
> Hm, which compiler release are you using?

2.95.2, as stated above.  moving it out of the function body and to the
function end doesn't change my underlying problem.

Nick:
> This is a little unfair.  GCC will normally put these contants at the
> end of the function, not in the middle of the instruction stream.

as discussed below, it may partly be a compiler issue as well as the 
optimization level... so "unfair" may be "unfair for current gcc" but
the toolchain CRL gave me was 2.95.2 based, and i've been hesitant to
replace it... i don't know how many other things i'll have to replace
in the process :(

Richard:
> Hm, what you are describing is a poition-independent data model (in ARM's
> ATPCS parlance, RWPI -- read-write position independent), but taken to the
> extreme that even constants are pushed into the global data tables.

i'll have to read up about this... and the % efficiency drop you mention.
i'd like to know if that drop is seen with *all* ARM compilers, or just 
the ARM compiler.  i note that the ARM compiler does not generate code like
gcc does ... as is true on all architectures i've dealt with, gcc is a good
compiler in general, but if you want completely optimized code, you have
to use the platform specific compiler (intel c, sun devwkshp c, etc).  

Nick:
> to know that the next generation of the ARM ABI is being developed.
> See http://www.armdevzone.com/ for more information).

ooo, more goodies to look through.  thanks, i'll spend some time browsing.

Nick:
> What happens if there is too much data to fit into the area pointed to
> by r11 ?  (or whichever register is used).  Since this may only be

Richard:
> time.  Further, any moderately large program is going to exceed the 4k
> offset range of your base register, meaning that you will either need to
> create one base value per module (= more code at the start of each module
> to set the base register up) or you will have to compile on the assumption
> that a single ldr can't load a constant, something like

you both jumped on this one rather pointedly ;)  it deserves it.  i'm not
sure how silly the idea is.  but ... to answer ...

not necessarily.  i won't put it past some people to have a buttload of 
files in their projects, so the smart decision would be to have the register
be an indirect index itself.  think of something like the idea that you have
1K of 4-byte addresses to "data tables" ... these addresses are in turn
addresses to function- or file-specific "data tables" ... and then there you
are.  you have one setup of the register at program start, and then each
function would have two loads at the top to get the right table into the
register value ... this doesn't seem much of a penalty to me.  makes more
overhead in the data segment... but that could be massaged a bit to reduce
inefficiency.

Richard:
> [re: alternate address load model]
>
>         add     Rtmp, Rbase, #OFFSET_HIGH(offset)
>         ldr     Rx, [Rtmp, #OFFSET_LOW(offset)]

this is sort of how Sparc (and other) systems work. they do have a different
instruction pattern for it, but it does the two-stage load... and it's very
easy to catch in the code by a parser like mine.  i know exactly what its
doing... in the sparc.  because it doesn't use register offset addressing.
makes me wish very fervently that the ARM had an instruction like the "bl"
or "b" -- something like "ldrhi r3, <high-imm-16bit>", and the "ldrlo"
follow-on.  

Nick:
> What about shared libraries ?  Would r11 be loaded with a different
> value whenever a shared library function is called, or would the
> share'd libraries data have to merged into the application's own data?

uhhh... good question.  i'll have to think about the shared libraries
aspect.  i don't see it as a major obstacle given the above multiple-level
pointer trick, but it might make the fixup a little dicey.  i was assuming
a function would always know what offset path to follow, and that would
be inserted properly by the compiler and values stuff by the linker... 
shared libraries are a different problem.  i'm not using them, so i didn't
think about it.  

Richard:
> normal use.  You can't use r11 since it is already used, so you would have
> to use r9 (or for some compilations r8), that would use up 15-20% of the
> remaining call-saved registers -- that's likely to have a significant
> effect on the efficiency of the rest of your code, since the compiler will
> now have to spill more often.

maybe, i don't think so.  (ignoring which particular rN is being used.)  in
the test code i've generated (prior to using -ffixed-r8), i've looked through
the assembly output quite a bit.  i have yet to see (this is working through
adpcm codecs, mpeg codecs, jpeg codecs, and some custom test apps) anything
use *more* than r0-r5.  i have never seen an r6, r7, r8 reference *anywhere*.

to be honest, i find that kind of odd and don't understand why.  maybe there's
some hidden penalty in using r6-r8 the compiler knows about that i don't? but
given this, i see no problem is taking off another register that doesn't seem
to be used anyway.  (note, i don't use -O2, i use -O1 - could be an 
optimization issue...)

Richard:
> > we also avoid having the dozens of "ldr r3,<some-var-block>" throughout
> > the code generation.  this would make for more efficient code.
> 
> Please show me a real example where we get dozens of such accesses that
> would be avoided by your model; the existing model makes use of the PC as
> an effective base register, you would loose that benefit with your
> approach.

yeah, there are pros/cons.  i don't really know how to solve this particular
problem.  but, you wanted a real code example, here ya go:

here's a fairly simple function i was testing through the system, and noted
the behavior on first:

test.c:
---------
#include <stdio.h>

extern int s1( int );
extern int s2( int );
extern int s3( int );

extern int g1;

int debug( void )
{
   int g;

   printf("debugging s1/2/3...\n");
   for (g=0; g<10; g++)
      printf("s1(%d) = %d, s2(%d)=%d, s3(%d)=%d\n", g, s1(g), g, s2(g), g, s3(g) );
   printf("end debug...\n");

   g1 *= s3(g);
   printf("g1 is now %d\n", g1);

   g = s1(10) + s2(20);
   return g;
}

here's a snippet of the asm output from gcc ...

        ldr     r3, .L7
        ldr     r0, [r3, #0]
        bl      s1
        mov     r4, r0
        ldr     r3, .L7
        ldr     r0, [r3, #0]
        bl      s2
        mov     r5, r0
        ldr     r3, .L7
        ldr     r0, [r3, #0]
        bl      s3
        mov     r1, r0
        ldr     r2, .L7
        ldr     r3, .L7
        str     r5, [sp, #0]

sure looks like a lot of "ldr r3, .L7" to me ;)  however, i note that i'm using
a very long string of flags to gcc, as well as an older version.  when i went back
and undid many of the flags, and put the optimization level at O2 (i use O1 at
present, O2 has some side effects i haven't figured out how to deal with yet)
i *do* get different output that looks different:

.LM5:
        ldr     r0, [sp, #12]
        bl      s1
        mov     r4, r0
        ldr     r0, [sp, #12]
        bl      s2
        mov     r5, r0
        ldr     r0, [sp, #12]
        bl      s3
        mov     r3, r0
        str     r5, [sp, #0]
        ldr     r2, [sp, #12]

so i'm not sure if what i'm seeing is necessarily normal behavior.  kind of hard
to tell.

> I think it probable that code compiled the way you suggest could be made
> to work, but I very much doubt that it would be more efficient.

i'd be willing to settle for neutral.  a very small performance loss i could
accept - if we can make this work, we'd do an actual implementation that had
hardware support for our check-routines and better granularity on page controls.
then we might actually have a viable system...

then again, we may suck eggs.  it's always hard to tell this early on which
way it will go ;)

any other thoughts on the situation?

thanks for the feedback!

-josh fryman

[ begin research description : ignore rest of email if uninterested ]

for a research project we're looking at a different model of CPU design
that would have *no* caches.  think of a remote sensor device with a 
very small (~4-8K) on-chip memory footprint and that's it - plus some way
to interface a sensor array.  (sensor = uninteresting black box here.)
so essentially all the space that would be cache is now the RAM we have
available.  we want to dynamically page in/out code and data from this
space, for our little sensor uP is connected to a backend powerful server
via some link (serial, IR, ethernet, wireless, whatever).  so the 
uP will run a little set of "stubs" that will obtain code snippets from
the server, run them, and intercept ld/st and b/bl situations ... to
remap the instruction to (a) the proper address if resident; or (b) to
fetch the proper chunk needed and then remap - this may involve shipping
code/data back to the server.  the concept is that assuming our memory is
sufficiently large for our "hot" code (adpcm coding, whatever) that 
eventually we reach steady-state and no longer talk to the server except
to send "cooked" sensor data back.

this is the final goal.  the current implementation is a test prototype
running under linux... the app and server run on the same skiff board
(or not) and communicate via generic socket read/write ops.  i have a
client which receives function-sized chunks and executes them, asking
the server periodically for more code... for now, we ignore the data
segment paging by just allocating enough memory and not trying to manage
it.  managing the code is difficult enough.

the problem we immediately run into is that for sub-function-sized 
page granularity, it's very difficult in a "nice" way to catch the 
memory references that are loading data from the instruction stream.
(from those little tables in the text segment i'm complaining about ;)

i'm expicitly compiling the "app" to run on our client such that there is
a big separation between I- and D- sections -- I at 0x021... and D at
0x022... if we were running in a real embedded environment (no linux),
i could probably just set the MMU up to catch these for me, assuming
i could get it to recognize the small page size... 

but here, when the server is parsing the code chunk to send to the 
client, it replaces "bl myfunc" with "bl bl_intercept" ... the intercept
will do the negotiation with the server.  for me to use finer grain chunks 
and break up the function (say, on any "b offs" or "bl func") then i need
to move the code chunks around in memory to be non-contiguous.  the
problem here is that now i need to have more knowledge of the code 
and be on the lookout for "ld r3, <some ofs addr in I-space>" and 
replace that with something like "bl ld_intercept" where i go and
do all the address work elsewhere for the load.  in all probability,
i'd wind up exceeding the limited offset range of the ldr instruction
to just remap it...

you may think "so what?" - i'm taking a huge performance hit with
the bl-intercept routine, so what difference does the ld-intercept
make?  the reason is that as we page code in to us, we self-modify
those "bl b_intercept" to actually become "bl <new-real-address>" ...
and when we page code out, we replace any call sites to the now-
removed code with "bl b_intercept" so we can reload the code as 
needed.  so in essence, we take the hit once, and then never again,
when we reach that "hot code" steady state...

the problem is i can't see a way to remove the ld_intercept *ever*
because i may always exceed the offset space of the ldr-instr,
not to mention the extra complexity in server-side book-keeping.

if i could just stuff all those tables at a fixed address in memory
that i can keep track of in some way, that would make my life much
easier.  (maybe by defining a new segment ".funcvars" and sticking
them there ... by whatever means i can make it work...)

hope this info helps paint the broader picture...

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: modifying the ARM generation behavior?
  2001-09-24  5:50   ` Josh Fryman
@ 2001-09-24 10:16     ` Richard Earnshaw
  0 siblings, 0 replies; 6+ messages in thread
From: Richard Earnshaw @ 2001-09-24 10:16 UTC (permalink / raw)
  To: Josh Fryman; +Cc: Richard.Earnshaw, gcc, nickc

> Richard:
> > Hm, which compiler release are you using?
> 
> 2.95.2, as stated above.  moving it out of the function body and to the
> function end doesn't change my underlying problem.

Indeed though it does help to explain some of the oddities.

> Richard:
> > Hm, what you are describing is a poition-independent data model (in ARM's
> > ATPCS parlance, RWPI -- read-write position independent), but taken to the
> > extreme that even constants are pushed into the global data tables.
> 
> i'll have to read up about this... and the % efficiency drop you mention.
> i'd like to know if that drop is seen with *all* ARM compilers, or just 
> the ARM compiler.  i note that the ARM compiler does not generate code like
> gcc does ... as is true on all architectures i've dealt with, gcc is a good
> compiler in general, but if you want completely optimized code, you have
> to use the platform specific compiler (intel c, sun devwkshp c, etc).  

I'll go into it further below, but I remain to be convinced you are 
enabling the optimizer.  The ARM compiler very definitely puts constants 
that can't be synthesized into the code segment.  Where possible these 
will be placed between functions, but occasionally, when a function is 
large, then they will get placed at convenient points in the function 
body.  Where possible this will be after a natural branch instruction; but 
very occasionally even this isn't possible and the compiler will have to 
insert a jump around the data table.

> Nick:
> > What happens if there is too much data to fit into the area pointed to
> > by r11 ?  (or whichever register is used).  Since this may only be
> 
> Richard:
> > time.  Further, any moderately large program is going to exceed the 4k
> > offset range of your base register, meaning that you will either need to
> > create one base value per module (= more code at the start of each module
> > to set the base register up) or you will have to compile on the assumption
> > that a single ldr can't load a constant, something like
> 
> you both jumped on this one rather pointedly ;)  it deserves it.  i'm not
> sure how silly the idea is.  but ... to answer ...
> 
> not necessarily.  i won't put it past some people to have a buttload of 
> files in their projects, so the smart decision would be to have the register
> be an indirect index itself.  think of something like the idea that you have
> 1K of 4-byte addresses to "data tables" ... these addresses are in turn
> addresses to function- or file-specific "data tables" ... and then there you
> are.  you have one setup of the register at program start, and then each
> function would have two loads at the top to get the right table into the
> register value ... this doesn't seem much of a penalty to me.  makes more
> overhead in the data segment... but that could be massaged a bit to reduce
> inefficiency.

Again, this is similar to the RWPI (or even the ROPI) shared library model 
which allows for multiple tables; though that model has a more efficient 
way of handling multiple tables that normally avoids more than one 
additional indirection per function.

> 
> Richard:
> > [re: alternate address load model]
> >
> >         add     Rtmp, Rbase, #OFFSET_HIGH(offset)
> >         ldr     Rx, [Rtmp, #OFFSET_LOW(offset)]
> 
> this is sort of how Sparc (and other) systems work. they do have a different
> instruction pattern for it, but it does the two-stage load... and it's very
> easy to catch in the code by a parser like mine.  i know exactly what its
> doing... in the sparc.  because it doesn't use register offset addressing.
> makes me wish very fervently that the ARM had an instruction like the "bl"
> or "b" -- something like "ldrhi r3, <high-imm-16bit>", and the "ldrlo"
> follow-on.  

Yes, but the SPARC can access the full 32-bit address space with that 
model.  On the ARM that still only buys you 20-bits of offset (probably 
enough for most cases from a single pointer, but still 12 bits short of 
the full range).

> Richard:
> > normal use.  You can't use r11 since it is already used, so you would have
> > to use r9 (or for some compilations r8), that would use up 15-20% of the
> > remaining call-saved registers -- that's likely to have a significant
> > effect on the efficiency of the rest of your code, since the compiler will
> > now have to spill more often.
> 
> maybe, i don't think so.  (ignoring which particular rN is being used.)  in
> the test code i've generated (prior to using -ffixed-r8), i've looked through
> the assembly output quite a bit.  i have yet to see (this is working through
> adpcm codecs, mpeg codecs, jpeg codecs, and some custom test apps) anything
> use *more* than r0-r5.  i have never seen an r6, r7, r8 reference *anywhere*.

I'm not convinced you are turning the optimizer on (or you have *very* 
small functions).

> here's a fairly simple function i was testing through the system, and noted
> the behavior on first:
> 
> test.c:
> ---------
> #include <stdio.h>
> 
> extern int s1( int );
> extern int s2( int );
> extern int s3( int );
> 
> extern int g1;
> 
> int debug( void )
> {
>    int g;
> 
>    printf("debugging s1/2/3...\n");
>    for (g=0; g<10; g++)
>       printf("s1(%d) = %d, s2(%d)=%d, s3(%d)=%d\n", g, s1(g), g, s2(g), g, s3(g) );
>    printf("end debug...\n");
> 
>    g1 *= s3(g);
>    printf("g1 is now %d\n", g1);
> 
>    g = s1(10) + s2(20);
>    return g;
> }
> 

I cannot get the compiler to generate the following output that you have:

> here's a snippet of the asm output from gcc ...
> 
>         ldr     r3, .L7
>         ldr     r0, [r3, #0]
>         bl      s1
>         mov     r4, r0
>         ldr     r3, .L7

> sure looks like a lot of "ldr r3, .L7" to me ;)  however, i note that i'm using
> a very long string of flags to gcc, as well as an older version.  

This doesn't make sense for your source code.  The variable passed to s1 
is "g", a local variable.  The only places this can exist are in a 
register, or on the stack.  In no case can it then be referenced by 
looking it up through a constant data pointer -- there's no way the 
compiler could know where on the stack it would be at compile time.  Are 
you sure this example was compiled from your posted code?

>when i went back
> and undid many of the flags, and put the optimization level at O2 (i use O1 at
> present, O2 has some side effects i haven't figured out how to deal with yet)
> i *do* get different output that looks different:
> 
> .LM5:
>         ldr     r0, [sp, #12]
>         bl      s1
>         mov     r4, r0
>         ldr     r0, [sp, #12]
>         bl      s2
>         mov     r5, r0
>         ldr     r0, [sp, #12]
>         bl      s3
>         mov     r3, r0
>         str     r5, [sp, #0]
>         ldr     r2, [sp, #12]

I can get this sort of output if I use -O0 -fomit-frame-pointer, but in no 
other way.

Compiling with ANY level of optimization on gives

.L6:
        mov     r0, r6
        bl      s1
        mov     r5, r0
        mov     r0, r6
        bl      s2
        mov     r4, r0
        mov     r0, r6
        bl      s3

If you've got a long list of flags that are being passed to the compiler, 
please check them carefully to ensure that a flag later on the command 
line isn't turning the optimizer off again.

> [ begin research description : ignore rest of email if uninterested ]
> 
> for a research project we're looking at a different model of CPU design
> that would have *no* caches.  think of a remote sensor device with a 
> very small (~4-8K) on-chip memory footprint and that's it - plus some way
> to interface a sensor array.  (sensor = uninteresting black box here.)
> so essentially all the space that would be cache is now the RAM we have
> available.  we want to dynamically page in/out code and data from this
> space, for our little sensor uP is connected to a backend powerful server
> via some link (serial, IR, ethernet, wireless, whatever).  so the 
> uP will run a little set of "stubs" that will obtain code snippets from
> the server, run them, and intercept ld/st and b/bl situations ... to
> remap the instruction to (a) the proper address if resident; or (b) to
> fetch the proper chunk needed and then remap - this may involve shipping
> code/data back to the server.  the concept is that assuming our memory is
> sufficiently large for our "hot" code (adpcm coding, whatever) that 
> eventually we reach steady-state and no longer talk to the server except
> to send "cooked" sensor data back.

This doesn't really sound any different from a demand-paged virtual memory 
system, except, perhaps that you are trying to change the code directly, 
rather than having additional hardware in the CPU to manage that for you.

> but here, when the server is parsing the code chunk to send to the 
> client, it replaces "bl myfunc" with "bl bl_intercept" ... the intercept
> will do the negotiation with the server.  for me to use finer grain chunks 
> and break up the function (say, on any "b offs" or "bl func") then i need
> to move the code chunks around in memory to be non-contiguous.  the
> problem here is that now i need to have more knowledge of the code 
> and be on the lookout for "ld r3, <some ofs addr in I-space>" and 
> replace that with something like "bl ld_intercept" where i go and
> do all the address work elsewhere for the load.  in all probability,
> i'd wind up exceeding the limited offset range of the ldr instruction
> to just remap it...
> 
> you may think "so what?" - i'm taking a huge performance hit with
> the bl-intercept routine, so what difference does the ld-intercept
> make?  the reason is that as we page code in to us, we self-modify
> those "bl b_intercept" to actually become "bl <new-real-address>" ...
> and when we page code out, we replace any call sites to the now-
> removed code with "bl b_intercept" so we can reload the code as 
> needed.  so in essence, we take the hit once, and then never again,
> when we reach that "hot code" steady state...

Ok, so presumably your server has to remember what the original address 
was when fixing up the bl (so that when executed it can repair the 
damage).  Why can't you extend this to the load/store and replace them 
with something like
	ldr rd, [r0, -r0]

This will always cause a load/store to address zero, and it would be easy 
to make either your memory system or MMU fault such an access.  Then you 
could catch that with a segmentation fault handler and put in the correct 
address before resuming execution.

R.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2001-09-24 10:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-22  9:31 modifying the ARM generation behavior? Josh Fryman
2001-09-22 12:57 ` Ray Lehtiniemi
2001-09-24  3:44 ` Nick Clifton
2001-09-24  4:00 ` Richard Earnshaw
2001-09-24  5:50   ` Josh Fryman
2001-09-24 10:16     ` Richard Earnshaw

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).