* modifying the ARM generation behavior?
@ 2001-09-22 9:31 Josh Fryman
2001-09-22 12:57 ` Ray Lehtiniemi
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Josh Fryman @ 2001-09-22 9:31 UTC (permalink / raw)
To: gcc
hi all,
i've got a problem with the way the ARM code is being generated. i'm
wondering if there is any way around this problem that already exists,
or if i'm going to have the priveledge of writing some hacks... i
asked on gcc-help first, but since that struck out, it's on to the this
dev list.
here's the problem. gcc generates code with "data" values interspersed
in the text segment. that is, it might generate something like this:
in C
----
void foo( void )
{
int myvar=42;
printf("myvar=%d\n",myvar);
}
in ARM-ASM (edited to make shorter)
----------
.section .rodata
.LC0:
.ascii "myvar=%d\012\000"
.text
foo:
.....
ldr r0, .L4
ldr r1, [fp, #-16]
bl printf
b .L3
.L4:
.word .LC0
,word myvar
.L3:
ldmea fp, {fp, sp, pc}
note how gcc has "stuck" into the middle of the instruction stream
some data values at .L4. it would make my life much easier (for research
purposes) if i could move these little random scattered data segments
into the main data segment or an alternate data segment...
i realize that the ARM doesn't support a decent load-immediate size
(only 12 bits signed) for addresses or data, and that was probably
why this approach was taken. however ... i'd like to make a fundamental
change.
with a few registers already pinned for other uses, like lr, sp, etc,
i'd like to reserve "another" register for being a pointer to a
special data segment of these values - say r11. then, at the very
beginning of the program, r11 gets loaded with a pointer to the data
segment containing all these address offsets, and we no longer have to
mix data into the instruction stream. this is almost what happens now
with "r3" throughout the program. it spends most of its life as a
pointer to a block of these variable addresses...
we also avoid having the dozens of "ldr r3,<some-var-block>" throughout
the code generation. this would make for more efficient code.
i'd be happy to field any queries on more specifics or suggestsions on
existing ways to get around this...
thanks for your time,
josh fryman
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: modifying the ARM generation behavior?
2001-09-22 9:31 modifying the ARM generation behavior? Josh Fryman
@ 2001-09-22 12:57 ` Ray Lehtiniemi
2001-09-24 3:44 ` Nick Clifton
2001-09-24 4:00 ` Richard Earnshaw
2 siblings, 0 replies; 6+ messages in thread
From: Ray Lehtiniemi @ 2001-09-22 12:57 UTC (permalink / raw)
To: Josh Fryman; +Cc: gcc
hi josh
On Sat, Sep 22, 2001 at 12:30:47PM -0400, Josh Fryman wrote:
>
> hi all,
>
> i've got a problem with the way the ARM code is being generated. i'm
> wondering if there is any way around this problem that already exists,
> or if i'm going to have the priveledge of writing some hacks... i
> asked on gcc-help first, but since that struck out, it's on to the this
> dev list.
[snip]
> i'd be happy to field any queries on more specifics or suggestsions on
> existing ways to get around this...
i noticed this problem a few months ago and exchanged a few emails with philip
blundell
http://gcc.gnu.org/ml/gcc/2001-06/msg00910.html
my time constraints and compiler-hacking skills are not up to the task of doing
this myself. i'd be happy to see this patch appear, though, so please let me
know if there's anything i can do do help.
thanks
--
---------------------------------------------------------------------------
Ray Lehtiniemi <rayl@mail.com>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: modifying the ARM generation behavior?
2001-09-22 9:31 modifying the ARM generation behavior? Josh Fryman
2001-09-22 12:57 ` Ray Lehtiniemi
@ 2001-09-24 3:44 ` Nick Clifton
2001-09-24 4:00 ` Richard Earnshaw
2 siblings, 0 replies; 6+ messages in thread
From: Nick Clifton @ 2001-09-24 3:44 UTC (permalink / raw)
To: Josh Fryman; +Cc: gcc
Hi Josh,
> in C
> ----
> void foo( void )
> {
> int myvar=42;
> printf("myvar=%d\n",myvar);
> }
>
> in ARM-ASM (edited to make shorter)
> ----------
> .section .rodata
> .LC0:
> .ascii "myvar=%d\012\000"
> .text
> foo:
> .....
> ldr r0, .L4
> ldr r1, [fp, #-16]
> bl printf
> b .L3
> .L4:
> .word .LC0
> .word myvar
> .L3:
> ldmea fp, {fp, sp, pc}
>
> note how gcc has "stuck" into the middle of the instruction stream
> some data values at .L4.
This is a little unfair. GCC will normally put these contants at the
end of the function, not in the middle of the instruction stream.
With the current CVS sources for example, your test case compiles as:
....
ldr r0, .L2
mov r1, #42
ldr lr, [sp], #4
b printf
.L2:
.word .LC0
(This is with -O3 and -fomit-frame-pointer, so the redundant variable
myvar has been eliminated and the call to printf has been turned into
a tailcall). Even at -O0 the constants are still dumped at the end of
the function and the branch to .L3 is eliminated. It it true of
course, that if the function is very big then the constant pool may
have to be dumped inside it, and branches a round the pool generated,
but this is rare occurrence.
> it would make my life much easier (for research purposes) if i could
> move these little random scattered data segments into the main data
> segment or an alternate data segment...
I am intrigued - how would this help your research ? Are you
investigating ARMs with Harvard architectures ?
> with a few registers already pinned for other uses, like lr, sp, etc,
> i'd like to reserve "another" register for being a pointer to a
> special data segment of these values - say r11.
r11 is already in use. It is the frame pointer. In fact the ARM is
rather short of "free registers". r9 to r15 have already been
assigned. r0 - r3 are the argument registers which only leaves r4 -
r8. The ARM ABI document (the ATPCS) specifies these as variable
registers, so reserving one as a global register would contravene the
specification. This is not necessarily a huge problem, but you should
be aware of the fact. It will also mean that you will need to make
sure that you mark the binaries with this feature so that they can be
distinguished from "ordinary" binaries. (You may also be interested
to know that the next generation of the ARM ABI is being developed.
See http://www.armdevzone.com/ for more information).
> then, at the very beginning of the program, r11 gets loaded with a
> pointer to the data segment containing all these address offsets,
> and we no longer have to mix data into the instruction stream.
What happens if there is too much data to fit into the area pointed to
by r11 ? (or whichever register is used). Since this may only be
discovered at link time, it is too late to recompile the objects to
use the old system...
What about shared libraries ? Would r11 be loaded with a different
value whenever a shared library function is called, or would the
share'd libraries data have to merged into the application's own data?
Just some things to think about... :-)
Cheers
Nick
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: modifying the ARM generation behavior?
2001-09-22 9:31 modifying the ARM generation behavior? Josh Fryman
2001-09-22 12:57 ` Ray Lehtiniemi
2001-09-24 3:44 ` Nick Clifton
@ 2001-09-24 4:00 ` Richard Earnshaw
2001-09-24 5:50 ` Josh Fryman
2 siblings, 1 reply; 6+ messages in thread
From: Richard Earnshaw @ 2001-09-24 4:00 UTC (permalink / raw)
To: Josh Fryman; +Cc: gcc, Richard.Earnshaw
> here's the problem. gcc generates code with "data" values interspersed
> in the text segment. that is, it might generate something like this:
>
> in C
> ----
> void foo( void )
> {
> int myvar=42;
> printf("myvar=%d\n",myvar);
> }
>
> in ARM-ASM (edited to make shorter)
> ----------
> .section .rodata
> .LC0:
> .ascii "myvar=%d\012\000"
> .text
> foo:
> .....
> ldr r0, .L4
> ldr r1, [fp, #-16]
> bl printf
> b .L3
> .L4:
> .word .LC0
> ,word myvar
> .L3:
> ldmea fp, {fp, sp, pc}
Hm, which compiler release are you using? The latest release should at
least move that section outside of the code-stream for this example.
Something like
.text
foo:
.....
ldr r0, .L4
ldr r1, [fp, #-16]
bl printf
ldmea fp, {fp, sp, pc}
.L4:
.word .LC0
.word myvar
(in fact, this particular example will now tail-call if the optimizer is
on).
>
> note how gcc has "stuck" into the middle of the instruction stream
> some data values at .L4. it would make my life much easier (for research
> purposes) if i could move these little random scattered data segments
> into the main data segment or an alternate data segment...
>
> i realize that the ARM doesn't support a decent load-immediate size
> (only 12 bits signed) for addresses or data, and that was probably
> why this approach was taken. however ... i'd like to make a fundamental
> change.
Hm, what you are describing is a poition-independent data model (in ARM's
ATPCS parlance, RWPI -- read-write position independent), but taken to the
extreme that even constants are pushed into the global data tables.
Analysis has shown that this is typically 3-4% less efficient than the
current model used (see the ATPCS document on ARM's web pages). Note that
in order to make this work you would also need support of the linker,
since you wouldn't know the offsets from your base register until link
time. Further, any moderately large program is going to exceed the 4k
offset range of your base register, meaning that you will either need to
create one base value per module (= more code at the start of each module
to set the base register up) or you will have to compile on the assumption
that a single ldr can't load a constant, something like
add Rtmp, Rbase, #OFFSET_HIGH(offset)
ldr Rx, [Rtmp, #OFFSET_LOW(offset)]
For really large programs you might even need two add instructions to get
all the data. In either case the linker would then have to be able to fix
up such sequences once the offset was finally known.
> with a few registers already pinned for other uses, like lr, sp, etc,
> i'd like to reserve "another" register for being a pointer to a
> special data segment of these values - say r11. then, at the very
> beginning of the program, r11 gets loaded with a pointer to the data
> segment containing all these address offsets, and we no longer have to
> mix data into the instruction stream. this is almost what happens now
> with "r3" throughout the program. it spends most of its life as a
> pointer to a block of these variable addresses...
Hm, so on the ARM we currently have 16 registers (well, 15 really, since
one is the PC). Of these 5 are call clobbered (r0-r3,ip) and one more
(lr) is effectively call-clobbered since it holds the return address.
That leaves 9 registers that are call-saved. But of these we have 3
(sometimes 4) that already have designated fixed uses (sp is the stack, fp
(r11) is needed for a frame pointer and r10 is used as the pic register --
on some compilation models, r9 is the pic register and r10 is stack-limit
register). That leaves 6, sometimes 5, registers that are call-saved for
normal use. You can't use r11 since it is already used, so you would have
to use r9 (or for some compilations r8), that would use up 15-20% of the
remaining call-saved registers -- that's likely to have a significant
effect on the efficiency of the rest of your code, since the compiler will
now have to spill more often.
> we also avoid having the dozens of "ldr r3,<some-var-block>" throughout
> the code generation. this would make for more efficient code.
Please show me a real example where we get dozens of such accesses that
would be avoided by your model; the existing model makes use of the PC as
an effective base register, you would loose that benefit with your
approach.
> i'd be happy to field any queries on more specifics or suggestsions on
> existing ways to get around this...
>
I think it probable that code compiled the way you suggest could be made
to work, but I very much doubt that it would be more efficient.
R.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: modifying the ARM generation behavior?
2001-09-24 4:00 ` Richard Earnshaw
@ 2001-09-24 5:50 ` Josh Fryman
2001-09-24 10:16 ` Richard Earnshaw
0 siblings, 1 reply; 6+ messages in thread
From: Josh Fryman @ 2001-09-24 5:50 UTC (permalink / raw)
To: Richard.Earnshaw; +Cc: gcc, nickc
hi,
thanks for the replies ;) i'm combining replies to avoid redundancy,
hope you don't mind.
let me start by saying my current comparison is coming from gcc-2.95.2,
vanilla, and is targeted to the CRL Skiff board (SA-110). i note now that
you may just tell me to upgrade my compiler version to 3.0.0 or 3.0.1,
and if you think that will fix part of this in some way, i'll be happy to
give that a go... i don't think it will work, personally, but let me
explain why.
my knowledge of ARM is limited to what i've been picking up en route to
what i'm doing research-wise, but that is the source of the problem. so
to give you a better understanding of what i want to do, i've put a bit
about what exactly i'm doing at the end of this email. sorry, this wound
up being a bit windy :(
Nick:
> r11 is already in use. It is the frame pointer. In fact the ARM is
i was suggesting r11 as just an exercise to think about, sorry. i
didn't mean it literally. you were both right to jump on me for my
choice - bad thinking on my part...
Richard:
> Hm, which compiler release are you using?
2.95.2, as stated above. moving it out of the function body and to the
function end doesn't change my underlying problem.
Nick:
> This is a little unfair. GCC will normally put these contants at the
> end of the function, not in the middle of the instruction stream.
as discussed below, it may partly be a compiler issue as well as the
optimization level... so "unfair" may be "unfair for current gcc" but
the toolchain CRL gave me was 2.95.2 based, and i've been hesitant to
replace it... i don't know how many other things i'll have to replace
in the process :(
Richard:
> Hm, what you are describing is a poition-independent data model (in ARM's
> ATPCS parlance, RWPI -- read-write position independent), but taken to the
> extreme that even constants are pushed into the global data tables.
i'll have to read up about this... and the % efficiency drop you mention.
i'd like to know if that drop is seen with *all* ARM compilers, or just
the ARM compiler. i note that the ARM compiler does not generate code like
gcc does ... as is true on all architectures i've dealt with, gcc is a good
compiler in general, but if you want completely optimized code, you have
to use the platform specific compiler (intel c, sun devwkshp c, etc).
Nick:
> to know that the next generation of the ARM ABI is being developed.
> See http://www.armdevzone.com/ for more information).
ooo, more goodies to look through. thanks, i'll spend some time browsing.
Nick:
> What happens if there is too much data to fit into the area pointed to
> by r11 ? (or whichever register is used). Since this may only be
Richard:
> time. Further, any moderately large program is going to exceed the 4k
> offset range of your base register, meaning that you will either need to
> create one base value per module (= more code at the start of each module
> to set the base register up) or you will have to compile on the assumption
> that a single ldr can't load a constant, something like
you both jumped on this one rather pointedly ;) it deserves it. i'm not
sure how silly the idea is. but ... to answer ...
not necessarily. i won't put it past some people to have a buttload of
files in their projects, so the smart decision would be to have the register
be an indirect index itself. think of something like the idea that you have
1K of 4-byte addresses to "data tables" ... these addresses are in turn
addresses to function- or file-specific "data tables" ... and then there you
are. you have one setup of the register at program start, and then each
function would have two loads at the top to get the right table into the
register value ... this doesn't seem much of a penalty to me. makes more
overhead in the data segment... but that could be massaged a bit to reduce
inefficiency.
Richard:
> [re: alternate address load model]
>
> add Rtmp, Rbase, #OFFSET_HIGH(offset)
> ldr Rx, [Rtmp, #OFFSET_LOW(offset)]
this is sort of how Sparc (and other) systems work. they do have a different
instruction pattern for it, but it does the two-stage load... and it's very
easy to catch in the code by a parser like mine. i know exactly what its
doing... in the sparc. because it doesn't use register offset addressing.
makes me wish very fervently that the ARM had an instruction like the "bl"
or "b" -- something like "ldrhi r3, <high-imm-16bit>", and the "ldrlo"
follow-on.
Nick:
> What about shared libraries ? Would r11 be loaded with a different
> value whenever a shared library function is called, or would the
> share'd libraries data have to merged into the application's own data?
uhhh... good question. i'll have to think about the shared libraries
aspect. i don't see it as a major obstacle given the above multiple-level
pointer trick, but it might make the fixup a little dicey. i was assuming
a function would always know what offset path to follow, and that would
be inserted properly by the compiler and values stuff by the linker...
shared libraries are a different problem. i'm not using them, so i didn't
think about it.
Richard:
> normal use. You can't use r11 since it is already used, so you would have
> to use r9 (or for some compilations r8), that would use up 15-20% of the
> remaining call-saved registers -- that's likely to have a significant
> effect on the efficiency of the rest of your code, since the compiler will
> now have to spill more often.
maybe, i don't think so. (ignoring which particular rN is being used.) in
the test code i've generated (prior to using -ffixed-r8), i've looked through
the assembly output quite a bit. i have yet to see (this is working through
adpcm codecs, mpeg codecs, jpeg codecs, and some custom test apps) anything
use *more* than r0-r5. i have never seen an r6, r7, r8 reference *anywhere*.
to be honest, i find that kind of odd and don't understand why. maybe there's
some hidden penalty in using r6-r8 the compiler knows about that i don't? but
given this, i see no problem is taking off another register that doesn't seem
to be used anyway. (note, i don't use -O2, i use -O1 - could be an
optimization issue...)
Richard:
> > we also avoid having the dozens of "ldr r3,<some-var-block>" throughout
> > the code generation. this would make for more efficient code.
>
> Please show me a real example where we get dozens of such accesses that
> would be avoided by your model; the existing model makes use of the PC as
> an effective base register, you would loose that benefit with your
> approach.
yeah, there are pros/cons. i don't really know how to solve this particular
problem. but, you wanted a real code example, here ya go:
here's a fairly simple function i was testing through the system, and noted
the behavior on first:
test.c:
---------
#include <stdio.h>
extern int s1( int );
extern int s2( int );
extern int s3( int );
extern int g1;
int debug( void )
{
int g;
printf("debugging s1/2/3...\n");
for (g=0; g<10; g++)
printf("s1(%d) = %d, s2(%d)=%d, s3(%d)=%d\n", g, s1(g), g, s2(g), g, s3(g) );
printf("end debug...\n");
g1 *= s3(g);
printf("g1 is now %d\n", g1);
g = s1(10) + s2(20);
return g;
}
here's a snippet of the asm output from gcc ...
ldr r3, .L7
ldr r0, [r3, #0]
bl s1
mov r4, r0
ldr r3, .L7
ldr r0, [r3, #0]
bl s2
mov r5, r0
ldr r3, .L7
ldr r0, [r3, #0]
bl s3
mov r1, r0
ldr r2, .L7
ldr r3, .L7
str r5, [sp, #0]
sure looks like a lot of "ldr r3, .L7" to me ;) however, i note that i'm using
a very long string of flags to gcc, as well as an older version. when i went back
and undid many of the flags, and put the optimization level at O2 (i use O1 at
present, O2 has some side effects i haven't figured out how to deal with yet)
i *do* get different output that looks different:
.LM5:
ldr r0, [sp, #12]
bl s1
mov r4, r0
ldr r0, [sp, #12]
bl s2
mov r5, r0
ldr r0, [sp, #12]
bl s3
mov r3, r0
str r5, [sp, #0]
ldr r2, [sp, #12]
so i'm not sure if what i'm seeing is necessarily normal behavior. kind of hard
to tell.
> I think it probable that code compiled the way you suggest could be made
> to work, but I very much doubt that it would be more efficient.
i'd be willing to settle for neutral. a very small performance loss i could
accept - if we can make this work, we'd do an actual implementation that had
hardware support for our check-routines and better granularity on page controls.
then we might actually have a viable system...
then again, we may suck eggs. it's always hard to tell this early on which
way it will go ;)
any other thoughts on the situation?
thanks for the feedback!
-josh fryman
[ begin research description : ignore rest of email if uninterested ]
for a research project we're looking at a different model of CPU design
that would have *no* caches. think of a remote sensor device with a
very small (~4-8K) on-chip memory footprint and that's it - plus some way
to interface a sensor array. (sensor = uninteresting black box here.)
so essentially all the space that would be cache is now the RAM we have
available. we want to dynamically page in/out code and data from this
space, for our little sensor uP is connected to a backend powerful server
via some link (serial, IR, ethernet, wireless, whatever). so the
uP will run a little set of "stubs" that will obtain code snippets from
the server, run them, and intercept ld/st and b/bl situations ... to
remap the instruction to (a) the proper address if resident; or (b) to
fetch the proper chunk needed and then remap - this may involve shipping
code/data back to the server. the concept is that assuming our memory is
sufficiently large for our "hot" code (adpcm coding, whatever) that
eventually we reach steady-state and no longer talk to the server except
to send "cooked" sensor data back.
this is the final goal. the current implementation is a test prototype
running under linux... the app and server run on the same skiff board
(or not) and communicate via generic socket read/write ops. i have a
client which receives function-sized chunks and executes them, asking
the server periodically for more code... for now, we ignore the data
segment paging by just allocating enough memory and not trying to manage
it. managing the code is difficult enough.
the problem we immediately run into is that for sub-function-sized
page granularity, it's very difficult in a "nice" way to catch the
memory references that are loading data from the instruction stream.
(from those little tables in the text segment i'm complaining about ;)
i'm expicitly compiling the "app" to run on our client such that there is
a big separation between I- and D- sections -- I at 0x021... and D at
0x022... if we were running in a real embedded environment (no linux),
i could probably just set the MMU up to catch these for me, assuming
i could get it to recognize the small page size...
but here, when the server is parsing the code chunk to send to the
client, it replaces "bl myfunc" with "bl bl_intercept" ... the intercept
will do the negotiation with the server. for me to use finer grain chunks
and break up the function (say, on any "b offs" or "bl func") then i need
to move the code chunks around in memory to be non-contiguous. the
problem here is that now i need to have more knowledge of the code
and be on the lookout for "ld r3, <some ofs addr in I-space>" and
replace that with something like "bl ld_intercept" where i go and
do all the address work elsewhere for the load. in all probability,
i'd wind up exceeding the limited offset range of the ldr instruction
to just remap it...
you may think "so what?" - i'm taking a huge performance hit with
the bl-intercept routine, so what difference does the ld-intercept
make? the reason is that as we page code in to us, we self-modify
those "bl b_intercept" to actually become "bl <new-real-address>" ...
and when we page code out, we replace any call sites to the now-
removed code with "bl b_intercept" so we can reload the code as
needed. so in essence, we take the hit once, and then never again,
when we reach that "hot code" steady state...
the problem is i can't see a way to remove the ld_intercept *ever*
because i may always exceed the offset space of the ldr-instr,
not to mention the extra complexity in server-side book-keeping.
if i could just stuff all those tables at a fixed address in memory
that i can keep track of in some way, that would make my life much
easier. (maybe by defining a new segment ".funcvars" and sticking
them there ... by whatever means i can make it work...)
hope this info helps paint the broader picture...
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: modifying the ARM generation behavior?
2001-09-24 5:50 ` Josh Fryman
@ 2001-09-24 10:16 ` Richard Earnshaw
0 siblings, 0 replies; 6+ messages in thread
From: Richard Earnshaw @ 2001-09-24 10:16 UTC (permalink / raw)
To: Josh Fryman; +Cc: Richard.Earnshaw, gcc, nickc
> Richard:
> > Hm, which compiler release are you using?
>
> 2.95.2, as stated above. moving it out of the function body and to the
> function end doesn't change my underlying problem.
Indeed though it does help to explain some of the oddities.
> Richard:
> > Hm, what you are describing is a poition-independent data model (in ARM's
> > ATPCS parlance, RWPI -- read-write position independent), but taken to the
> > extreme that even constants are pushed into the global data tables.
>
> i'll have to read up about this... and the % efficiency drop you mention.
> i'd like to know if that drop is seen with *all* ARM compilers, or just
> the ARM compiler. i note that the ARM compiler does not generate code like
> gcc does ... as is true on all architectures i've dealt with, gcc is a good
> compiler in general, but if you want completely optimized code, you have
> to use the platform specific compiler (intel c, sun devwkshp c, etc).
I'll go into it further below, but I remain to be convinced you are
enabling the optimizer. The ARM compiler very definitely puts constants
that can't be synthesized into the code segment. Where possible these
will be placed between functions, but occasionally, when a function is
large, then they will get placed at convenient points in the function
body. Where possible this will be after a natural branch instruction; but
very occasionally even this isn't possible and the compiler will have to
insert a jump around the data table.
> Nick:
> > What happens if there is too much data to fit into the area pointed to
> > by r11 ? (or whichever register is used). Since this may only be
>
> Richard:
> > time. Further, any moderately large program is going to exceed the 4k
> > offset range of your base register, meaning that you will either need to
> > create one base value per module (= more code at the start of each module
> > to set the base register up) or you will have to compile on the assumption
> > that a single ldr can't load a constant, something like
>
> you both jumped on this one rather pointedly ;) it deserves it. i'm not
> sure how silly the idea is. but ... to answer ...
>
> not necessarily. i won't put it past some people to have a buttload of
> files in their projects, so the smart decision would be to have the register
> be an indirect index itself. think of something like the idea that you have
> 1K of 4-byte addresses to "data tables" ... these addresses are in turn
> addresses to function- or file-specific "data tables" ... and then there you
> are. you have one setup of the register at program start, and then each
> function would have two loads at the top to get the right table into the
> register value ... this doesn't seem much of a penalty to me. makes more
> overhead in the data segment... but that could be massaged a bit to reduce
> inefficiency.
Again, this is similar to the RWPI (or even the ROPI) shared library model
which allows for multiple tables; though that model has a more efficient
way of handling multiple tables that normally avoids more than one
additional indirection per function.
>
> Richard:
> > [re: alternate address load model]
> >
> > add Rtmp, Rbase, #OFFSET_HIGH(offset)
> > ldr Rx, [Rtmp, #OFFSET_LOW(offset)]
>
> this is sort of how Sparc (and other) systems work. they do have a different
> instruction pattern for it, but it does the two-stage load... and it's very
> easy to catch in the code by a parser like mine. i know exactly what its
> doing... in the sparc. because it doesn't use register offset addressing.
> makes me wish very fervently that the ARM had an instruction like the "bl"
> or "b" -- something like "ldrhi r3, <high-imm-16bit>", and the "ldrlo"
> follow-on.
Yes, but the SPARC can access the full 32-bit address space with that
model. On the ARM that still only buys you 20-bits of offset (probably
enough for most cases from a single pointer, but still 12 bits short of
the full range).
> Richard:
> > normal use. You can't use r11 since it is already used, so you would have
> > to use r9 (or for some compilations r8), that would use up 15-20% of the
> > remaining call-saved registers -- that's likely to have a significant
> > effect on the efficiency of the rest of your code, since the compiler will
> > now have to spill more often.
>
> maybe, i don't think so. (ignoring which particular rN is being used.) in
> the test code i've generated (prior to using -ffixed-r8), i've looked through
> the assembly output quite a bit. i have yet to see (this is working through
> adpcm codecs, mpeg codecs, jpeg codecs, and some custom test apps) anything
> use *more* than r0-r5. i have never seen an r6, r7, r8 reference *anywhere*.
I'm not convinced you are turning the optimizer on (or you have *very*
small functions).
> here's a fairly simple function i was testing through the system, and noted
> the behavior on first:
>
> test.c:
> ---------
> #include <stdio.h>
>
> extern int s1( int );
> extern int s2( int );
> extern int s3( int );
>
> extern int g1;
>
> int debug( void )
> {
> int g;
>
> printf("debugging s1/2/3...\n");
> for (g=0; g<10; g++)
> printf("s1(%d) = %d, s2(%d)=%d, s3(%d)=%d\n", g, s1(g), g, s2(g), g, s3(g) );
> printf("end debug...\n");
>
> g1 *= s3(g);
> printf("g1 is now %d\n", g1);
>
> g = s1(10) + s2(20);
> return g;
> }
>
I cannot get the compiler to generate the following output that you have:
> here's a snippet of the asm output from gcc ...
>
> ldr r3, .L7
> ldr r0, [r3, #0]
> bl s1
> mov r4, r0
> ldr r3, .L7
> sure looks like a lot of "ldr r3, .L7" to me ;) however, i note that i'm using
> a very long string of flags to gcc, as well as an older version.
This doesn't make sense for your source code. The variable passed to s1
is "g", a local variable. The only places this can exist are in a
register, or on the stack. In no case can it then be referenced by
looking it up through a constant data pointer -- there's no way the
compiler could know where on the stack it would be at compile time. Are
you sure this example was compiled from your posted code?
>when i went back
> and undid many of the flags, and put the optimization level at O2 (i use O1 at
> present, O2 has some side effects i haven't figured out how to deal with yet)
> i *do* get different output that looks different:
>
> .LM5:
> ldr r0, [sp, #12]
> bl s1
> mov r4, r0
> ldr r0, [sp, #12]
> bl s2
> mov r5, r0
> ldr r0, [sp, #12]
> bl s3
> mov r3, r0
> str r5, [sp, #0]
> ldr r2, [sp, #12]
I can get this sort of output if I use -O0 -fomit-frame-pointer, but in no
other way.
Compiling with ANY level of optimization on gives
.L6:
mov r0, r6
bl s1
mov r5, r0
mov r0, r6
bl s2
mov r4, r0
mov r0, r6
bl s3
If you've got a long list of flags that are being passed to the compiler,
please check them carefully to ensure that a flag later on the command
line isn't turning the optimizer off again.
> [ begin research description : ignore rest of email if uninterested ]
>
> for a research project we're looking at a different model of CPU design
> that would have *no* caches. think of a remote sensor device with a
> very small (~4-8K) on-chip memory footprint and that's it - plus some way
> to interface a sensor array. (sensor = uninteresting black box here.)
> so essentially all the space that would be cache is now the RAM we have
> available. we want to dynamically page in/out code and data from this
> space, for our little sensor uP is connected to a backend powerful server
> via some link (serial, IR, ethernet, wireless, whatever). so the
> uP will run a little set of "stubs" that will obtain code snippets from
> the server, run them, and intercept ld/st and b/bl situations ... to
> remap the instruction to (a) the proper address if resident; or (b) to
> fetch the proper chunk needed and then remap - this may involve shipping
> code/data back to the server. the concept is that assuming our memory is
> sufficiently large for our "hot" code (adpcm coding, whatever) that
> eventually we reach steady-state and no longer talk to the server except
> to send "cooked" sensor data back.
This doesn't really sound any different from a demand-paged virtual memory
system, except, perhaps that you are trying to change the code directly,
rather than having additional hardware in the CPU to manage that for you.
> but here, when the server is parsing the code chunk to send to the
> client, it replaces "bl myfunc" with "bl bl_intercept" ... the intercept
> will do the negotiation with the server. for me to use finer grain chunks
> and break up the function (say, on any "b offs" or "bl func") then i need
> to move the code chunks around in memory to be non-contiguous. the
> problem here is that now i need to have more knowledge of the code
> and be on the lookout for "ld r3, <some ofs addr in I-space>" and
> replace that with something like "bl ld_intercept" where i go and
> do all the address work elsewhere for the load. in all probability,
> i'd wind up exceeding the limited offset range of the ldr instruction
> to just remap it...
>
> you may think "so what?" - i'm taking a huge performance hit with
> the bl-intercept routine, so what difference does the ld-intercept
> make? the reason is that as we page code in to us, we self-modify
> those "bl b_intercept" to actually become "bl <new-real-address>" ...
> and when we page code out, we replace any call sites to the now-
> removed code with "bl b_intercept" so we can reload the code as
> needed. so in essence, we take the hit once, and then never again,
> when we reach that "hot code" steady state...
Ok, so presumably your server has to remember what the original address
was when fixing up the bl (so that when executed it can repair the
damage). Why can't you extend this to the load/store and replace them
with something like
ldr rd, [r0, -r0]
This will always cause a load/store to address zero, and it would be easy
to make either your memory system or MMU fault such an access. Then you
could catch that with a segmentation fault handler and put in the correct
address before resuming execution.
R.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2001-09-24 10:16 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-22 9:31 modifying the ARM generation behavior? Josh Fryman
2001-09-22 12:57 ` Ray Lehtiniemi
2001-09-24 3:44 ` Nick Clifton
2001-09-24 4:00 ` Richard Earnshaw
2001-09-24 5:50 ` Josh Fryman
2001-09-24 10:16 ` Richard Earnshaw
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).