Help needed: Optimization of bytecode interpreter for ARM paltform

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* Help needed: Optimization of bytecode interpreter for ARM paltform
       [not found] <E4A374257A3CD1438C6EAE5FCD9A1EF502DD8058@idbexc02.americas.cpqcorp.net>
@ 2006-12-08 15:30 ` de Brebisson, Cyrille (Calculator Division)
  2006-12-08 16:43   ` Andrew Haley
  0 siblings, 1 reply; 9+ messages in thread
From: de Brebisson, Cyrille (Calculator Division) @ 2006-12-08 15:30 UTC (permalink / raw)
  To: gcc-help; +Cc: de Brebisson, Cyrille (Calculator Division)

Hello,

I hope that this is the best location to ask this question, if not, please accept my apologize and redirect me where needed.

I am trying to write a fast byte code interpreter, but the compiler optimizer just 'does not get it' and generates bad code (it does not realize that they are jumps everywhere and optimizes out the code out)...

Here is a simplified version of the code:

static int rom[]= 
  { 0, 1, 2, 3, 4, 5, 6, 7, 8, 
    9, 10, 11, 12, 13, 14, }; // the 'program'

void execute()

{

  const void * const jumps[] = 
    { &&ins000, &&ins001, &&ins002, &&ins003, 
      &&ins004, &&ins005, &&ins006, &&ins007 }; // table of jumps

  register int carry asm ("r0");
  register int instr asm("r1"); // currently executed instruction
  register int *pc asm ("r4"); // program counter, points on next instr.
  register const void * const * jm asm ("r5") = jumps; //pointer jump table

int a=0, b=0; // virtual machine registers

// this macro does a fast carry=0; goto *jumps[*pc++]; 
#define next asm ("ldrh %2, [%0], #2\n\t" \
                   "mov %1, #0\n\t" \
                   "ldr pc, [%4, %2, asl #2]" : 
                   "=r" (pc), 
                   "=r" (carry), 
                   "=r" (instr): 
                   "0" (pc), 
                   "r" (jm)) 

// this macro does a fast goto *jumps[*pc++]; 
#define nextnocarry asm ("ldrh %1, [%0], #2\n\t"\
                         "ldr pc, [%3, %1, asl #2]" : 
                         "=r" (pc), 
                         "=r" (instr) : 
                         "0" (pc), 
                         "r" (jm))

pc = &rom[0]; next; // initialize PC and jump on first instruction...

// instruction execution..
ins000: a= 0; next;
ins001: b= 0; next;
ins002: a++; carry= a==0; nextnocarry;
ins003: b++; carry= b==0; nextnocarry;
ins004: pc= pc-a; next;
ins005: if (carry) pc+= b; next;
ins006: a--; carry= a==0; nextnocarry;
ins007: b--; carry= b==0; nextnocarry;

}

arm-elf-gcc -O1 -S ex.c compiles this whole thing in absolutely NOTHING! (well, a bx lr to be more precise, a return!).
I am using version "arm-elf-gcc (GCC) 4.0.2"

Can anyone help me with this?

Note, if I replace the asm parts with the C equivalent, it generates:
      ldrh  r1, [r4], #2
      ldr   r8, .L2691+4
      ldr   fp, [r8, r1, asl #2]
      mov   r0, #0
      mov   pc, fp      @ indirect register jump
5 instructions instead of 3 as it
  1: does not keep jm in a register
  2: load the value of the label in a temp register instead of directly in pc. Which is not only slower, but wastes a lot of memory (and I am very memory limited on this system).

thanks, cyrille 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Help needed: Optimization of bytecode interpreter for ARM paltform
  2006-12-08 15:30 ` Help needed: Optimization of bytecode interpreter for ARM paltform de Brebisson, Cyrille (Calculator Division)
@ 2006-12-08 16:43   ` Andrew Haley
  2006-12-08 17:05     ` de Brebisson, Cyrille (Calculator Division)
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Haley @ 2006-12-08 16:43 UTC (permalink / raw)
  To: de Brebisson, Cyrille (Calculator Division); +Cc: gcc-help

de Brebisson, Cyrille (Calculator Division) writes:
 > Hello,
 > 
 > I hope that this is the best location to ask this question, if not, please accept my apologize and redirect me where needed.
 > 
 > I am trying to write a fast byte code interpreter, but the compiler optimizer just 'does not get it' and generates bad code (it does not realize that they are jumps everywhere and optimizes out the code out)...
 > 
 > Here is a simplified version of the code:
 > 
 > static int rom[]= 
 >   { 0, 1, 2, 3, 4, 5, 6, 7, 8, 
 >     9, 10, 11, 12, 13, 14, }; // the 'program'
 >  
 > 
 > void execute()
 > 
 > {
 > 
 >   const void * const jumps[] = 
 >     { &&ins000, &&ins001, &&ins002, &&ins003, 
 >       &&ins004, &&ins005, &&ins006, &&ins007 }; // table of jumps
 > 
 >   register int carry asm ("r0");
 >   register int instr asm("r1"); // currently executed instruction
 >   register int *pc asm ("r4"); // program counter, points on next instr.
 >   register const void * const * jm asm ("r5") = jumps; //pointer jump table
 > 
 > int a=0, b=0; // virtual machine registers
 > 
 > // this macro does a fast carry=0; goto *jumps[*pc++]; 
 > #define next asm ("ldrh %2, [%0], #2\n\t" \
 >                    "mov %1, #0\n\t" \
 >                    "ldr pc, [%4, %2, asl #2]" : 
 >                    "=r" (pc), 
 >                    "=r" (carry), 
 >                    "=r" (instr): 
 >                    "0" (pc), 
 >                    "r" (jm)) 
 > 
 > // this macro does a fast goto *jumps[*pc++]; 
 > #define nextnocarry asm ("ldrh %1, [%0], #2\n\t"\
 >                          "ldr pc, [%3, %1, asl #2]" : 
 >                          "=r" (pc), 
 >                          "=r" (instr) : 
 >                          "0" (pc), 
 >                          "r" (jm))

This is the crucial mistake: you can't jump out of an inline asm.

Andrew.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Help needed: Optimization of bytecode interpreter for ARM paltform
  2006-12-08 16:43   ` Andrew Haley
@ 2006-12-08 17:05     ` de Brebisson, Cyrille (Calculator Division)
  2006-12-08 17:22       ` Andrew Haley
  0 siblings, 1 reply; 9+ messages in thread
From: de Brebisson, Cyrille (Calculator Division) @ 2006-12-08 17:05 UTC (permalink / raw)
  To: Andrew Haley; +Cc: gcc-help

Hello,

[snip] trying to re-code, using inline assembly goto *jump[*progc++]
I used inline assembly to do:
Ldrh instr, [progc], #2       // note that in most cases, there is an
                              // extra instruction here that allows to
                              // cancel the waitstate caused by the use
                              // of register instr on the next
instruction
ldr pc, [jump, instr, asl #2]

because the compiler generates the highly unoptimized (and too large for
the memory in my device)
	ldrh	r1, [r4], #2
	ldr	r8, .L2691+4
	ldr	fp, [r8, r1, asl #2]
	mov	pc, fp	@ indirect register jump
[/snip]

>This is the crucial mistake: you can't jump out of an inline asm.

So, how can I optimize my code? Is there a way to force the compiler to
1: put a variable in a register? As the asm ("register"); constraint
does not seem to do a lot of forcing
2: get the compiler to condense the last 2 instructions in 1?

Thanks, cyrille

-----Original Message-----
From: Andrew Haley [mailto:aph@gcc.gnu.org] 
Sent: 08 December 2006 09:43
To: de Brebisson, Cyrille (Calculator Division)
Cc: gcc-help@gcc.gnu.org
Subject: Re: Help needed: Optimization of bytecode interpreter for ARM
paltform

de Brebisson, Cyrille (Calculator Division) writes:
 > Hello,
 > 
 > I hope that this is the best location to ask this question, if not,
please accept my apologize and redirect me where needed.
 > 
 > I am trying to write a fast byte code interpreter, but the compiler
optimizer just 'does not get it' and generates bad code (it does not
realize that they are jumps everywhere and optimizes out the code
out)...
 > 
 > Here is a simplified version of the code:
 > 
 > static int rom[]= 
 >   { 0, 1, 2, 3, 4, 5, 6, 7, 8, 
 >     9, 10, 11, 12, 13, 14, }; // the 'program'
 >  
 > 
 > void execute()
 > 
 > {
 > 
 >   const void * const jumps[] = 
 >     { &&ins000, &&ins001, &&ins002, &&ins003, 
 >       &&ins004, &&ins005, &&ins006, &&ins007 }; // table of jumps
 > 
 >   register int carry asm ("r0");
 >   register int instr asm("r1"); // currently executed instruction
 >   register int *pc asm ("r4"); // program counter, points on next
instr.
 >   register const void * const * jm asm ("r5") = jumps; //pointer jump
table
 > 
 > int a=0, b=0; // virtual machine registers
 > 
 > // this macro does a fast carry=0; goto *jumps[*pc++]; 
 > #define next asm ("ldrh %2, [%0], #2\n\t" \
 >                    "mov %1, #0\n\t" \
 >                    "ldr pc, [%4, %2, asl #2]" : 
 >                    "=r" (pc), 
 >                    "=r" (carry), 
 >                    "=r" (instr): 
 >                    "0" (pc), 
 >                    "r" (jm)) 
 > 
 > // this macro does a fast goto *jumps[*pc++]; 
 > #define nextnocarry asm ("ldrh %1, [%0], #2\n\t"\
 >                          "ldr pc, [%3, %1, asl #2]" : 
 >                          "=r" (pc), 
 >                          "=r" (instr) : 
 >                          "0" (pc), 
 >                          "r" (jm))

This is the crucial mistake: you can't jump out of an inline asm.

Andrew.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Help needed: Optimization of bytecode interpreter for ARM paltform
  2006-12-08 17:05     ` de Brebisson, Cyrille (Calculator Division)
@ 2006-12-08 17:22       ` Andrew Haley
  2006-12-08 18:12         ` Richard Earnshaw
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Haley @ 2006-12-08 17:22 UTC (permalink / raw)
  To: de Brebisson, Cyrille (Calculator Division); +Cc: gcc-help, Richard Earnshaw

de Brebisson, Cyrille (Calculator Division) writes:

 > [snip] trying to re-code, using inline assembly goto *jump[*progc++]
 > I used inline assembly to do:
 > Ldrh instr, [progc], #2       // note that in most cases, there is an
 >                               // extra instruction here that allows to
 >                               // cancel the waitstate caused by the use
 >                               // of register instr on the next
 > instruction
 > ldr pc, [jump, instr, asl #2]
 > 
 > because the compiler generates the highly unoptimized (and too large for
 > the memory in my device)
 > 	ldrh	r1, [r4], #2
 > 	ldr	r8, .L2691+4
 > 	ldr	fp, [r8, r1, asl #2]
 > 	mov	pc, fp	@ indirect register jump
 > [/snip]
 > 
 > >This is the crucial mistake: you can't jump out of an inline asm.
 > 
 > So, how can I optimize my code? Is there a way to force the compiler to
 > 1: put a variable in a register? As the asm ("register"); constraint
 > does not seem to do a lot of forcing

Definitely: if declaring a global register variable doesn't work,
that's a bug.  What exactly did you try?

 > 2: get the compiler to condense the last 2 instructions in 1?

I'm not sure why gcc generates that sequence.  Forwarding to Richard
Earnshaw for comment.

Andrew.


 > -----Original Message-----
 > From: Andrew Haley [mailto:aph@gcc.gnu.org] 
 > Sent: 08 December 2006 09:43
 > To: de Brebisson, Cyrille (Calculator Division)
 > Cc: gcc-help@gcc.gnu.org
 > Subject: Re: Help needed: Optimization of bytecode interpreter for ARM
 > paltform
 > 
 > de Brebisson, Cyrille (Calculator Division) writes:
 >  > Hello,
 >  > 
 >  > I hope that this is the best location to ask this question, if not,
 > please accept my apologize and redirect me where needed.
 >  > 
 >  > I am trying to write a fast byte code interpreter, but the compiler
 > optimizer just 'does not get it' and generates bad code (it does not
 > realize that they are jumps everywhere and optimizes out the code
 > out)...
 >  > 
 >  > Here is a simplified version of the code:
 >  > 
 >  > static int rom[]= 
 >  >   { 0, 1, 2, 3, 4, 5, 6, 7, 8, 
 >  >     9, 10, 11, 12, 13, 14, }; // the 'program'
 >  >  
 >  > 
 >  > void execute()
 >  > 
 >  > {
 >  > 
 >  >   const void * const jumps[] = 
 >  >     { &&ins000, &&ins001, &&ins002, &&ins003, 
 >  >       &&ins004, &&ins005, &&ins006, &&ins007 }; // table of jumps
 >  > 
 >  >   register int carry asm ("r0");
 >  >   register int instr asm("r1"); // currently executed instruction
 >  >   register int *pc asm ("r4"); // program counter, points on next
 > instr.
 >  >   register const void * const * jm asm ("r5") = jumps; //pointer jump
 > table
 >  > 
 >  > int a=0, b=0; // virtual machine registers
 >  > 
 >  > // this macro does a fast carry=0; goto *jumps[*pc++]; 
 >  > #define next asm ("ldrh %2, [%0], #2\n\t" \
 >  >                    "mov %1, #0\n\t" \
 >  >                    "ldr pc, [%4, %2, asl #2]" : 
 >  >                    "=r" (pc), 
 >  >                    "=r" (carry), 
 >  >                    "=r" (instr): 
 >  >                    "0" (pc), 
 >  >                    "r" (jm)) 
 >  > 
 >  > // this macro does a fast goto *jumps[*pc++]; 
 >  > #define nextnocarry asm ("ldrh %1, [%0], #2\n\t"\
 >  >                          "ldr pc, [%3, %1, asl #2]" : 
 >  >                          "=r" (pc), 
 >  >                          "=r" (instr) : 
 >  >                          "0" (pc), 
 >  >                          "r" (jm))
 > 
 > This is the crucial mistake: you can't jump out of an inline asm.
 > 
 > Andrew.
 > 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Help needed: Optimization of bytecode interpreter for ARM  paltform
  2006-12-08 17:22       ` Andrew Haley
@ 2006-12-08 18:12         ` Richard Earnshaw
  2006-12-08 19:30           ` Help needed: Optimization of bytecode interpreter for ARMpaltform de Brebisson, Cyrille (Calculator Division)
  2006-12-11 22:23           ` Syntax for inline asm for 64 bit variable on 32 bit architecture de Brebisson, Cyrille (Calculator Division)
  0 siblings, 2 replies; 9+ messages in thread
From: Richard Earnshaw @ 2006-12-08 18:12 UTC (permalink / raw)
  To: Andrew Haley; +Cc: de Brebisson, Cyrille (Calculator Division), gcc-help

On Fri, 2006-12-08 at 17:21 +0000, Andrew Haley wrote:
> de Brebisson, Cyrille (Calculator Division) writes:
> 
>  > [snip] trying to re-code, using inline assembly goto *jump[*progc++]
>  > I used inline assembly to do:
>  > Ldrh instr, [progc], #2       // note that in most cases, there is an
>  >                               // extra instruction here that allows to
>  >                               // cancel the waitstate caused by the use
>  >                               // of register instr on the next
>  > instruction
>  > ldr pc, [jump, instr, asl #2]
>  > 
>  > because the compiler generates the highly unoptimized (and too large for
>  > the memory in my device)
>  > 	ldrh	r1, [r4], #2
>  > 	ldr	r8, .L2691+4
>  > 	ldr	fp, [r8, r1, asl #2]
>  > 	mov	pc, fp	@ indirect register jump
>  > [/snip]
>  > 
>  > >This is the crucial mistake: you can't jump out of an inline asm.
>  > 
>  > So, how can I optimize my code? Is there a way to force the compiler to
>  > 1: put a variable in a register? As the asm ("register"); constraint
>  > does not seem to do a lot of forcing
> 
> Definitely: if declaring a global register variable doesn't work,
> that's a bug.  What exactly did you try?
> 
>  > 2: get the compiler to condense the last 2 instructions in 1?
> 
> I'm not sure why gcc generates that sequence.  Forwarding to Richard
> Earnshaw for comment.

First of all, you don't mention which version of the compiler you are
using, so it's hard to know precisely why you get the code you do.
GCC-4.1 is used in my example below.

Trying to second guess the compiler is rarely profitable, but it's not
clear to me why the address of the jump table is not being hoisted out
of the loop.  There is a hack that will effectively force this in this
instance.  By loading a global variable (or you could pass it in as an
additional parameter such that it is always zero), we force the address
calculation into a local variable that the compiler can't (easily)
optimize away.  For the following test-case:

int offset = 0;

void runprog(unsigned short *prog, int count)
{
    __label__ code0, code1, code2, code3;
    static const void* const jump[4] = 
	{
	    &&code0, &&code1, &&code2, &&code3
	};
    const void* const* interp = jump+offset;
    
    while (count--)
	{
	    goto *interp[*prog++];
    code0:
	    foo();
	    continue;
    code1:
	    bar();
	    continue;
    code2:
	    wibble();
	    continue;
    code3:
	    wombat();
	    break;
	}
}

The critical part of the loop then compiles to:

        ldrh    r3, [r5], #2
        ldr     pc, [r6, r3, asl #2]    @ indirect memory jump

which looks fine to me.  Note, however, that if your 'switch' statement
is large, then you'll quite probably get spilling of variables.  The
value of interp is higly likely to be a candidate here because it's used
exactly once per iteration, so you'll then be back to where you started.

I'm somewhat confused as to why you haven't just used a switch table for
this, though.  The equivalent code:

void runprog(unsigned short *prog, int count)
{
    while (count--)
	{
	    switch(*prog++)
		{
		case 0:
		    foo();
		    continue;
		case 1:
		    bar();
		    continue;
		case 2:
		    wibble();
		    continue;
		case 3:
		    wombat();
		    goto done;
		}
	}
 done:
    ;
}

is much easier to understand and much more ammenable to the standard
optimizer framework.

R.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Help needed: Optimization of bytecode interpreter for ARMpaltform
  2006-12-08 18:12         ` Richard Earnshaw
@ 2006-12-08 19:30           ` de Brebisson, Cyrille (Calculator Division)
  2006-12-09  8:23             ` Andrew Haley
  2006-12-09 16:54             ` Daniel Berlin
  2006-12-11 22:23           ` Syntax for inline asm for 64 bit variable on 32 bit architecture de Brebisson, Cyrille (Calculator Division)
  1 sibling, 2 replies; 9+ messages in thread
From: de Brebisson, Cyrille (Calculator Division) @ 2006-12-08 19:30 UTC (permalink / raw)
  To: Richard Earnshaw, Andrew Haley; +Cc: gcc-help

Hello,

Lots of questions are being ask that can probably better answer if I try
to explain what I am doing and why I am doing it (believe it or not,
there is method in my madness!) 

BTW, I am using arm-elf-gcc (GCC) 4.0.2, that was in my original post,
but got deleted afterward, sorry about that

So, let me try to answer the main question.

I need to create an interpreter/simulator for an old (wired) CPU (with
1024 instructions) that will reside in a small embedded system.
Therefore, memory (executable size) is an issue as well as speed. 

My first version was of course a large switch case
Switch (*pc++)
{
  Case xxx: execute; continue;
}

But, this would cause 1 extra jump (3 cycles) for each loop plus extra
testing at each loop (is the number to switch to too large?), all
together, the 'loop and execute the next instruction' code was over 12
cycles. For comparison, the most used instructions when executing the
bytecode (ie: virtual code) are jumps which take 2 cycle to emulate, so
the overhead of the switch case/loop is extremely significant!

So I tried to use table of jump locations (code in previous messages) in
order to replace the poor code generated by the switch by 2 instructions
(5 cycles ldrh instr, [instr_pc] #2 mov pc, [jump_table, instr, asl #2]
which should provide a 60% speed increase on the most executed
instructions! After all this is the exact reason why table of labels
were introduced in gcc (see help files!).

But, the compiler if fighting me, not liking the jump in the inline code
(it basically does not see the jumps and optimizes out all the code!

Replacing the assembly by a goto *jump[pc]; does help a bit, but the
code generated is not optimal (and makes the whole program too large to
fit in memory!). (because it loads the jump address in a register first
and then moves it in PC instead of loading directly in PC.

The 2nd problem is the fact that one of the most used variable (the jump
table address) is moved on the stack instead of being kept in a
register. Probably because the optimizer uses the register for some
other local optimizations (knowing my luck for instructions that are
pretty much never emulated) at the expanse of a much more effective
global optimization.

The best way for me to solve this problem (which is due to the fact that
I am doing definitely non-standard code) would be to allow me to specify
to the compiler where I want global optimization turned on or off...
then I could let the compiler optimize local things, but would turn it
off for the main loops (where I write my own code).

So, is there any hope for me?

If needed I can provide the full code (in order to simplify things, I
have only put an example that would show the problem in my messages, the
real code being 1000 lines long). The main different is that with the
real code the jump table address is put on the stack while with the
example, it is not...

Thanks for your help, Cyrille

-----Original Message-----
From: Richard Earnshaw [mailto:rearnsha@arm.com] 
Sent: 08 December 2006 11:12
To: Andrew Haley
Cc: de Brebisson, Cyrille (Calculator Division); gcc-help@gcc.gnu.org
Subject: RE: Help needed: Optimization of bytecode interpreter for
ARMpaltform

On Fri, 2006-12-08 at 17:21 +0000, Andrew Haley wrote:
> de Brebisson, Cyrille (Calculator Division) writes:
> 
>  > [snip] trying to re-code, using inline assembly goto
*jump[*progc++]
>  > I used inline assembly to do:
>  > Ldrh instr, [progc], #2       // note that in most cases, there is
an
>  >                               // extra instruction here that allows
to
>  >                               // cancel the waitstate caused by the
use
>  >                               // of register instr on the next
>  > instruction
>  > ldr pc, [jump, instr, asl #2]
>  > 
>  > because the compiler generates the highly unoptimized (and too
large for
>  > the memory in my device)
>  > 	ldrh	r1, [r4], #2
>  > 	ldr	r8, .L2691+4
>  > 	ldr	fp, [r8, r1, asl #2]
>  > 	mov	pc, fp	@ indirect register jump
>  > [/snip]
>  > 
>  > >This is the crucial mistake: you can't jump out of an inline asm.
>  > 
>  > So, how can I optimize my code? Is there a way to force the
compiler to
>  > 1: put a variable in a register? As the asm ("register");
constraint
>  > does not seem to do a lot of forcing
> 
> Definitely: if declaring a global register variable doesn't work,
> that's a bug.  What exactly did you try?
> 
>  > 2: get the compiler to condense the last 2 instructions in 1?
> 
> I'm not sure why gcc generates that sequence.  Forwarding to Richard
> Earnshaw for comment.

First of all, you don't mention which version of the compiler you are
using, so it's hard to know precisely why you get the code you do.
GCC-4.1 is used in my example below.

Trying to second guess the compiler is rarely profitable, but it's not
clear to me why the address of the jump table is not being hoisted out
of the loop.  There is a hack that will effectively force this in this
instance.  By loading a global variable (or you could pass it in as an
additional parameter such that it is always zero), we force the address
calculation into a local variable that the compiler can't (easily)
optimize away.  For the following test-case:

int offset = 0;

void runprog(unsigned short *prog, int count)
{
    __label__ code0, code1, code2, code3;
    static const void* const jump[4] = 
	{
	    &&code0, &&code1, &&code2, &&code3
	};
    const void* const* interp = jump+offset;

    while (count--)
	{
	    goto *interp[*prog++];
    code0:
	    foo();
	    continue;
    code1:
	    bar();
	    continue;
    code2:
	    wibble();
	    continue;
    code3:
	    wombat();
	    break;
	}
}

The critical part of the loop then compiles to:

        ldrh    r3, [r5], #2
        ldr     pc, [r6, r3, asl #2]    @ indirect memory jump

which looks fine to me.  Note, however, that if your 'switch' statement
is large, then you'll quite probably get spilling of variables.  The
value of interp is higly likely to be a candidate here because it's used
exactly once per iteration, so you'll then be back to where you started.

I'm somewhat confused as to why you haven't just used a switch table for
this, though.  The equivalent code:

void runprog(unsigned short *prog, int count)
{
    while (count--)
	{
	    switch(*prog++)
		{
		case 0:
		    foo();
		    continue;
		case 1:
		    bar();
		    continue;
		case 2:
		    wibble();
		    continue;
		case 3:
		    wombat();
		    goto done;
		}
	}
 done:
    ;
}

is much easier to understand and much more ammenable to the standard
optimizer framework.

R.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Help needed: Optimization of bytecode interpreter for ARMpaltform
  2006-12-08 19:30           ` Help needed: Optimization of bytecode interpreter for ARMpaltform de Brebisson, Cyrille (Calculator Division)
@ 2006-12-09  8:23             ` Andrew Haley
  2006-12-09 16:54             ` Daniel Berlin
  1 sibling, 0 replies; 9+ messages in thread
From: Andrew Haley @ 2006-12-09  8:23 UTC (permalink / raw)
  To: de Brebisson, Cyrille (Calculator Division); +Cc: Richard Earnshaw, gcc-help

de Brebisson, Cyrille (Calculator Division) writes:

 > Lots of questions are being ask that can probably better answer if I try
 > to explain what I am doing and why I am doing it (believe it or not,
 > there is method in my madness!) 
 > 
 > BTW, I am using arm-elf-gcc (GCC) 4.0.2, that was in my original post,
 > but got deleted afterward, sorry about that
 > 
 > So, let me try to answer the main question.
 > 
 > I need to create an interpreter/simulator for an old (wired) CPU (with
 > 1024 instructions) 

That would be Saturn, I suppose.

 > that will reside in a small embedded system.

 > The 2nd problem is the fact that one of the most used variable (the
 > jump table address) is moved on the stack instead of being kept in
 > a register. Probably because the optimizer uses the register for
 > some other local optimizations (knowing my luck for instructions
 > that are pretty much never emulated) at the expanse of a much more
 > effective global optimization.

Can you please provide us with the declaration of the global register
variable that you tried?  Really, that part of gcc should work, and if
it doesn't we want to know!

Andrew.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Help needed: Optimization of bytecode interpreter for ARMpaltform
  2006-12-08 19:30           ` Help needed: Optimization of bytecode interpreter for ARMpaltform de Brebisson, Cyrille (Calculator Division)
  2006-12-09  8:23             ` Andrew Haley
@ 2006-12-09 16:54             ` Daniel Berlin
  1 sibling, 0 replies; 9+ messages in thread
From: Daniel Berlin @ 2006-12-09 16:54 UTC (permalink / raw)
  To: de Brebisson, Cyrille (Calculator Division)
  Cc: Richard Earnshaw, Andrew Haley, gcc-help

>
> But, the compiler if fighting me, not liking the jump in the inline code
> (it basically does not see the jumps and optimizes out all the code!


We do not allow you to change control flow using inline assembly.

It's not a matter of "not liking the jump". We simply don't allow
control flow changes through inline assembly.

This is relatively common among other compilers, AFAIK.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Syntax for inline asm for 64 bit variable on 32 bit architecture
  2006-12-08 18:12         ` Richard Earnshaw
  2006-12-08 19:30           ` Help needed: Optimization of bytecode interpreter for ARMpaltform de Brebisson, Cyrille (Calculator Division)
@ 2006-12-11 22:23           ` de Brebisson, Cyrille (Calculator Division)
  1 sibling, 0 replies; 9+ messages in thread
From: de Brebisson, Cyrille (Calculator Division) @ 2006-12-11 22:23 UTC (permalink / raw)
  To: gcc-help

Hello,

What is the syntax for using 64 bit variables in inline asm on a 32 bit
architecture. How do you know which register pair is used?
For example, how to code a+=b; in this case..
Long long a= 0, b = 1;
Asm ("add %0, %1 \n\t adc %2, %3" : "+r" (a) : "r" (b));

Thanks, cyrille

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-12-11 22:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <E4A374257A3CD1438C6EAE5FCD9A1EF502DD8058@idbexc02.americas.cpqcorp.net>
2006-12-08 15:30 ` Help needed: Optimization of bytecode interpreter for ARM paltform de Brebisson, Cyrille (Calculator Division)
2006-12-08 16:43   ` Andrew Haley
2006-12-08 17:05     ` de Brebisson, Cyrille (Calculator Division)
2006-12-08 17:22       ` Andrew Haley
2006-12-08 18:12         ` Richard Earnshaw
2006-12-08 19:30           ` Help needed: Optimization of bytecode interpreter for ARMpaltform de Brebisson, Cyrille (Calculator Division)
2006-12-09  8:23             ` Andrew Haley
2006-12-09 16:54             ` Daniel Berlin
2006-12-11 22:23           ` Syntax for inline asm for 64 bit variable on 32 bit architecture de Brebisson, Cyrille (Calculator Division)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).