Scheduling x86 dispatch windows

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Scheduling x86 dispatch windows
@ 2010-06-10 17:24 reza yazdani
  2010-06-10 19:52 ` Quentin Neill
  0 siblings, 1 reply; 28+ messages in thread
From: reza yazdani @ 2010-06-10 17:24 UTC (permalink / raw)
  To: gcc

Hi,

We are in the process of adding a feature to GCC to take advantage of a new hardware feature in the latest AMD micro processor. This feature requires a certain mix, ordering and alignments in instruction sequences to obtain the expected hardware performance.

I am asking the community to review this high level implementation design and give me direction or advice. 

The new hardware issues two windows of the size N bytes of instructions in every cycle. It goes into accelerate mode if the windows have the right combination of instructions or alignments. Our goal is to maximize the IPC by proper instruction scheduling and alignments. 

Here is a summary of the most important requirements:

a) Maximum of N instructions per window.

b) An instruction may cross the first window.

c) Each window can have maximum of x memory loads and y memory stores . 

d) The total number of immediate constants in the instructions of a window should not exceed k.

e) The first window must be aligned on 16 byte boundary.

f) A Window set terminates when a branch exists in a window.

g) The number of allowed prefixes varies for instructions.

h) A window set needs to be padded by prefixes in instructions or terminated by nops to ensure adherence to the rules.

We have the following implementation plan for GCC:

1) Modify the Haifa scheduler to make the desired arrangement of instructions for the two dispatch windows. The scheduler is called once before and once after register allocation as usual. In both cases it performs dispatch scheduling along with its normal job of instruction scheduling. 

The advantage of doing it before register allocation is avoiding extra dependencies caused by register allocation which may become an obstacle to movement of instructions.  The advantage of doing it after register allocation is a consideration for spilling code which may be generated by the register allocator.

The algorithm we use is:

a) Considering the current dispatch window set, choose the first instruction from ready queue that does not violate dispatch rules.

b) When an instruction is selected and scheduled, inform the dispatcher code about the instruction. This step keeps track of the instruction content of windows for future evaluation. It also manages the window set by closing and opening new virtual dispatch windows.

2) Insertion of alignment code. 

In x86 alignment is done by inserting prefixes or by generating nops. As the object code is generated by the assembler in GCC, some information such as sizes of branches are unknown until assembly or link time. To do alignments related to dispatch correctly in GCC, we need to iteratively compute prefixes and branch sizes until its convergence. This pass currently does not exist in GCC, but it exists in the assembler. 

There are two possible approaches to solve alignment problem.

a)  Let the assembler performs the alignments and padding needed to adhere with the new machine dispatching rules and avoid an extra pass in GCC.

b)  Add a new pass to mimic what assembler does before generating the assembly listing in GCC and insert the required alignments.

I appreciate your comments on the proposed implementation procedure and the choices a or b above.

Reza Yazdani

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-10 17:24 Scheduling x86 dispatch windows reza yazdani
@ 2010-06-10 19:52 ` Quentin Neill
  2010-06-10 20:12   ` H.J. Lu
  0 siblings, 1 reply; 28+ messages in thread
From: Quentin Neill @ 2010-06-10 19:52 UTC (permalink / raw)
  To: binutils; +Cc: gcc

Cross-posting Reza's call for feedback to the binutils list since it
is relevant -
see the last few paragraphs regarding how to "solve the alignment problem".

Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402

Not sure if followups should occur on one list or both.
--
Quentin Neill


On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani <yazdani_reza@yahoo.com> wrote:
> Hi,
>
> We are in the process of adding a feature to GCC to take advantage
> of a new hardware feature in the latest AMD micro processor. This
> feature requires a certain mix, ordering and alignments in
> instruction sequences to obtain the expected hardware performance.
>
> I am asking the community to review this high level implementation
> design and give me direction or advice.
>
> The new hardware issues two windows of the size N bytes of
> instructions in every cycle. It goes into accelerate mode if the
> windows have the right combination of instructions or alignments. Our
> goal is to maximize the IPC by proper instruction scheduling and
> alignments.
>
> Here is a summary of the most important requirements:
>
> a) Maximum of N instructions per window.
> b) An instruction may cross the first window.
> c) Each window can have maximum of x memory loads and y memory
>    stores .
> d) The total number of immediate constants in the instructions
>    of a window should not exceed k.
> e) The first window must be aligned on 16 byte boundary.
> f) A Window set terminates when a branch exists in a window.
> g) The number of allowed prefixes varies for instructions.
> h) A window set needs to be padded by prefixes in instructions
>    or terminated by nops to ensure adherence to the rules.
>
> We have the following implementation plan for GCC:
>
> 1) Modify the Haifa scheduler to make the desired arrangement of
>    instructions for the two dispatch windows. The scheduler is called
>    once before and once after register allocation as usual. In both
>    cases it performs dispatch scheduling along with its normal job of
>    instruction scheduling.
>
> The advantage of doing it before register allocation is avoiding
> extra dependencies caused by register allocation which may become
> an obstacle to movement of instructions.  The advantage of doing
> it after register allocation is a consideration for spilling code
> which may be generated by the register allocator.
>
> The algorithm we use is:
>
> a) Considering the current dispatch window set, choose the first
>    instruction from ready queue that does not violate dispatch rules.
> b) When an instruction is selected and scheduled, inform the
>    dispatcher code about the instruction. This step keeps track of the
>    instruction content of windows for future evaluation. It also manages
>    the window set by closing and opening new virtual dispatch windows.
>
> 2) Insertion of alignment code.
>
> In x86 alignment is done by inserting prefixes or by generating
> nops. As the object code is generated by the assembler in GCC, some
> information such as sizes of branches are unknown until assembly or
> link time. To do alignments related to dispatch correctly in GCC,
> we need to iteratively compute prefixes and branch sizes until
> its convergence. This pass currently does not exist in GCC, but it
> exists in the assembler.
>
> There are two possible approaches to solve alignment problem.
>
> a)  Let the assembler performs the alignments and padding needed
>     to adhere with the new machine dispatching rules and avoid an extra
>     pass in GCC.
> b)  Add a new pass to mimic what assembler does before generating
>     the assembly listing in GCC and insert the required alignments.
>
> I appreciate your comments on the proposed implementation procedure
> and the choices a or b above.
>
> Reza Yazdani

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-10 19:52 ` Quentin Neill
@ 2010-06-10 20:12   ` H.J. Lu
  2010-06-10 20:25     ` Jeff Law
  0 siblings, 1 reply; 28+ messages in thread
From: H.J. Lu @ 2010-06-10 20:12 UTC (permalink / raw)
  To: Quentin Neill; +Cc: binutils, gcc

On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
<quentin.neill.gnu@gmail.com> wrote:
> Cross-posting Reza's call for feedback to the binutils list since it
> is relevant -
> see the last few paragraphs regarding how to "solve the alignment problem".
>
> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402
>
> Not sure if followups should occur on one list or both.
> --
> Quentin Neill
>
>
> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani <yazdani_reza@yahoo.com> wrote:
>> Hi,
>>
>> We are in the process of adding a feature to GCC to take advantage
>> of a new hardware feature in the latest AMD micro processor. This
>> feature requires a certain mix, ordering and alignments in
>> instruction sequences to obtain the expected hardware performance.
>>
>> I am asking the community to review this high level implementation
>> design and give me direction or advice.
>>
>> The new hardware issues two windows of the size N bytes of
>> instructions in every cycle. It goes into accelerate mode if the
>> windows have the right combination of instructions or alignments. Our
>> goal is to maximize the IPC by proper instruction scheduling and
>> alignments.
>>
>> Here is a summary of the most important requirements:
>>
>> a) Maximum of N instructions per window.
>> b) An instruction may cross the first window.
>> c) Each window can have maximum of x memory loads and y memory
>>    stores .
>> d) The total number of immediate constants in the instructions
>>    of a window should not exceed k.
>> e) The first window must be aligned on 16 byte boundary.
>> f) A Window set terminates when a branch exists in a window.
>> g) The number of allowed prefixes varies for instructions.
>> h) A window set needs to be padded by prefixes in instructions
>>    or terminated by nops to ensure adherence to the rules.
>>
>> We have the following implementation plan for GCC:
>>
>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>    instructions for the two dispatch windows. The scheduler is called
>>    once before and once after register allocation as usual. In both
>>    cases it performs dispatch scheduling along with its normal job of
>>    instruction scheduling.
>>
>> The advantage of doing it before register allocation is avoiding
>> extra dependencies caused by register allocation which may become
>> an obstacle to movement of instructions.  The advantage of doing
>> it after register allocation is a consideration for spilling code
>> which may be generated by the register allocator.
>>
>> The algorithm we use is:
>>
>> a) Considering the current dispatch window set, choose the first
>>    instruction from ready queue that does not violate dispatch rules.
>> b) When an instruction is selected and scheduled, inform the
>>    dispatcher code about the instruction. This step keeps track of the
>>    instruction content of windows for future evaluation. It also manages
>>    the window set by closing and opening new virtual dispatch windows.
>>
>> 2) Insertion of alignment code.
>>
>> In x86 alignment is done by inserting prefixes or by generating
>> nops. As the object code is generated by the assembler in GCC, some
>> information such as sizes of branches are unknown until assembly or
>> link time. To do alignments related to dispatch correctly in GCC,
>> we need to iteratively compute prefixes and branch sizes until
>> its convergence. This pass currently does not exist in GCC, but it
>> exists in the assembler.
>>
>> There are two possible approaches to solve alignment problem.
>>
>> a)  Let the assembler performs the alignments and padding needed
>>     to adhere with the new machine dispatching rules and avoid an extra
>>     pass in GCC.
>> b)  Add a new pass to mimic what assembler does before generating
>>     the assembly listing in GCC and insert the required alignments.
>>
>> I appreciate your comments on the proposed implementation procedure
>> and the choices a or b above.

I don't this should be done in assembler. Assembler should just assemble
the assembly input.

-- 
H.J.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-10 20:12   ` H.J. Lu
@ 2010-06-10 20:25     ` Jeff Law
  2010-06-10 20:38       ` Joern Rennecke
                         ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Jeff Law @ 2010-06-10 20:25 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Quentin Neill, binutils, gcc

On 06/10/10 13:52, H.J. Lu wrote:
> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
> <quentin.neill.gnu@gmail.com>  wrote:
>    
>> Cross-posting Reza's call for feedback to the binutils list since it
>> is relevant -
>> see the last few paragraphs regarding how to "solve the alignment problem".
>>
>> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402
>>
>> Not sure if followups should occur on one list or both.
>> --
>> Quentin Neill
>>
>>
>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<yazdani_reza@yahoo.com>  wrote:
>>      
>>> Hi,
>>>
>>> We are in the process of adding a feature to GCC to take advantage
>>> of a new hardware feature in the latest AMD micro processor. This
>>> feature requires a certain mix, ordering and alignments in
>>> instruction sequences to obtain the expected hardware performance.
>>>
>>> I am asking the community to review this high level implementation
>>> design and give me direction or advice.
>>>
>>> The new hardware issues two windows of the size N bytes of
>>> instructions in every cycle. It goes into accelerate mode if the
>>> windows have the right combination of instructions or alignments. Our
>>> goal is to maximize the IPC by proper instruction scheduling and
>>> alignments.
>>>
>>> Here is a summary of the most important requirements:
>>>
>>> a) Maximum of N instructions per window.
>>> b) An instruction may cross the first window.
>>> c) Each window can have maximum of x memory loads and y memory
>>>     stores .
>>> d) The total number of immediate constants in the instructions
>>>     of a window should not exceed k.
>>> e) The first window must be aligned on 16 byte boundary.
>>> f) A Window set terminates when a branch exists in a window.
>>> g) The number of allowed prefixes varies for instructions.
>>> h) A window set needs to be padded by prefixes in instructions
>>>     or terminated by nops to ensure adherence to the rules.
>>>
>>> We have the following implementation plan for GCC:
>>>
>>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>>     instructions for the two dispatch windows. The scheduler is called
>>>     once before and once after register allocation as usual. In both
>>>     cases it performs dispatch scheduling along with its normal job of
>>>     instruction scheduling.
>>>
>>> The advantage of doing it before register allocation is avoiding
>>> extra dependencies caused by register allocation which may become
>>> an obstacle to movement of instructions.  The advantage of doing
>>> it after register allocation is a consideration for spilling code
>>> which may be generated by the register allocator.
>>>
>>> The algorithm we use is:
>>>
>>> a) Considering the current dispatch window set, choose the first
>>>     instruction from ready queue that does not violate dispatch rules.
>>> b) When an instruction is selected and scheduled, inform the
>>>     dispatcher code about the instruction. This step keeps track of the
>>>     instruction content of windows for future evaluation. It also manages
>>>     the window set by closing and opening new virtual dispatch windows.
>>>
>>> 2) Insertion of alignment code.
>>>
>>> In x86 alignment is done by inserting prefixes or by generating
>>> nops. As the object code is generated by the assembler in GCC, some
>>> information such as sizes of branches are unknown until assembly or
>>> link time. To do alignments related to dispatch correctly in GCC,
>>> we need to iteratively compute prefixes and branch sizes until
>>> its convergence. This pass currently does not exist in GCC, but it
>>> exists in the assembler.
>>>
>>> There are two possible approaches to solve alignment problem.
>>>
>>> a)  Let the assembler performs the alignments and padding needed
>>>      to adhere with the new machine dispatching rules and avoid an extra
>>>      pass in GCC.
>>> b)  Add a new pass to mimic what assembler does before generating
>>>      the assembly listing in GCC and insert the required alignments.
>>>
>>> I appreciate your comments on the proposed implementation procedure
>>> and the choices a or b above.
>>>        
> I don't this should be done in assembler. Assembler should just assemble
> the assembly input.
>    
That adds quite a bit of complication to the compiler though -- getting 
the instruction lengths right (and thus proper packing & alignment) can 
be extremely difficult.  I did some experiments with this on a target 
with *fixed* instruction lengths a while back and even though the port 
tried hard to get lengths right, it would routinely miss something.  
Ultimately I decided that it forcing the compiler to know instruction 
lengths with a very high degree of accuracy wasn't a sane thing to 
do.    Dealing with variable instruction lengths just adds yet another 
complexity to the situation.  Then add the complication of needing to 
add specific prefixes or nops and it just gets downright ugly.

I'd probably approach this by having the compiler emit a directive which 
states what the desired alignment at a particular point should be, then 
allow the assembler to select the best method to get the desired alignment.

jeff



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-10 20:25     ` Jeff Law
@ 2010-06-10 20:38       ` Joern Rennecke
  2010-06-10 21:54       ` Quentin Neill
  2010-06-11  0:44       ` Daniel Jacobowitz
  2 siblings, 0 replies; 28+ messages in thread
From: Joern Rennecke @ 2010-06-10 20:38 UTC (permalink / raw)
  To: Jeff Law; +Cc: H.J. Lu, Quentin Neill, binutils, gcc

Quoting Jeff Law <law@redhat.com>:

> That adds quite a bit of complication to the compiler though -- getting
> the instruction lengths right (and thus proper packing & alignment) can
> be extremely difficult.  I did some experiments with this on a target
> with *fixed* instruction lengths a while back and even though the port
> tried hard to get lengths right, it would routinely miss something.
> Ultimately I decided that it forcing the compiler to know instruction
> lengths with a very high degree of accuracy wasn't a sane thing to do.
>   Dealing with variable instruction lengths just adds yet another
> complexity to the situation.  Then add the complication of needing to
> add specific prefixes or nops and it just gets downright ugly.

I did add alignment-aware & exact branch shortening to the ARCompact port,
but ultimately the added complexity due to this was also a factor why the
port couldn't go into mainline without an active maintainer.
The code is available on branches.
See PR target/39303.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-10 20:25     ` Jeff Law
  2010-06-10 20:38       ` Joern Rennecke
@ 2010-06-10 21:54       ` Quentin Neill
  2010-06-10 22:09         ` H.J. Lu
  2010-06-11  0:44       ` Daniel Jacobowitz
  2 siblings, 1 reply; 28+ messages in thread
From: Quentin Neill @ 2010-06-10 21:54 UTC (permalink / raw)
  To: Jeff Law; +Cc: H.J. Lu, binutils, gcc

On Thu, Jun 10, 2010 at 3:03 PM, Jeff Law <law@redhat.com> wrote:
> On 06/10/10 13:52, H.J. Lu wrote:
>> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
>> <quentin.neill.gnu@gmail.com>  wrote:
>>> Cross-posting Reza's call for feedback to the binutils list since it
>>> is relevant - s ee the last few paragraphs regarding how to
>>> "solve the alignment problem".
>>>
>>> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402
>>>
>>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<yazdani_reza@yahoo.com>
>>>  wrote:
>>>> Hi,
>>>>
>>>> We are in the process of adding a feature to GCC to take advantage
>>>> of a new hardware feature in the latest AMD micro processor. This
>>>> feature requires a certain mix, ordering and alignments in
>>>> instruction sequences to obtain the expected hardware performance.
>>>>
>>>> I am asking the community to review this high level implementation
>>>> design and give me direction or advice.
>>>>
>>>> The new hardware issues two windows of the size N bytes of
>>>> instructions in every cycle. It goes into accelerate mode if the
>>>> windows have the right combination of instructions or alignments. Our
>>>> goal is to maximize the IPC by proper instruction scheduling and
>>>> alignments.
>>>>
>>>> Here is a summary of the most important requirements:
>>>>
>>>> a) Maximum of N instructions per window.
>>>> b) An instruction may cross the first window.
>>>> c) Each window can have maximum of x memory loads and y memory
>>>>    stores .
>>>> d) The total number of immediate constants in the instructions
>>>>    of a window should not exceed k.
>>>> e) The first window must be aligned on 16 byte boundary.
>>>> f) A Window set terminates when a branch exists in a window.
>>>> g) The number of allowed prefixes varies for instructions.
>>>> h) A window set needs to be padded by prefixes in instructions
>>>>    or terminated by nops to ensure adherence to the rules.
>>>>
>>>> We have the following implementation plan for GCC:
>>>>
>>>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>>>    instructions for the two dispatch windows. The scheduler is called
>>>>    once before and once after register allocation as usual. In both
>>>>    cases it performs dispatch scheduling along with its normal job of
>>>>    instruction scheduling.
>>>>
>>>> The advantage of doing it before register allocation is avoiding
>>>> extra dependencies caused by register allocation which may become
>>>> an obstacle to movement of instructions.  The advantage of doing
>>>> it after register allocation is a consideration for spilling code
>>>> which may be generated by the register allocator.
>>>>
>>>> The algorithm we use is:
>>>>
>>>> a) Considering the current dispatch window set, choose the first
>>>>    instruction from ready queue that does not violate dispatch rules.
>>>> b) When an instruction is selected and scheduled, inform the
>>>>    dispatcher code about the instruction. This step keeps track of the
>>>>    instruction content of windows for future evaluation. It also manages
>>>>    the window set by closing and opening new virtual dispatch windows.
>>>>
>>>> 2) Insertion of alignment code.
>>>>
>>>> In x86 alignment is done by inserting prefixes or by generating
>>>> nops. As the object code is generated by the assembler in GCC, some
>>>> information such as sizes of branches are unknown until assembly or
>>>> link time. To do alignments related to dispatch correctly in GCC,
>>>> we need to iteratively compute prefixes and branch sizes until
>>>> its convergence. This pass currently does not exist in GCC, but it
>>>> exists in the assembler.
>>>>
>>>> There are two possible approaches to solve alignment problem.
>>>>
>>>> a)  Let the assembler performs the alignments and padding needed
>>>>     to adhere with the new machine dispatching rules and avoid an extra
>>>>     pass in GCC.
>>>> b)  Add a new pass to mimic what assembler does before generating
>>>>     the assembly listing in GCC and insert the required alignments.
>>>>
>>>> I appreciate your comments on the proposed implementation procedure
>>>> and the choices a or b above.
>>>>
>>
>> I don't this should be done in assembler. Assembler should just assemble
>> the assembly input.
>
> That adds quite a bit of complication to the compiler though -- getting the
> instruction lengths right (and thus proper packing & alignment) can be
> extremely difficult.  I did some experiments with this on a target with
> *fixed* instruction lengths a while back and even though the port tried hard
> to get lengths right, it would routinely miss something.  Ultimately I
> decided that it forcing the compiler to know instruction lengths with a very
> high degree of accuracy wasn't a sane thing to do.    Dealing with variable
> instruction lengths just adds yet another complexity to the situation.  Then
> add the complication of needing to add specific prefixes or nops and it just
> gets downright ugly.
>
> I'd probably approach this by having the compiler emit a directive which
> states what the desired alignment at a particular point should be, then
> allow the assembler to select the best method to get the desired alignment.

Jeff,

This is exactly part of our binutils side of the proposal, which I'll
outline now

1. Allow multiple prefixes for ADDR and DS (and possibly others)
a) multiple prefixes are benign in certain modes and are thus chosen for padding
b) although ".byte" works, the "ds" and "addr" prefix mnemonics are
more explicit (and they don't trigger a call to
md_flush_pending_output)

2. Add new pseudo-op to delineate alignment boundaries.  This is
needed to signal any dispatch engine (below) to pad.  Here are my top
two candidates, any feedback is appreciated:
a) ".flush" new psuedo op plumbed directly to "md_flush_pending_output()"
b) ".padalign" which calla a new "md_pad_align()"

3. Add dispatch optimization infrastructure which
a) is guarded by -mtune flag (and possibly other -f style flags)
b) tracks assembled instruction attributes and their fragments
c) can pad (insert benign prefixes) into previously assembled fragments
d) maintains dispatch engine state (according to some subset of Reza's rules)

Discussion:

The flags in 3a) should guard against these changes affecting current behavior.

The assembly tracking in 3b) is for bookkeeping only; the padding in
3c) would only occur when a compiler uses the pseudo-op in 2) or when
the dispatch engine in 3d) signals.

For compilers that know exactly how to pad for the new processor, the
ability to
pad explicitly using 1), 2), and .align/.balign/.p2align should be enough.

For assembly programs and/or compilers that don't choose to do any
dispatch optimization, it's anticipated that the engine in 3d) would
be useful for optimizing for -mtune=bdver1

I'll post patches for these soon.
-- 
Quentin Neill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-10 21:54       ` Quentin Neill
@ 2010-06-10 22:09         ` H.J. Lu
  2010-06-10 22:40           ` Quentin Neill
  0 siblings, 1 reply; 28+ messages in thread
From: H.J. Lu @ 2010-06-10 22:09 UTC (permalink / raw)
  To: Quentin Neill; +Cc: Jeff Law, binutils, gcc

On Thu, Jun 10, 2010 at 1:59 PM, Quentin Neill
<quentin.neill.gnu@gmail.com> wrote:
> On Thu, Jun 10, 2010 at 3:03 PM, Jeff Law <law@redhat.com> wrote:
>> On 06/10/10 13:52, H.J. Lu wrote:
>>> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
>>> <quentin.neill.gnu@gmail.com>  wrote:
>>>> Cross-posting Reza's call for feedback to the binutils list since it
>>>> is relevant - s ee the last few paragraphs regarding how to
>>>> "solve the alignment problem".
>>>>
>>>> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402
>>>>
>>>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<yazdani_reza@yahoo.com>
>>>>  wrote:
>>>>> Hi,
>>>>>
>>>>> We are in the process of adding a feature to GCC to take advantage
>>>>> of a new hardware feature in the latest AMD micro processor. This
>>>>> feature requires a certain mix, ordering and alignments in
>>>>> instruction sequences to obtain the expected hardware performance.
>>>>>
>>>>> I am asking the community to review this high level implementation
>>>>> design and give me direction or advice.
>>>>>
>>>>> The new hardware issues two windows of the size N bytes of
>>>>> instructions in every cycle. It goes into accelerate mode if the
>>>>> windows have the right combination of instructions or alignments. Our
>>>>> goal is to maximize the IPC by proper instruction scheduling and
>>>>> alignments.
>>>>>
>>>>> Here is a summary of the most important requirements:
>>>>>
>>>>> a) Maximum of N instructions per window.
>>>>> b) An instruction may cross the first window.
>>>>> c) Each window can have maximum of x memory loads and y memory
>>>>>    stores .
>>>>> d) The total number of immediate constants in the instructions
>>>>>    of a window should not exceed k.
>>>>> e) The first window must be aligned on 16 byte boundary.
>>>>> f) A Window set terminates when a branch exists in a window.
>>>>> g) The number of allowed prefixes varies for instructions.
>>>>> h) A window set needs to be padded by prefixes in instructions
>>>>>    or terminated by nops to ensure adherence to the rules.
>>>>>
>>>>> We have the following implementation plan for GCC:
>>>>>
>>>>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>>>>    instructions for the two dispatch windows. The scheduler is called
>>>>>    once before and once after register allocation as usual. In both
>>>>>    cases it performs dispatch scheduling along with its normal job of
>>>>>    instruction scheduling.
>>>>>
>>>>> The advantage of doing it before register allocation is avoiding
>>>>> extra dependencies caused by register allocation which may become
>>>>> an obstacle to movement of instructions.  The advantage of doing
>>>>> it after register allocation is a consideration for spilling code
>>>>> which may be generated by the register allocator.
>>>>>
>>>>> The algorithm we use is:
>>>>>
>>>>> a) Considering the current dispatch window set, choose the first
>>>>>    instruction from ready queue that does not violate dispatch rules.
>>>>> b) When an instruction is selected and scheduled, inform the
>>>>>    dispatcher code about the instruction. This step keeps track of the
>>>>>    instruction content of windows for future evaluation. It also manages
>>>>>    the window set by closing and opening new virtual dispatch windows.
>>>>>
>>>>> 2) Insertion of alignment code.
>>>>>
>>>>> In x86 alignment is done by inserting prefixes or by generating
>>>>> nops. As the object code is generated by the assembler in GCC, some
>>>>> information such as sizes of branches are unknown until assembly or
>>>>> link time. To do alignments related to dispatch correctly in GCC,
>>>>> we need to iteratively compute prefixes and branch sizes until
>>>>> its convergence. This pass currently does not exist in GCC, but it
>>>>> exists in the assembler.
>>>>>
>>>>> There are two possible approaches to solve alignment problem.
>>>>>
>>>>> a)  Let the assembler performs the alignments and padding needed
>>>>>     to adhere with the new machine dispatching rules and avoid an extra
>>>>>     pass in GCC.
>>>>> b)  Add a new pass to mimic what assembler does before generating
>>>>>     the assembly listing in GCC and insert the required alignments.
>>>>>
>>>>> I appreciate your comments on the proposed implementation procedure
>>>>> and the choices a or b above.
>>>>>
>>>
>>> I don't this should be done in assembler. Assembler should just assemble
>>> the assembly input.
>>
>> That adds quite a bit of complication to the compiler though -- getting the
>> instruction lengths right (and thus proper packing & alignment) can be
>> extremely difficult.  I did some experiments with this on a target with
>> *fixed* instruction lengths a while back and even though the port tried hard
>> to get lengths right, it would routinely miss something.  Ultimately I
>> decided that it forcing the compiler to know instruction lengths with a very
>> high degree of accuracy wasn't a sane thing to do.    Dealing with variable
>> instruction lengths just adds yet another complexity to the situation.  Then
>> add the complication of needing to add specific prefixes or nops and it just
>> gets downright ugly.
>>
>> I'd probably approach this by having the compiler emit a directive which
>> states what the desired alignment at a particular point should be, then
>> allow the assembler to select the best method to get the desired alignment.
>
> Jeff,
>
> This is exactly part of our binutils side of the proposal, which I'll
> outline now
>
> 1. Allow multiple prefixes for ADDR and DS (and possibly others)
> a) multiple prefixes are benign in certain modes and are thus chosen for padding
> b) although ".byte" works, the "ds" and "addr" prefix mnemonics are
> more explicit (and they don't trigger a call to
> md_flush_pending_output)
>
> 2. Add new pseudo-op to delineate alignment boundaries.  This is
> needed to signal any dispatch engine (below) to pad.  Here are my top
> two candidates, any feedback is appreciated:
> a) ".flush" new psuedo op plumbed directly to "md_flush_pending_output()"
> b) ".padalign" which calla a new "md_pad_align()"
>
> 3. Add dispatch optimization infrastructure which
> a) is guarded by -mtune flag (and possibly other -f style flags)
> b) tracks assembled instruction attributes and their fragments
> c) can pad (insert benign prefixes) into previously assembled fragments
> d) maintains dispatch engine state (according to some subset of Reza's rules)
>
> Discussion:
>
> The flags in 3a) should guard against these changes affecting current behavior.
>
> The assembly tracking in 3b) is for bookkeeping only; the padding in
> 3c) would only occur when a compiler uses the pseudo-op in 2) or when
> the dispatch engine in 3d) signals.
>
> For compilers that know exactly how to pad for the new processor, the
> ability to
> pad explicitly using 1), 2), and .align/.balign/.p2align should be enough.
>
> For assembly programs and/or compilers that don't choose to do any
> dispatch optimization, it's anticipated that the engine in 3d) would
> be useful for optimizing for -mtune=bdver1
>
> I'll post patches for these soon.

Can you do it with directives only?


-- 
H.J.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-10 22:09         ` H.J. Lu
@ 2010-06-10 22:40           ` Quentin Neill
  2010-06-10 22:48             ` H.J. Lu
  0 siblings, 1 reply; 28+ messages in thread
From: Quentin Neill @ 2010-06-10 22:40 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jeff Law, binutils, gcc

On Thu, Jun 10, 2010 at 4:08 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Thu, Jun 10, 2010 at 1:59 PM, Quentin Neill
> <quentin.neill.gnu@gmail.com> wrote:
>> On Thu, Jun 10, 2010 at 3:03 PM, Jeff Law <law@redhat.com> wrote:
>>> On 06/10/10 13:52, H.J. Lu wrote:
>>>> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
>>>> <quentin.neill.gnu@gmail.com>  wrote:
>>>>> Cross-posting Reza's call for feedback to the binutils list since it
>>>>> is relevant - s ee the last few paragraphs regarding how to
>>>>> "solve the alignment problem".
>>>>>
>>>>> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402
>>>>>
>>>>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<yazdani_reza@yahoo.com>
>>>>>  wrote:
>>>>>> Hi,
>>>>>>
>>>>>> We are in the process of adding a feature to GCC to take advantage
>>>>>> of a new hardware feature in the latest AMD micro processor. This
>>>>>> feature requires a certain mix, ordering and alignments in
>>>>>> instruction sequences to obtain the expected hardware performance.
>>>>>>
>>>>>> I am asking the community to review this high level implementation
>>>>>> design and give me direction or advice.
>>>>>>
>>>>>> The new hardware issues two windows of the size N bytes of
>>>>>> instructions in every cycle. It goes into accelerate mode if the
>>>>>> windows have the right combination of instructions or alignments. Our
>>>>>> goal is to maximize the IPC by proper instruction scheduling and
>>>>>> alignments.
>>>>>>
>>>>>> Here is a summary of the most important requirements:
>>>>>>
>>>>>> a) Maximum of N instructions per window.
>>>>>> b) An instruction may cross the first window.
>>>>>> c) Each window can have maximum of x memory loads and y memory
>>>>>>    stores .
>>>>>> d) The total number of immediate constants in the instructions
>>>>>>    of a window should not exceed k.
>>>>>> e) The first window must be aligned on 16 byte boundary.
>>>>>> f) A Window set terminates when a branch exists in a window.
>>>>>> g) The number of allowed prefixes varies for instructions.
>>>>>> h) A window set needs to be padded by prefixes in instructions
>>>>>>    or terminated by nops to ensure adherence to the rules.
>>>>>>
>>>>>> We have the following implementation plan for GCC:
>>>>>>
>>>>>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>>>>>    instructions for the two dispatch windows. The scheduler is called
>>>>>>    once before and once after register allocation as usual. In both
>>>>>>    cases it performs dispatch scheduling along with its normal job of
>>>>>>    instruction scheduling.
>>>>>>
>>>>>> The advantage of doing it before register allocation is avoiding
>>>>>> extra dependencies caused by register allocation which may become
>>>>>> an obstacle to movement of instructions.  The advantage of doing
>>>>>> it after register allocation is a consideration for spilling code
>>>>>> which may be generated by the register allocator.
>>>>>>
>>>>>> The algorithm we use is:
>>>>>>
>>>>>> a) Considering the current dispatch window set, choose the first
>>>>>>    instruction from ready queue that does not violate dispatch rules.
>>>>>> b) When an instruction is selected and scheduled, inform the
>>>>>>    dispatcher code about the instruction. This step keeps track of the
>>>>>>    instruction content of windows for future evaluation. It also manages
>>>>>>    the window set by closing and opening new virtual dispatch windows.
>>>>>>
>>>>>> 2) Insertion of alignment code.
>>>>>>
>>>>>> In x86 alignment is done by inserting prefixes or by generating
>>>>>> nops. As the object code is generated by the assembler in GCC, some
>>>>>> information such as sizes of branches are unknown until assembly or
>>>>>> link time. To do alignments related to dispatch correctly in GCC,
>>>>>> we need to iteratively compute prefixes and branch sizes until
>>>>>> its convergence. This pass currently does not exist in GCC, but it
>>>>>> exists in the assembler.
>>>>>>
>>>>>> There are two possible approaches to solve alignment problem.
>>>>>>
>>>>>> a)  Let the assembler performs the alignments and padding needed
>>>>>>     to adhere with the new machine dispatching rules and avoid an extra
>>>>>>     pass in GCC.
>>>>>> b)  Add a new pass to mimic what assembler does before generating
>>>>>>     the assembly listing in GCC and insert the required alignments.
>>>>>>
>>>>>> I appreciate your comments on the proposed implementation procedure
>>>>>> and the choices a or b above.
>>>>>>
>>>>
>>>> I don't this should be done in assembler. Assembler should just assemble
>>>> the assembly input.
>>>
>>> That adds quite a bit of complication to the compiler though -- getting the
>>> instruction lengths right (and thus proper packing & alignment) can be
>>> extremely difficult.  I did some experiments with this on a target with
>>> *fixed* instruction lengths a while back and even though the port tried hard
>>> to get lengths right, it would routinely miss something.  Ultimately I
>>> decided that it forcing the compiler to know instruction lengths with a very
>>> high degree of accuracy wasn't a sane thing to do.    Dealing with variable
>>> instruction lengths just adds yet another complexity to the situation.  Then
>>> add the complication of needing to add specific prefixes or nops and it just
>>> gets downright ugly.
>>>
>>> I'd probably approach this by having the compiler emit a directive which
>>> states what the desired alignment at a particular point should be, then
>>> allow the assembler to select the best method to get the desired alignment.
>>
>> Jeff,
>>
>> This is exactly part of our binutils side of the proposal, which I'll
>> outline now
>>
>> 1. Allow multiple prefixes for ADDR and DS (and possibly others)
>> a) multiple prefixes are benign in certain modes and are thus chosen for padding
>> b) although ".byte" works, the "ds" and "addr" prefix mnemonics are
>> more explicit (and they don't trigger a call to
>> md_flush_pending_output)
>>
>> 2. Add new pseudo-op to delineate alignment boundaries.  This is
>> needed to signal any dispatch engine (below) to pad.  Here are my top
>> two candidates, any feedback is appreciated:
>> a) ".flush" new psuedo op plumbed directly to "md_flush_pending_output()"
>> b) ".padalign" which calla a new "md_pad_align()"
>>
>> 3. Add dispatch optimization infrastructure which
>> a) is guarded by -mtune flag (and possibly other -f style flags)
>> b) tracks assembled instruction attributes and their fragments
>> c) can pad (insert benign prefixes) into previously assembled fragments
>> d) maintains dispatch engine state (according to some subset of Reza's rules)
>>
>> Discussion:
>>
>> The flags in 3a) should guard against these changes affecting current behavior.
>>
>> The assembly tracking in 3b) is for bookkeeping only; the padding in
>> 3c) would only occur when a compiler uses the pseudo-op in 2) or when
>> the dispatch engine in 3d) signals.
>>
>> For compilers that know exactly how to pad for the new processor, the
>> ability to
>> pad explicitly using 1), 2), and .align/.balign/.p2align should be enough.
>>
>> For assembly programs and/or compilers that don't choose to do any
>> dispatch optimization, it's anticipated that the engine in 3d) would
>> be useful for optimizing for -mtune=bdver1
>>
>> I'll post patches for these soon.
>
> Can you do it with directives only?

In theory, if the compiler knows all sizes and offsets, yes (given
some way to add multiple prefixes).

However in practice, no.

To get  GCC to know all would require replicating most assembler
functionality in  GCC, including parsing, assembling, and sizing
(parts of output_insn() and its child output_*() functions).  We
considered exposing one-line assembly as a library but you have to
provide (or reuse) the segment/frchain/fragment context, and I don't
think introducing a GCC->binutils dependency (other than runtime)
would be easy to introduce into the community.

This wouldn't cover the assembly language case either.

And remember, even if you have all the directives (and the
programmer/compiler knows all), the assembler must remember potential
padding locations until the decision (and knowledge about how) to pad
arrives.

-- 
Quentin Neill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-10 22:40           ` Quentin Neill
@ 2010-06-10 22:48             ` H.J. Lu
  2010-06-11 23:36               ` Quentin Neill
  0 siblings, 1 reply; 28+ messages in thread
From: H.J. Lu @ 2010-06-10 22:48 UTC (permalink / raw)
  To: Quentin Neill; +Cc: Jeff Law, binutils, gcc

On Thu, Jun 10, 2010 at 3:09 PM, Quentin Neill
<quentin.neill.gnu@gmail.com> wrote:
> On Thu, Jun 10, 2010 at 4:08 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Thu, Jun 10, 2010 at 1:59 PM, Quentin Neill
>> <quentin.neill.gnu@gmail.com> wrote:
>>> On Thu, Jun 10, 2010 at 3:03 PM, Jeff Law <law@redhat.com> wrote:
>>>> On 06/10/10 13:52, H.J. Lu wrote:
>>>>> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
>>>>> <quentin.neill.gnu@gmail.com>  wrote:
>>>>>> Cross-posting Reza's call for feedback to the binutils list since it
>>>>>> is relevant - s ee the last few paragraphs regarding how to
>>>>>> "solve the alignment problem".
>>>>>>
>>>>>> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402
>>>>>>
>>>>>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<yazdani_reza@yahoo.com>
>>>>>>  wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> We are in the process of adding a feature to GCC to take advantage
>>>>>>> of a new hardware feature in the latest AMD micro processor. This
>>>>>>> feature requires a certain mix, ordering and alignments in
>>>>>>> instruction sequences to obtain the expected hardware performance.
>>>>>>>
>>>>>>> I am asking the community to review this high level implementation
>>>>>>> design and give me direction or advice.
>>>>>>>
>>>>>>> The new hardware issues two windows of the size N bytes of
>>>>>>> instructions in every cycle. It goes into accelerate mode if the
>>>>>>> windows have the right combination of instructions or alignments. Our
>>>>>>> goal is to maximize the IPC by proper instruction scheduling and
>>>>>>> alignments.
>>>>>>>
>>>>>>> Here is a summary of the most important requirements:
>>>>>>>
>>>>>>> a) Maximum of N instructions per window.
>>>>>>> b) An instruction may cross the first window.
>>>>>>> c) Each window can have maximum of x memory loads and y memory
>>>>>>>    stores .
>>>>>>> d) The total number of immediate constants in the instructions
>>>>>>>    of a window should not exceed k.
>>>>>>> e) The first window must be aligned on 16 byte boundary.
>>>>>>> f) A Window set terminates when a branch exists in a window.
>>>>>>> g) The number of allowed prefixes varies for instructions.
>>>>>>> h) A window set needs to be padded by prefixes in instructions
>>>>>>>    or terminated by nops to ensure adherence to the rules.
>>>>>>>
>>>>>>> We have the following implementation plan for GCC:
>>>>>>>
>>>>>>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>>>>>>    instructions for the two dispatch windows. The scheduler is called
>>>>>>>    once before and once after register allocation as usual. In both
>>>>>>>    cases it performs dispatch scheduling along with its normal job of
>>>>>>>    instruction scheduling.
>>>>>>>
>>>>>>> The advantage of doing it before register allocation is avoiding
>>>>>>> extra dependencies caused by register allocation which may become
>>>>>>> an obstacle to movement of instructions.  The advantage of doing
>>>>>>> it after register allocation is a consideration for spilling code
>>>>>>> which may be generated by the register allocator.
>>>>>>>
>>>>>>> The algorithm we use is:
>>>>>>>
>>>>>>> a) Considering the current dispatch window set, choose the first
>>>>>>>    instruction from ready queue that does not violate dispatch rules.
>>>>>>> b) When an instruction is selected and scheduled, inform the
>>>>>>>    dispatcher code about the instruction. This step keeps track of the
>>>>>>>    instruction content of windows for future evaluation. It also manages
>>>>>>>    the window set by closing and opening new virtual dispatch windows.
>>>>>>>
>>>>>>> 2) Insertion of alignment code.
>>>>>>>
>>>>>>> In x86 alignment is done by inserting prefixes or by generating
>>>>>>> nops. As the object code is generated by the assembler in GCC, some
>>>>>>> information such as sizes of branches are unknown until assembly or
>>>>>>> link time. To do alignments related to dispatch correctly in GCC,
>>>>>>> we need to iteratively compute prefixes and branch sizes until
>>>>>>> its convergence. This pass currently does not exist in GCC, but it
>>>>>>> exists in the assembler.
>>>>>>>
>>>>>>> There are two possible approaches to solve alignment problem.
>>>>>>>
>>>>>>> a)  Let the assembler performs the alignments and padding needed
>>>>>>>     to adhere with the new machine dispatching rules and avoid an extra
>>>>>>>     pass in GCC.
>>>>>>> b)  Add a new pass to mimic what assembler does before generating
>>>>>>>     the assembly listing in GCC and insert the required alignments.
>>>>>>>
>>>>>>> I appreciate your comments on the proposed implementation procedure
>>>>>>> and the choices a or b above.
>>>>>>>
>>>>>
>>>>> I don't this should be done in assembler. Assembler should just assemble
>>>>> the assembly input.
>>>>
>>>> That adds quite a bit of complication to the compiler though -- getting the
>>>> instruction lengths right (and thus proper packing & alignment) can be
>>>> extremely difficult.  I did some experiments with this on a target with
>>>> *fixed* instruction lengths a while back and even though the port tried hard
>>>> to get lengths right, it would routinely miss something.  Ultimately I
>>>> decided that it forcing the compiler to know instruction lengths with a very
>>>> high degree of accuracy wasn't a sane thing to do.    Dealing with variable
>>>> instruction lengths just adds yet another complexity to the situation.  Then
>>>> add the complication of needing to add specific prefixes or nops and it just
>>>> gets downright ugly.
>>>>
>>>> I'd probably approach this by having the compiler emit a directive which
>>>> states what the desired alignment at a particular point should be, then
>>>> allow the assembler to select the best method to get the desired alignment.
>>>
>>> Jeff,
>>>
>>> This is exactly part of our binutils side of the proposal, which I'll
>>> outline now
>>>
>>> 1. Allow multiple prefixes for ADDR and DS (and possibly others)
>>> a) multiple prefixes are benign in certain modes and are thus chosen for padding
>>> b) although ".byte" works, the "ds" and "addr" prefix mnemonics are
>>> more explicit (and they don't trigger a call to
>>> md_flush_pending_output)
>>>
>>> 2. Add new pseudo-op to delineate alignment boundaries.  This is
>>> needed to signal any dispatch engine (below) to pad.  Here are my top
>>> two candidates, any feedback is appreciated:
>>> a) ".flush" new psuedo op plumbed directly to "md_flush_pending_output()"
>>> b) ".padalign" which calla a new "md_pad_align()"
>>>
>>> 3. Add dispatch optimization infrastructure which
>>> a) is guarded by -mtune flag (and possibly other -f style flags)
>>> b) tracks assembled instruction attributes and their fragments
>>> c) can pad (insert benign prefixes) into previously assembled fragments
>>> d) maintains dispatch engine state (according to some subset of Reza's rules)
>>>
>>> Discussion:
>>>
>>> The flags in 3a) should guard against these changes affecting current behavior.
>>>
>>> The assembly tracking in 3b) is for bookkeeping only; the padding in
>>> 3c) would only occur when a compiler uses the pseudo-op in 2) or when
>>> the dispatch engine in 3d) signals.
>>>
>>> For compilers that know exactly how to pad for the new processor, the
>>> ability to
>>> pad explicitly using 1), 2), and .align/.balign/.p2align should be enough.
>>>
>>> For assembly programs and/or compilers that don't choose to do any
>>> dispatch optimization, it's anticipated that the engine in 3d) would
>>> be useful for optimizing for -mtune=bdver1
>>>
>>> I'll post patches for these soon.
>>
>> Can you do it with directives only?
>
> In theory, if the compiler knows all sizes and offsets, yes (given
> some way to add multiple prefixes).
>
> However in practice, no.
>
> To get  GCC to know all would require replicating most assembler
> functionality in  GCC, including parsing, assembling, and sizing
> (parts of output_insn() and its child output_*() functions).  We
> considered exposing one-line assembly as a library but you have to
> provide (or reuse) the segment/frchain/fragment context, and I don't
> think introducing a GCC->binutils dependency (other than runtime)
> would be easy to introduce into the community.
>
> This wouldn't cover the assembly language case either.
>
> And remember, even if you have all the directives (and the
> programmer/compiler knows all), the assembler must remember potential
> padding locations until the decision (and knowledge about how) to pad
> arrives.
>

x86 assembler isn't an optimizing assembler. -mtune only does
instruction selection.  What you are proposing sounds like an optimizing
assembler to me. Are we going to support scheduling, macro, ...?


-- 
H.J.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-10 20:25     ` Jeff Law
  2010-06-10 20:38       ` Joern Rennecke
  2010-06-10 21:54       ` Quentin Neill
@ 2010-06-11  0:44       ` Daniel Jacobowitz
  2010-06-11  5:58         ` Quentin Neill
  2 siblings, 1 reply; 28+ messages in thread
From: Daniel Jacobowitz @ 2010-06-11  0:44 UTC (permalink / raw)
  To: Jeff Law; +Cc: H.J. Lu, Quentin Neill, binutils, gcc

On Thu, Jun 10, 2010 at 02:03:03PM -0600, Jeff Law wrote:
> That adds quite a bit of complication to the compiler though --
> getting the instruction lengths right (and thus proper packing &
> alignment) can be extremely difficult.  I did some experiments with
> this on a target with *fixed* instruction lengths a while back and
> even though the port tried hard to get lengths right, it would
> routinely miss something.  Ultimately I decided that it forcing the
> compiler to know instruction lengths with a very high degree of
> accuracy wasn't a sane thing to do.

FWIW, my opinion (and I think Jakub has expressed a similar opinion
and/or tool in the past) is that there is a sane way to do this: put
assertions in the assembler output and have the assembler validate
them.

On the other hand, I'm not going to argue that it's a lot of work.

-- 
Daniel Jacobowitz
CodeSourcery

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-11  0:44       ` Daniel Jacobowitz
@ 2010-06-11  5:58         ` Quentin Neill
  2010-06-11 16:46           ` Daniel Jacobowitz
  0 siblings, 1 reply; 28+ messages in thread
From: Quentin Neill @ 2010-06-11  5:58 UTC (permalink / raw)
  To: Jeff Law, H.J. Lu, Quentin Neill, binutils, gcc

On Thu, Jun 10, 2010 at 5:40 PM, Daniel Jacobowitz <dan@codesourcery.com> wrote:
> On Thu, Jun 10, 2010 at 02:03:03PM -0600, Jeff Law wrote:
>> That adds quite a bit of complication to the compiler though --
>> getting the instruction lengths right (and thus proper packing &
>> alignment) can be extremely difficult.  I did some experiments with
>> this on a target with *fixed* instruction lengths a while back and
>> even though the port tried hard to get lengths right, it would
>> routinely miss something.  Ultimately I decided that it forcing the
>> compiler to know instruction lengths with a very high degree of
>> accuracy wasn't a sane thing to do.
>
> FWIW, my opinion (and I think Jakub has expressed a similar opinion
> and/or tool in the past) is that there is a sane way to do this: put
> assertions in the assembler output and have the assembler validate
> them.
>
> On the other hand, I'm not going to argue that it's a lot of work.
> --
> Daniel Jacobowitz
> CodeSourcery

When you say "put assertions in the assembler output" I understood it
to mean "in the assembly source code output by the compiler", not "the
output produced by the assembler".

Does this qualify as a form of what you are suggesting?  Because this
is exactly what is being proposed:

.balign 8                  # start window
    insn op, op          # 67 67 XX YY ZZ  - padded with 2 prefixes to make 8
    insn2 op, op        # AA BB CC
.padalign 8              # window boundary
    insn4 op
    . . .

-- 
Quentin Neill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-11  5:58         ` Quentin Neill
@ 2010-06-11 16:46           ` Daniel Jacobowitz
  2010-06-11 19:21             ` Quentin Neill
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Jacobowitz @ 2010-06-11 16:46 UTC (permalink / raw)
  To: Quentin Neill; +Cc: Jeff Law, H.J. Lu, binutils, gcc

On Thu, Jun 10, 2010 at 09:48:24PM -0500, Quentin Neill wrote:
> > On the other hand, I'm not going to argue that it's a lot of work.

Missing "not" !

> When you say "put assertions in the assembler output" I understood it
> to mean "in the assembly source code output by the compiler", not "the
> output produced by the assembler".

Yes.

> Does this qualify as a form of what you are suggesting?  Because this
> is exactly what is being proposed:
> 
> .balign 8                  # start window
>     insn op, op          # 67 67 XX YY ZZ  - padded with 2 prefixes to make 8
>     insn2 op, op        # AA BB CC
> .padalign 8              # window boundary
>     insn4 op
>     . . .

No, this is quite different.  These are directives that tell the
assembler to make changes.  I'm talking about assertions, not
directives.  Something like this:

  mov r0, r1 @ [length 2]
  add ip, lr, ip @ [length 4]
  mov r0, r1 @ [length 4] <-- assembler error 'insn has length 2'

GCC can output length information, but it is never exact, and it is
not in a form recognized by the assembler.

On x86, I have no idea how this would work.

-- 
Daniel Jacobowitz
CodeSourcery

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-11 16:46           ` Daniel Jacobowitz
@ 2010-06-11 19:21             ` Quentin Neill
  2010-06-11 19:41               ` H.J. Lu
                                 ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Quentin Neill @ 2010-06-11 19:21 UTC (permalink / raw)
  To: Quentin Neill, Jeff Law, H.J. Lu, binutils, gcc

On Fri, Jun 11, 2010 at 10:58 AM, Daniel Jacobowitz
<dan@codesourcery.com> wrote:
> On Thu, Jun 10, 2010 at 09:48:24PM -0500, Quentin Neill wrote:
[snip]
>> Does this qualify as a form of what you are suggesting?  Because this
>> is exactly what is being proposed:
>>
>> .balign 8                  # start window
>>     insn op, op          # 67 67 XX YY ZZ  - padded with 2 prefixes to make 8
>>     insn2 op, op        # AA BB CC
>> .padalign 8              # window boundary
>>     insn4 op
>>     . . .
>
> No, this is quite different.  These are directives that tell the
> assembler to make changes.  I'm talking about assertions, not
> directives.  Something like this:
>
>  mov r0, r1 @ [length 2]
>  add ip, lr, ip @ [length 4]
>  mov r0, r1 @ [length 4] <-- assembler error 'insn has length 2'
>
> GCC can output length information, but it is never exact, and it is
> not in a form recognized by the assembler.
>
> On x86, I have no idea how this would work.
>
> --
> Daniel Jacobowitz
> CodeSourcery
>

I see.

Currently GCC doesn't compute the current encoding offset (doesn't
know mnemonic/opcode lengths), nor does it perform a relaxation pass
(to resolve forward displacement/branch offsets).   Without these it
so cannot accurately formulate such assertions.

Our proposal is to let the assembler itself (knowing best the details
of the encoding stream, offsets, and the processor) aligns
instructions, with hints about the structure (block starts, ends,
instruction sets) using macros/assertions/tokens if needed.

Another option would be to expose some subset of the assembler
functionality as a plugin to GCC (similar to how gold is used) to
extract the instruction sizes.   Any comments on that approach?

-- 
Quentin Neill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-11 19:21             ` Quentin Neill
@ 2010-06-11 19:41               ` H.J. Lu
  2010-06-11 19:50               ` Jakub Jelinek
  2010-06-12 11:25               ` Andi Kleen
  2 siblings, 0 replies; 28+ messages in thread
From: H.J. Lu @ 2010-06-11 19:41 UTC (permalink / raw)
  To: Quentin Neill; +Cc: Jeff Law, binutils, gcc

On Fri, Jun 11, 2010 at 12:09 PM, Quentin Neill
<quentin.neill.gnu@gmail.com> wrote:
> On Fri, Jun 11, 2010 at 10:58 AM, Daniel Jacobowitz
> <dan@codesourcery.com> wrote:
>> On Thu, Jun 10, 2010 at 09:48:24PM -0500, Quentin Neill wrote:
> [snip]
>>> Does this qualify as a form of what you are suggesting?  Because this
>>> is exactly what is being proposed:
>>>
>>> .balign 8                  # start window
>>>     insn op, op          # 67 67 XX YY ZZ  - padded with 2 prefixes to make 8
>>>     insn2 op, op        # AA BB CC
>>> .padalign 8              # window boundary
>>>     insn4 op
>>>     . . .
>>
>> No, this is quite different.  These are directives that tell the
>> assembler to make changes.  I'm talking about assertions, not
>> directives.  Something like this:
>>
>>  mov r0, r1 @ [length 2]
>>  add ip, lr, ip @ [length 4]
>>  mov r0, r1 @ [length 4] <-- assembler error 'insn has length 2'
>>
>> GCC can output length information, but it is never exact, and it is
>> not in a form recognized by the assembler.
>>
>> On x86, I have no idea how this would work.
>>
>> --
>> Daniel Jacobowitz
>> CodeSourcery
>>
>
> I see.
>
> Currently GCC doesn't compute the current encoding offset (doesn't
> know mnemonic/opcode lengths), nor does it perform a relaxation pass
> (to resolve forward displacement/branch offsets).   Without these it
> so cannot accurately formulate such assertions.
>
> Our proposal is to let the assembler itself (knowing best the details
> of the encoding stream, offsets, and the processor) aligns
> instructions, with hints about the structure (block starts, ends,
> instruction sets) using macros/assertions/tokens if needed.
>
> Another option would be to expose some subset of the assembler
> functionality as a plugin to GCC (similar to how gold is used) to
> extract the instruction sizes.   Any comments on that approach?
>

I would suggest generating object code directly, totally bypassing
assembler. Many compilers do it. But it is a HUGE effort.


-- 
H.J.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-11 19:21             ` Quentin Neill
  2010-06-11 19:41               ` H.J. Lu
@ 2010-06-11 19:50               ` Jakub Jelinek
  2010-06-12 11:25               ` Andi Kleen
  2 siblings, 0 replies; 28+ messages in thread
From: Jakub Jelinek @ 2010-06-11 19:50 UTC (permalink / raw)
  To: Quentin Neill; +Cc: Jeff Law, H.J. Lu, binutils, gcc

On Fri, Jun 11, 2010 at 02:09:33PM -0500, Quentin Neill wrote:
> Currently GCC doesn't compute the current encoding offset (doesn't
> know mnemonic/opcode lengths), 

That's not true, gcc for i?86/x86_64 actually calculates the length and for
most of the commonly used insns correctly, I've spent some time fixing
various bugs in it a year ago, see
http://gcc.gnu.org/ml/gcc-patches/2009-05/msg01808.html
and the thread around it.

Many of the remaining few issues (haven't tested bdver ISA additions for
lengths) are fixable too, of course there is always inline asm where
GCC can't know.

	Jakub

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-10 22:48             ` H.J. Lu
@ 2010-06-11 23:36               ` Quentin Neill
  2010-06-12 18:54                 ` H.J. Lu
  0 siblings, 1 reply; 28+ messages in thread
From: Quentin Neill @ 2010-06-11 23:36 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jeff Law, binutils, gcc

On Thu, Jun 10, 2010 at 5:23 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> [snip]
> x86 assembler isn't an optimizing assembler. -mtune only does
> instruction selection.  What you are proposing sounds like an optimizing
> assembler to me. Are we going to support scheduling, macro, ...?
> --
> H.J.

Just to clarify, we are not doing scheduling or macros.   The
assembler already supported alignment and padding using .align and
friends, which can be from the compiler and from hand-written
assembly.

Now we are seeing more complex alignment rules that are not as simple
as it used to be for the older hardware.  It will be almost impossible
for an assembly programmer to insert the right directives, not to
mention any change might invalidate previous alignments.   Assembly
programmers will be out of luck (that is, unless the compiler becomes
the assembler).

The essence is we want to insert prefixes (as well as nops) according
to certain rules known at encoding time.  The mechanism implementing
these rules can be abstracted (table driven?) and could be applicable
to any hardware having similar features.

As gcc does not currently encode and/or generate object code, we are
wary of introducing such assembler functionality and want to avoid if
possible, instead leveraging the existing binutils infrastructure.

-- 
Quentin Neill (with some input from Reza Yazdani)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-11 19:21             ` Quentin Neill
  2010-06-11 19:41               ` H.J. Lu
  2010-06-11 19:50               ` Jakub Jelinek
@ 2010-06-12 11:25               ` Andi Kleen
  2010-06-12 22:45                 ` Ian Lance Taylor
  2 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2010-06-12 11:25 UTC (permalink / raw)
  To: Quentin Neill; +Cc: Jeff Law, H.J. Lu, binutils, gcc

Quentin Neill <quentin.neill.gnu@gmail.com> writes:
>
> Another option would be to expose some subset of the assembler
> functionality as a plugin to GCC (similar to how gold is used) to
> extract the instruction sizes.   Any comments on that approach?

AFAIK gcc already does keep track of instruction lengths
(e.g. for LOOP), but it may not be fully reliable.

But if you need more why can't you just link the whole assembler
into gcc? That would hopefully speed up compilation too
(e.g. over time the text generation of instructions could
be bypassed)

I don't know how hard it would be, but it would seem like the
right thing to do.

-Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-11 23:36               ` Quentin Neill
@ 2010-06-12 18:54                 ` H.J. Lu
  2010-06-13 21:54                   ` H.J. Lu
  0 siblings, 1 reply; 28+ messages in thread
From: H.J. Lu @ 2010-06-12 18:54 UTC (permalink / raw)
  To: Quentin Neill; +Cc: Jeff Law, binutils, gcc

On Fri, Jun 11, 2010 at 3:42 PM, Quentin Neill
<quentin.neill.gnu@gmail.com> wrote:
> On Thu, Jun 10, 2010 at 5:23 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> [snip]
>> x86 assembler isn't an optimizing assembler. -mtune only does
>> instruction selection.  What you are proposing sounds like an optimizing
>> assembler to me. Are we going to support scheduling, macro, ...?
>> --
>> H.J.
>
> Just to clarify, we are not doing scheduling or macros.   The
> assembler already supported alignment and padding using .align and
> friends, which can be from the compiler and from hand-written
> assembly.
>
> Now we are seeing more complex alignment rules that are not as simple
> as it used to be for the older hardware.  It will be almost impossible
> for an assembly programmer to insert the right directives, not to
> mention any change might invalidate previous alignments.   Assembly
> programmers will be out of luck (that is, unless the compiler becomes
> the assembler).

If you can find a way to help assembly programmers via new directives,
it is great.  GNU x86 assembler should just translate assembly code
into binary code. The output of "objdump -d" should be identical
to the input assembly.

We shouldn't turn GNU x86 assembler into an optimizing assembler.
Next people may ask assembler to remove redundant instructions, ...

Right now, when something goes wrong, people don't have to debug
assembler since it is very unlikely that the problem is in assembler.
When assembler starts to make changes to assembly input, we have
another place where a bug may be introduced.

>
> The essence is we want to insert prefixes (as well as nops) according
> to certain rules known at encoding time.  The mechanism implementing
> these rules can be abstracted (table driven?) and could be applicable
> to any hardware having similar features.

Can you implement them with new directives/pseudo instructions?

BTW, GCC should know the instruction length. If not, it is a GCC bug.

-- 
H.J.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-12 11:25               ` Andi Kleen
@ 2010-06-12 22:45                 ` Ian Lance Taylor
  2010-06-13 12:35                   ` Andi Kleen
  0 siblings, 1 reply; 28+ messages in thread
From: Ian Lance Taylor @ 2010-06-12 22:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Quentin Neill, Jeff Law, H.J. Lu, binutils, gcc

Andi Kleen <andi@firstfloor.org> writes:

> But if you need more why can't you just link the whole assembler
> into gcc? That would hopefully speed up compilation too
> (e.g. over time the text generation of instructions could
> be bypassed)

It would help compilation time a little bit, but generating the
assembly code and running the entire assembler is a fairly small
percentage of the overall compilation time--e.g., 3%.  It's worth
doing a fair amount of work to speed up compilation by 3%, but linking
the assembler into gcc would be an enormous amount of work.  I would
certainly support somebody who wants to tackle it, but I don't think
it's very high up the priority list for the overall project.

Ian

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-12 22:45                 ` Ian Lance Taylor
@ 2010-06-13 12:35                   ` Andi Kleen
  2010-06-13 13:09                     ` Joern Rennecke
  0 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2010-06-13 12:35 UTC (permalink / raw)
  To: Ian Lance Taylor
  Cc: Andi Kleen, Quentin Neill, Jeff Law, H.J. Lu, binutils, gcc

> It would help compilation time a little bit, but generating the
> assembly code and running the entire assembler is a fairly small
> percentage of the overall compilation time--e.g., 3%.  It's worth
> doing a fair amount of work to speed up compilation by 3%, but linking
> the assembler into gcc would be an enormous amount of work.  I would

Curious -- why do you think it would be that much work?

I admit I haven't looked into gas code, but naively it can't
be all that difficult to e.g. run gas as a thread and 
pass the text input through some shared memory buffer?

That would likely not speed up thinks too much, but
it could be a starting for more short cuts.

-Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-13 12:35                   ` Andi Kleen
@ 2010-06-13 13:09                     ` Joern Rennecke
  2010-06-13 14:36                       ` Andi Kleen
  0 siblings, 1 reply; 28+ messages in thread
From: Joern Rennecke @ 2010-06-13 13:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ian Lance Taylor, Quentin Neill, Jeff Law, H.J. Lu, binutils, gcc

Quoting Andi Kleen <andi@firstfloor.org>:
> I admit I haven't looked into gas code, but naively it can't
> be all that difficult to e.g. run gas as a thread and
> pass the text input through some shared memory buffer?

If you are generating text anyway, there should be little difference to
the existing -pipe option - at least on operating systems that can handle
processes efficiently.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-13 13:09                     ` Joern Rennecke
@ 2010-06-13 14:36                       ` Andi Kleen
  2010-06-13 15:02                         ` Joern Rennecke
  2010-06-13 16:28                         ` Frank Ch. Eigler
  0 siblings, 2 replies; 28+ messages in thread
From: Andi Kleen @ 2010-06-13 14:36 UTC (permalink / raw)
  To: Joern Rennecke
  Cc: Andi Kleen, Ian Lance Taylor, Quentin Neill, Jeff Law, H.J. Lu,
	binutils, gcc

On Sun, Jun 13, 2010 at 07:14:03AM -0400, Joern Rennecke wrote:
> Quoting Andi Kleen <andi@firstfloor.org>:
>> I admit I haven't looked into gas code, but naively it can't
>> be all that difficult to e.g. run gas as a thread and
>> pass the text input through some shared memory buffer?
>
> If you are generating text anyway, there should be little difference to
> the existing -pipe option - at least on operating systems that can handle
> processes efficiently.

Yes but you can't easily pass data back, like accurate instruction lengths.

-Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-13 14:36                       ` Andi Kleen
@ 2010-06-13 15:02                         ` Joern Rennecke
  2010-06-13 19:56                           ` Chris Lattner
  2010-06-13 16:28                         ` Frank Ch. Eigler
  1 sibling, 1 reply; 28+ messages in thread
From: Joern Rennecke @ 2010-06-13 15:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ian Lance Taylor, Quentin Neill, Jeff Law, H.J. Lu, binutils, gcc

Quoting Andi Kleen <andi@firstfloor.org>:

> On Sun, Jun 13, 2010 at 07:14:03AM -0400, Joern Rennecke wrote:
>> Quoting Andi Kleen <andi@firstfloor.org>:
>>> I admit I haven't looked into gas code, but naively it can't
>>> be all that difficult to e.g. run gas as a thread and
>>> pass the text input through some shared memory buffer?
>>
>> If you are generating text anyway, there should be little difference to
>> the existing -pipe option - at least on operating systems that can handle
>> processes efficiently.
>
> Yes but you can't easily pass data back, like accurate instruction lengths.

We were just talking about saving time by reducing the work of the
assembler, which Ian said takes about 3% of the compilation time.
So in that context, -pipe should be similar to text buffer thread
communication.

If you want to make exact length information available to the scheduler,
text buffers are not all that easy to use - the assembly output of the
compiler is so far only done in final, and things like register allocation
(for sched1), machine_dependent_reorg, peepholes, final_{pre_}scan_insn,
assembler dialects, and functional output templates that rely on some data
to be computed by prologue code or similar are in the way of getting a
string suitable for the assembler during scheduling.
And if you have to audit and/or patch every pattern of your port to verify
that it'll work in a different context, you might as well fix the length
information.

An even if you have a suitable text for the assembler, to link the compiler
with the assembler requires to merge to two complex build systems, and
resolve symbol name clash issues.
Using socketpair / fork / exec for the interface would restrict portability
to the new feature to host operating systems that implement these system
calls, but you wouldn't have to worry about having to merge the compiler
with the assembler.
Or if you have a bit more (developer & CPU) time, you can use a client/server
interface with AF_UNIX / AF_INET sockets.  Or you could go for SYSV shared
memory.
All of these Inter Process Communication mechanisms with nice autoconf and
configure option scripting together will likely be only a fraction of the
hassle of trying to merge gcc and gas and to push the patches for that
monster upstream.
For starters, gas and gcc have different maintainers, asynchronous  
release schedules, and reside in different repositories with different  
version
control systems.

Using the opcodes directory directly could be slightly saner, if it
were not for the fact that there are about as many different interfaces as
there are different ports.
Alhough the cgen ports have some things in common, there is always some
port-specific glue.  And there are still plenty of non-cgen ports, too.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-13 14:36                       ` Andi Kleen
  2010-06-13 15:02                         ` Joern Rennecke
@ 2010-06-13 16:28                         ` Frank Ch. Eigler
  1 sibling, 0 replies; 28+ messages in thread
From: Frank Ch. Eigler @ 2010-06-13 16:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Joern Rennecke, Ian Lance Taylor, Quentin Neill, Jeff Law,
	H.J. Lu, binutils, gcc

Andi Kleen <andi@firstfloor.org> writes:

> [...]
> Yes but you can't easily pass data back, like accurate instruction lengths.

Wouldn't it be too late by then?  Or are you imagining having the
compiler pass trial data to the assembler to create a feedback loop?

- FChE

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-13 15:02                         ` Joern Rennecke
@ 2010-06-13 19:56                           ` Chris Lattner
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Lattner @ 2010-06-13 19:56 UTC (permalink / raw)
  To: Joern Rennecke
  Cc: Andi Kleen, Ian Lance Taylor, Quentin Neill, Jeff Law, H.J. Lu,
	gcc Mailing List

On Jun 13, 2010, at 7:35 AM, Joern Rennecke wrote:
> An even if you have a suitable text for the assembler, to link the compiler
> with the assembler requires to merge to two complex build systems, and
> resolve symbol name clash issues.

Not trying to be inflammatory, but if you guys are really serious about doing this, you could consider using the LLVM integrated assembler.  It is built as a library and is easily separable from the code generator (codegen depends on the assembler, not the other way around).  It would be much much simpler to integrate than the binutils assembler.

Some information is here:
http://blog.llvm.org/2010/04/intro-to-llvm-mc-project.html

It currently supports darwin x86 quite well.  ELF/PECOFF and ARM support are in active development.  It sped up clang by ~10% at -O0 -g.

-Chris

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-12 18:54                 ` H.J. Lu
@ 2010-06-13 21:54                   ` H.J. Lu
  2010-06-14 13:37                     ` Michael Matz
  0 siblings, 1 reply; 28+ messages in thread
From: H.J. Lu @ 2010-06-13 21:54 UTC (permalink / raw)
  To: Quentin Neill; +Cc: Jeff Law, binutils, gcc

On Sat, Jun 12, 2010 at 8:15 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, Jun 11, 2010 at 3:42 PM, Quentin Neill
> <quentin.neill.gnu@gmail.com> wrote:
>> On Thu, Jun 10, 2010 at 5:23 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> [snip]
>>> x86 assembler isn't an optimizing assembler. -mtune only does
>>> instruction selection.  What you are proposing sounds like an optimizing
>>> assembler to me. Are we going to support scheduling, macro, ...?
>>> --
>>> H.J.
>>
>> Just to clarify, we are not doing scheduling or macros.   The
>> assembler already supported alignment and padding using .align and
>> friends, which can be from the compiler and from hand-written
>> assembly.
>>
>> Now we are seeing more complex alignment rules that are not as simple
>> as it used to be for the older hardware.  It will be almost impossible
>> for an assembly programmer to insert the right directives, not to
>> mention any change might invalidate previous alignments.   Assembly
>> programmers will be out of luck (that is, unless the compiler becomes
>> the assembler).
>
> If you can find a way to help assembly programmers via new directives,
> it is great.  GNU x86 assembler should just translate assembly code
> into binary code. The output of "objdump -d" should be identical
> to the input assembly.
>
> We shouldn't turn GNU x86 assembler into an optimizing assembler.
> Next people may ask assembler to remove redundant instructions, ...
>
> Right now, when something goes wrong, people don't have to debug
> assembler since it is very unlikely that the problem is in assembler.
> When assembler starts to make changes to assembly input, we have
> another place where a bug may be introduced.
>
>>
>> The essence is we want to insert prefixes (as well as nops) according
>> to certain rules known at encoding time.  The mechanism implementing
>> these rules can be abstracted (table driven?) and could be applicable
>> to any hardware having similar features.
>
> Can you implement them with new directives/pseudo instructions?
>

I think you should take a look at

http://code.google.com/p/mao/


-- 
H.J.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-13 21:54                   ` H.J. Lu
@ 2010-06-14 13:37                     ` Michael Matz
  2010-06-14 16:06                       ` Jakub Jelinek
  0 siblings, 1 reply; 28+ messages in thread
From: Michael Matz @ 2010-06-14 13:37 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Quentin Neill, Jeff Law, binutils, gcc

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2969 bytes --]

Hi,

On Sun, 13 Jun 2010, H.J. Lu wrote:

> > We shouldn't turn GNU x86 assembler into an optimizing assembler. Next 
> > people may ask assembler to remove redundant instructions, ...

Well, but currently nobody is asking for such thing, right?

> > Right now, when something goes wrong, people don't have to debug 
> > assembler since it is very unlikely that the problem is in assembler. 
> > When assembler starts to make changes to assembly input, we have 
> > another place where a bug may be introduced.

But that's the case also right now.  align directives are one example for 
the assembler not emitting a one-to-one mapping, jump instructions are 
another.

> >> The essence is we want to insert prefixes (as well as nops) according 
> >> to certain rules known at encoding time. Â The mechanism implementing 
> >> these rules can be abstracted (table driven?) and could be applicable 
> >> to any hardware having similar features.
> >
> > Can you implement them with new directives/pseudo instructions?
> >
> 
> I think you should take a look at
> 
> http://code.google.com/p/mao/

I find the direction this discussion takes a bit bizarre.  Quentins 
suggestions were grounded in the way GCC works.  It emits text, and 
expects the assembler to transform this into binary blobs.  Changing this 
fundmental property is so much work that it isn't sensible to suggest that 
as alternative to the proposal.

Also GCC prefers to use GNU as.  Suggesting to use a different as' also 
doesn't read realistic (or even desirable) in my book, at least not on 
platforms where other as' aren't supported right now.  Neither does a 
post-processing tool seem desirable, as we want to generate fast code by 
default.

Therefore the only two realistic (IMO) possibilities are to either change 
GNU as to ensure the hw constraints are observed, or to do the same change 
in GCC.

Doing the change in GNU as has the advantage that all insn lengths are 
available without any work, i.e. it will handle e.g. inline asm; and that 
relaxation also is implemented just fine (it exists already in order to 
decide which jump form it's going to use); it has a high chance of always 
emitting the correct sequences.  It has the disadvantage that GNU as would 
emit (no-op) prefixes that the asm author didn't write.

Doing the change in GCC has the advantage that it would know about this 
change in instruction size (and therefore also could calculate sizes of 
jumps more correctly).  It has at least the disadvantage to need to do the 
tedious job of ensuring all insn lengths are correct, which by necessicity 
won't be done for inline asm; even ignoring inline asm it's known to 
quickly bit-rot (despite Jakubs heroic efforts).  From that follows that 
it has a somewhat higher chance of emitting slow sequences.

I don't see realistic and desirable other options.  For completeness 
considerations (inline asm) I think changing GNU as is the better choice.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scheduling x86 dispatch windows
  2010-06-14 13:37                     ` Michael Matz
@ 2010-06-14 16:06                       ` Jakub Jelinek
  0 siblings, 0 replies; 28+ messages in thread
From: Jakub Jelinek @ 2010-06-14 16:06 UTC (permalink / raw)
  To: Michael Matz; +Cc: H.J. Lu, Quentin Neill, Jeff Law, binutils, gcc

On Mon, Jun 14, 2010 at 03:14:37PM +0200, Michael Matz wrote:
> Doing the change in GNU as has the advantage that all insn lengths are 
> available without any work, i.e. it will handle e.g. inline asm; and that 
> relaxation also is implemented just fine (it exists already in order to 
> decide which jump form it's going to use); it has a high chance of always 
> emitting the correct sequences.  It has the disadvantage that GNU as would 
> emit (no-op) prefixes that the asm author didn't write.
> 
> Doing the change in GCC has the advantage that it would know about this 
> change in instruction size (and therefore also could calculate sizes of 
> jumps more correctly).  It has at least the disadvantage to need to do the 
> tedious job of ensuring all insn lengths are correct, which by necessicity 
> won't be done for inline asm; even ignoring inline asm it's known to 
> quickly bit-rot (despite Jakubs heroic efforts).  From that follows that 
> it has a somewhat higher chance of emitting slow sequences.
> 
> I don't see realistic and desirable other options.  For completeness 
> considerations (inline asm) I think changing GNU as is the better choice.

Doing it in GNU as without the user asking for it might break code though
(e.g. when code has jmp .+16 or something similar).
So, IMHO it is to be done in GNU as, it should be only requested through
directives.  Say directives that set start and end of an insn block which
would request padding the whole block to certain alignment and CPU to
optimize for while doing that.  This wouldn't be useful just for this future AMD
CPU, but e.g. could insert a prefix or two here and there to ensure loop
start is already aligned at 16 byte (or some other) boundary without having
to add a nop there.  Of course when optimizing for some CPUs the strategy
could be not to include any prefixes in the sequence, just add normal
aligning nop at the end.

	Jakub

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2010-06-14 13:26 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-10 17:24 Scheduling x86 dispatch windows reza yazdani
2010-06-10 19:52 ` Quentin Neill
2010-06-10 20:12   ` H.J. Lu
2010-06-10 20:25     ` Jeff Law
2010-06-10 20:38       ` Joern Rennecke
2010-06-10 21:54       ` Quentin Neill
2010-06-10 22:09         ` H.J. Lu
2010-06-10 22:40           ` Quentin Neill
2010-06-10 22:48             ` H.J. Lu
2010-06-11 23:36               ` Quentin Neill
2010-06-12 18:54                 ` H.J. Lu
2010-06-13 21:54                   ` H.J. Lu
2010-06-14 13:37                     ` Michael Matz
2010-06-14 16:06                       ` Jakub Jelinek
2010-06-11  0:44       ` Daniel Jacobowitz
2010-06-11  5:58         ` Quentin Neill
2010-06-11 16:46           ` Daniel Jacobowitz
2010-06-11 19:21             ` Quentin Neill
2010-06-11 19:41               ` H.J. Lu
2010-06-11 19:50               ` Jakub Jelinek
2010-06-12 11:25               ` Andi Kleen
2010-06-12 22:45                 ` Ian Lance Taylor
2010-06-13 12:35                   ` Andi Kleen
2010-06-13 13:09                     ` Joern Rennecke
2010-06-13 14:36                       ` Andi Kleen
2010-06-13 15:02                         ` Joern Rennecke
2010-06-13 19:56                           ` Chris Lattner
2010-06-13 16:28                         ` Frank Ch. Eigler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).