Blocking Components In SID
==========================
This document outlines the design, implementation and use of blocking
components in SID.

Introduction
============
SID is a serial simulator. Components in the simulation are notified, one by
one, by a scheduling component that it is their turn to perform an activity.
Each component completes its activity and returns control to the scheduler
which then notifies the next component in turn. One might consider one
complete round of scheduling to represent one "cycle" of execution for the
simulated system.

There is no need for arbitration of resources, since only one component has
control at any one time. For most systems, simulating concurrency is not
a problem, since the timing of accesses to a resource by more than one
component is often unimportant.

One exception is when the purpose of the simulation is to model the
arbitration of access to a resource by more than one component. It may be
useful to simulate different arbitration schemes in order to determine which
one will be best for the system being designed. In this case, we need some way
of blocking access to a resource and a way to arbitrate among requests which
arrive in the same "cycle".

In order to achieve this, the accessing component requires some way to be
informed that a request has been denied as well as some way of being informed
that access is later granted. This can be accomplished by the introduction of
a new bus status, sid::bus::busy, and by the addition of a new mix-in class for
components -- blocking_component -- which provides components with the
ability to save state so that access to a resource can be retried when the
component is next activated by the scheduler.

Components may be arbitrarily complex and the point at which access
is denied, or blocked may be arbitrarily deep within the logic of the
implementation. One way of saving state in a general way under these conditions
is by using a separate thread to perform the work of the component (child
thread). When the component is activated
by the scheduler, it activates its child thread to perform the task. If the
task bcomes blocked for any reason, the child thread is suspended at the point
at which it became blocked, the parent thread regains control and, in turn,
returns control to the scheduler. During the next "cycle" when the component
is activated again, the child thread is awakened and it retries the activity
which was blocked. This pattern repeats until the "cycle" at which the
component is no longer blocked. In this case, the child thread is still
suspended, however, the point of suspension is at the beginning of its task.

Note that the execution is still synchronous and deterministic, since only one
thread executes at any one time, having been given control in order to
perform its task and suspending when the task becomes blocked or is completed.
Similarly, when a child thread is activated, the activating thread (parent
thread) suspends until the child thread becomes blocked or completes its task.

The following patch contains the implementation of the blocking_component
class, the implementation of a new blocking_cache_component and some
changes needed in order to support the implementation of a blocking cpu
component (I have an implementation of a blocking cpu which, unfortunately, can
not be contributed at this time, however I have provided a skeletal
sample implementation below). The patch also contains the implementation of
a virtual base class,
bus_arbitrator, which can be extended to provide the implementation of a
bus arbitrator component.

The Patch
=========
This section will describe the changes and additions introduced by the
patch which follows.

sid/include/sidblockingutil.h:
------------------------------
This new header contains the definition of the sidutil::blocking_component
class which is designed for virtual inheritence similar to the existing
component "mix-in" classes like fixed_attribute_map_component.

This class is used to implement the threaded state saving alorithm described
above and may be virtually inherited by any component. The threads are
implemented using POSIX pthreads.

The constructor is declared as follows

 blocking_component::blocking_component (void *child_self, void *(*f)(void *));

The 'child_self' argument is the 'this' pointer of the class which inherits
from blocking_component and is used to give the child thread access to the
class.

The 'f' argument is the entry point to the child thread. child_self will be
passed to this function when the child thread is created.

Note that blocking_component inherits from
fixed_attribute_map_with_logging_component. This is because component logging
was used to help debug the implementation and remains for use in debugging
possible future problems.

A boolean attribute, "blockable?" is provided to allow the blocking behaviour
to be enabled and disabled.

The remaining methods are as follows:

protected:
  // Called by the parent thread to ensure that a child thread exists
  //
  void need_child_thread ();
  // Called by the parent thread to signal the child thread to resume
  //
  int continue_child_thread_and_wait ();
public:
  // Called by the child thread once when it is created.
  //
  void child_init ();
  // Called by the child thread to signal normal completion of the child task
  //
  void child_completed ();
  // Called by the child thread to signal that it is blocked
  //
  void child_blocked ();
private:	    
  // Called by need_child_thread
  //
  void parent_init ();
  // Called by continue_child_thread_and_wait
  //
  int wait_for_child_thread ();
  // Called by child_completed and child_blocked
  //
  void child_wait_for_resume ();

The typical logic for the parent thread of the component is:

1) Component is activated (pin driven or bus receives request)
2) call need_child_thread ()
3) setup any state needed by the child
4) call continue_child_thread_and_wait ()
   - the parent thread suspends here until the child gives up control
5) return control to the activating component

For the logic of the child thread, we will use the child thread of the
blocking_cache_component, which is very typical:

extern "C" void *
blocking_cache_child_thread_root (void *comp)
{
  // Set up this thread to receive and handle signals from the parent thread.
  // This need only be done once.
  //
  blocking_cache_component *cache = static_cast<blocking_cache_component *>(comp);
  cache->child_init ();

  for (;;)
    {
      // Signal completion and wait for the signal to resume
      cache->child_completed ();

      // Now perform the transaction
      cache->perform_transaction ();
    }

  // We should never reach here.
  return NULL;
}

This function is called when the child thread is created (when the parent
thread calls need_child_thread). Its logic is as follows:

1) calls child_init once () and then signals completion right away. The parent
   thread will awaken it almost immediately (see parent logic above).
2) when awakened, it performs its activity and either
   a) signals completion if control returns to the main loop or
   b) signals that it is blocked if that condition arises during
      perform_transaction ().

   In either case, the child thread will wait for the parent thread to
   reawaken it.

Note that the child thread is never created if the component is never
activated and that a single child thread is used during the duration of the
simulation for this component.

Configury Changes
-----------------
Solaris requires the definition of some macros when using pthreads in order
to enable thread safety. The changes to sid/component/configure.in
ensure that these macro definitions are available for the
components which need this. The Makefile.in changes are as a result
of running autoconf.

cache_component Changes: sid/component/cache/cache.{cxx,h}
----------------------------------------------------------
These are changes to the existing cache_component class which were necessary to
support the implementation of the new blocking_cache_component and its
application in bus arbitration modelling.

Changes include:

o A new operation-status pin which reports the status of the last operation
  - This is needed to return status for operations which are initiated by
    driving a pin and which could become blocked, such as
    flush-and-invalidate .

o Logic and a pin (data-width) for accessing the downstream components in units
  of 4 or 8 bytes.
  - This was needed for the implementation of an internal bus model which
    operated in units of 4 or 8 bytes.

o Logic and a pin (total-latency) for accumulating the total latency of a cache
  line flush or fill.
  - This is needed for determining the actual latency of a flush or refill
    burst in the presence of bus arbitration downstream

o Logic and virtual methods for handing unsuccessful reads/writes.
  - These are used as a hook by blocking_cache_component for handling accesses
    which are blocked downstream.

o Virtual methods (lock_downstream, unlock_downstream) required for modelling
  exclusive access to a downstream bus interface during a read/write burst.

o Virtual read/write methods
  - Used by blocking_cache_component to implement blockable reads/writes

o Virtual pin handlers
  - Used by blocking_cache_component to implement blockable operations

o New methods (read_downstream, write_downstream) simply encapsulate some logic
  which would have otherwise been coded identically in several places.

o Fixed a bug in flush_set, invalidate_set, flush_and_invalidate_set. The pins
  were driven with address which these methods were treating as a cache set
  index. Introduced cache::addr_to_index method to convert the address to a set
  index before use.

blocking_cache_component: sid/component/cache/cache.{cxx,h}
-----------------------------------------------------------
This class inherits from cache_component and uses the new virtual interfaces
to implement blocking behaviour when a downstream component returns
sid::bus:busy.

o handle_{read,write}_error: These virtual methods are called when a request
downstream returns something other than sid::bus::ok. If the status is not
sid::bus:busy, then the status is passed upstream as usual. Otherwise,
child_blocked (child thread) is called which will suspend the thread and return
control to the parent. The child thread will remain suspended until it is
awoken again by the parent.

o The remaining methods are blockable versions of the handers for each type of
cache request (bus reads/writes and transactions initiated by driving input
pins).
In each case the blockable implementation checks the "blockable?" attribute and
calls the normal handler if it is false. Otherwise it

o calls need_child_thread to ensure that the child thread has been created
o sets up the transaction details for the child thread
o calls continue_child_thread_and_wait to execute the transaction
o returns or reports the transaction status when control is returned

New cache component types
-------------------------
Using the implementation above, several new cache component types are now
available:

hw-blocking-cache-basic
hw-blocking-cache-buffer-8
hw-blocking-cache-direct/<s>kb/<l>
hw-blocking-cache-<a>/<s>kb/<l>

Each of these corresponds to an existing non-blocking cache type.

BlockingCacheCfg: sid/main/dynamic/commonCfg.{cxx,h}
----------------------------------------------------
The configuration class has been added to suppor the creation of the new
cache types above.

Changes to enable blocking cpu implementation used to model bus arbitration
===========================================================================
These changes make the implementation of a blocking cpu component possible.
I have such a cpu implemented, however I am unable to contribute it at this
time.

sid/component/cgen-cpu/cgen-cpu.h
---------------------------------
o GETMEM*, SETMEM*, GETIMEM*, SETIMEM* are
no longer 'const' methods, since blocking during these operations may require
some internal state to be changed.

sid/include/sidbusutil.h
-------------------------
o The readAny and writeAny methods of word_bus now track the maximum latency
of the reads and writes performed during the transaction and return that as the
overall latency of the transaction.

sid/include/sidcomp.h
---------------------
o A new enumerator has ben added --- sid::bus::busy.

sid/include/sidcpuutil.h
------------------------
o basic_cpu now inherits virtually from its base classes in order to avoid
unexpected complications when mixing in blocking_component lower in the
heirarchy.

o {read,write}_{insn,data}_memory* are no longer 'const' since the
implementation of blocking on reads/writes may require state changes.

o New virtual methods handle_{insn,data}_memory_{read,write}_error may be used
as hooks for implementing blocking on reads/writes. The default methods
return false to indicate that the error was not handled.

o New virtual methods record_{insn,data}_{read,write}_latency may be used to
record the latency caused by blocked reads/writes. The default methods add
the given latency to total_latency.

o {read,write}_{insn,data}_memory now call the new methods documented above.

Misellaneous Changes
====================

sid/include/sidattrutil.h
-------------------------
o Some methods and members which had previously been moved to the logger class
in sidmiscutil.h was still also in sid_attribute_map_with_logging_component and
was unused. These have been removed.

o the check_level method is declared to return bool, but was not returning
anything.

o The members and methods of sid_attribute_map_with_logging_component are now
protected (some were private) to allow access from inheriting classes.

Sample implementation of a blocking cpu component
=================================================
The sample below is a skeletal implementation of a blocking cpu component
which blocks on reads/writes from/to data/insn memory when sid::bus::busy
is returned.

In order to model latency in the presence of other components, notice that
it notes the latency returned with the status of all
read/writes (blocked and unblocked), schedules itself such that it won't be
called to step again until that latency has expired and then blocks itself
for the same duration (see record_latency).

------------------------------------------------------
extern "C" void *blocking_cpu_child_thread_root (void *comp);

// Abstract class!
class blocking_cpu: public cgen_bi_endian_cpu, public blocking_component
  {
    public:
      blocking_cpu ();
      ~blocking_cpu () throw() { };

      // blockable thread support
      //
    public:
      virtual void step_pin_handler (sid::host_int_4);

      void parent_step_pin_handler (sid::host_int_4 v)
      {
	blocked_latency = 0;
	cgen_bi_endian_cpu::step_pin_handler (v);
      }

    protected:
      virtual bool handle_insn_memory_read_error (sid::bus::status s, sid::host_int_4 & address) { return handle_bus_error (s); }
      virtual bool handle_insn_memory_write_error (sid::bus::status s, sid::host_int_4 & address) { return handle_bus_error (s); }
      virtual bool handle_data_memory_read_error (sid::bus::status s, sid::host_int_4 & address) { return handle_bus_error (s); }
      virtual bool handle_data_memory_write_error (sid::bus::status s, sid::host_int_4 & address) { return handle_bus_error (s); }

      // Handles errors for all of the above.
      bool handle_bus_error (sid::bus::status s);

      virtual void record_insn_memory_read_latency (sid::bus::status s)
        { record_latency (s); }
      virtual void record_data_memory_read_latency (sid::bus::status s)
        { record_latency (s); }

      void record_latency (sid::bus::status s)
        {
	  if (s.latency == 0)
	    return;
	  total_latency += s.latency;
	  if (blockable)
	    {
	      blocked_latency += s.latency;
	      cgen_bi_endian_cpu::stepped (s.latency);
	      child_blocked ();
	    }
	}
      virtual void stepped (sid::host_int_4 n)
        {
	  cgen_bi_endian_cpu::stepped (n - blocked_latency);
	}
      sid::host_int_4 blocked_latency;
  };


// Constructor
blocking_cpu::blocking_cpu ()
  : blocking_component (this, blocking_cpu_child_thread_root)\
{
}

// Virtual override of step_pin_handler
//
void 
blocking_cpu::step_pin_handler (sid::host_int_4 v)
{
  if (blockable)
    {
      // Signal the child thread to resume
      need_child_thread ();
      continue_child_thread_and_wait ();
      return;
    }

  cgen_bi_endian_cpu::step_pin_handler (v);
}

// Handles bus errors from reads and writes from/to insn and data memory.
// Specifically, bus::busy is handled in blockable mode.
//
bool
blocking_cpu::handle_bus_error (sid::bus::status s)
{
  if (s != sid::bus::busy)
    return false; // not handled

  // Reschedule for after the length of time the bus will be busy.
  // This will also block this child thread so that we continue
  // from here when scheduled again.
  record_latency (s);

  return true;
}

// This function is the root of the blockable child thread. It gets passed
// to pthread_create.
//
extern "C" void *
blocking_cpu_child_thread_root (void *comp)
{
  // Set up this thread to receive and handle signals from the parent thread.
  // this need only be done once.
  //
  blocking_cpu *cpu = static_cast<blocking_cpu *>(comp);
  cpu->child_init ();

  for (;;)
    {
      // Signal completion and wait for the signal to resume
      cpu->child_completed ();

      // Call the parent class' step_pin_handler
      cpu->parent_step_pin_handler (1);
    }

  // We should never reach here.
  return NULL;
}

New virtual base class: sidutil::bus_arbitrator
===============================================
This class is designed to be the base class for a customized bus arbitrator
component. The component is designed to accept read/write requests from
multiple upstream busses and to map them to multiple downstream accessors
while prioritizing the requests using an implementation defined strategy.
Upstream and downstream interfaces are identified using integral indices the
assignment of which is implementation defined.

Features include:

o read/write methods which identify the upstream interface, the address and
the size of the request. Mapping of upstream requests to downstream interfaces
is implementation defined.

o helper classes, input_interface and bus_request, help automate the delivery
of upstream requests to the arbitration logic.

o virtual methods for customizing the behaviour of the arbitrator. In many
cases, the default implementations are sufficient.

o passthough capability which bypasses the arbitration logic when the system
is initializing or is idle (e.g. stopped by GDB).

o scheduling and methods to manage the passing of time (cycles) provide
the capability to compute accurate latencies for requests.

Adding upstream interfaces
--------------------------
The input_interface class inherits from sid::bus, so upstream interfaces are
added in the usual way using add_accessor. Each input_interface is assigned
a unique integer index when constructed, so that the arbitration logic knows
which interface is making each request.

Virtual methods
---------------
These methods may be specialized in order to implement abritrary arbitration
strategies:

  virtual bool prioritize_request (bus_request &r);

This method examines the given request. It returns true if it should be
serviced right away and returns false otherwise.

The default method simply returns true (i.e. there is no arbitration).

  virtual void lock_downstream (int upstream, int downstream);

If the model requires locking an interface for the duration of several accesses
then this method should lock the given downstream interface if the given
upstream interface is locked and unlock it otherwise. The mechanism for locking
an interface (e.g. pin, attribute, etc.) is implementation defined.

The default method simply returns without providing any locking.

  virtual sid::bus::status set_route_busy (bus_request &r, sid::bus::status s);

This method should set state indicating that the route represented by the
request r is busy for the number of cycles indicated by the latency contained
within the status s. How this is done is implementation defined.

The default method simply returns s without setting any busy state.

  virtual bool check_route_busy (int upstream, int downstream);

This method is called after prioritize_request has indicated
that a request should be processed. It returns true if the route through the
arbitrator from the upstream interface to the downstream interface is busy.
This can happen, for example,  if a previous request used one of the interfaces
and the latency of that request has not yet elapsed.

The default implementation simply returns false (i.e. not busy).

  virtual sid::bus::status busy_status ();

This method is called after prioritize_request has determined that a request
can not be handled right away or after check_route_busy has returned true. It
should return sid::bus:busy with the latency set to the minimum
number of "cycles" which the requesting component should wait before trying
again.

The default implementation sets the latency to 1.

  virtual void step_cycle ();

Handles the step-event pin which is normally driven by the target scheduler.

The default implementation simply calls another virtual method,
update_busy_routes.

    virtual void update_busy_routes ();

This method should update any state associated with interfaces being busy and
should be called once per simulated "cycle". By default, it is called once by
the step_cycle method each time the step-event pin is driven.

The default implementation does nothing.

  virtual void reschedule (sid::host_int_2 latency);

This method reschedules the arbitrator using the step-control pin. It should
be called by the implementation whenever internal state which must be
updated as time passes has been set. This is implementation defined, however
this generally occurs when:

o a request has been accepted and set_route_busy has saved state associated
with the busy route
o the step-event pin has been driven and update_busy_routes has updated the
state of busy routes

If the internal state is changed such that no updates are required with the
passage of time, then reschedule need not be called.

The default implementation ignores the given latency and reschedules for 1 tick
later.

  virtual int downstream_for_address (sid::host_int_4 address) = 0;

Returns the index of the downstream accessor associated with the given
address.

Indices are assigned by the implementation.

  virtual sid::bus *downstream_bus (int downstream) = 0;

This method should return a pointer to the downstream accessor identified by
the given index. The index will have been obtained from the
downstream_for_address method.

  virtual const char *up2str (int upstream) = 0;

This method maps an upstream interface index to a name. It is used in logging
messages.

Arbitration
-----------
Each read or write request on an upstream interface will trigger a call to
arbitrate_read or arbitrate_write respectively. These methods will 
check whether the request should be passed through. If so the request is
passed immediately to the proper downstream accessor. Otherwise they will:

o create a bus_request representing the read/write request
o pass the bus_request to prioritize_request. If true is returned then the
request is handled immediately using perform_read or perform_write. Otherwise
sid::bus::busy is returned with the latency computed by busy_status.

Scheduling
----------
The bus_arbitrator component has a step-event pin and a step-control pin which
are intended to be connected in the usual way with the target scheduler
component. The implementation should cause the arbitrator to be scheduled in
such a way that the passage of time in "cycles" can be managed, if necessary.
For example, if a request is granted and has a latency of n "cycles", then
the arbitrator should schedule itself such that the passage of those n cycles
can be detected.

Passthrough
-----------
The abritration logic will only be executed if the "passthrough" pin is
inactive and the "running" and "active" pins are both active.

Thus, if the "running", "active" and "passthrough" pins are connected as
follows, requests to the arbitrator will automatically be passed through
(bypassing the arbitration logic) during loading of the executable and when GDB
has stopped the simulation, :

o the "running" pin should be connected to an init-seq output which is driven
after the one connected to the loader's "load!" pin.

o the "active" pin should be connected to the sim-sched's "active" pin.

o the "passthrough" pin may be connected to the output pin of any component
which has a need to set the arbitrator into passthrough mode. For example,
a cpu component should drive this pin with a non-zero value before executing
a syscall via the gloss component and drive it again with a value of zero
after the syscall finishes.