Blocking Components In SID ========================== This document outlines the design, implementation and use of blocking components in SID. Introduction ============ SID is a serial simulator. Components in the simulation are notified, one by one, by a scheduling component that it is their turn to perform an activity. Each component completes its activity and returns control to the scheduler which then notifies the next component in turn. One might consider one complete round of scheduling to represent one "cycle" of execution for the simulated system. There is no need for arbitration of resources, since only one component has control at any one time. For most systems, simulating concurrency is not a problem, since the timing of accesses to a resource by more than one component is often unimportant. One exception is when the purpose of the simulation is to model the arbitration of access to a resource by more than one component. It may be useful to simulate different arbitration schemes in order to determine which one will be best for the system being designed. In this case, we need some way of blocking access to a resource and a way to arbitrate among requests which arrive in the same "cycle". In order to achieve this, the accessing component requires some way to be informed that a request has been denied as well as some way of being informed that access is later granted. This can be accomplished by the introduction of a new bus status, sid::bus::busy, and by the addition of a new mix-in class for components -- blocking_component -- which provides components with the ability to save state so that access to a resource can be retried when the component is next activated by the scheduler. Components may be arbitrarily complex and the point at which access is denied, or blocked may be arbitrarily deep within the logic of the implementation. One way of saving state in a general way under these conditions is by using a separate thread to perform the work of the component (child thread). When the component is activated by the scheduler, it activates its child thread to perform the task. If the task bcomes blocked for any reason, the child thread is suspended at the point at which it became blocked, the parent thread regains control and, in turn, returns control to the scheduler. During the next "cycle" when the component is activated again, the child thread is awakened and it retries the activity which was blocked. This pattern repeats until the "cycle" at which the component is no longer blocked. In this case, the child thread is still suspended, however, the point of suspension is at the beginning of its task. Note that the execution is still synchronous and deterministic, since only one thread executes at any one time, having been given control in order to perform its task and suspending when the task becomes blocked or is completed. Similarly, when a child thread is activated, the activating thread (parent thread) suspends until the child thread becomes blocked or completes its task. The following patch contains the implementation of the blocking_component class, the implementation of a new blocking_cache_component and some changes needed in order to support the implementation of a blocking cpu component (I have an implementation of a blocking cpu which, unfortunately, can not be contributed at this time, however I have provided a skeletal sample implementation below). The patch also contains the implementation of a virtual base class, bus_arbitrator, which can be extended to provide the implementation of a bus arbitrator component. The Patch ========= This section will describe the changes and additions introduced by the patch which follows. sid/include/sidblockingutil.h: ------------------------------ This new header contains the definition of the sidutil::blocking_component class which is designed for virtual inheritence similar to the existing component "mix-in" classes like fixed_attribute_map_component. This class is used to implement the threaded state saving alorithm described above and may be virtually inherited by any component. The threads are implemented using POSIX pthreads. The constructor is declared as follows blocking_component::blocking_component (void *child_self, void *(*f)(void *)); The 'child_self' argument is the 'this' pointer of the class which inherits from blocking_component and is used to give the child thread access to the class. The 'f' argument is the entry point to the child thread. child_self will be passed to this function when the child thread is created. Note that blocking_component inherits from fixed_attribute_map_with_logging_component. This is because component logging was used to help debug the implementation and remains for use in debugging possible future problems. A boolean attribute, "blockable?" is provided to allow the blocking behaviour to be enabled and disabled. The remaining methods are as follows: protected: // Called by the parent thread to ensure that a child thread exists // void need_child_thread (); // Called by the parent thread to signal the child thread to resume // int continue_child_thread_and_wait (); public: // Called by the child thread once when it is created. // void child_init (); // Called by the child thread to signal normal completion of the child task // void child_completed (); // Called by the child thread to signal that it is blocked // void child_blocked (); private: // Called by need_child_thread // void parent_init (); // Called by continue_child_thread_and_wait // int wait_for_child_thread (); // Called by child_completed and child_blocked // void child_wait_for_resume (); The typical logic for the parent thread of the component is: 1) Component is activated (pin driven or bus receives request) 2) call need_child_thread () 3) setup any state needed by the child 4) call continue_child_thread_and_wait () - the parent thread suspends here until the child gives up control 5) return control to the activating component For the logic of the child thread, we will use the child thread of the blocking_cache_component, which is very typical: extern "C" void * blocking_cache_child_thread_root (void *comp) { // Set up this thread to receive and handle signals from the parent thread. // This need only be done once. // blocking_cache_component *cache = static_cast(comp); cache->child_init (); for (;;) { // Signal completion and wait for the signal to resume cache->child_completed (); // Now perform the transaction cache->perform_transaction (); } // We should never reach here. return NULL; } This function is called when the child thread is created (when the parent thread calls need_child_thread). Its logic is as follows: 1) calls child_init once () and then signals completion right away. The parent thread will awaken it almost immediately (see parent logic above). 2) when awakened, it performs its activity and either a) signals completion if control returns to the main loop or b) signals that it is blocked if that condition arises during perform_transaction (). In either case, the child thread will wait for the parent thread to reawaken it. Note that the child thread is never created if the component is never activated and that a single child thread is used during the duration of the simulation for this component. Configury Changes ----------------- Solaris requires the definition of some macros when using pthreads in order to enable thread safety. The changes to sid/component/configure.in ensure that these macro definitions are available for the components which need this. The Makefile.in changes are as a result of running autoconf. cache_component Changes: sid/component/cache/cache.{cxx,h} ---------------------------------------------------------- These are changes to the existing cache_component class which were necessary to support the implementation of the new blocking_cache_component and its application in bus arbitration modelling. Changes include: o A new operation-status pin which reports the status of the last operation - This is needed to return status for operations which are initiated by driving a pin and which could become blocked, such as flush-and-invalidate . o Logic and a pin (data-width) for accessing the downstream components in units of 4 or 8 bytes. - This was needed for the implementation of an internal bus model which operated in units of 4 or 8 bytes. o Logic and a pin (total-latency) for accumulating the total latency of a cache line flush or fill. - This is needed for determining the actual latency of a flush or refill burst in the presence of bus arbitration downstream o Logic and virtual methods for handing unsuccessful reads/writes. - These are used as a hook by blocking_cache_component for handling accesses which are blocked downstream. o Virtual methods (lock_downstream, unlock_downstream) required for modelling exclusive access to a downstream bus interface during a read/write burst. o Virtual read/write methods - Used by blocking_cache_component to implement blockable reads/writes o Virtual pin handlers - Used by blocking_cache_component to implement blockable operations o New methods (read_downstream, write_downstream) simply encapsulate some logic which would have otherwise been coded identically in several places. o Fixed a bug in flush_set, invalidate_set, flush_and_invalidate_set. The pins were driven with address which these methods were treating as a cache set index. Introduced cache::addr_to_index method to convert the address to a set index before use. blocking_cache_component: sid/component/cache/cache.{cxx,h} ----------------------------------------------------------- This class inherits from cache_component and uses the new virtual interfaces to implement blocking behaviour when a downstream component returns sid::bus:busy. o handle_{read,write}_error: These virtual methods are called when a request downstream returns something other than sid::bus::ok. If the status is not sid::bus:busy, then the status is passed upstream as usual. Otherwise, child_blocked (child thread) is called which will suspend the thread and return control to the parent. The child thread will remain suspended until it is awoken again by the parent. o The remaining methods are blockable versions of the handers for each type of cache request (bus reads/writes and transactions initiated by driving input pins). In each case the blockable implementation checks the "blockable?" attribute and calls the normal handler if it is false. Otherwise it o calls need_child_thread to ensure that the child thread has been created o sets up the transaction details for the child thread o calls continue_child_thread_and_wait to execute the transaction o returns or reports the transaction status when control is returned New cache component types ------------------------- Using the implementation above, several new cache component types are now available: hw-blocking-cache-basic hw-blocking-cache-buffer-8 hw-blocking-cache-direct/kb/ hw-blocking-cache-/kb/ Each of these corresponds to an existing non-blocking cache type. BlockingCacheCfg: sid/main/dynamic/commonCfg.{cxx,h} ---------------------------------------------------- The configuration class has been added to suppor the creation of the new cache types above. Changes to enable blocking cpu implementation used to model bus arbitration =========================================================================== These changes make the implementation of a blocking cpu component possible. I have such a cpu implemented, however I am unable to contribute it at this time. sid/component/cgen-cpu/cgen-cpu.h --------------------------------- o GETMEM*, SETMEM*, GETIMEM*, SETIMEM* are no longer 'const' methods, since blocking during these operations may require some internal state to be changed. sid/include/sidbusutil.h ------------------------- o The readAny and writeAny methods of word_bus now track the maximum latency of the reads and writes performed during the transaction and return that as the overall latency of the transaction. sid/include/sidcomp.h --------------------- o A new enumerator has ben added --- sid::bus::busy. sid/include/sidcpuutil.h ------------------------ o basic_cpu now inherits virtually from its base classes in order to avoid unexpected complications when mixing in blocking_component lower in the heirarchy. o {read,write}_{insn,data}_memory* are no longer 'const' since the implementation of blocking on reads/writes may require state changes. o New virtual methods handle_{insn,data}_memory_{read,write}_error may be used as hooks for implementing blocking on reads/writes. The default methods return false to indicate that the error was not handled. o New virtual methods record_{insn,data}_{read,write}_latency may be used to record the latency caused by blocked reads/writes. The default methods add the given latency to total_latency. o {read,write}_{insn,data}_memory now call the new methods documented above. Misellaneous Changes ==================== sid/include/sidattrutil.h ------------------------- o Some methods and members which had previously been moved to the logger class in sidmiscutil.h was still also in sid_attribute_map_with_logging_component and was unused. These have been removed. o the check_level method is declared to return bool, but was not returning anything. o The members and methods of sid_attribute_map_with_logging_component are now protected (some were private) to allow access from inheriting classes. Sample implementation of a blocking cpu component ================================================= The sample below is a skeletal implementation of a blocking cpu component which blocks on reads/writes from/to data/insn memory when sid::bus::busy is returned. In order to model latency in the presence of other components, notice that it notes the latency returned with the status of all read/writes (blocked and unblocked), schedules itself such that it won't be called to step again until that latency has expired and then blocks itself for the same duration (see record_latency). ------------------------------------------------------ extern "C" void *blocking_cpu_child_thread_root (void *comp); // Abstract class! class blocking_cpu: public cgen_bi_endian_cpu, public blocking_component { public: blocking_cpu (); ~blocking_cpu () throw() { }; // blockable thread support // public: virtual void step_pin_handler (sid::host_int_4); void parent_step_pin_handler (sid::host_int_4 v) { blocked_latency = 0; cgen_bi_endian_cpu::step_pin_handler (v); } protected: virtual bool handle_insn_memory_read_error (sid::bus::status s, sid::host_int_4 & address) { return handle_bus_error (s); } virtual bool handle_insn_memory_write_error (sid::bus::status s, sid::host_int_4 & address) { return handle_bus_error (s); } virtual bool handle_data_memory_read_error (sid::bus::status s, sid::host_int_4 & address) { return handle_bus_error (s); } virtual bool handle_data_memory_write_error (sid::bus::status s, sid::host_int_4 & address) { return handle_bus_error (s); } // Handles errors for all of the above. bool handle_bus_error (sid::bus::status s); virtual void record_insn_memory_read_latency (sid::bus::status s) { record_latency (s); } virtual void record_data_memory_read_latency (sid::bus::status s) { record_latency (s); } void record_latency (sid::bus::status s) { if (s.latency == 0) return; total_latency += s.latency; if (blockable) { blocked_latency += s.latency; cgen_bi_endian_cpu::stepped (s.latency); child_blocked (); } } virtual void stepped (sid::host_int_4 n) { cgen_bi_endian_cpu::stepped (n - blocked_latency); } sid::host_int_4 blocked_latency; }; // Constructor blocking_cpu::blocking_cpu () : blocking_component (this, blocking_cpu_child_thread_root)\ { } // Virtual override of step_pin_handler // void blocking_cpu::step_pin_handler (sid::host_int_4 v) { if (blockable) { // Signal the child thread to resume need_child_thread (); continue_child_thread_and_wait (); return; } cgen_bi_endian_cpu::step_pin_handler (v); } // Handles bus errors from reads and writes from/to insn and data memory. // Specifically, bus::busy is handled in blockable mode. // bool blocking_cpu::handle_bus_error (sid::bus::status s) { if (s != sid::bus::busy) return false; // not handled // Reschedule for after the length of time the bus will be busy. // This will also block this child thread so that we continue // from here when scheduled again. record_latency (s); return true; } // This function is the root of the blockable child thread. It gets passed // to pthread_create. // extern "C" void * blocking_cpu_child_thread_root (void *comp) { // Set up this thread to receive and handle signals from the parent thread. // this need only be done once. // blocking_cpu *cpu = static_cast(comp); cpu->child_init (); for (;;) { // Signal completion and wait for the signal to resume cpu->child_completed (); // Call the parent class' step_pin_handler cpu->parent_step_pin_handler (1); } // We should never reach here. return NULL; } New virtual base class: sidutil::bus_arbitrator =============================================== This class is designed to be the base class for a customized bus arbitrator component. The component is designed to accept read/write requests from multiple upstream busses and to map them to multiple downstream accessors while prioritizing the requests using an implementation defined strategy. Upstream and downstream interfaces are identified using integral indices the assignment of which is implementation defined. Features include: o read/write methods which identify the upstream interface, the address and the size of the request. Mapping of upstream requests to downstream interfaces is implementation defined. o helper classes, input_interface and bus_request, help automate the delivery of upstream requests to the arbitration logic. o virtual methods for customizing the behaviour of the arbitrator. In many cases, the default implementations are sufficient. o passthough capability which bypasses the arbitration logic when the system is initializing or is idle (e.g. stopped by GDB). o scheduling and methods to manage the passing of time (cycles) provide the capability to compute accurate latencies for requests. Adding upstream interfaces -------------------------- The input_interface class inherits from sid::bus, so upstream interfaces are added in the usual way using add_accessor. Each input_interface is assigned a unique integer index when constructed, so that the arbitration logic knows which interface is making each request. Virtual methods --------------- These methods may be specialized in order to implement abritrary arbitration strategies: virtual bool prioritize_request (bus_request &r); This method examines the given request. It returns true if it should be serviced right away and returns false otherwise. The default method simply returns true (i.e. there is no arbitration). virtual void lock_downstream (int upstream, int downstream); If the model requires locking an interface for the duration of several accesses then this method should lock the given downstream interface if the given upstream interface is locked and unlock it otherwise. The mechanism for locking an interface (e.g. pin, attribute, etc.) is implementation defined. The default method simply returns without providing any locking. virtual sid::bus::status set_route_busy (bus_request &r, sid::bus::status s); This method should set state indicating that the route represented by the request r is busy for the number of cycles indicated by the latency contained within the status s. How this is done is implementation defined. The default method simply returns s without setting any busy state. virtual bool check_route_busy (int upstream, int downstream); This method is called after prioritize_request has indicated that a request should be processed. It returns true if the route through the arbitrator from the upstream interface to the downstream interface is busy. This can happen, for example, if a previous request used one of the interfaces and the latency of that request has not yet elapsed. The default implementation simply returns false (i.e. not busy). virtual sid::bus::status busy_status (); This method is called after prioritize_request has determined that a request can not be handled right away or after check_route_busy has returned true. It should return sid::bus:busy with the latency set to the minimum number of "cycles" which the requesting component should wait before trying again. The default implementation sets the latency to 1. virtual void step_cycle (); Handles the step-event pin which is normally driven by the target scheduler. The default implementation simply calls another virtual method, update_busy_routes. virtual void update_busy_routes (); This method should update any state associated with interfaces being busy and should be called once per simulated "cycle". By default, it is called once by the step_cycle method each time the step-event pin is driven. The default implementation does nothing. virtual void reschedule (sid::host_int_2 latency); This method reschedules the arbitrator using the step-control pin. It should be called by the implementation whenever internal state which must be updated as time passes has been set. This is implementation defined, however this generally occurs when: o a request has been accepted and set_route_busy has saved state associated with the busy route o the step-event pin has been driven and update_busy_routes has updated the state of busy routes If the internal state is changed such that no updates are required with the passage of time, then reschedule need not be called. The default implementation ignores the given latency and reschedules for 1 tick later. virtual int downstream_for_address (sid::host_int_4 address) = 0; Returns the index of the downstream accessor associated with the given address. Indices are assigned by the implementation. virtual sid::bus *downstream_bus (int downstream) = 0; This method should return a pointer to the downstream accessor identified by the given index. The index will have been obtained from the downstream_for_address method. virtual const char *up2str (int upstream) = 0; This method maps an upstream interface index to a name. It is used in logging messages. Arbitration ----------- Each read or write request on an upstream interface will trigger a call to arbitrate_read or arbitrate_write respectively. These methods will check whether the request should be passed through. If so the request is passed immediately to the proper downstream accessor. Otherwise they will: o create a bus_request representing the read/write request o pass the bus_request to prioritize_request. If true is returned then the request is handled immediately using perform_read or perform_write. Otherwise sid::bus::busy is returned with the latency computed by busy_status. Scheduling ---------- The bus_arbitrator component has a step-event pin and a step-control pin which are intended to be connected in the usual way with the target scheduler component. The implementation should cause the arbitrator to be scheduled in such a way that the passage of time in "cycles" can be managed, if necessary. For example, if a request is granted and has a latency of n "cycles", then the arbitrator should schedule itself such that the passage of those n cycles can be detected. Passthrough ----------- The abritration logic will only be executed if the "passthrough" pin is inactive and the "running" and "active" pins are both active. Thus, if the "running", "active" and "passthrough" pins are connected as follows, requests to the arbitrator will automatically be passed through (bypassing the arbitration logic) during loading of the executable and when GDB has stopped the simulation, : o the "running" pin should be connected to an init-seq output which is driven after the one connected to the loader's "load!" pin. o the "active" pin should be connected to the sim-sched's "active" pin. o the "passthrough" pin may be connected to the output pin of any component which has a need to set the arbitrator into passthrough mode. For example, a cpu component should drive this pin with a non-zero value before executing a syscall via the gloss component and drive it again with a value of zero after the syscall finishes.