Analysis and optimisation of heterogeneous real-time embedded systems

Pop, Paul; Eles, Petru; Peng, Zebo

Published in:
IEE Proceedings - Computers and digital Techniques

Link to article, DOI:
10.1049/ip-cdt:20045069

Publication date:
2005

Document Version
Publisher's PDF, also known as Version of record

Link back to DTU Orbit

Citation (APA):

DTU Library
Technical Information Center of Denmark

General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

- Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
- You may not further distribute the material or use it for any profit-making activity or commercial gain
- You may freely distribute the URL identifying the publication in the public portal

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.
1 Introduction

Embedded real-time systems have to be designed such that they implement correctly the required functionality. In addition, they have to fulfil a wide range of competing constraints: development cost, unit cost, reliability, security, size, performance, power consumption, flexibility, time-to-market, maintainability, correctness, safety, etc. Very important for the correct functioning of such systems are their timing constraints: ‘the correctness of the system behaviour depends not only on the logical results of the computations, but also on the physical instant at which these results are produced’ [1].

Real-time systems have been classified as hard real-time and soft real-time systems [1]. Basically, hard real-time systems are systems where failing to meet a timing constraint can potentially have catastrophic consequences. For example, a brake-by-wire system in a car failing to react within a given time interval can result in a fatal accident. On the other hand, a multimedia system, which is a soft real-time system, can, under certain circumstances, tolerate a certain amount of delays, resulting maybe in a patchier picture, without serious consequences besides some possible inconvenience to the user.

Many real-time applications, following physical, modularity or safety constraints, are implemented using distributed architectures. Such systems are composed of several different types of hardware components, interconnected in a network. For such systems, the communication between the functions implemented on different nodes has an important impact on the overall system properties such as performance, cost, maintainability, etc.

The analysis and optimisation approaches presented in this paper are aimed towards heterogeneous distributed hard real-time systems that implement safety-critical applications where timing constraints are of utmost importance to the correct behaviour of the application.

1.1 Automotive electronics

Although the discussion in this paper is valid for several application areas, it is useful, for understanding the distributed embedded real-time systems evolution and design challenges, to exemplify the developments in a particular area.

If we take the example of automotive manufacturers, they were reluctant, until recently, to use computer controlled functions onboard vehicles. Today, this attitude has changed for several reasons. First, there is a constant market demand for increased vehicle performance, more functionality, less fuel consumption and less exhausts, all of these at lower costs. Then, from the manufacturers’ side, there is a need for shorter time-to-market and reduced development and manufacturing costs. These, combined with the advancements of semiconductor technology, which is delivering ever increasing performance at lower and lower costs, has led to the rapid increase in the number of electronically controlled functions onboard a vehicle [2].

The amount of electronic content in an average car in 1977 had a cost of $110. In 2004, that cost was $1341, and it was expected that this figure would reach $1476 by the year 2005, continuing to increase because of the introduction of sophisticated electronics found until now only in high-end cars [3, 4]. It is estimated that in 2006 the electronics inside...
a car will amount to 25% of the total cost of the vehicle (35% for the high-end models), a quarter of which will be due to semiconductors [3, 5]. High-end vehicles currently have up to 100 microprocessors implementing and controlling various parts of their functionality. The total market for semiconductors in vehicles is predicted to grow from $8.9$ billions in $1998$ to $21$ billion in $2005$, amounting to 10% of the total worldwide semiconductors market [2, 3].

At the same time, with the increased complexity, the type of functions implemented by embedded automotive electronics systems has also evolved. Thanks to the semiconductor revolution, in the late 1950s, electronic devices became small enough to be installed on board vehicles. In the 1960s the first analogue fuel injection system appeared, and in the 1970s analogue devices for controlling transmission, carburettor and spark advance timing were developed. The oil crisis of the 1970s led to the demand for engine control devices that improved the efficiency of the engine, thus reducing fuel consumption. In this context, the first microprocessor based injection control system appeared in 1976 in the USA. During the 1980s, more sophisticated systems began to appear, like electronically controlled braking systems, dashboards, information and navigation systems, air conditioning systems, etc. In the 1990s, development and improvement have concentrated in areas like safety and convenience. Today, it is not uncommon to have highly critical functions like steering or braking implemented through electronic functionality only, without any mechanical backup, as is the case in drive-by-wire and brake-by-wire systems [6, 7].

The complexity of electronics in modern vehicles is growing at a very high pace, and the constraints – in terms of functionality, performance, reliability, cost and time-to-market – are getting tighter. Therefore, the task of designing such systems is becoming increasingly important and difficult at the same time. New design techniques are needed, which are able to:

- successfully manage the complexity of embedded systems
- meet the constraints imposed by the application domain
- shorten the time-to-market
- reduce development and manufacturing costs.

The success of such new design methods depends on the availability of analysis and optimisation techniques, beyond those corresponding to the state-of-the-art, which are presented in the following Section.

2 Heterogeneous real-time embedded systems

2.1 Heterogeneous hardware architecture

Currently, distributed real-time systems are implemented using architectures where each node is dedicated to the implementation of a single function or class of functions. The complete system can be, in general, composed of several networks, interconnected with each other (see Fig. 1). Each network has its own communication protocol, and inter-network communication is via a gateway which is a node connected to both networks. The architecture can contain several such networks, having different types of topologies.

A network is composed of several different types of hardware components, called nodes. Typically, every node, also called an electronic control unit (ECU), has a communication controller, CPU, RAM, ROM and an I/O interface to sensors and actuators. Nodes can also have ASICs in order to accelerate parts of their functionality.

The microcontrollers used in a node and the type of network protocol employed are influenced by the nature of the functionality and the imposed real-time, fault-tolerance and power constraints. In the automotive electronics area, the functionality is typically divided into two classes, depending on the level of criticalness:

- **Body electronics** refers to the functionality that controls simple devices such as the lights, the mirrors, the windows and the dashboard. The constraints of the body electronic functions are determined by the reaction time of the human operator, which is in the range of $100$ ms to $200$ ms. A typical body electronics system within a vehicle consists of a network of 10 to 20 nodes that are interconnected by a low bandwidth communication network like LIN. A node is usually implemented using a single-chip 8 bit microcontroller (e.g. Motorola 68HC05 or Motorola 68HC11) with some hundred bytes of RAM and kilobytes of ROM, I/O points to connect sensors and to control actuators, and a simple network interface. Moreover, the memory size is growing by more than 25% each year [6, 8].

- **System electronics** are concerned with the control of vehicle functions that are related to the movement of the vehicle. Examples of system electronics applications are engine control, braking, suspension and vehicle dynamics control. The timing constraints of system electronic functions are in the range of a couple of ms to 20 ms, requiring 16-bit or 32-bit microcontrollers (e.g. Motorola 68332) with about 16 kbytes of RAM and 256 kbytes of ROM. These microcontrollers have built-in communication controllers (e.g. the 68HC11 and 68HC12 automotive family of microcontrollers have an on-chip CAN controller), I/O to sensors and actuators, and are interconnected by high bandwidth networks [6, 8].

Section 5 presents more details concerning the hardware and software architecture considered by our analysis and optimisation techniques.

2.2 Heterogeneous communication protocols

As the communications become a critical component, new protocols are needed that can cope with the high bandwidth and predictability required.

There are several communication protocols for real-time networks. Among the protocols that have been proposed for vehicle multiplexing, only the controller area network (CAN) [9], the local interconnection network (LIN) [10] and SAE’s J1850 [11] are currently in use on a large scale. Moreover, only a few of them are suitable for safety-critical applications where predictability is mandatory [12]. A survey and comparison of communication protocols for safety-critical embedded systems is available in [12]. Communication activities can be triggered either dynamically, in response to an event, or statically, at predetermined moments in time.
Therefore, on one hand, there are protocols that schedule the messages statically based on the progression of time, for example, the SAFEbus [13] and SPIDER [14] protocols for the avionics industry, and the TTCAN [15] and time-triggered protocol (TTP) [1] intended for the automotive industry.

On the other hand, there are several communication protocols where message scheduling is performed dynamically, such as the controller area network (CAN) used in a large number of application areas including automotive electronics, LonWorks [16] and Profibus [17] for real-time systems in general, etc. Out of these, CAN is the most well known and widespread event-driven communication protocol in the area of distributed embedded real-time systems.

However, there is also a hybrid type of communication protocol, such as Byteflight [18] introduced by BMW for automotive applications and the FlexRay protocol [19], that allows the sharing of the bus by event-driven and time-driven messages.

The time-triggered protocols have the advantage of simplicity and predictability, while event-triggered protocols are flexible and have low cost. Moreover, protocols like TTP offer fault-tolerant services necessary in implementing safety-critical applications. However, it has been shown [20] that event-driven protocols like CAN are also predictable, and fault-tolerant services can also be offered on top of protocols like the TTCAN. A hybrid communication protocol like FlexRay offers some of the advantages of both worlds.

2.3 Heterogeneous scheduling policies

The automotive suppliers will select, based on their own requirements, the scheduling policy to be used in their ECU. The main approaches to scheduling are:

- **Static cyclic scheduling** algorithms are used to build, offline, a schedule table with activation times for each process, such that the timing constraints of processes are satisfied.
- **Fixed priority scheduling** (FPS). In this scheduling approach each process has a fixed (static) priority which is computed off-line. The decision about which ready process to activate is taken on-line according to their priority.
- **Earliest deadline first** (EDF). In this case, that process will be activated which has the nearest deadline. Typically, processes scheduled off-line using static cyclic scheduling are non-pre-emptible, while processes scheduled using techniques such as FPS and EDF are pre-emptible. Another important distinction is between the event-triggered and time-triggered approaches.
- **Time-triggered**

In the time-triggered approach activities are initiated at predetermined points in time. In a distributed time-triggered system it is assumed that the clocks of all nodes are synchronised to provide a global notion of time. Time-triggered systems are typically implemented using non-preemptive static cyclic scheduling, where the process activation or message communication is done based on a schedule table built off-line.

- **Event-triggered**

In the event-triggered approach activities happen when a significant change of state occurs. Event-triggered systems are typically implemented using pre-emptive fixed-priority based scheduling, or earliest deadline first, where, as a response to an event, the appropriate process is invoked to service it.

There has been a long debate in the real-time and embedded systems communities concerning the advantages of each approach [1, 21, 22]. Several aspects have been considered in favour of one or the other approach, such as flexibility, predictability, jitter control, processor utilisation and testability.

An interesting comparison of the ET and TT approaches, from a more industrial, in particular automotive perspective, can be found in [23]. The conclusion there is that one has to choose the right approach, depending on the particularities of the application.

For certain applications, several scheduling approaches can be used together. Efficient implementation of new, highly sophisticated automotive applications entails the use of time-triggered process sets together with event-triggered ones implemented on top of complex distributed architectures.

2.4 Distributed safety-critical applications

Considering the automotive industry, the way functionality has been distributed on an architecture has evolved over time. Initially, distributed real-time systems were implemented using architectures where each node is dedicated to the implementation of a single function or class of functions, allowing the system integrators to purchase nodes implementing required functions from different vendors, and to integrate them into their system [24]. There are several problems related to this restricted mapping of functionality:

- The number of such nodes in the architecture has exploded, reaching, for example, more than 100 in a high-end car, incurring heavy cost and performance penalties.
- The resulting solutions are sub-optimal in many aspects, and do not use the available resources efficiently to reduce costs. For example, it is not possible to move a function from one node to another node where there are enough available resources (e.g. memory and computation power).
- Emerging functionality, such as brake-by-wire in the automotive industry, is inherently distributed, and achieving an efficient fault-tolerant implementation is very difficult in the current setting.

This has created a huge pressure to reduce the number of nodes by integrating several functions in one node and, at the same time, to distribute certain functionality over several nodes (see Fig. 2). Although an application is typically distributed over one single network, we begin to see applications that are distributed across several networks. For example, in Fig. 2, the third application, represented as black dots, is distributed over two networks.

This trend is driven by the need to further reduce costs, improve resource usage, but also by application constraints.
like having to be physically close to particular sensors and actuators. Moreover, not only are these applications distributed across networks, but their functions can exchange critical information through the gateway nodes.

3 Schedulability analysis

There is a large quantity of research related to scheduling and schedulability analysis, with results having been incorporated in analysis tools such as TimeWiz [27], RapidRMA [28], RTA-OSEK Planner [29] and Aires [30]. The tools determine if the timing constraints of the functionality are met, support the designer in exploring several design scenarios and help to design optimised implementations.

Typically, the timing analysis considers independent processes running on single processors. However, very often functionality consists of distributed processes that have data and control dependencies, exclusion constraints, etc. New schedulability analysis techniques are needed which can handle distributed applications, data and control dependencies, and accurately take into account the details of the communication protocols that have an important influence on the timing properties. Moreover, highly complex and safety critical applications can in the future be distributed across several networks, and can use different, heterogeneous, scheduling policies.

Pre-emptive scheduling of independent processes with static priorities running on single-processor architectures has its roots in the work of Liu and Layland [31]. The approach has been later extended to accommodate more general computational models and has also been applied to distributed systems [32]. The reader is referred to [25, 26, 33] for surveys on this topic. Static cyclic scheduling of a set of data dependent software processes on a multiprocessor architecture has also been intensively researched [1, 34].

In [35] an earlier deadline first strategy is used for non-pre-emptive scheduling of processes with possible data dependencies. Pre-emptive and non-pre-emptive static scheduling are combined in the cosynthesis environment described in [36, 37]. In many of the previous scheduling approaches researchers have assumed that processes are scheduled independently. However, processes can be sporadic or aperiodic, are seldom independent, and normally they exhibit precedence and exclusion constraints. Knowledge regarding these dependencies can be used in order to improve the accuracy of schedulability analyses and the quality of the produced schedules [38].

It has been claimed [22] that static cyclic scheduling is the only approach that can provide efficient solutions to applications that exhibit data dependencies. However, advances in the area of fixed priority pre-emptive scheduling show that such applications can also be handled with other scheduling strategies [39].

One way of dealing with data dependencies between processes in the context of static priority based scheduling has been indirectly addressed by the extensions proposed for the schedulability analysis of distributed systems through the use of the release jitter [32]. Release jitter is the worst case delay between the arrival of a process and its release (when it is placed in the ready-queue for the processor) and can include the communication delay due to the transmission of a message on the communication channel.

In [32, 40], time offset relationships and phases, respectively, are used to model data dependencies. Offset and phase are similar concepts that express the existence of a fixed interval in time between the arrivals of sets of processes. The authors show that by introducing such concepts into the computational model, the pessimism of the analysis is significantly reduced when bounding the time behaviour of the system. The concept of dynamic offsets has been later introduced and used to model data dependencies [41].

Currently, more and more real-time systems are used in physically distributed environments and have to be implemented on distributed architectures to meet reliability, functional and performance constraints.

Researchers have often ignored or very much simplified the communication infrastructure. One typical approach is to consider communications as processes with a given execution time (depending on the amount of information exchanged) and to schedule them as any other process, without considering issues like communication protocol, bus arbitration, packaging of messages, clock synchronisation, etc. [40].

Tindell and Clark [32] integrate processor and communication scheduling and provide a ‘holistic’ schedulability analysis in the context of distributed real-time systems. The validity of the analysis has been later confirmed in [42].

In the case of a distributed system the response time of a process also depends on the communication delay due to messages. In [32] the analysis for messages is done in a similar way as for processes: a message is seen as a non-pre-emptible process that is ‘running’ on a bus. The response time analyses for processes and messages are combined by realising that the jitter (the delay between the arrival of a process – the time when it becomes ready for execution – and the start of its execution) of a destination process depends on the communication delay (the time it takes for a message to reach the destination process, from the moment it has been produced by the sender process) between sending and receiving a message. Several researchers have provided analyses that bound the communication delay for a given communication protocol:

- controller area network protocol [20]
- time-division multiple access protocol [32]
- asynchronous transfer mode protocol [43]
- token ring protocol [44]
- fibre distributed data interface protocol [45]
- time-triggered protocol [46]
- FlexRay protocol [47].

Based on their own requirements, the suppliers choose one particular scheduling policy to be used. However, for certain applications, several scheduling approaches can be used together.

One approach to the design of such systems is to allow ET and TT processes to share the same processor as well as static (TT) and dynamic (ET) communications to share the same bus. Bus sharing of TT and ET messages is supported by protocols which support both static and dynamic communication [19]. We have addressed the problem of timing analysis for such systems [47].

A fundamentally different architectural approach to heterogeneous TT/ET systems is that of heterogeneous multi-clusters, where each cluster can be either TT or ET. In a time-triggered cluster, processes and messages are scheduled according to a static cyclic policy, with the bus implementing a TDMA protocol such as, for example, the time-triggered protocol (introduced in Section 5.1). On event-triggered clusters the processes are scheduled according to a priority based pre-emptive approach, while messages are transmitted using the priority-based CAN bus.
In this context, we have proposed an approach to schedulability analysis for multi-cluster distributed embedded systems [48]. This analysis will be outlined in Section 7.

When several event-driven scheduling policies are used in a heterogeneous system, another approach to the verification of timing properties is to use the technique presented in [49] which couples the analysis of local scheduling strategies via an event interface model.

4 Design optimisation

4.1 Traditional design methodology

There are several methodologies for real-time embedded systems design. The aim of a design methodology is to co-ordinate the design tasks such that the time-to-market is minimised, the design constraints are satisfied, and various parameters are optimised.

The main design tasks that have to be performed are described in the following Sections.

4.1.1 Functional analysis and design: The functionality of the host system, into which the electronic system is embedded, is normally described using a formalism from that particular domain of application. For example, if the host system is a vehicle, then its functionality is described in terms of control algorithms using differential equations, which are modelling the behaviour of the vehicle and its environment. At the level of the embedded real-time system which controls the host system, the functionality is typically described as a set of functions, accepting certain inputs and producing some output values.

The typical automotive application is a control application. The controller reads inputs from sensors, and uses the actuators to control the physical environment (the vehicle). A controller can have several modes of operation, and can interact with other electronic functions, or with the driver through switches and instruments.

During the functional analysis and design stage, the desired functionality is specified, analysed and decomposed into sub-functions based on the experience of the designer. Several suppliers and manufacturers have started to use tools like Statemate [50], Matlab/Simulink [51], ASCET/SD [52] and SystemBuild/MatrixX [53] for describing the functionality, to eliminate the ambiguities and to avoid producing incomplete or incoherent specifications.

At the level of functional analysis the exploration is currently limited to evaluating several alternative control algorithms for solving the control problem. Once the functionality has been captured using tools like Matlab/Simulink, useful explorations can involve simulations of executable specifications to determine the correctness of the behaviour, and to assess certain properties of chosen solutions.

4.1.2 Architecture selection and mapping: The architecture selection task decides what components to include in the hardware architecture and how these components are connected.

According to current practice, architecture selection is an ad hoc process, based on the experience of the designer and previous product versions.

The mapping task has to decide what part of the functionality should be implemented on which of the selected components.

The manufacturers integrate components from suppliers, and thus the design space is severely restricted in current practice, by the fact that the mapping of functionality to an ECU is fixed.

4.1.3 Software design and implementation: This is the phase in which the software is designed and the code is written.

The code for the functions is developed manually for efficiency reasons, and thus the exploration that would be allowed by automatic code generation is limited.

At this stage the correctness of the software is analysed through simulations, but there is no analysis of timing constraints, which is left for the scheduling and schedulability analysis stage.

4.1.4 Scheduling and schedulability analysis: Once the functions have been defined and the code has been written, the scheduling task is responsible for determining the execution strategy for the functions inside an ECU, such that the timing constraints are satisfied.

Simulation is extensively used to determine whether the timing constraints are satisfied. However, simulations are very time-consuming and provide no guarantees that the timing constraints are met.

In the context of static cyclic scheduling, deriving a schedule table is a complex design exploration problem. Static cyclic scheduling of a set of data dependent software processes on a multiprocessor architecture has been researched in [1, 34]. Such research has been used in commercial tools like TTP-Plan [54], which derives the static schedules for processes and messages in a time-triggered system using the time-triggered protocol for communication.

If fixed priority pre-emptive scheduling is used, exploration is used to determine how to allocate priorities to a set of distributed processes [55]. Their priority assignment heuristic is based on the schedulability analysis from [32]. For the earliest deadline, first the issue of distributing the global deadlines to local deadlines has to be addressed [56].

4.1.5 Integration: In this phase the manufacturer has to integrate the ECUs from different suppliers.

There is a lack of tools that can analyse the performance of the interacting functionality, and thus the manufacturer has to rely on simulation runs using the realistic environment of a prototype car. Detecting potential problems at such a late stage requires time-consuming extensive simulations. Moreover, once a problem is identified it takes a very long time to go through all the previous stages in order to fix it. This leads to large delays on the time-to-market.

To reduce the large simulation times, and to guarantee that potential violations of timing constraints are detected, manufacturers have started to use in-house analysis tools and commercially available tools such as Volcano Network Architect (for the CAN and LIN buses) [57].

Volcano makes inter-ECU communication transparent for the programmer. The programmer only deals with signals that have to be sent and received, and the details of the network are hidden. Volcano provides basic API calls for manipulating signals. To achieve interoperability between ECUs from different suppliers, Volcano uses a publish/subscribe model for defining the signal requirements. Published signals are made available to the system integrator by the suppliers, while subscribed signals are required as inputs to the ECU. The system integrator makes
the publish/subscribe connections by creating a set of CAN frames, and creating a mapping between the data in frames and signals [58]. Volcano uses the analysis in [20] for bounding the communication delay of messages transmitted using the CAN bus.

4.1.6 Calibration, testing and verification: These are the final stages of the design process. Because not enough analysis, testing and verification has been done in earlier stages of the design, these stages tend to be very time-consuming, and problems identified here lead to large delays in product delivery.

4.2 Function architecture co-design and platform based design

New design methodologies are needed, which can handle the increasing complexity of heterogeneous systems, and their competing requirements in terms of performance, reliability, low power consumption, cost, time-to-market, etc. As the complexity of the systems continues to increase, the development time lengths dramatically, and the manufacturing costs become prohibitively high. To cope with this complexity, it is necessary to reuse as much as possible at all levels of the design process, and to work at higher and higher abstraction levels.

Function/architecture co-design is a methodology proposed in [59, 60], which addresses the design process at higher abstraction levels. Function/architecture co-design uses a top-down synthesis approach, where trade-offs are evaluated at a high level of abstraction. The main characteristic of this methodology is the use, at the same time with the top-down synthesis, of a bottom-up evaluation of design alternatives, without the need to perform a full synthesis of the design. The approach to obtaining accurate evaluations is to use an accurate modeling of the behavior and architecture, and to develop analysis techniques that are able to derive estimates and to formally verify properties relative to a certain design alternative. The determined estimates and properties, together with user-specified constraints, are then used to drive the synthesis process.

Thus, several architectures are evaluated to determine whether they are suited for the specified system functionality. There are two extremes in the degrees of freedom available for choosing an architecture. At one end, the architecture is already given, and no modifications are possible. At the other end of the spectrum, no constraints are imposed on the architecture selection, and the synthesis task has to determine, from scratch, the best architecture for the required functionality. These two situations are, however, not common in practice. Often, a hardware platform is available, which can be parameterised (e.g., size of memory, speed of the buses, etc.). In this case, the synthesis task is to derive the parameters of the platform architecture such that the functionality of the system is successfully implemented. Once an architecture is determined and/or parameterised, the function/architecture co-design continues with the mapping of functionality onto the instantiated architecture.

This methodology has been used in research tools like Polis [61] and Metropolis [62], and has also led to commercial tools such as the Virtual Component Co-design (VCC) [63].

To reduce costs, especially in the case of a mass market product, the system architecture is usually reused, with some modifications, for several product lines. Such a common architecture is denoted by the term platform, and consequently the design tasks related to such an approach are grouped under the term platform-based design [64]. The platform consists of a hardware infrastructure together with software components that will be used for several product versions, and will be shared with other product lines, in the hope of reducing costs and the time-to-market.

The authors in [64] have proposed techniques for deriving such a platform for a given family of applications. Their approach can be used within any design methodology for determining a system platform that later on can be parameterised and instantiated to a desired system architecture.

Considering a given application or family of applications, the system platform has to be instantiated, deciding on certain parameters, and lower level details, to suit that particular application(s). The search for an architecture instance starts from a certain platform and a given application. The application is mapped and compiled on an architecture instance, and the performance numbers are derived, typically using simulation. If the designer is not satisfied with the performance of the instantiated architecture, the process is repeated.

In the remainder of the paper we will consider a platform consisting of event- and time-triggered clusters, using the CAN and TTP protocols for communication, respectively. We will discuss analysis and optimisation techniques for the configuration of the platform such that the given application is schedulable.

5 Multi-cluster systems

One class of heterogeneous real-time embedded systems is that of multi-cluster systems. We consider architectures consisting of two clusters – one time-triggered and the other event-triggered – interconnected by gateways (see Fig. 2):

- In a time-triggered cluster (TTC) processes and messages are scheduled according to a static cyclic policy, with the bus implementing a TDMA protocol such as, for example, the time-triggered protocol (TTP) [65].
- On event-triggered clusters (ETC) the processes are scheduled according to a priority based pre-emptive approach, while messages are transmitted using the priority-based CAN bus [9].

Sections 5.1 and 5.2 present the hardware and software architecture of a two-cluster system, respectively while Section 5.3 presents the application model used. Section 6 will introduce design problems characteristic for multi-cluster systems composed of time-triggered clusters interconnected with event-triggered clusters: the partitioning of functionality between the TT and ET clusters, the mapping of functionality to the nodes inside a cluster, and the packing of application messages to frames on the TTP and CAN buses. Then, Section 8 will present two optimisation strategies for the frame packing problem.

5.1 Hardware architecture

A cluster is composed of nodes which share a broadcast communication channel. Let \( N_T \) (\( N_E \)) be the set of nodes on the TTC (ETC). Every node \( N_i \in N_T \cup N_E \) includes a communication controller and a CPU, along with other components. The gateways, connected to both types of clusters, have two communication controllers, for TTP and CAN. The communication controllers implement the protocol services, and run independently of the node’s CPU. Communication with the CPU is performed through a message base interface (MBI).
Communication between the nodes on a TTC is based on the TTP [65]. The TTP integrates all the services necessary for fault-tolerant real-time systems. The bus access scheme is time-division multiple-access (TDMA), meaning that each node \( N_i \) on the TTC, including the gateway node, can transmit only during a predetermined time interval, the TDMA slot \( S_i \). In such a slot, a node can send several messages packed in a frame. A sequence of slots corresponding to all the nodes in the architecture is called a TDMA round. A node can have only one slot in a TDMA round. Several TDMA rounds can be combined together in a cycle that is repeated periodically. The sequence and length of the slots are the same for all the TDMA rounds. However, the length and contents of the frames may differ.

The TDMA access scheme is imposed by a message descriptor list (MEDL) that is located in every TTP controller. The MEDL serves as a schedule table for the TTP controller which has to know when to send/receive a frame to/from the communication channel.

There are two types of frames in the TTP. The initialisation frames, or I-frames, which are needed for the initialisation of a node, and the normal frames, or N-frames, which are the data frames containing, in their data field, the application messages. A TTP data frame (Fig. 3) consists of the following fields: start of frame bit (SOF), control field, a data field up to 16 bytes containing one or more messages, and a cyclic redundancy check (CRC) field. Frames are delimited by the inter-frame delimiter (IDF, 3 bits).

For example, the data efficiency of a frame that carries 8 bytes of application data, i.e. the percentage of transmitted bits which are the actual data bits needed by the application, is 69.5% (64 data bits transmitted in a 92-bit frame, without considering the details of a particular physical layer). Note that no identifier bits are necessary, as the TTP controllers know from their MEDL what frame to expect at a given point in time. In general, the protocol efficiency is in the range of 60–80% [66].

On an ETC, the CAN [9] protocol is used for communication. The CAN bus is a priority bus that employs a collision avoidance mechanism, whereby the node that transmits the frame with the highest priority wins the contention. Frame priorities are unique and are encoded in the frame identifiers, which are the first bits to be transmitted on the bus.

In the case of CAN 2.0A (Fig. 4), there are four frame types: data frame, remote frame, error frame and overload frame. We are interested in the composition of the data frame, depicted in Fig. 3. A data frame contains seven fields: SOF, arbitration field that encodes the 11-bit frame identifier, a control field, a data field up to 8 bytes, a CRC field, an acknowledgment (ACK) field, and an end of frame (EOF).

In this case, for a frame that carries 8 bytes of application data, we will have an efficiency of 47.4% [67]. The typical CAN protocol efficiency is in the range of 25–35% [66].

### 5.2 Software architecture

A real-time kernel is responsible for activation of processes and transmission of messages on each node. On a TTC, the processes are activated based on the local schedule tables, and messages are transmitted according to the MEDL. On an ETC, we have a scheduler that decides on activation of ready processes and transmission of messages, based on their priorities.

In Fig. 5 we illustrate our message passing mechanism. Here we concentrate on the communication between processes located on different clusters. For message passing within a TTC the reader is directed to [68], while the infrastructure needed for communications on an ETC has been detailed in [20].

Let us consider the example in Fig. 5, where we have an application consisting of four processes and four messages (depicted in Fig. 5b) mapped on the two clusters in Fig. 5c. Processes \( P_1 \) and \( P_2 \) are mapped on node \( N_1 \) of the TTC, while \( P_2 \) and \( P_3 \) are mapped on node \( N_2 \) of the ETC. Process \( P_1 \) sends messages \( m_1 \) and \( m_2 \) to processes \( P_2 \) and \( P_3 \), respectively, while \( P_2 \) and \( P_3 \) send messages \( m_3 \) and \( m_4 \) to \( P_2 \). All messages have a size of 1 byte.

The transmission of messages from the TTC to the ETC takes place in the following way (see Fig. 5). \( P_1 \), which is statically scheduled, is activated according to the schedule table, and when it finishes it calls the send kernel function in order to send \( m_1 \) and \( m_2 \), indicated in the Figure by the number (1). Messages \( m_1 \) and \( m_2 \) have to be sent from node \( N_1 \) to node \( N_2 \). At a certain time, known from the schedule table, the kernel transfers \( m_1 \) and \( m_2 \) to the TTP controller by packing them into a frame in the MBI. Later on, the TTP controller knows from its MEDL when it has to take the frame from the MBI to broadcast it on the bus. In our example, the timing information in the schedule table of the kernel and the MEDL is determined in such a way that the broadcasting of the frame is done in the slot \( S_1 \) of round 2 (2). The TTP controller of the gateway node \( N_G \)
knows from its MEDL that it has to read a frame from slot $S_1$ of round 2 and to transfer it into its MBI (3). Invoked periodically, having the highest priority on node $N_G$, and with a period which guarantees that no messages are lost, the gateway process $T$ copies messages $m_1$ and $m_2$ from the MBI to the TTP-to-CAN priority-ordered message queue $Out_{CAN}$ (4). Let us assume that on the ETC messages $m_1$ and $m_2$ are sent independently, one per frame. The highest priority frame in the queue, in our case the frame $f_1$ containing $m_1$, will tentatively be broadcast on the CAN bus (5). Whenever $f_1$ is the highest priority frame on the CAN bus, it will successfully be broadcast and will be received by the interested nodes, in our case node $N_2$ (6). The CAN communication controller of node $N_2$ receiving $f_1$ will copy it in the transfer buffer between the controller and the CPU, and raise an interrupt which will activate a delivery process, responsible to activate the corresponding receiving process, in our case $P_3$, and hand over message $m_1$ that finally arrives at the destination (7).

Message $m_3$ (depicted in Fig. 5 as a grey rectangle labelled ‘$m_3^\prime$’), sent by process $P_3$ from the ETC, will be transmitted to process $P_4$ on the TTC. The transmission starts when $P_3$ calls its send function and enqueues $m_3$ in the priority-ordered $Out_{TTP}$ queue (8). When the frame $f_3$ containing $m_3$ has the highest priority on the bus, it will be removed from the queue (9) and broadcast on the CAN bus (10). Several messages can be packed into a frame in order to increase the efficiency of data transmission. For example, $m_3$ can wait in the queue until $m_4$ is produced by $P_3$, in order to be packed together with $m_4$ in a frame. When $f_3$ arrives at the gateway’s CAN controller it raises an interrupt. Based on this interrupt, the gateway transfer process $T$ is activated, and $m_3$ is unpacked from $f_3$ and placed in the $Out_{TTP}$ FIFO queue (11). The gateway node $N_G$ is only able to broadcast on the TTC in the slot $S_{TTP}$ of the TDMA rounds circulating on the TTP bus. According to the MEDL of the gateway, a set of messages not exceeding $size_{TTP}$ of the data field of the frame travelling in slot $S_{TTP}$ will be removed from the front of the $Out_{TTP}$ queue in every round, and packed in the $S_{TTP}$ slot (12). Once the frame is broadcast (13) it will arrive at node $N_1$ (14), where all the messages in the frame will be copied in the input buffers of the destination processes (15). Process $P_4$ is activated according to the schedule table, which has to be constructed such that it accounts for the worst-case communication delay of message $m_3$, bounded by the analysis in Section 7.1, and thus, when $P_4$ starts executing, it will find $m_3$ in its input buffer.

As part of our frame packing approach, we generate all the MEDLs on the TTC (i.e. the TT frames and the sequence of the TDMA slots), as well as the ET frames and their priorities on the ETC such that the global system is schedulable.

5.3 Application model

We model an application $\Gamma$ as a set of process graphs $G_i \in \Gamma$ (see Fig. 6). Nodes in the graph represent processes and arcs represent dependency between the connected processes. A process is a sequence of computations (corresponding to several building blocks in a programming language) which starts when all its inputs are available. When it finishes executing, the process produces its output values. Processes can be pre-emptible or non-pre-emptible. Non-pre-emptible processes are processes that cannot be interrupted during their execution, and are mapped on the TTC. Pre-emptible processes can be interrupted during their execution, and are mapped on the ETC. For example, a higher priority process has to be activated to service an event, in this case, the lower priority process will be temporary pre-empted until the higher priority process finishes its execution.

A process graph is polar, which means that there are two nodes, called source and sink, that conventionally represent the first and last process. If needed, these nodes are introduced as dummy processes so that all other nodes in the graph are successors of the source and predecessors of the sink, respectively.

The communication time between processes mapped on the same processor is considered to be part of the process worst-case execution time and is not modelled explicitly. Communication between processes mapped to different processors is performed by message passing over the buses and, if needed, through the gateway. Such message passing is modelled as a communication process inserted on the arc connecting the sender and the receiver process (the black dots in Fig. 6).

Potential communication between processes in different applications is not part of the model. Technically, such a communication is implemented by the kernels based on asynchronous non-blocking send and receive primitives. Such messages are considered non-critical and are not affected by real-time constraints. Therefore, communications of this nature will not be addressed in this paper.

Each process $P_i$ is mapped on a processor $\mathcal{M}(P_i)$ (mapping represented by hashing in Fig. 6), and has
a worst case execution time $C_i$ on that processor (depicted to the left of each node). The designer can provide manually such worst-case times, or tools can be used in order to determine the worst-case execution time of a piece of code on a given processor [69].

For each message we know its size (in bytes, indicated to its left) and its period, which is identical to that of the sender process. Processes and messages activated based on events also have a uniquely assigned priority -- $priority_{p}$ for processes and $priority_{m}$ for messages.

All processes and messages belonging to a process graph $G_i$ have the same period $T_i = T_{G_i}$, which is the period of the process graph. A deadline $D_{G_i}$ is imposed on each process graph $G_i$. Deadlines can also be placed locally on processes. Release times of some processes as well as multiple deadlines can be easily modelled by inserting dummy nodes between certain processes and the source or the sink node, respectively. These dummy nodes represent processes with a certain execution time but which are not allocated to any processing element.

6 Multi-cluster optimisation

Considering the type of applications and systems described in the preceding Section, and using the analysis outlined in Section 7, several design optimisation problems can be addressed.

In this Section, we present problems which are characteristic to applications distributed across multi-cluster systems consisting of heterogeneous TT and ET networks:

- Section 6.1 briefly outlines the problem of partitioning the processes of an application into time-triggered and event-triggered domains, and their mapping to the nodes of the clusters.
- Section 6.2 presents the problem of packing of messages to frames, which is of utmost importance in cost-sensitive embedded systems where resources, such as communication bandwidth, have to be fully utilised [58, 70, 71]. This problem will be discussed in more detail in Section 8.

The goal of these optimisation problems is to produce an implementation which meets all the timing constraints (i.e. the application is schedulable).

To drive our optimisation algorithms towards schedulable solutions, we characterise a given frame packing configuration using the degree of schedulability of the application. The degree of schedulability [72] is calculated as

$$\delta_T = \begin{cases} c_1 = \sum_{i=1}^{n} \max(0, r_i - D_i) & \text{if } c_1 > 0 \\ c_2 = \sum_{i=1}^{n} (r_i - D_i) & \text{if } c_1 = 0 \end{cases}$$

where $n$ is the number of processes in the application, $r_i$ is the worst-case response time of a process $P_i$ and $D_i$ its deadline. The worst-case response times are calculated by the MultiClusterScheduling algorithm using the response time analysis presented in Section 7.

If the application is not schedulable, the term $c_1$ will be positive, and, in this case, the cost function is equal to $c_1$. However, if the process set is schedulable, $c_1 = 0$ and we use $c_2$ as a cost function, as it is able to differentiate between two alternatives, both leading to a schedulable process set. For a given set of optimisation parameters leading to a schedulable process set, a smaller $c_2$ means that we have improved the worst-case response times of the processes, so the application can potentially be implemented on a cheaper hardware architecture (with slower processors and/or buses). Improving the degree of schedulability can also lead to an improvement in the quality of control for control applications.

6.1 Partitioning and mapping

By partitioning we denote the decision whether a certain process should be assigned to the TT or the ET domain (and, implicitly, to a TTC or an ETC, respectively) Mapping a process means assigning it to a particular node inside a cluster. Very often, the partitioning decision is taken based on the experience and preferences of the designer, considering aspects like the functionality implemented by the process, the hardness of the constraints, sensitivity to jitter, legacy constraints, etc. Let $P$ be the set of processes in the application $\Gamma$. We denote with $P_T \subseteq P$ the subset of processes which the designer has assigned to the TT cluster, while $P_E \subseteq P$ contains processes which are assigned to the ET cluster.

Many processes, however, do not exhibit certain particular features or requirements which obviously lead to their implementation as TT or ET activities. The subset $P^* = P \setminus (P_T \cup P_E)$ of processes could be assigned to any of the TT or ET domains. Decisions concerning the partitioning of this set of activities can lead to various trade-offs concerning, for example, the schedulability properties of the system, the amount of communication exchanged through the gateway, the size of the schedule tables, etc.

For part of the partitioned processes, the designer might have already decided their mapping. For example, certain processes, due to constraints like having to be close to sensors/actuators, have to be physically located in a particular hardware unit. They represent the sets $P_T^M \subseteq P_T$ and $P_E^M \subseteq P_E$ of already mapped TT and ET processes, respectively. Consequently, we denote with $P_T^M = P_T \setminus P_T^M$ the TT processes for which the mapping has not yet been decided, and similarly, with $P_E^M = P_E \setminus P_E^M$ the unmapped
ET processes. The set $\mathcal{P}^\prime = \mathcal{P}_T^f \cup \mathcal{P}_E^f \cup \mathcal{P}^\dagger$ then represents all the unmapped processes in the application.

The mapping of messages is decided implicitly by the mapping of processes. Thus, a message exchanged between two processes on the TTC (ETC) will be mapped on the TTP bus (CAN bus) if these processes are allocated to different nodes. If the communication takes place between two clusters, two message instances will be created, one mapped on the TTP bus and one on the CAN bus. The first message is sent from the sender node to the gateway, while the second message is sent from the gateway to the receiving node.

Using the notation introduced, the partitioning and mapping problem can be described more exactly as follows. As an input we have an application $G$ given as a set of process graphs and a two-cluster system consisting of a TT and an ET cluster. As introduced previously, $P_T$ and $P_E$ are the sets of processes already partitioned into TT and ET, respectively. Also, $P_M^T \subseteq P_T$ and $P_M^E \subseteq P_E$ are the sets of already mapped TT and ET processes. We are interested to find a partitioning for processes in $P^\dagger = P \setminus (P_T \cup P_E)$ and decide a mapping for processes in $P/C3 = P \setminus (P_T \cup P_E \cup P^\dagger)$, where $P_T = P_T \setminus P_M^T$, and $P_E = P_E \setminus P_M^E$ such that imposed deadlines are guaranteed to be satisfied.

### 6.2 Frame packing

In both the TTP and CAN protocols messages are not sent independently, but several messages having similar timing properties are usually packed into frames. In many application areas, like automotive electronics, messages range from one single bit (e.g. the state of a device) to a couple of bytes (e.g. vehicle speed, etc.). Transmitting such small messages one per frame would create a high communication overhead, which can cause long delays leading to an unschedulable system. For example, 65 bits have to be transmitted on CAN for delivering one single bit of application data. Moreover, a given frame configuration defines the exact behaviour of a node on the network, which is very important when integrating nodes from different suppliers.

Let us consider the motivational example in Fig. 7, where we have the process graph from Fig. 7d mapped on the two-cluster system from Fig. 7e: $P_1$ and $P_4$ are mapped on node $N_1$ from the TTC, while $P_2$ and $P_3$ are mapped on $N_2$ from ETC. The data field of the frames is represented with a black rectangle, while the other frame fields are depicted with a grey colour. We consider a physical implementation of the buses such that the frames will take the time indicated in the Figure by the length of their rectangles. We are interested to find a frame configuration such that the application is schedulable.

In the system configuration of Fig. 7a we consider that, on the TTP bus, the node $N_1$ transmits in the first slot ($S_1$) of the TDMA round, while the gateway transmits in the second slot ($S_G$). Process $P_3$ has a higher priority than process $P_2$, hence $P_2$ will be interrupted by $P_3$ when it receives message $m_2$. In such a setting, $P_4$ will miss its deadline, which is depicted as a thick vertical line in Fig. 7. Changing the frame configuration as in Fig. 7b, so that messages $m_1$ and $m_2$ are packed into frame $f_1$ and slot $S_G$ of the gateway comes first, processes $P_2$ and $P_3$ will receive $m_1$ and $m_2$ sooner and thus reduce the worst-case response time of the process graph, which is still larger than the deadline. In Fig. 7c, we also pack $m_3$ and $m_4$ into $f_2$. In such a situation, the sending of $m_3$ will have to be delayed until $m_4$ is queued by $P_2$. Nevertheless, the worst-case response time of the application is further reduced, which means that the deadline is met, and thus the system is schedulable.

**Fig. 7** Frame-packing optimisation example
However, packing more messages will not necessarily reduce the worst-case response times further, as it might increase too much the worst-case response times of messages that have to wait for the frame to be assembled, as is the case with message $m_3$ in Fig. 7c.

This design optimisation problem can be formulated more exactly as follows. As input to the frame-packing problem we have an application $\Gamma$ given as a set of process graphs mapped on an architecture consisting of a TTC and an ETC interconnected through a gateway. We consider that the partitioning and mapping of processes has been already decided.

We are interested to find a mapping of messages to frames (a frame packing configuration) denoted by a 4-tuple $\psi = <\pi, \beta, \sigma>$ such that the application $\Gamma$ is schedulable. Once a schedulable system is found, we are interested to further improve the ‘degree of schedulability’ so the application can potentially be implemented on a cheaper hardware architecture (with slower buses and processors).

Determining a frame configuration $\psi$ means deciding on:

- the mapping of application messages transmitted on the ETC to frames (the set of ETC frames $a$), and their relative priorities, $\pi$. Note that the ETC frames $a$ have to include messages transmitted from an ETC node to a TTC node, messages transmitted inside the ETC cluster, and those messages transmitted from the TTC to the ETC.
- the mapping of messages transmitted on the TTC to frames, denoted by the set of TTC frames $\beta$, and the sequence $\sigma$ of slots in a TDMA round. The slot sizes are determined based on the set $\beta$, and are calculated such that they can accommodate the largest frame sent in that particular slot. We consider that messages transmitted from the ETC to the TTC are not statically allocated to frames. Rather, we will dynamically pack messages originating from the ETC into the ‘gateway frame’, for which we have to decide the data field length (see Section 5.2).

Several details related to the schedulability analysis were omitted from the discussion of the example. These details will be discussed in Section 7.

7 Multi-cluster analysis and scheduling

Once a partitioning and a mapping is decided, and a frame packing configuration is fixed, the processes and messages have to be scheduled. For the TTC this means building the schedule tables, while for the ETC the priorities of the ET processes have to be determined and their schedulability has to be analysed.

The analysis presented in this Section works under the following assumptions:

- All the processes belonging to a process graph $G$ have the same period $T_G$. However, process graphs can have different periods.
- The offsets are static (as opposed to dynamic [42]), and are smaller than the period.
- The deadlines are arbitrary, i.e. they can be larger than the period.

The basic idea is that on the TTC an application is schedulable if it is possible to build a schedule table such that the timing requirements are satisfied.

On the ETC, the answer to whether or not a system is schedulable is given by a schedulability analysis. In this paper, for the ETC we use a response time analysis, where the schedulability test consists of the comparison between the worst-case response time $r_i$ of a process $P_i$ and its deadline $D_i$. Response time analysis of data dependent processes with static priority pre-emptive scheduling has been proposed in [39, 40, 42] and has also been extended to consider the CAN protocol [20]. The authors use the concept of offset to handle data dependencies. Thus, each process $P_i$ is characterised by an offset $O_i$, measured from the start of the process graph, that indicates the earliest possible start time of $P_i$. Such an offset is, for example, $O_2$ in Fig. 7a, as process $P_3$ cannot start before receiving $m_1$. The same is true for messages, their offset indicating the earliest possible transmission time. The response time analysis employed is presented in Section 7.1.

However, determining the schedulability of an application mapped on a multi-cluster system cannot be addressed separately for each type of cluster, since the inter-cluster communication creates a circular dependency: the static schedules determined for the TTC influence through the offsets the worst-case response times of the processes on the ETC, which in their turn influence the schedule table construction on the TTC. In Fig. 7b, packing $m_1$ and $m_2$ in the same frame leads to equal offsets for $P_3$ and $P_3$. Because of this, $P_3$ will delay $P_2$ (which would not be the case if $m_2$ sent to $P_1$ were scheduled in round 3, for example) and thus the placement of $P_3$ in the schedule table has to be accordingly delayed to guarantee the arrivals of $m_3$ and $m_4$.

In our analysis we consider the influence between the two clusters by making the following observations:

- The start time of process $P_i$ in a schedule table on the TTC is its offset $O_i$.
- The worst-case response time $r_i$ of a TT process is its worst case execution time, i.e. $r_i = C_i$ (TT processes are not pre-emptible).
- The worst-case response times of the messages exchanged between two clusters have to be calculated according to the schedulability analysis described in Section 7.1.
- The offsets have to be set by a scheduling algorithm such that the precedence relationships are preserved. This means that, if process $P_B$ depends on process $P_A$, the following condition must hold: $O_B \geq O_A + r_A$. Note that for the

MultiClusterScheduling\($\Gamma, M, \psi$\)

- determines the set of offsets $\phi$ and worst-case response times $\rho$

\begin{algorithm}
  \begin{algorithmic}[1]
    \State \textbf{for each} $O_i \in \phi \text{ do } O_i = 0 $ \textbf{end for}$ \quad \text{initially all offsets are zero}$
    \State $2 \quad \text{determine initial values for the worst-case response times}$
    \State $3 \quad \text{according to the analysis in Section 7.1}$
    \State $4 \quad \rho = \text{ResponseTimeAnalysis}(\Gamma, M, \psi, \phi)$
    \State $5 \quad \text{determine new values for the offsets, based on the response times}$
    \State $6 \quad \phi^{\text{new}} = \text{ListScheduling}(\Gamma, M, \psi, \rho)$
    \State $7 \quad \text{consider the system unschedulable at first}$
    \State $8 \quad \text{repeat} \quad \text{iteratively improve the degree of schedulability } \delta_t$
    \State $9 \quad \text{for each} \ O^{\text{new}}_i \in \phi^{\text{new}} \text{ do } \quad \text{for each newly calculated offset}$
    \State $10 \quad \quad \phi^{\text{old}} = \phi, O_i, \phi, O_i = \phi^{\text{new}} \quad \text{set the new offset, remember old}$
    \State $11 \quad \quad \rho^{\text{new}} = \text{ResponseTimeAnalysis}(\Gamma, M, \psi, \phi)$
    \State $12 \quad \quad \delta_t^{\text{new}} = \text{SchedulabilityDegree}(\Gamma, \rho)$
    \State $13 \quad \quad \text{if } \delta_t^{\text{new}} < \delta_t \text{ then } \quad \text{the schedulability has improved}$
    \State $14 \quad \quad \quad \text{offsets are recalculated using } \rho^{\text{new}}$
    \State $15 \quad \quad \phi^{\text{new}} = \text{ListScheduling}(\Gamma, M, \psi, \rho^{\text{new}})$
    \State $16 \quad \quad \text{break} \quad \text{exit the for-each loop}$
    \State $17 \quad \text{else} \quad \text{the schedulability has not improved}$
    \State $18 \quad \quad \phi, O_i = \phi^{\text{old}} \quad \text{restore the old offset}$
    \State $19 \quad \text{end for}$
    \State $20 \quad \text{until } \delta_t \text{ has not changed}$
    \State $21 \quad \text{return } \phi, \rho, \delta_t$
  \end{algorithmic}
\end{algorithm}

Fig. 8 MultiClusterScheduling algorithm
processes on a TTC which receive messages from the ETC this translates to setting the start times of the processes such that a process is not activated before the worst-case arrival time of the message from the ETC. In general, offsets on the TTC are set such that all the necessary messages are present at the process invocation.

The MultiClusterScheduling algorithm in Fig. 8 receives as input the application $\Gamma$, the frame configuration $\psi$, and produces the offsets $\phi$ and worst-case response times $\rho$.

The algorithm sets initially all the offsets to 0 (line 1). Then, the worst-case response times are calculated using the ResponseTimeAnalysis function (line 4) using the analysis presented in Section 7.1. The fixed-point iterations that calculate the response times at line 3 will converge if processor and bus loads are smaller than 100% [39]. Based on these worst-case response times, we determine new values $\phi^\text{new}$ for the offsets using a list scheduling algorithm (line 6). We now have a schedule table for the TTC and worst-case response times for the ETC, which are pessimistic. The following loop will reduce the pessimism of the worst-case response times.

The multi-cluster scheduling algorithm loops until the degree of schedulability $\delta_{T}$ of the application $\Gamma$ cannot be further reduced (lines 8–20). In each loop iteration, we select a new offset from the set of $\phi^\text{new}$ offsets (line 10), and run the response time analysis (line 11) to see if the degree of schedulability has improved (line 12). If $\delta_{T}$ has not improved, we continue with the next offset in $\phi^\text{new}$.

When a new offset $O_{f}$ leads to an improved $\delta_{T}$ we exit the for-each loop 9–19 that examines offsets from $\phi^\text{new}$. The loop iteration 8–20 continues with a new set of offsets, determined by ListScheduling at line 15, based on the worst-case response times $\rho^\text{new}$ corresponding to the previously accepted offset.

In the multi-cluster scheduling algorithm, the calculation of offsets is performed by the list scheduling algorithm presented in Fig. 9. In each iteration, the algorithm visits the processes and messages in the ReadyList. A process or a message in the application is placed in the ReadyList if all its predecessors have already been scheduled. The list is ordered based on the priorities presented in [73]. The algorithm terminates when all processes and messages have been visited.

In each loop iteration, the algorithm calculates the earliest time moment (offset) when the process or message $node_{i}$ can start (lines 5–7). There are four situations:

1. The visited node is an ET message. The message $m_{i}$ is packed into its frame $f$ (line 9), and the offset $O_{f}$ of the frame is updated. The frame can only be transmitted after all the sender processes that pack messages in this frame have finished executing. The offset of message $m_{i}$ packed to frame $f$ is equal to the frame offset $O_{f}$.
2. The node is a TT message. In this case, when the frame is ready for transmission, it is scheduled using the ScheduleTTFrame function (presented in Fig. 10), which returns the $round$ and the $slot$ where the frame has been placed (line 16 in Fig. 9). In Fig. 10, the round immediately following offset is the initial candidate to be considered (line 2). However, it can be too late to catch the allocated slot, in which case the next round is considered (line 4). For this candidate round, we have to check if the slot is not occupied by another frame. If so, the communication has to be delayed for another round (line 7). Once a frame has been scheduled, we can determine the offsets and worst-case response times (Fig. 9, line 18). For all the messages in the frame the offset is equal to the start of the slot in the TDMA round, and the worst-case response time is the slot length.
3. The algorithm visits a process $P_{1}$ mapped on an ETC node. A process on the ETC can start as soon as its predecessors have finished and its inputs have arrived, hence $O_{f} = offset$ (line 22). However, $P_{1}$ might experience, later on, interference from higher priority processes.
4. Process $P_{1}$ is mapped on a TTC node. In this case, besides waiting for the predecessors to finish executing, $P_{1}$ will also have to wait for its processor $M(P_{1})$ to become available (line 23). The earliest time when the processor is available is returned by the ProcessorAvailable function.

Let us now turn the attention back to the multi-cluster scheduling algorithm in Fig. 8. The algorithm stops when the $\delta_{T}$ of the application $\Gamma$ is no longer improved, or when a limit imposed on the number of iterations has been reached. Since in a loop iteration we do not accept a solution with a larger $\delta_{T}$, the algorithm will terminate when in a loop iteration we are no longer able to improve $\delta_{T}$ by modifying the offsets.

**Fig. 9 ListScheduling algorithm**

**Fig. 10 Frame scheduling on TTC**
7.1 Schedulability analysis for ETC

For the ETC we use a response time analysis. A response time analysis has two steps. In the first step, the analysis derives the worst-case response time of each process (the time it takes from the moment it is ready for execution until it has finished executing). The second step compares the worst-case response time of each process to its deadline and, if the response times are smaller than or equal to the deadlines, the system is schedulable. The analysis presented in this Section is used in the Response Time Analysis function (line 4 of the algorithm in Fig. 8).

Thus, the response time analysis in [74] uses the following equation for determining the worst-case response time $r_i$ of a process $P_i$:

$$r_i = C_i + \sum_{P_j \in hp(P_i)} \left( \frac{r_j}{T_j} \right) C_j$$  \hspace{1cm} (2)

where $C_i$ is the worst-case execution time of process $P_i$, $T_j$ is the period of process $P_j$, and $hp(P_i)$ denotes the set of processes that have a priority higher than the priority of $P_i$.

The summation term, representing the interference $I_i$ of higher priority processes on $P_i$, increases monotonically in $r_i$, and thus solutions can be found using a recurrence relation. Moreover, the recurrence relations that calculate the worst-case response time are guaranteed to converge if the processor utilisation is under 100%.

The previously presented analysis assumes that the deadline of a process is smaller than or equal to its period. This assumption has later been relaxed [32] to consider arbitrary deadlines (i.e. deadlines can be larger than the period). Thus, the worst-case response time $r_i$ of a process $P_i$ becomes

$$r_i = \max_{q=0,1,2...} \left( J_i + w_i(q) - qT_i \right)$$  \hspace{1cm} (3)

where $J_i$ is the jitter of process $P_i$ (the worst-case delay between the arrival of a process and the start of its execution), $q$ is the number of busy periods being examined, and $w_i(q)$ is the width of the level-$i$ busy period starting at time $qT_i$. The level-$i$ busy period is defined as the maximum time a processor executes processes of priority greater than or equal to the priority of process $P_i$, and is calculated as [32]

$$w_i(q) = (q+1)C_i + B_i + \sum_{P_j \in hp(P_i)} \left( \frac{w_j(q) + J_j}{T_j} \right) C_j$$  \hspace{1cm} (4)

The pessimism of the previous analysis can be reduced by using the information related to the precedence relations between processes. The basic idea is to exclude certain worst-case scenarios, from the critical instant analysis, which are impossible due to precedence constraints.

Methods for schedulability analysis of data dependent processes with static priority pre-emptive scheduling have been proposed in [39–42]. They use the concept of offset (or phase) to handle data dependencies. In [39], Tindell shows that the pessimism of the analysis is reduced through the introduction of offsets. The offsets have to be determined by the designer.

In their analysis [39], the response time of a process $P_i$ is

$$r_i = \max_{q=0,1,2...} \max_{P_j \in G} \left( w_i(q) + O_j + J_j - T_G \times \left( q + \left( O_j + J_j - O_i - J_i \right) \frac{T_G}{T_j} \right) - O_i \right)$$  \hspace{1cm} (5)

where $T_G$ is the period of the process graph $G$, $O_i$ and $O_j$ are offsets of processes $P_i$ and $P_j$, respectively, and $J_i$ and $J_j$ are the release jitters of $P_i$ and $P_j$. In (5), the level-$i$ busy period starting at time $qT_G$ is

$$w_i(q) = (q+1)C_i + B_i + I_i$$  \hspace{1cm} (6)

In the previous equation, the blocking term $B_i$ represents interference from lower priority processes that are in their critical section and cannot be interrupted, and $C_i$ represents the worst-case execution time of process $P_i$. The last term captures the interference $I_i$ from higher priority processes in the application, including higher priority processes from other process graphs. The reader is directed to [39] for the details of the interference calculation.

Although this analysis is exact (both necessary and sufficient), it is computationally infeasible to evaluate. Hence, [39] proposes a feasible but not exact analysis (sufficient but not necessary) for solving (5). Our implementations use the feasible analysis provided in [39] for deriving the worst-case response time of a process $P_i$.

We are now interested to determine the worst-case response time of frames and the worst-case queuing delays experienced by a frame in a communication controller.

Regarding the worst-case response time of messages, we have extended the analysis from [20] and applied it for frames on the CAN bus:

$$r_f = \max_{q=0,1,2...} \left( J_f + W_f(q) + (1 + q)C_f \right)$$  \hspace{1cm} (7)

where $J_f$ is the jitter of frame $f$ which in the worst case is equal to the largest worst-case response time $r_{s(m)}$ of a sender process $S(m)$ which sends message $m$ packed into frame $f$.

$$J_f = \max_{m \in G} (r_{s(m)})$$  \hspace{1cm} (8)

In (7), $W_f$ is the worst-case queuing delay experienced by $f$ at the communication controller, and is calculated as

$$W_f(q) = w_f(q) - qT_f$$  \hspace{1cm} (9)

where $q$ is the number of busy periods being examined and $w_f(q)$ is the width of the level-$i$ busy period starting at time $qT_f$.

Moreover, in (7), $C_f$ is the worst-case time it takes for a frame $f$ to reach the destination controller. On CAN, $C_f$ depends on the frame configuration and the size of the data field, $s_f$, while on TTP it is equal to the slot size in which $f$ is transmitted.

The worst-case response time of message $m$ packed into a frame $f$ can be determined by observing that $r_m = r_f$.

The worst-case queuing delay for a frame ($W_f$ in equation (7)) is calculated differently for each type of queue:

1. The output queue of an ETC node, in which case $W_f^{OUT}$ represents the worst-case time a frame $f$ has to spend in the $OUT_{N_f}$ queue on ETC node $N_f$. An example of such a frame is the one containing message $m_i$ in Fig. 7a, which is sent by process $P_2$ from the ETC node $N_2$ to the gateway node $N_G$, and has to wait in the $OUT_{N_f}$ queue.

2. The TTP-to-CAN queue of the gateway node, in which case $W_f^{CAN}$ is the worst-case time a frame $f$ has to spend in the $OUT_{N_f}$ queue of node $N_e$. In Fig. 7a, the frame containing $m_1$ is sent from the TTC node $N_1$ to the ETC node $N_2$, and has to wait in the $OUT_{CAN}$ queue of gateway node $N_G$ before it is transmitted on the CAN bus.

3. The CAN-to-TTP queue of the gateway node, where $W_f^{TTP}$ captures the time $f$ has to spend in the $OUT_{TTP}$ queue node $N_G$. Such a situation is present in Fig. 7a, where the
frame with \( m_3 \) is sent from the ETC node \( N_3 \) to the TTC node \( N_1 \) through the gateway node \( N_G \), where it has to wait in the Out\_TTP queue before it is transmitted on the TTP bus, in the \( S_G \) slot of node \( N_G \).

On the TTC, the synchronisation between processes and the TDMA bus configuration is solved through the proper synthesis of schedule tables, and hence no output queues are needed. The frames sent from a TTC node to another TTC node are taken into account when determining the offsets, and are not involved directly in the ETC analysis.

The following Sections show how the worst queueing delays are calculated for each of the previous three cases.

### 7.1.1 Worst-case queueing delays in the Out\_Ni and Out\_CAN queues

The analyses for \( W_f^{Ni} \) and \( W_f^{CAN} \) are similar. Once \( f \) is the highest priority frame in the Out\_CAN queue, it will be sent by the gateway’s CAN controller as a frame; therefore, the same equation for \( W_f \) can be used:

\[
w_f(q) = B_f + \sum_{v_j \in \text{hp}(f)} \left( \frac{w_f(q) + J_f}{T_j} \right) C_j \tag{10}\]

The intuition is that \( f \) has to wait, in the worst case, first for the largest lower priority frame that is just being transmitted (\( B_f \)) as well as for the higher priority \( f_j \in \text{hp}(f) \) frames that have to be transmitted ahead of \( f \) (the second term). In the worst case, the time it takes for the largest lower priority frame \( f_k \in \text{lp}(f) \) to be transmitted to its destination is:

\[
B_f = \max_{f_k \in \text{lp}(f)} (C_k) \tag{11}\]

Note that, in our case, \( \text{lp}(f) \) and \( \text{hp}(f) \) also include messages produced by the gateway node, transferred from the TTC to the ETC.

### 7.1.2 Worst-case queueing delay in the Out\_TTP queue

The time a frame \( f \) has to spend in the Out\_TTP queue in the worst case depends on the total size of messages queued ahead of \( f \) (Out\_TTP is a FIFO queue), \( s_{SG} \) of the data field of the frame fitting into the gateway slot responsible for carrying the CAN messages on the TTP bus, and the period \( T_{TDMA} \) with which this slot \( S_G \) is circulating on the bus [46];

\[
w_f^{TTP}(q) = B_f + \left( \frac{(q + 1)s_f + I_f(w_f(q))}{S_G} \right) T_{TDMA} \tag{12}\]

where \( I_f \) is the total size of the frames queued ahead of \( f \). Those frames \( f_j \in \text{hp}(f) \) are ahead of \( f \), which have been sent from the ETC to the TTC, and have higher priority than \( f \):

\[
I_f(w) = \sum_{v_j \in \text{hp}(f)} \left( \frac{w_f + J_f}{T_j} \right) s_j \tag{13}\]

where the frame jitter \( J_f \) is given by (8).

The blocking term \( B_f \) is the time interval in which \( f \) cannot be transmitted because the slot \( S_G \) of the TDMA round has not arrived yet. In the worst case (i.e. the frame \( f \) has just missed the slot \( S_G \)), the frame has to wait an entire round \( T_{TDMA} \) for the slot \( S_G \) in the next TDMA round.

### 8 Frame-packing optimisation strategy

The general multi-cluster optimisation strategy is outlined in Fig. 11. The MultiClusterConfiguration strategy has two steps:

1. In the first step, line 3, the application is partitioned on the TTC and ETC clusters, and processes are mapped to the nodes of the architecture using the Partitioning-AndMapping function. The partitioning and mapping can be done with an optimisation heuristic like the one presented in [75]. As part of the partitioning and mapping process, an initial frame configuration \( \psi^0 = \langle \pi^0, \alpha^0, \beta^0, \sigma^0 \rangle \) is derived. Messages exchanged by processes partitioned to the TTC will be mapped to TTC frames, while messages exchanged on the ETC will be mapped to ETC frames. For each message sent from a TTC process to an ETC process, we create an additional message on the ETC, and we map this message to an ETC frame. The sequence \( \sigma^0 \) of slots for the TTC is decided by assigning in order nodes to the slots \( (S_f = N) \). One message is assigned per frame in the initial set \( \beta^0 \) of TTC frames. For the ETC, the frames in the set \( \pi^0 \) initially hold each one single message, and we calculate the message priorities \( \pi^0 \) based on the deadlines of the receiver processes.

2. The frame packing optimisation is performed as the second step (line 5 in Fig. 11). The FramePacking-Optimisation function receives as input the application \( \Gamma \), the mapping \( M \) of processes to resources and the initial frame configuration \( \psi^0 \), and it produces as output the optimised frame packing configuration \( \psi \). Such an optimisation problem is NP complete [76], so obtaining the optimal solution is not feasible. In this paper, we propose two frame packing optimisation strategies, one based on a simulated annealing approach, presented in Section 8.1, while the other, outlined in Section 8.2, is based on a greedy heuristic that uses the problem-specific knowledge intelligently in order to explore the design space.

If after these steps the application is unschedulable, we conclude that no satisfactory implementation could be found with the available amount of resources.

Testing if the application \( \Gamma \) is schedulable is done using the MultiClusterScheduling (MCS) algorithm (line 7 in Fig. 11). The multicenter scheduling algorithm, presented in Fig. 8, takes as input an application \( \Gamma \), a mapping \( M \) and an initial frame configuration \( \psi^0 \), builds the TT schedule tables, sets the ET priorities for processes and provides the global analysis.

#### 8.1 Frame packing with simulated annealing

The first algorithm we have developed is based on a simulated annealing (SA) strategy [76], and is presented in Fig. 12. The algorithm takes as input the application \( \Gamma \), a mapping \( M \) and an initial frame configuration \( \psi^0 \), and determines the frame configuration \( \psi \) which leads to the best degree of schedulability \( \delta_f \) (the smaller the value, the more schedulable the system; see Section 6).

```plaintext
MultiClusterConfiguration(\Gamma) 1 – determine an initial partitioning and mapping \( M \),
2 – and an initial frame configuration \( \psi^0 \),
3 \( \langle M, \psi^0 \rangle = \text{PartitioningAndMapping}(\Gamma) \)
4 – the frame packing optimisation algorithm
5 \( \psi = \text{FramePackingOptimisation}(\Gamma, M, \psi^0) \)
6 – test if the resulted configuration leads to a schedulable application
7 if MultiClusterScheduling (\Gamma, M, \psi) returns schedulable then
8 \( \text{return } M, \psi \)
9 else
10 \( \text{return unschedulable} \)
11 end
end MultiClusterConfiguration
```

Fig. 11 General frame packing strategy
Determining a frame configuration $\psi$ means finding the set of ETC frames $\alpha$ and their relative priorities $\pi$, and the set of TTC frames $\beta$, including the sequence $\sigma$ of slots in a TDMA round.

The main feature of an SA strategy is that it tries to escape from a local optimum by randomly selecting a new solution from the neighbours of the current solution. The new solution is accepted if it is an improved solution (lines 9–10 of the algorithm in Fig. 12). However, a worse solution can also be accepted with a certain probability that depends on the deterioration of the cost function and on a control parameter called temperature (lines 12–13).

In Fig. 12 we give a short description of this algorithm. An essential component of the algorithm is the generation of a new solution $\psi_{\text{new}}$, starting from the current one $\psi_{\text{current}}$. The neighbours of the current solution $\psi_{\text{current}}$ are obtained by performing transformations (called moves) on the current frame configuration $\psi_{\text{current}}$ (line 8). We consider the following moves:

- moving a message $m$ from a frame $f_1$ to another frame $f_2$ (or moving $m$ into a separate single-message frame)
- swapping the priorities of two frames in $\alpha$
- swapping two slots in the sequence $\sigma$ of slots in a TDMA round.

For the implementation of this algorithm, the parameters $T_I$ (initial temperature), $T_L$ (temperature length), $\varepsilon$ (cooling ratio) and the stopping criterion have to be determined. They define the ‘cooling schedule’ and have a decisive impact on the quality of the solutions and the CPU time consumed. We are interested to obtain values for $T_I$, $T_L$ and $\varepsilon$ that will guarantee the finding of good quality solutions in a short time.

We performed long runs of up to 48 h with the SA algorithm, for ten synthetic process graphs (two for each graph dimension of 80, 160, 240 320, 400, see Section 9), and the best ever solution produced has been considered as the optimum. Based on further experiments we have determined the parameters of the SA algorithm so that the optimisation time is reduced as much as possible but the near-optimal result is still produced. For example, for the graphs with 320 nodes, $T_I$ is 700, $T_L$ is 500 and $\varepsilon$ is 0.98.

The algorithm stops if for three consecutive temperatures no new solution has been accepted.

**SimulatedAnnealing** ($\Gamma, \mathcal{M}, \psi_{\text{opt}}$)

1. given an application $\Gamma$ finds out if it is schedulable and produces
2. $\xi = \langle x, \pi, \beta, \sigma \rangle > \xi$ leading to the smallest $\delta_T$
3. initial frame configuration
4. $\psi_{\text{current}} = \psi_{\text{opt}}$
5. $\text{temperature} = \text{initial temperature} T_I$
6. repeat
7. for $i = 1$ to temperature length $T_L$ do
8. generate randomly a neighboring solution $\psi_{\text{new}}$ of $\psi_{\text{current}}$
9. MultiClusterScheduling($\Gamma, \mathcal{M}, \psi_{\text{frame}}$)
10. $\delta = \text{MultiClusterScheduling}($ $\Gamma, \mathcal{M}, \psi_{\text{current}}$)$
11. if $\delta > 0$ then $\psi_{\text{current}} = \psi_{\text{new}}$
12. else
13. generate $q = \text{Random}(0, 1)$
14. if $q < e^{-\delta_{\text{temperature}}}$ then $\psi_{\text{current}} = \psi_{\text{new}}$
15. end if
16. until stopping criterion is met
17. return SchedulabilityTest($\Gamma, \mathcal{M}, \psi_{\text{best}}$). solution $\psi_{\text{best}}$
18. end SimulatedAnnealing

**8.2 Frame packing greedy heuristic**

The OptimizeFramePacking greedy heuristic (Fig. 13) constructs the solution by progressively selecting the best candidate in terms of the degree of schedulability.

We start by observing that all activities taking place in a multi-cluster system are ordered in time using the offset information, determined in the **StaticScheduling** function based on the worst-case response times known so far and the application structure (i.e., the dependencies in the process graph). Thus, our greedy heuristic outlined in Fig. 13 starts with building two lists of messages ordered according to the ascending value of their offsets, one for the TTC, $\text{messages}_T$, and one for ETC, $\text{messages}_E$. Our heuristic is to consider for packing in the same frame messages which are adjacent in the ordered lists. For example, let us consider that we have three messages, $m_1$ of 1 byte, $m_2$ of 2 bytes and $m_3$ of 3 bytes, and that messages are ordered as $m_2$, $m_1$, $m_3$, based on the offset information. Also, assume that our heuristic has suggested two frames, frame $f_1$ with a data field of 4 bytes, and $f_2$ with a data field of 2 bytes. The **PackMessages** function will start with $m_1$ and pack it in frame $f_1$. It continues with $m_2$, which is also packed into $f_1$, since there is space left for it. Finally, $m_3$ is packed in $f_2$, since there is no space left for it in $f_1$.

The algorithm tries to determine, using the for-each loops in Fig. 13, the best frame configuration. The algorithm starts from the initial frame configuration $\psi_{\text{opt}}$, and progressively determines the best change to the current configuration. The quality of a frame configuration is measured using the **MultiClusterScheduling** algorithm, which calculates the degree of schedulability $\delta_T$ (line 13). Once a configuration parameter has been fixed in the outer loops it is used by the inner loops:

- Lines 10–15: The innermost loops determine the best size $S_a$ for the currently investigated frame $f_a$ in the ETC frame configuration $\text{size}_{\text{current}}$. Thus, several frame sizes are tried (line 11), each with a size returned by **RecomendedSizes**, to see if it improves the current configuration. The **RecomendedSizes** (messages) list is built recognising that only messages adjacent in the messages list will be packed into the same frame. Sizes of frames are determined as a sum resulting from adding the sizes of combinations of adjacent messages, not exceeding 8 bytes. For the previous example, with $m_1$, $m_2$ and $m_3$ of 1, 2 and 3 bytes, respectively, the frame sizes recommended will be of 1, 2, 3, 4 and 6 bytes. A size of 5 bytes will not be recommended since there are no adjacent messages that can be summed together to obtain 5 bytes of data.

- Lines 9–16: This loop determines the best frame configuration $\mathcal{X}$ This means deciding on how many frames to include in $\mathcal{X}$ (line 9), and which are the best sizes for them. In $\mathcal{X}$ there can be any number of frames, from one single frame to $n_s$ frames (in which case each frame carries one single message). Once a configuration $\mathcal{X}_{\text{best}}$ for the ETC, minimising $\delta_T$ has been determined (saved in line 16), the algorithm looks for the frame configuration $\beta$ which will further improve $\delta_T$.

- Lines 7–17: The best size for a frame $f_\beta$ is determined similarly to the size for a frame $f_a$.

- Lines 6–18: The best frame configuration $\beta_{\text{best}}$ is determined. For each frame configuration $\beta$ tried, the algorithm loops again through the innermost loops to see if there are better frame configurations $\mathcal{X}$ in the context of the current frame configuration $\beta_{\text{current}}$.

- Lines 4–19: After a $\beta_{\text{best}}$ has been decided, the algorithm looks for a slot sequence $\sigma$, starting with the first slot, and tries to find the node which, when transmitting in this slot,
OptimizeFramePacking \((\Gamma, M, \psi^0)\) – produces the frame configuration \(\psi\) leading to the smallest degree of schedulability \(\delta_F\).

1. \(\pi^0\) = HOPA – the initial priorities \(\pi^0\) are updated using the HOPA heuristic.
2. – build the message lists ordered ascending on their offsets
3. messages = ordered list of \(n_t\) messages on the TTC; messages = ordered list of \(n_t\) messages on the ETC
4. for each slot \(\sigma_{current}\) do for each slot \(\sigma_{current} \wedge \text{slot} \neq \) do – determine the best TTP slot sequence \(\sigma\)
5. Swap(slot, slot2) – tentatively swap slots slot1 with slot2
6. for each \(\sigma_{current}\) with 1 to \(n_f\) frames do – determine the best frame packing configuration \(\beta\) for the TTC
7. for each frame \(f_{current}\) for each frame size \(S_b \in \text{RecommendedSizes(messages)}\) do – determine the best frame size for \(f_{current}\)

\(\beta_{current}, f_{current}, S = S_b\),
8. for each \(\sigma_{current}\) with 1 to \(n_f\) frames do – determine the best frame packing configuration \(\sigma\) for the ETC
9. for each frame \(f_{current}\) do for each frame size \(S_b \in \text{RecommendedSizes(messages)}\) do – determine the best frame size for \(f_{current}\)
10. \(\sigma_{current}, f_{current}, S = S_b\),
11. \(\psi_{current} = \langle \sigma_{current}, \psi_{current}, \sigma_{current} > \) PackMessages \((\psi_{current}, \text{messages} \cup \text{messages})\),
12. \(\delta_{current} = \text{MultiClusterScheduling} \((\Gamma, M, \psi_{current})\),
13. if \(\psi_{current}(\psi_{current}, \text{best so far})\) then \(\psi_{best} = \psi_{current}\) end if – remember the best configuration so far
14. end for; end for;
15. if \(\exists \psi_{best}\) then \(\sigma_{current} = \sigma_{best}\) end if – remember the best frame packing configuration \(\sigma\)
16. end for;
17. if \(\exists \psi_{best}\) then \(\sigma_{current}, f_{current}, S = \beta_{best}, f_{current}, S\) end if – remember the best frame size for \(\psi_{current}\)
18. end for;
19. if \(\exists \psi_{best}\) then \(\psi_{current}, \text{slot} = \sigma_{current}, \text{slot}\) end if – remember the best slot sequence \(\sigma\);
20. return SchedulabilityTest \((\Gamma, M, \psi_{best})\); \(\psi_{best}\)
end OptimizeFramePacking

\[\text{deviation} = \frac{\delta_{approach} - \delta_{SA}}{\delta_{SA}} \times 100\]  

The second column presents the maximum percentage deviation from the SA result, and the third column presents the average execution time of the algorithm, in seconds. For the SA algorithm we present only its average execution times.

Table 1 shows that, when packing messages to frames, the degree of schedulability improves dramatically compared to...
the straightforward approach. The greedy heuristic OptimizeFramePacking performs well for all the graph dimensions, having, for example, run-times which are on average under 50 for applications with 240 processes.

When deciding on which heuristic to use for design space exploration or system synthesis, an important issue is the execution time. On average, ouroptimisation heuristics needed a couple of minutes to produce results, while the simulated annealing approach had an execution time of up to 6h.

9.1 Vehicle cruise controller

A typical safety critical application with hard real-time constraints is a vehicle cruise controller (CC). We have considered a CC system derived from a requirement specification provided by the industry. The CC delivers the following functionality: it maintains a constant speed for speeds over 35 km/h and under 200 km/h, offers an interface (buttons) to increase or decrease the reference speed, and is able to resume its operation at the previous reference speed. The CC operation is suspended when the driver presses the brake pedal.

The specification assumes that the CC will operate in an environment consisting of two clusters. There are four nodes which functionally interact with the CC system: the anti-lock braking system (ABS), the transmission control module (TCM), the engine control module (ECM), and the electronic throttle module (ETM) (see Fig. 14).

It has been decided to map the functionality (processes) of the CC over these four nodes. The ECM and ETM nodes have an 8-bit Motorola M68HC11 family CPU with 128 kbytes of memory, while the ABS and TCM are equipped with a 16-bit Motorola M68HC12 CPU and 256 kbytes of memory. The 16-bit CPUs are twice as fast than the 8-bit ones. The transmission speed of the communication channel is 256 kbit/s and the frequency of the TTP controller was chosen to be 20 MHz.

We have modelled the specification of the CC system using a set of 32 processes and 17 messages as described in [77], where the mapping of processes to the nodes is also given. The period was chosen as 250 ms, equal to the deadline.

In this setting, the straightforward approach SF produced an end-to-end worst-case response time of 320 ms, greater than the deadline, while both the OFP and SA heuristics produced a schedulable system with a worst-case response time of 172 ms.

This shows that the optimisation heuristic proposed, driven by our schedulability analysis, is able to identify that frame packing configuration which increases the schedulability degree of an application, allowing the developers to reduce the implementation cost of a system.

Fig. 14  Hardware architecture for cruise controller

10 Conclusions

Heterogeneous distributed real-time systems are used in several application areas to implement increasingly complex applications that have tight timing constraints. The heterogeneity is manifested not only at the hardware and communication protocol levels, but also at the level of the scheduling policies used. To reduce costs and use the available resources more efficiently, the applications are distributed across several networks.

We have introduced the current state-of-the-art analysis and optimisation techniques available for such systems, and addressed in more detail a special class of heterogeneous distributed real-time embedded systems called multi-cluster systems.

We have presented an analysis for multi-cluster systems and outlined several characteristic design problems, related to the partitioning and mapping of functionality, and the optimisation of the access to the communication infrastructure. An approach to schedulability-driven frame packing for the synthesis of multi-cluster systems was presented as an example of solving such a design optimisation problem. We have developed two optimisation heuristics for frame configuration synthesis which are able to determine frame configurations that lead to a schedulable system. We have shown that, by considering the frame packing problem, we are able to synthetize schedulable hard real-time systems and to potentially reduce the overall cost of the architecture.

The main message of this paper is that efficient analysis and optimisation methods are needed and can be developed for the efficient implementation of applications distributed over interconnected heterogeneous networks.

11 References
