MARCO SOLIERI - personale UniMoRe

Nuova ricerca

MARCO SOLIERI

COLLABORATORE IN SPIN OFF
Dipartimento di Scienze Fisiche, Informatiche e Matematiche sede ex-Fisica

Pubblicazioni

2023 - Time-sensitive autonomous architectures [Articolo su rivista]
Ferraro, D; Palazzi, L; Gavioli, F; Guzzinati, M; Bernardi, A; Rouxel, B; Burgio, P; Solieri, M
abstract

Autonomous and software-defined vehicles (ASDVs) feature highly complex systems, coupling safety-critical and non-critical components such as infotainment. These systems require the highest connectivity, both inside the vehicle and with the outside world. An effective solution for network communication lies in Time-Sensitive Networking (TSN) which enables high-bandwidth and low-latency communications in a mixed-criticality environment. In this work, we present Time-Sensitive Autonomous Architectures (TSAA) to enable TSN in ASDVs. The software architecture is based on a hypervisor providing strong isolation and virtual access to TSN for virtual machines (VMs). TSAA latest iteration includes an autonomous car controlled by two Xilinx accelerators and a multiport TSN switch. We discuss the engineering challenges and the performance evaluation of the project demonstrator. In addition, we propose a Proof-of-Concept design of virtualized TSN to enable multiple VMs executing on a single board taking advantage of the inherent guarantees offered by TSN.

2022 - Evaluating Controlled Memory Request Injection for Efficient Bandwidth Utilization and Predictable Execution in Heterogeneous SoCs [Articolo su rivista]
Brilli, Gianluca; Cavicchioli, Roberto; Solieri, Marco; Valente, Paolo; Marongiu, Andrea
abstract

High-performance embedded platforms are increasingly adopting heterogeneous systems-on-chip (HeSoC) that couple multi-core CPUs with accelerators such as GPU, FPGA, or AI engines. Adopting HeSoCs in the context of real-time workloads is not immediately possible, though, as contention on shared resources like the memory hierarchy—and in particular the main memory (DRAM)—causes unpredictable latency increase. To tackle this problem, both the research community and certification authorities mandate (i) that accesses from parallel threads to the shared system resources (typically, main memory) happen in a mutually exclusive manner by design, or (ii) that per-thread bandwidth regulation is enforced. Such arbitration schemes provide timing guarantees, but make poor use of the memory bandwidth available in a modern HeSoC. Controlled Memory Request Injection (CMRI) is a recently-proposed bandwidth limitation concept that builds on top of a mutually-exclusive schedule but still allows the threads currently not entitled to access memory to use as much of the unused bandwidth as possible without losing the timing guarantee. CMRI has been discussed in the context of a multi-core CPU, but the same principle applies also to a more complex system such as an HeSoC. In this article, we introduce two CMRI schemes suitable for HeSoCs: Voluntary Throttling via code refactoring and Bandwidth Regulation via dynamic throttling. We extensively characterize a proof-of-concept incarnation of both schemes on two HeSoCs: an NVIDIA Tegra TX2 and a Xilinx UltraScale+, highlighting the benefits and the costs of CMRI for synthetic workloads that model worst-case DRAM access. We also test the effectiveness of CMRI with real benchmarks, studying the effect of interference among the host CPU and the accelerators.

2022 - Real-Time Requirements for ADAS Platforms Featuring Shared Memory Hierarchies [Articolo su rivista]
Capodieci, Nicola; Burgio, Paolo; Cavicchioli, Roberto; Olmedo, Ignacio Sanudo; Solieri, Marco; Bertogna, Marko
abstract

2021 - SPHERE: A Multi-SoC Architecture for Next-Generation Cyber-Physical Systems Based on Heterogeneous Platforms [Articolo su rivista]
Biondi, A.; Casini, D.; Cicero, G.; Borgioli, N.; Buttazzo, G.; Patti, G.; Leonardi, L.; Bello, L. L.; Solieri, M.; Burgio, P.; Olmedo, I. S.; Ruocco, A.; Palazzi, L.; Bertogna, M.; Cilardo, A.; Mazzocca, N.; Mazzeo, A.
abstract

This paper presents SPHERE, a project aimed at the realization of an integrated framework to abstract the hardware complexity of interconnected, modern system-on-chips (SoC) and simplify the management of their heterogeneous computational resources. The SPHERE framework leverages hypervisor technology to virtualize computational resources and isolate the behavior of different subsystems running on the same platform, while providing safety, security, and real-time communication mechanisms. The main challenges addressed by SPHERE are discussed in the paper along with a set of new technologies developed in the context of the project. They include isolation mechanisms for mixed-criticality applications, predictable I/O virtualization, the management of time-sensitive networks with heterogeneous traffic flows, and the management of field-programmable gate arrays (FPGA) to provide efficient implementations for cryptography modules, as well as hardware acceleration for deep neural networks. The SPHERE architecture is validated through an autonomous driving use-case.

2021 - The Predictable Execution Model in Practice: Compiling Real Applications for COTS Hardware [Articolo su rivista]
Forsberg, B.; Solieri, M.; Bertogna, M.; Benini, L.; Marongiu, A.
abstract

Adoption of multi- and many-core processors in real-time systems has so far been slowed down, if not totally barred, due do the difficulty in providing analytical real-time guarantees on worst-case execution times. The Predictable Execution Model (PREM) has been proposed to solve this problem, but its practical support requires significant code refactoring, a task better suited for a compilation tool chain than human programmers. Implementing a PREM compiler presents significant challenges to conform to PREM requirements, such as guaranteed upper bounds on memory footprint and the generation of efficient schedulable non-preemptive regions. This article presents a comprehensive description on how a PREM compiler can be implemented, based on several years of experience from the community. We provide accumulated insights on how to best balance conformance to real-time requirements and performance and present novel techniques that extend the applicability from simple benchmark suites to real-world applications. We show that code transformed by the PREM compiler enables timing predictable execution on modern commercial off-the-shelf hardware, providing novel insights on how PREM can protect 99.4% of memory accesses on random replacement policy caches at only 16% performance loss on benchmarks from the PolyBench benchmark suite. Finally, we show that the requirements imposed on the programming model are well-aligned with current coding guidelines for timing critical software, promoting easy adoption.

2020 - A Systematic Assessment of Embedded Neural Networks for Object Detection [Relazione in Atti di Convegno]
Verucchi, M.; Brilli, G.; Sapienza, D.; Verasani, M.; Arena, M.; Gatti, F.; Capotondi, A.; Cavicchioli, R.; Bertogna, M.; Solieri, M.
abstract

Object detection is arguably one of the most important and complex tasks to enable the advent of next-generation autonomous systems. Recent advancements in deep learning techniques allowed a significant improvement in detection accuracy and latency of modern neural networks, allowing their adoption in automotive, avionics and industrial embedded systems, where performances are required to meet size, weight and power constraints.Multiple benchmarks and surveys exist to compare state-of-the-art detection networks, profiling important metrics, like precision, latency and power efficiency on Commercial-off-the-Shelf (COTS) embedded platforms. However, we observed a fundamental lack of fairness in the existing comparisons, with a number of implicit assumptions that may significantly bias the metrics of interest. This includes using heterogeneous settings for the input size, training dataset, threshold confidences, and, most importantly, platform-specific optimizations, that are especially important when assessing latency and energy-related values. The lack of uniform comparisons is mainly due to the significant effort required to re-implement network models, whenever openly available, on the specific platforms, to properly configure the available acceleration engines for optimizing performance, and to re-train the model using a homogeneous dataset.This paper aims at filling this gap, providing a comprehensive and fair comparison of the best-in-class Convolution Neural Networks (CNNs) for real-time embedded systems, detailing the effort made to achieve an unbiased characterization on cutting-edge system-on-chips. Multi-dimensional trade-offs are explored for achieving a proper configuration of the available programmable accelerators for neural inference, adopting the best available software libraries. To stimulate the adoption of fair benchmarking assessments, the framework is released to the public in an open source repository.

2020 - BAO: A lightweight static partitioning hypervisor for modern multi-core embedded systems [Relazione in Atti di Convegno]
Martins, J.; Tavares, A.; Solieri, M.; Bertogna, M.; Pinto, S.
abstract

Given the increasingly complex and mixed-criticality nature of modern embedded systems, virtualization emerges as a natural solution to achieve strong spatial and temporal isolation. Widely used hypervisors such as KVM and Xen were not designed having embedded constraints and requirements in mind. The static partitioning architecture pioneered by Jailhouse seems to address embedded concerns. However, Jailhouse still depends on Linux to boot and manage its VMs. In this paper, we present the Bao hypervisor, a minimal, standalone and clean-slate implementation of the static partitioning architecture for Armv8 and RISC-V platforms. Preliminary results regarding size, boot, performance, and interrupt latency, show this approach incurs only minimal virtualization overhead. Bao will soon be publicly available, in hopes of engaging both industry and academia on improving Bao's safety, security, and real-time guarantees.

2020 - Contending memory in heterogeneous SoCs: Evolution in NVIDIA Tegra embedded platforms [Relazione in Atti di Convegno]
Capodieci, N.; Cavicchioli, R.; Olmedo, I. S.; Solieri, M.; Bertogna, M.
abstract

Modern embedded platforms are known to be constrained by size, weight and power (SWaP) requirements. In such contexts, achieving the desired performance-per-watt target calls for increasing the number of processors rather than ramping up their voltage and frequency. Hence, generation after generation, modern heterogeneous System on Chips (SoC) present a higher number of cores within their CPU complexes as well as a wider variety of accelerators that leverages massively parallel compute architectures. Previous literature demonstrated that while increasing parallelism is theoretically optimal for improving on average performance, shared memory hierarchies (i.e. caches and system DRAM) act as a bottleneck by exposing the platform processors to severe contention on memory accesses, hence dramatically impacting performance and timing predictability. In this work we characterize how subsequent generations of embedded platforms from the NVIDIA Tegra family balanced the increasing parallelism of each platform's processors with the consequent higher potential on memory interference. We also present an open-source software for generating test scenarios aimed at measuring memory contention in highly heterogeneous SoCs.

2020 - Evaluating Controlled Memory Request Injection to Counter PREM Memory Underutilization [Relazione in Atti di Convegno]
Cavicchioli, R.; Capodieci, N.; Solieri, M.; Bertogna, M.; Valente, P.; Marongiu, A.
abstract

Modern heterogeneous systems-on-chip (HeSoC) feature high-performance multi-core CPUs tightly integrated with data-parallel accelerators. Such HeSoCS heavily rely on shared resources, which hinder their adoption in the context of Real-Time systems. The predictable execution model (PREM) has proven effective at preventing uncontrolled execution time lengthening due to memory interference in HeSoC sharing main memory (DRAM). However, PREM only allows one task at a time to access memory, which inherently under-utilizes the available memory bandwidth in modern HeSoCs. In this paper, we conduct a thorough experimental study aimed at assessing the potential benefits of extending PREM so as to inject controlled amounts of memory requests coming from other tasks than the one currently granted exclusive DRAM access. Focusing on a state-of-the-art HeSoC, the NVIDIA TX2, we extensively characterize the relation between the injected bandwidth and the latency experienced by the task under test. The results confirm that for various types of workload it is possible to exploit the available bandwidth much more efficiently than standard PREM arbitration, often close to its maximum, while keeping latency inflation below 10%. We discuss possible practical implementation directions, highlighting the expected benefits and technical challenges.

2020 - Real-Time Virtualization for Industrial Automation [Relazione in Atti di Convegno]
Scordino, C.; Savino, I. M.; Cuomo, L.; Miccio, L.; Tagliavini, A.; Bertogna, M.; Solieri, M.
abstract

The industry has recently shown a growing interest in running non-critical activities (e.g., design tools) together with real-time control tasks on open-source COTS platforms.In this paper, we present a modular platform for industrial automation based on open-source components. Thanks to the isolation provided by a hypervisor, the same platform can run both the real-time control code and the design tools, reducing the overall hardware costs.Besides illustrating the software architecture and describing the various open-source components, we illustrate extensions needed for convenient applicability, and we address the problem of interference on hardware resources shared by multiple operating systems, which undermines the performance of the real-time control. A set of experimental results show the effectiveness and the performance of the proposed solution.

2020 - The Key Role of Memory in Next-Generation Embedded Systems for Military Applications [Relazione in Atti di Convegno]
Sañudo, Ignacio; Cortimiglia, Paolo; Miccio, Luca; Solieri, Marco; Burgio, Paolo; Di Biagio, Christian; Felici, Franco; Nuzzo, Giovanni; Bertogna, Marko
abstract

With the increasing use of multi-core platforms in safety-related domains, aircraft system integrators and authorities exhibit a concern about the impact of concurrent access to shared-resources in the Worst-Case Execution Time (WCET). This paper highlights the need for accurate memory-centric scheduling mechanisms for guaranteeing prioritized memory accesses to Real-Time safety-related components of the system. We implemented a software technique called cache coloring that demonstrates that isolation at timing and spatial level can be achieved by managing the lines that can be evicted in the cache. In order to show the effectiveness of this technique, the timing properties of a real application are considered as a use case, this application is made of parallel tasks that show different trade-offs between computation and memory loads.

2019 - Deterministic memory hierarchy and virtualization for modern multi-core embedded systems [Relazione in Atti di Convegno]
Kloda, Tomasz; Solieri, M.; Mancuso, R.; Capodieci, N.; Valente, P.; Bertogna, M.
abstract

One of the main predictability bottlenecks of modern multi-core embedded systems is contention for access to shared memory resources. Partitioning and software-driven allocation of memory resources is an effective strategy to mitigate contention in the memory hierarchy. Unfortunately, however, many of the strategies adopted so far can have unforeseen side-effects when practically implemented latest-generation, high-performance embedded platforms. Predictability is further jeopardized by cache eviction policies based on random replacement, targeting average performance instead of timing determinism.In this paper, we present a framework of software-based techniques to restore memory access determinism in high-performance embedded systems. Our approach leverages OS-transparent and DMA-friendly cache coloring, in combination with an invalidation-driven allocation (IDA) technique. The proposed method allows protecting important cache blocks from (i) external eviction by tasks concurrently executing on different cores, and (ii) internal eviction by tasks running on the same core. A working implementation obtained by extending the Jailhouse partitioning hypervisor is presented and evaluated with a combination of synthetic and real benchmarks.

2019 - Novel methodologies for predictable CPU-to-GPU command offloading [Relazione in Atti di Convegno]
Cavicchioli, R.; Capodieci, N.; Solieri, Marco; Bertogna, M.
abstract

There is an increasing industrial and academic interest towards a more predictable characterization of real-time tasks on high-performance heterogeneous embedded platforms, where a host system offloads parallel workloads to an integrated accelerator, such as General Purpose-Graphic Processing Units (GP-GPUs). In this paper, we analyze an important aspect that has not yet been considered in the real-time literature, and that may significantly affect real-time performance if not properly treated, i.e., the time spent by the CPU for submitting GP-GPU operations. We will show that the impact of CPU-to-GPU kernel submissions may be indeed relevant for typical real-time workloads, and that it should be properly factored in when deriving an integrated schedulability analysis for the considered platforms. This is the case when an application is composed of many small and consecutive GPU compute/copy operations. While existing techniques mitigate this issue by batching kernel calls into a reduced number of persistent kernel invocations, in this work we present and evaluate three other approaches that are made possible by recently released versions of the NVIDIA CUDA GP-GPU API, and by Vulkan, a novel open standard GPU API that allows an improved control of GPU command submissions. We will show that this added control may significantly improve the application performance and predictability due to a substantial reduction in CPU-to-GPU driver interactions, making Vulkan an interesting candidate for becoming the state-of-the-art API for heterogeneous Real-Time systems. Our findings are evaluated on a latest generation NVIDIA Jetson AGX Xavier embedded board, executing typical workloads involving Deep Neural Networks of parameterized complexity.

Università degli studi di Modena e Reggio Emilia

Pubblicazioni