ROBERTO CAVICCHIOLI - personale UniMoRe

Nuova ricerca

ROBERTO CAVICCHIOLI

Ricercatore t.d. art. 24 c. 3 lett. B
Dipartimento di Comunicazione ed Economia

Pubblicazioni

2024 - High-Performance Feature Extraction for GPU -Accelerated ORB-SLAMx [Relazione in Atti di Convegno]
Muzzini, F.; Capodieci, N.; Cavicchioli, R.; Rouxel, B.
abstract

In the autonomous vehicles field, localization is a crucial aspect. While the ORB-SLAM algorithm is a recognized solution for these tasks, it poses challenges due to its computational intensity. Although accelerated implementation exists, a bottleneck persists in the Point Filtering phase which relies on the Distribute Octree algorithm that is not suitable for GPU processing. In this paper, we introduce a novel GPU-suitable algorithm designed to enhance the Point Filtering step, surpassing Distribute Octree. We conducted a comprehensive comparison with state-of-the-art CPU and GPU implementations, considering both computational time and trajectory accuracy. Our experimental results, demonstrate significant speed-ups up to 3x compared to previous contributions.

2024 - Learning From Wrong Predictions in Low-Resource Neural Machine Translation [Relazione in Atti di Convegno]
Hu, JIA CHENG; Cavicchioli, R.; Berardinelli, G.; Capotondi, A.
abstract

2024 - On Using Artificial Intelligence to Predict Music Playlist Success [Relazione in Atti di Convegno]
Cavicchioli, R.; Hu, JIA CHENG; Furini, M.
abstract

The emergence of digital music platforms has fundamentally transformed the way we engage with and organize music. As playlist creation has gained widespread popularity, there is an increasing desire among music aficionados and industry experts to comprehend the factors that drive playlist success. This paper presents a machine learning-based approach designed to predict the success of music playlists. By analyzing various musical characteristics of songs, our model achieves an impressive accuracy of 89.6% in predicting playlist success. Notably, it exhibits a remarkable 92.0% accuracy in forecasting the success of popular playlists, while also effectively identifying unpopular playlists with an accuracy of 89.4%. These findings provide invaluable insights into playlist creation, ultimately enhancing the overall music-listening experience. By harnessing the power of machine learning, our proposed approach unlocks new prospects for optimizing playlist design strategies and delivering personalized music recommendations. This has significant ramifications for music enthusiasts and industry professionals seeking to elevate playlist creation and enrich the music consumption experience.

2024 - ShareBERT: Embeddings Are Capable of Learning Hidden Layers [Relazione in Atti di Convegno]
Hu, JIA CHENG; Cavicchioli, Roberto; Berardinelli, Giulia; Capotondi, Alessandro
abstract

2024 - The Degree of Entanglement: Cyber-Physical Awareness in Digital Twin Applications [Relazione in Atti di Convegno]
Picone, Marco; Mariani, Stefano; Cavicchioli, Roberto; Burgio, Paolo; Cherif, Arslane Hamza
abstract

A defining feature of a Digital Twin (DT) is its level of "entanglement": the degree of strength to which the twin is interconnected with its physical counterpart. Despite its importance, this characteristic has not been yet fully investigated, and its impact on applications' design is underestimated. In this paper, we define the concept of "Degree of Entanglement" (DoE), which provides an operational model for assessing the strength of the entanglement between a DT and its physical counterpart. We also propose an interoperable representation of DoE within the Web of Things (WoT) framework, which enables DT-driven applications to dynamically adapt to changes in the physical environment. We evaluate our proposal using two realistic use cases, demonstrating the practical utility of DoE in supporting, for instance, context-awareness decisions and adaptiveness.

2023 - 5G MEC Architecture for Vulnerable Road Users Management Through Smart City Data Fusion [Relazione in Atti di Convegno]
Rossini, Enrico; Pietri, Marcello; Cavicchioli, Roberto; Picone, Marco; Mamei, Marco; Querio, Roberto; Colazzo, Laura; Procopio, Roberto
abstract

2023 - A Request for Clarity over the End of Sequence Token in the Self-Critical Sequence Training [Relazione in Atti di Convegno]
Hu, JIA CHENG; Cavicchioli, R.; Capotondi, A.
abstract

The Image Captioning research field is currently compromised by the lack of transparency and awareness over the End-of-Sequence token () in the Self-Critical Sequence Training. If the token is omitted, a model can boost its performance up to +4.1 CIDEr-D using trivial sentence fragments. While this phenomenon poses an obstacle to a fair evaluation and comparison of established works, people involved in new projects are given the arduous choice between lower scores and unsatisfactory descriptions due to the competitive nature of the research. This work proposes to solve the problem by spreading awareness of the issue itself. In particular, we invite future works to share a simple and informative signature with the help of a library called SacreEOS. Code available at: https://github.com/jchenghu/sacreeos.

2023 - Brief Announcement: Optimized GPU-accelerated Feature Extraction for ORB-SLAM Systems [Relazione in Atti di Convegno]
Muzzini, F.; Capodieci, N.; Cavicchioli, R.; Rouxel, B.
abstract

Reducing the execution time of ORB-SLAM algorithm is a crucial aspect of autonomous vehicles since it is computationally intensive for embedded boards. We propose a parallel GPU-based implementation, able to run on embedded boards, of the Tracking part of the ORB-SLAM2/3 algorithm. Our implementation is not simply a GPU port of the tracking phase. Instead, we propose a novel method to accelerate image Pyramid construction on GPUs. Comparison against state-of-the-art CPU and GPU implementations, considering both computational time and trajectory errors shows improvement on execution time in well-known datasets, such as KITTI and EuRoC.

2023 - Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning [Relazione in Atti di Convegno]
Hu, JIA CHENG; Cavicchioli, Roberto; Capotondi, Alessandro
abstract

2023 - Heterogeneous Encoders Scaling in the Transformer for Neural Machine Translation [Relazione in Atti di Convegno]
Hu, JIA CHENG; Cavicchioli, R.; Berardinelli, G.; Capotondi, A.
abstract

2023 - Machine Learning Techniques for Understanding and Predicting Memory Interference in CPU-GPU Embedded Systems [Relazione in Atti di Convegno]
Masola, A.; Capodieci, N.; Rouxel, B.; Franchini, G.; Cavicchioli, R.
abstract

2023 - Memory-Aware Latency Prediction Model for Concurrent Kernels in Partitionable GPUs: Simulations and Experiments [Relazione in Atti di Convegno]
Masola, A.; Capodieci, N.; Cavicchioli, R.; Olmedo, I. S.; Rouxel, B.
abstract

2022 - A Novel Real-Time Edge-Cloud Big Data Management and Analytics Framework for Smart Cities [Articolo su rivista]
Cavicchioli, Roberto; Martoglia, Riccardo; Verucchi, Micaela
abstract

Exposing city information to dynamic, distributed, powerful, scalable, and user-friendly big data systems is expected to enable the implementation of a wide range of new opportunities; however, the size, heterogeneity and geographical dispersion of data often makes it difficult to combine, analyze and consume them in a single system. In the context of the H2020 CLASS project, we describe an innovative framework aiming to facilitate the design of advanced big-data analytics workflows. The proposal covers the whole compute continuum, from edge to cloud, and relies on a well-organized distributed infrastructure exploiting: a) edge solutions with advanced computer vision technologies enabling the real-time generation of “rich” data from a vast array of sensor types; b) cloud data management techniques offering efficient storage, real-time querying and updating of the high-frequency incoming data at different granularity levels. We specifically focus on obstacle detection and tracking for edge processing, and consider a traffic density monitoring application, with hierarchical data aggregation features for cloud processing; the discussed techniques will constitute the groundwork enabling many further services. The tests are performed on the real use-case of the Modena Automotive Smart Area (MASA).

2022 - Driving Safety & Awareness Cooperative Business Model exploiting the 5GMETA platform [Relazione in Atti di Convegno]
Canova, C.; Pacella, F.; Brevi, D.; Apruzzese, M.; Cavicchioli, R.
abstract

The introduction of the 5G in the automotive sector paves the way to new business opportunities and stakeholders collaboration, thanks to the availability of new connected vehicles technologies. To emphasise the opportunities of improved data sharing, different business modelling approaches exist, depending on the phase of the data value chain that technologies are able to cover. The 5GMETA open platform aims to leverage CCAM-based captured data to stimulate, facilitate and feed innovative products and services exploring the possibilities enabled by collaborative business models. This paper describes a new business model which focuses on a Use Case (UC) on "Driving Safety and Awareness" that exploit the 5GMETA platform, and highlight its business innovation. In this context, the value proposition relies in the ability to increase road safety thanks to a better driving-misbehaviour detection, thus providing benefits to a wide range of potential stakeholders. The UC has its foundation in the real-time data collected, in a scalable and reliable way, by the platform. The main innovation that the adoption of such business model will bring is the creation of new partnerships that may be enforced by a clear definition of the profit-sharing strategy between the actors involved.

2022 - Evaluating Controlled Memory Request Injection for Efficient Bandwidth Utilization and Predictable Execution in Heterogeneous SoCs [Articolo su rivista]
Brilli, Gianluca; Cavicchioli, Roberto; Solieri, Marco; Valente, Paolo; Marongiu, Andrea
abstract

High-performance embedded platforms are increasingly adopting heterogeneous systems-on-chip (HeSoC) that couple multi-core CPUs with accelerators such as GPU, FPGA, or AI engines. Adopting HeSoCs in the context of real-time workloads is not immediately possible, though, as contention on shared resources like the memory hierarchy—and in particular the main memory (DRAM)—causes unpredictable latency increase. To tackle this problem, both the research community and certification authorities mandate (i) that accesses from parallel threads to the shared system resources (typically, main memory) happen in a mutually exclusive manner by design, or (ii) that per-thread bandwidth regulation is enforced. Such arbitration schemes provide timing guarantees, but make poor use of the memory bandwidth available in a modern HeSoC. Controlled Memory Request Injection (CMRI) is a recently-proposed bandwidth limitation concept that builds on top of a mutually-exclusive schedule but still allows the threads currently not entitled to access memory to use as much of the unused bandwidth as possible without losing the timing guarantee. CMRI has been discussed in the context of a multi-core CPU, but the same principle applies also to a more complex system such as an HeSoC. In this article, we introduce two CMRI schemes suitable for HeSoCs: Voluntary Throttling via code refactoring and Bandwidth Regulation via dynamic throttling. We extensively characterize a proof-of-concept incarnation of both schemes on two HeSoCs: an NVIDIA Tegra TX2 and a Xilinx UltraScale+, highlighting the benefits and the costs of CMRI for synthetic workloads that model worst-case DRAM access. We also test the effectiveness of CMRI with real benchmarks, studying the effect of interference among the host CPU and the accelerators.

2022 - Real-Time Requirements for ADAS Platforms Featuring Shared Memory Hierarchies [Articolo su rivista]
Capodieci, Nicola; Burgio, Paolo; Cavicchioli, Roberto; Olmedo, Ignacio Sanudo; Solieri, Marco; Bertogna, Marko
abstract

2021 - A Taxonomy of Modern GPGPU Programming Methods: On the Benefits of a Unified Specification [Articolo su rivista]
Capodieci, N.; Cavicchioli, R.; Marongiu, A.
abstract

Several Application Programming Interfaces (APIs) and frameworks have been proposed to simplify the development of General-Purpose GPU (GPGPU) applications. GPGPU application development typically involves specific customization for the target operating systems and hardware devices. The effort to port applications from one API to the other (or to develop multi-target applications) is complicated by the availability of a plethora of specifications, which in essence offers very similar underlying functionality. In this work we provide an in-depth study of six state-of-the-art GPGPU APIs. From these we derive a taxonomy of the common semantics and propose a unified specification. We describe a methodology to translate this unified specification into different target APIs. This simplifies cross-platform application development and provides a clean framework for benchmarking. Our proposed unified specification is called GUST (GPGPU Unified Specification and Translation) and it captures common functionality found in compute-only APIs (e.g., CUDA and OpenCL), in the compute pipeline of traditional graphic-oriented APIs (e.g., OpenGL and Direct3D11) and in last-generation bare-metal APIs (e.g., Vulkan and Direct3D12). The proposed translation methodology solves differences between specific APIs in a transparent manner, without hiding available tuning knobs for compute kernel optimizations and fostering best programming practices in a simple manner.

2021 - Improving emergency response in the era of ADAS vehicles in the Smart City [Articolo su rivista]
Capodieci, N.; Cavicchioli, R.; Muzzini, F.; Montagna, L.
abstract

Management of emergency vehicles can be fostered within a Smart City, i.e. an urban environment in which many IoT devices are orchestrated by a distributed intelligence able to suggest to road users the best course of action in different traffic situations. By extending MATSim (Multi-Agent Transport Simulation Software), we design and test appropriate mitigation strategies when traffic accidents occur within an existing urban area augmented with V2V (Vehicle-to-Vehicle), V2I (Vehicle-to-Infrastructure) capabilities and Advanced Driving Assisted cars (ADAS). Further, we propose traffic congestion models and related mechanisms for improving the necessary time for emergency vehicles to respond to accidents.

2021 - The HPC-DAG Task Model for Heterogeneous Real-Time Systems [Articolo su rivista]
Houssam-Eddine, Z.; Capodieci, N.; Cavicchioli, R.; Lipari, G.; Bertogna, M.
abstract

Recent commercial hardware platforms for embedded real-time systems feature heterogeneous processing units and computing accelerators on the same System-on-Chip. When designing complex real-time applications for such architectures, the designer is exposed to a number of difficult choices, like deciding on which compute engine to execute a certain task, or what degree of parallelism to adopt for a given function. To help the designer exploring the wide space of design choices and tune the scheduling parameters, we propose a novel real-time application model, called HPC-DAG (Heterogeneous Parallel Condition Directed Acyclic Graph Model), specifically conceived for heterogeneous platforms. An HPC-DAG allows the system designer to specify alternative implementations of a software component for different processing engines, as well as conditional branches to model if-then-else statements. We also propose a schedulability analysis for the HPC-DAG model and a set of heuristic allocation algorithms aimed at improving schedulability for latency sensitive applications. Our analysis takes into account the cost of preempting a task, which can be non-negligible on certain processors. We show the use of our approach on a realistic case study, and we demonstrate its effectiveness by comparing it with state-of-the-art algorithms previously proposed in literature.

2021 - vkpolybench: A crossplatform Vulkan Compute port of the PolyBench/GPU benchmark suite [Articolo su rivista]
Capodieci, N.; Cavicchioli, R.
abstract

PolyBench is a well-known set of benchmarks characterized by embarrassingly parallel kernels able to run on Graphic Processing Units (GPUs). While Polybench GPU kernels leverage well-established GP-GPU APIs such as CUDA and OpenCL, in this paper we present vkpolybench, a crossplatform PolyBench/GPU port built on top of Vulkan. Vulkan is the recently released Khronos standard for heterogeneous CPU–GPU computing that is gaining significant traction lately. Compared to CUDA and OpenCL, the Vulkan API improves GPU utilization while reducing CPU overheads.

2020 - A Systematic Assessment of Embedded Neural Networks for Object Detection [Relazione in Atti di Convegno]
Verucchi, M.; Brilli, G.; Sapienza, D.; Verasani, M.; Arena, M.; Gatti, F.; Capotondi, A.; Cavicchioli, R.; Bertogna, M.; Solieri, M.
abstract

Object detection is arguably one of the most important and complex tasks to enable the advent of next-generation autonomous systems. Recent advancements in deep learning techniques allowed a significant improvement in detection accuracy and latency of modern neural networks, allowing their adoption in automotive, avionics and industrial embedded systems, where performances are required to meet size, weight and power constraints.Multiple benchmarks and surveys exist to compare state-of-the-art detection networks, profiling important metrics, like precision, latency and power efficiency on Commercial-off-the-Shelf (COTS) embedded platforms. However, we observed a fundamental lack of fairness in the existing comparisons, with a number of implicit assumptions that may significantly bias the metrics of interest. This includes using heterogeneous settings for the input size, training dataset, threshold confidences, and, most importantly, platform-specific optimizations, that are especially important when assessing latency and energy-related values. The lack of uniform comparisons is mainly due to the significant effort required to re-implement network models, whenever openly available, on the specific platforms, to properly configure the available acceleration engines for optimizing performance, and to re-train the model using a homogeneous dataset.This paper aims at filling this gap, providing a comprehensive and fair comparison of the best-in-class Convolution Neural Networks (CNNs) for real-time embedded systems, detailing the effort made to achieve an unbiased characterization on cutting-edge system-on-chips. Multi-dimensional trade-offs are explored for achieving a proper configuration of the available programmable accelerators for neural inference, adopting the best available software libraries. To stimulate the adoption of fair benchmarking assessments, the framework is released to the public in an open source repository.

2020 - Automatic stochastic dithering techniques on GPU: Image quality and processing time improved [Articolo su rivista]
Franchini, G.; Cavicchioli, R.; Hu, J. C.
abstract

Dithering or error diffusion is a technique used to obtain a binary image, suitable for printing, from a grayscale one. At each step, the algorithm computes an allowed value of a pixel from a grayscale one, applying a threshold and, therefore, causing a conversion error. To obtain the optical illusion of a continuous tone, the obtained error is distributed to adjacent pixels. In literature there are many algorithms of this type, to cite some Jarvis, Judice and Ninke (JJN), Stucki, Atkinson, Burkes, Sierra but the most known and used is the Floyd-Steinberg. We compared various types of dithering, which differ from each other for the weights and number of pixels involved in the error diffusion scheme. All these algorithms suffer from two problems: artifacts and slowness. First, we address the artifacts, which are undesired texture patterns generated by the dithering algorithm, leading to a less appealing visual results. To address this problem, we developed a stochastic version of Floyd-Steinberg's algorithm. The Weighted Signal to Noise Ratio (WSNR) is adopted to evaluate the outcome of the procedure, an error measure based on human visual perception that also takes into account artifacts. This measure behaves similarly to a low-pass filter and, in particular, exploits a contrast sensitivity function to compare the algorithm's result and the original image in terms of similarity. We will show that the new stochastic algorithm is better in terms of both WSNR measurement and visual analysis. Secondly, we address the method's inherent computational slowness: We implemented a parallel version of the Floyd-Steinberg algorithm that takes advantage of GPGPU (General Purtose Graphics Processing Unit) computing, drastically reducing the execution time. Specifically, we observed a quadratic time complexity with respect to the input size for the serial case, whereas the computational time required for our parallel implementation increased linearly. We then evaluated both image quality and the performance of the parallel algorithm on a exhaustive image database. Finally, to make the method fully automatic, an empirical technique is presented to choose the best degree of stochasticity.

2020 - Contending memory in heterogeneous SoCs: Evolution in NVIDIA Tegra embedded platforms [Relazione in Atti di Convegno]
Capodieci, N.; Cavicchioli, R.; Olmedo, I. S.; Solieri, M.; Bertogna, M.
abstract

Modern embedded platforms are known to be constrained by size, weight and power (SWaP) requirements. In such contexts, achieving the desired performance-per-watt target calls for increasing the number of processors rather than ramping up their voltage and frequency. Hence, generation after generation, modern heterogeneous System on Chips (SoC) present a higher number of cores within their CPU complexes as well as a wider variety of accelerators that leverages massively parallel compute architectures. Previous literature demonstrated that while increasing parallelism is theoretically optimal for improving on average performance, shared memory hierarchies (i.e. caches and system DRAM) act as a bottleneck by exposing the platform processors to severe contention on memory accesses, hence dramatically impacting performance and timing predictability. In this work we characterize how subsequent generations of embedded platforms from the NVIDIA Tegra family balanced the increasing parallelism of each platform's processors with the consequent higher potential on memory interference. We also present an open-source software for generating test scenarios aimed at measuring memory contention in highly heterogeneous SoCs.

2020 - Evaluating Controlled Memory Request Injection to Counter PREM Memory Underutilization [Relazione in Atti di Convegno]
Cavicchioli, R.; Capodieci, N.; Solieri, M.; Bertogna, M.; Valente, P.; Marongiu, A.
abstract

Modern heterogeneous systems-on-chip (HeSoC) feature high-performance multi-core CPUs tightly integrated with data-parallel accelerators. Such HeSoCS heavily rely on shared resources, which hinder their adoption in the context of Real-Time systems. The predictable execution model (PREM) has proven effective at preventing uncontrolled execution time lengthening due to memory interference in HeSoC sharing main memory (DRAM). However, PREM only allows one task at a time to access memory, which inherently under-utilizes the available memory bandwidth in modern HeSoCs. In this paper, we conduct a thorough experimental study aimed at assessing the potential benefits of extending PREM so as to inject controlled amounts of memory requests coming from other tasks than the one currently granted exclusive DRAM access. Focusing on a state-of-the-art HeSoC, the NVIDIA TX2, we extensively characterize the relation between the injected bandwidth and the latency experienced by the task under test. The results confirm that for various types of workload it is possible to exploit the available bandwidth much more efficiently than standard PREM arbitration, often close to its maximum, while keeping latency inflation below 10%. We discuss possible practical implementation directions, highlighting the expected benefits and technical challenges.

2020 - GPU acceleration of a model-based iterative method for Digital Breast Tomosynthesis [Articolo su rivista]
Cavicchioli, R.; Hu, J. C.; Loli Piccolomini, E.; Morotti, E.; Zanni, L.
abstract

Digital Breast Tomosynthesis (DBT) is a modern 3D Computed Tomography X-ray technique for the early detection of breast tumors, which is receiving growing interest in the medical and scientific community. Since DBT performs incomplete sampling of data, the image reconstruction approaches based on iterative methods are preferable to the classical analytic techniques, such as the Filtered Back Projection algorithm, providing fewer artifacts. In this work, we consider a Model-Based Iterative Reconstruction (MBIR) method well suited to describe the DBT data acquisition process and to include prior information on the reconstructed image. We propose a gradient-based solver named Scaled Gradient Projection (SGP) for the solution of the constrained optimization problem arising in the considered MBIR method. Even if the SGP algorithm exhibits fast convergence, the time required on a serial computer for the reconstruction of a real DBT data set is too long for the clinical needs. In this paper we propose a parallel SGP version designed to perform the most expensive computations of each iteration on Graphics Processing Unit (GPU). We apply the proposed parallel approach on three different GPU boards, with computational performance comparable with that of the boards usually installed in commercial DBT systems. The numerical results show that the proposed GPU-based MBIR method provides accurate reconstructions in a time suitable for clinical trials.

2019 - Deadline-Based Scheduling for GPU with Preemption Support [Relazione in Atti di Convegno]
Capodieci, N.; Cavicchioli, R.; Bertogna, M.; Paramakuru, A.
abstract

Modern automotive-grade embedded computing platforms feature high-performance Graphics Processing Units (GPUs) to support the massively parallel processing power needed for next-generation autonomous driving applications (e.g., Deep Neural Network (DNN) inference, sensor fusion, path planning, etc). As these workload-intensive activities are pushed to higher criticality levels, there is a stronger need for more predictable scheduling algorithms that are able to guarantee predictability without overly sacrificing GPU utilization. Unfortunately, the real-rime literature on GPU scheduling mostly considered limited (or null) preemption capabilities, while previous efforts in broader domains were often based on programming models and APIs that were not designed to support the real-rime requirements of recurring workloads. In this paper, we present the design of a prototype real-time scheduler for GPU activities on an embedded System on a Chip (SoC) featuring a cutting edge GPU architecture by NVIDIA adopted in the autonomous driving domain. The scheduler runs as a software partition on top of the NVIDIA hypervisor, and it leverages latest generation architectural features, such as pixel-level preemption and thread level preemption. Such a design allowed us to implement and test a preemptive Earliest Deadline First (EDF) scheduler for GPU tasks providing bandwidth isolations by means of a Constant Bandwidth Server (CBS). Our work involved investigating alternative programming models for compute APIs, allowing us to characterize CPU-to-GPU command submission with more detailed scheduling information. A detailed experimental characterization is presented to show the significant schedulability improvement of recurring real-time GPU tasks.

2019 - Novel methodologies for predictable CPU-to-GPU command offloading [Relazione in Atti di Convegno]
Cavicchioli, R.; Capodieci, N.; Solieri, Marco; Bertogna, M.
abstract

There is an increasing industrial and academic interest towards a more predictable characterization of real-time tasks on high-performance heterogeneous embedded platforms, where a host system offloads parallel workloads to an integrated accelerator, such as General Purpose-Graphic Processing Units (GP-GPUs). In this paper, we analyze an important aspect that has not yet been considered in the real-time literature, and that may significantly affect real-time performance if not properly treated, i.e., the time spent by the CPU for submitting GP-GPU operations. We will show that the impact of CPU-to-GPU kernel submissions may be indeed relevant for typical real-time workloads, and that it should be properly factored in when deriving an integrated schedulability analysis for the considered platforms. This is the case when an application is composed of many small and consecutive GPU compute/copy operations. While existing techniques mitigate this issue by batching kernel calls into a reduced number of persistent kernel invocations, in this work we present and evaluate three other approaches that are made possible by recently released versions of the NVIDIA CUDA GP-GPU API, and by Vulkan, a novel open standard GPU API that allows an improved control of GPU command submissions. We will show that this added control may significantly improve the application performance and predictability due to a substantial reduction in CPU-to-GPU driver interactions, making Vulkan an interesting candidate for becoming the state-of-the-art API for heterogeneous Real-Time systems. Our findings are evaluated on a latest generation NVIDIA Jetson AGX Xavier embedded board, executing typical workloads involving Deep Neural Networks of parameterized complexity.

2019 - Stochastic Floyd-Steinberg dithering on GPU: image quality and processing time improved [Relazione in Atti di Convegno]
Franchini, G.; Cavicchioli, R.; Hu, J. C.
abstract

Error diffusion dithering is a technique that is used to represent a grey-scale image in a format usable by a printer. At every step, an algorithm converts the grey-scale value of a pixel to a new value within the allowed ones, generating a conversion error. To achieve the effect of continuous-tone illusion, the error is distributed to the neighboring pixels. Among the existent algorithms, the most commonly used is Floyd-Steinberg. However, this algorithm suffers two issues: artifacts and slowness. Regarding artifacts, those are textures that can appear after the image elaboration, making it visually different from the original one. In order to avoid this effect, we will use a stochastic version of Floyd-Steinberg algorithm. To evaluate the results, we will apply the Weighted Signal to Noise Ratio (WSNR), a visual-based model to account for perceptivity of dithered textures. This filter has a low-pass characteristic and, in particular, it uses a Contrast Sensitivity Function to evaluate the similarity between the original image and the final image. Our claim is that the new stochastic algorithm is better suited for both the WSNR measure and the visual analysis. Secondly, we will face slowness: we will describe a parallel version of Floyd-Steinberg algorithm that will exploit GPU (Graphics Processing Unit), drastically reducing the spent time. Specifically, we noticed that the serial version computational time increases quadratically with the input size, while the parallel version one increases linearly. Both the image quality and the computational performance of the parallel algorithm are evaluated on several large-scale images.

2018 - A Perspective on Safety and Real-Time Issues for GPU Accelerated ADAS [Relazione in Atti di Convegno]
Sanudo Olmedo, Ignacio; Capodieci, Nicola; Cavicchioli, Roberto
abstract

The current trend in designing Advanced Driving Assistance System (ADAS) is to enhance their computing power by using modern multi/many core accelerators. For many critical applications such as pedestrian detection, line following, and path planning the Graphic Processing Unit (GPU) is the most popular choice for obtaining orders of magnitude increases in performance at modest power consumption. This is made possible by exploiting the general purpose nature of today's GPUs, as such devices are known to express unprecedented performance per watt on generic embarrassingly parallel workloads (as opposed of just graphical rendering, as GPUs where only designed to sustain in previous generations). In this work, we explore novel challenges that system engineers have to face in terms of real-time constraints and functional safety when the GPU is the chosen accelerator. More specifically, we investigate how much of the adopted safety standards currently applied for traditional platforms can be translated to a GPU accelerated platform used in critical scenarios.

2018 - A survey on shared disk I/O management in virtualized environments under real time constraints [Articolo su rivista]
Sanudo Olmedo, Ignacio; Cavicchioli, Roberto; Capodieci, Nicola; Valente, Paolo; Bertogna, Marko
abstract

In the embedded systems domain, hypervisors are increasingly being adopted to guarantee timing isolation and appropriate hardware resource sharing among different software components. However, managing concurrent and parallel requests to shared hardware resources in a predictable way still represents an open issue. We argue that hypervisors can be an effective means to achieve an efficient and predictable arbitration of competing requests to shared devices in order to satisfy real-time requirements. As a representative example, we consider the case for mass storage (I/O) devices like Hard Disk Drives (HDD) and Solid State Disks (SSD), whose access times are orders of magnitude higher than those of central memory and CPU caches, therefore having a greater impact on overall task delays. We provide a comprehensive and up-to-date survey of the literature on I/O management within virtualized environments, focusing on software solutions proposed in the open source community, and discussing their main limitations in terms of realtime performance. Then, we discuss how the research in this subject may evolve in the future, highlighting the importance of techniques that are focused on scheduling not uniquely the processing bandwidth, but also the access to other important shared resources, like I/O devices.

2018 - Work-in-Progress: NVIDIA GPU Scheduling Details in Virtualized Environments [Relazione in Atti di Convegno]
Capodieci, Nicola; Cavicchioli, Roberto; Bertogna, Marko
abstract

Modern automotive grade embedded platforms feature high performance Graphics Processing Units (GPUs) to support the massively parallel processing power needed for next-generation autonomous driving applications. Hence, a GPU scheduling approach with strong Real-Time guarantees is needed. While previous research efforts focused on reverse engineering the GPU ecosystem in order to understand and control GPU scheduling on NVIDIA platforms, we provide an in depth explanation of the NVIDIA standard approach to GPU application scheduling on a Drive PX platform. Then, we discuss how a privileged scheduling server can be used to enforce arbitrary scheduling policies in a virtualized environment.

2017 - A software stack for next-generation automotive systems on many-core heterogeneous platforms [Articolo su rivista]
Burgio, Paolo; Bertogna, Marko; Capodieci, Nicola; Cavicchioli, Roberto; Sojka, Michal; Houdek, Přemysl; Marongiu, Andrea; Gai, Paolo; Scordino, Claudio; Morelli, Bruno
abstract

The next-generation of partially and fully autonomous cars will be powered by embedded many-core platforms. Technologies for Advanced Driver Assistance Systems (ADAS) need to process an unprecedented amount of data within tight power budgets, making those platform the ideal candidate architecture. Integrating tens-to-hundreds of computing elements that run at lower frequencies allows obtaining impressive performance capabilities at a reduced power consumption, that meets the size, weight and power (SWaP) budget of automotive systems. Unfortunately, the inherent architectural complexity of many-core platforms makes it almost impossible to derive real-time guarantees using “traditional” state-of-the-art techniques, ultimately preventing their adoption in real industrial settings. Having impressive average performances with no guaranteed bounds on the response times of the critical computing activities is of little if no use in safety-critical applications. Project Hercules will address this issue, and provide the required technological infrastructure to exploit the tremendous potential of embedded many-cores for the next generation of automotive systems. This work gives an overview of the integrated Hercules software framework, which allows achieving an order-of-magnitude of predictable performance on top of cutting-edge Commercial-Off-The-Shelf components (COTS). The proposed software stack will let both real-time and non real-time application coexist on next-generation, power-efficient embedded platforms, with preserved timing guarantees.

2017 - Memory interference characterization between CPU cores and integrated GPUs in mixed-criticality platforms [Relazione in Atti di Convegno]
Cavicchioli, R.; Capodieci, N.; Bertogna, M.
abstract

Most of today’s mixed criticality platforms feature Systems on Chip (SoC) where a multi-core CPU complex (the host) competes with an integrated Graphic Processor Unit (iGPU, the device) for accessing central memory. The multi-core host and the iGPU share the same memory controller, which has to arbitrate data access to both clients through often undisclosed or non-priority driven mechanisms. Such aspect becomes critical when the iGPU is a high performance massively parallel computing complex potentially able to saturate the available DRAM bandwidth of the considered SoC. The contribution of this paper is to qualitatively analyze and characterize the conflicts due to parallel accesses to main memory by both CPU cores and iGPU, so to motivate the need of novel paradigms for memory centric scheduling mechanisms. We analyzed different well known and commercially available platforms in order to estimate variations in throughput and latencies within various memory access patterns, both at host and device side.

2017 - SiGAMMA: Server based integrated GPU arbitration mechanism for memory accesses [Relazione in Atti di Convegno]
Capodieci, Nicola; Cavicchioli, Roberto; Valente, Paolo; Bertogna, Marko
abstract

In embedded systems, CPUs and GPUs typically share main memory. The resulting memory contention may significantly inflate the duration of CPU tasks in a hard-to-predict way. Despite initial solutions have been devised to control this undesired inflation, these approaches do not consider the interference due to memoryintensive components in COTS embedded systems like integrated Graphical Processing Units. Dealing with this kind of interference might require custom-made hardware components that are not integrated in off-the-shelf platforms. We address these important issues by proposing a memory-arbitration mechanism, SiGAMMA (Siσ), for eliminating the interference on CPU tasks caused by conflicting memory requests from the GPU. Tasks on the CPU are assumed to comply with a prefetch-based execution model (PREM) proposed in the real-time literature, while memory accesses from the GPU are arbitrated through a predictable mechanism that avoids contention. Our experiments show that Siσ proves to be very effective in guaranteeing almost null inflation to memory phases of CPU tasks, while at the same time avoiding excessive starvation of GPU tasks.

2016 - A survey on shared disk I/O management in virtualized environments under real time constraints [Relazione in Atti di Convegno]
Sanudo, I.; Cavicchioli, R.; Capodieci, N.; Valente, P.; Bertogna, M.
abstract

2015 - Cross-correlation of electrical measurements via physics-based device simulations: Linking electrical and structural characteristics [Relazione in Atti di Convegno]
Padovani, A.; Larcher, L.; Vandelli, L.; Bertocchi, M.; Cavicchioli, R.; Veksler, D.; Bersuker, G.
abstract

We present a comprehensive simulation framework to interpret electrical characteristics (I-V, C-V, G-V, Charge-Pumping, BTl, CVS, RVS, ...) commonly used for material characterization and reliability analysis of gate dielectric stacks in modern semiconductor devices. By accounting for the physical processes controlling charge transport through the dielectric (e.g. carrier trapping/de-trapping at the defect sites, defect generation, etc.), which is modeled using a novel approach based of material characteristics, the simulations provide a unique link between the electrical measurements data and specific atomic defects in the dielectric stack. Within this methodology, the software allows an accurate defect spectroscopy by cross-correlating measurements of pre-stress electrical parameters (IV, CV, BTl). These data are then used to project the stack reliability through the simulations of stress-induced leakage current (SILC) and time-dependent dielectric degradation trends, demonstrating the tool capabilities as a technology characterization/optimization benchmark.

2015 - Substrate and temperature influence on the trap density distribution in high-k III-V MOSFETs [Relazione in Atti di Convegno]
Sereni, G.; Vandelli, L.; Cavicchioli, R.; Larcher, L.; Veksler, D.; Bersuker, G.
abstract

In this work we apply a new spectroscopic technique based on the simulation of capacitance and conductance measurements to investigate the defect density in high-ț/III-V MOSFETs. This technique exploits the simulation of C-V and GV curves measured over a wide frequency range to extract the defect density map in the energy-position domain. The technique was used to investigate the role of the substrate material and the temperature on the interfacial and bulk defect distributions in the gate stack in InGaAs MOS capacitors grown on both InP and Si substrate. It was found that the substrate material does not affect the defect density in the gate dielectric stack. Applying the technique to C-V and G-V curves measured at different temperatures allows extracting the relaxation energy of defects, an important parameter connected to their atomic nature.

2013 - ML estimation of wavelet regularization hyperparameters in inverse problems [Relazione in Atti di Convegno]
Cavicchioli, Roberto; C., Chaux; L., Blanc Feraud; Zanni, Luca
abstract

In this paper we are interested in regularizing hyperparameter estimation by maximum likelihood in inverse problems with wavelet regularization. One parameter per subband will be estimated by gradient ascent algorithm. We have to face with twomain difficulties: i) sampling the a posteriori image distribution to compute the gradient; ii) choosing a suited step-size to ensure good convergence properties. We first show that introducing an auxiliary variable makes the sampling feasible using classical Metropolis-Hastings algorithm and Gibbs sampler. Secondly, we propose an adaptive step-size selection and a line-search strategy to improve the gradient-based method. Good performances of the proposed approach are demonstrated on both synthetic and real data.

2013 - Towards real-time image deconvolution: application to confocal and STED microscopy [Articolo su rivista]
R., Zanella; G., Zanghirati; Cavicchioli, Roberto; Zanni, Luca; P., Boccacci; M., Bertero; G., Vicidomini
abstract

Although deconvolution can improve the quality of any type of microscope, the high computational time required has so far limited its massive spreading. Here we demonstrate the ability of the scaled-gradient-projection (SGP) method to provide accelerated versions of the most used algorithms in microscopy. To achieve further increases in efficiency, we also consider implementations on graphic processing units (GPUs). We test the proposed algorithms both on synthetic and real data of confocal and STED microscopy. Combining the SGP method with the GPU implementation we achieve a speed-up factor from about a factor 25 to 690 (with respect the conventional algorithm). The excellent results obtained on STED microscopy images demonstrate the synergy between super-resolution techniques and image-deconvolution. Further, the real-time processing allows conserving one of the most important property of STED microscopy, i.e the ability to provide fast sub-diffraction resolution recordings.

2012 - Efficient deconvolution methods for astronomical imaging: algorithms and IDL-GPU codes [Articolo su rivista]
Prato, Marco; Cavicchioli, Roberto; Zanni, Luca; P., Boccacci; M., Bertero
abstract

Context. The Richardson-Lucy (RL) method is the most popular deconvolution method in Astronomy because it preserves the number of counts and the nonnegativity of the original object. Regularization is, in general, obtained by an early stopping of RL iterations; in the case of point-wise objects such as binaries or open star clusters, iterations can be pushed to convergence. However, it is well known that RL is not an efficient method: in most cases and, in particular, for low noise levels, acceptable solutions are obtained at the cost of hundreds or thousands of iterations. Therefore, several approaches for accelerating RL have been proposed. They are mainly based on the remark that RL is a scaled gradient method for the minimization of the Kullback-Leibler (KL) divergence, or Csiszar I-divergence, which represents the data-fidelity function in the case of Poisson noise. In this framework, a line search along the descent direction is considered for reducing the number of iterations.Aims. In a recent paper, a general optimization method, denoted as scaled gradient projection (SGP) method , has been proposed for the constrained minimization of continuously differentiable convex functions. It is applicable to the nonnegative minimization of the KL divergence. If the scaling suggested by RL is used in this method, then it provides a considerable speedup of RL. Therefore the aim of this paper is to apply SGP to a number of imaging problems in Astronomy such as single image deconvolution, multiple image deconvolution and boundary effect correction.Methods. Deconvolution methods are proposed by applying SGP to the minimization of the KL divergence for the imaging problems mentioned above and the corresponding algorithms are derived and implemented in IDL. For all the algorithms several stopping rules are introduced, including one based on a recently proposed discrepancy principle for Poisson data. For a further increase of efficiency, implementation on GPU (Graphic Processing Unit) is also considered.Results. The proposed algorithms are tested on simulated images. The speedup of SGP methods with respect to the corresponding RL methods strongly depends on the problem and on the specific object to be reconstructed, and in our simulationsit ranges from about 4 to more than 30. Moreover, significant speedups up to two orders of magnitude have been observed between the serial and parallel implementations of the algorithms. The codes are available upon request.

2012 - Efficient multi-image deconvolution in astronomy [Poster]
Cavicchioli, Roberto; Prato, Marco; Zanni, Luca; Boccacci, P.; Bertero, M.
abstract

The deconvolution of astronomical images by the Richardson-Lucy method (RLM) is extended here to the problem of multiple image deconvolution and the reduction of boundary effects. We show the multiple image RLM in its accelerated gradient-version SGP (Scaled Gradient Projection). Numerical simulations indicate that the approach can provide excellent results with a considerable reduction of the boundary effects. Also exploiting GPUlib applied to the IDL code, we obtained a remarkable acceleration of up to two orders of magnitude.

2012 - Optimization methods for digital image restoration on MPP multicore architectures [Capitolo/Saggio]
Cavicchioli, Roberto; Prearo, A; Zanella, R; Zanghirati, G; Zanni, Luca
abstract

We consider the numerical solution on modern multicore architectures of large-scale optimization problems arising in image restoration. An efficient solution of these optimization problems is important in several areas, such as medical imaging, microscopy and astronomy, where large-scale imaging is a basic task. To face these challenging problems, a lot of effort has been put in designing effective algorithms, that have largely improved the classical optimization strategies usually applied in image processing. Nevertheless, in many large-scale applications also these improved algorithms do not provide the expected reconstruction in a reasonable time. In these cases, the modern multiprocessor architectures represent an important resource for reducing the reconstruction time. Actually, one can consider different possibilities for a parallel computational scenario. One is the use of Graphics Processing Units (GPUs): they were originally designed to perform many simple operations on matrices and vectors with high efficiency and low accuracy (single precision arithmetic), but they have recently seen a huge development of both computational power and accuracy (double precision arithmetic), while still retaining compactness and low price. Another possibility is the use of last-generation multi-core CPUs, where general-purpose, very powerful computational cores are integrated inside the same CPU and a bunch of CPUs can be hosted by the same motherboard, sharing a central memory: they can perform completely dierent and asynchronous tasks, as well as cooperate by suitably distributing the workload of a complex task. Additional opportunities are offered by the more classical clusters of nodes, usually connected in dierent distributed-memory topologies to form large-scale high-performance machines with tens to hundred-thousands of processors. Needless to say, various mix of these architectures (such as clusters of GPUs) are also possible and sold, indeed. It should be noticed, however, that all the mentioned scenarios can exist even in very small-sized and cheap configurations. This is particularly relevant for GPUs: initially targeted at 3D graphics applications, they have been employed in many other scientific computing areas, such as signal and image reconstruction. Recent applications show that in many cases GPU performances are comparable to those of a medium-sized cluster, at a fraction of its cost. Thus, also small laboratories, which cannot afford a cluster, can benet from a substantial reduction of computing time compared to a standard CPU system. Nevertheless, for very large problems, as 3D imaging in confocal microscopy, the size of GPU's on-devices dedicated memory can become a limit to performance. For this reason, the ability to exploit the scalability of clusters by means of standard MPI implementations is still crucial for facing very large-scale applications. Here, we deal with both the GPU and the MPI implementation of an optimization algorithm, called Scaled Gradient Projection (SGP) method, that applies to several imaging problems. GPU versions of this method have been recently evaluated, while an MPI version is presented in this work in the cases of both deblurring and denoising problems. A computational study of the different implementations is reported, to show the enhancements provided by these parallel approaches in solving both 2D and 3D imaging problems.

2011 - SGP-IDL: a Scaled Gradient Projection method for image deconvolution in an Interactive Data Language environment [Software]
Prato, Marco; Cavicchioli, Roberto; Zanni, Luca; Boccacci, P.; Bertero, M.
abstract

An Interactive Data Language (IDL) package for the single and multiple deconvolution of 2D images corrupted by Poisson noise, with the optional inclusion of a boundary effect correction. Following a maximum likelihood approach, SGP-IDL computes a deconvolved image by early stopping of the scaled gradient projection (SGP) algorithm for the solution of the optimization problem coming from the minimization of the generalized Kullback-Leibler divergence between the computed image and the observed image. The algorithms have been implemented also for Graphic Processing Units (GPUs).

2011 - SGP-dec:A Scaled Gradient Projection method for2D and 3D images deconvolution [Software]
R., Zanella; Zanni, Luca; G., Zanghirati; Cavicchioli, Roberto
abstract

SGP-dec is a Matlab package for the deconvolution of 2D and 3D images corrupted by Poisson noise. Following amaximum likelihood approach, SGP-dec computes a deconvolved image by early stopping an iterative method for the minimization of the generalized Kullback-Lieibler divergence. The iterative minimization method implemented by SGP-dec is a Scaled Gradient Projection (SGP) algorithm that can be considered an acceleration of the Expectation Maximization method, also known as Richardson-Lucy method. The main feature of the SGP algorithm consists in the combination of non-expensivediagonally scaled gradient directions with adaptive Barzilai-Borwein steplength rules specially designed for thesedirections; global convergence properties are ensured by exploiting a line-search strategy (monotone or nonmonotone)along the feasible direction.The algorithm SGP is provided to be used as iterative regularization method; this means that a regularized reconstruction can be obtained by early stopping the SGP sequence. Several early stopping strategies can be selected, basedon different criteria: maximum number of iterations, distance of successive iterations or function values, discrepancyprinciple; the user must choose a stopping criterion and fixsuited values for the parameters involved by the chosen criterion.

2009 - An Efficient Algorithm for Planted Composit Motif Extraction [Relazione in Atti di Convegno]
Federico, M.; Valente, Paolo; Leoncini, Mauro; Montangero, Manuela; Cavicchioli, R.
abstract

In this paper we present an algorithm for the problem of planted structured motif extraction from a set of sequences. This problem is strictly related to the structured motif extraction problem, which has many important applications in molecular biology. We propose an algorithm that uses a simple two-stage approach: first it extracts simple motifs, then the simple motifs are combined in order to extract structured motifs. We compare our algorithm with existing algorithms whose code is available, and which are based on more complex approaches. Our experiments show that, even if in general the problem is NP-hard, our algorithm is able to handle complex instances of the problem in a reasonable amount of time.

Università degli studi di Modena e Reggio Emilia

Pubblicazioni