PAOLO BURGIO - personale UniMoRe

Nuova ricerca

PAOLO BURGIO

Ricercatore t.d. art. 24 c. 3 lett. B
Dipartimento di Scienze Fisiche, Informatiche e Matematiche sede ex-Matematica

Home | Curriculum(pdf) | Didattica |

Pubblicazioni

2024 - The Degree of Entanglement: Cyber-Physical Awareness in Digital Twin Applications [Relazione in Atti di Convegno]
Picone, Marco; Mariani, Stefano; Cavicchioli, Roberto; Burgio, Paolo; Cherif, Arslane Hamza
abstract

A defining feature of a Digital Twin (DT) is its level of "entanglement": the degree of strength to which the twin is interconnected with its physical counterpart. Despite its importance, this characteristic has not been yet fully investigated, and its impact on applications' design is underestimated. In this paper, we define the concept of "Degree of Entanglement" (DoE), which provides an operational model for assessing the strength of the entanglement between a DT and its physical counterpart. We also propose an interoperable representation of DoE within the Web of Things (WoT) framework, which enables DT-driven applications to dynamically adapt to changes in the physical environment. We evaluate our proposal using two realistic use cases, demonstrating the practical utility of DoE in supporting, for instance, context-awareness decisions and adaptiveness.

2023 - Fine-Grained QoS Control via Tightly-Coupled Bandwidth Monitoring and Regulation for FPGA-based Heterogeneous SoCs [Relazione in Atti di Convegno]
Brilli, G.; Valente, G.; Capotondi, A.; Burgio, P.; Di Masciov, T.; Valente, P.; Marongiu, A.
abstract

2023 - Optimized Local Path Planner Implementation for GPU-Accelerated Embedded Systems [Articolo su rivista]
Muzzini, F.; Capodieci, N.; Ramanzin, F.; Burgio, P.
abstract

Autonomous vehicles are latency-sensitive systems. The planning phase is a critical component of such systems, during which the in-vehicle compute platform is responsible for determining the future maneuvers that the vehicle will follow. In this paper, we present a GPU-accelerated optimized implementation of the Frenet Path Planner, a widely known path planning algorithm. Unlike the current state-of-the-art, our implementation accelerates the entire algorithm, including the path generation and collision avoidance phases. We measure the execution time of our implementation and demonstrate dramatic speedups compared to the CPU baseline implementation. Additionally, we evaluate the impact of different precision types (double, float, half) on trajectory errors to investigate the tradeoff between completion latencies and computation precision.

2023 - The Advantage of Using Traffic Rules for Motion Prediction in Intersections (TRMPI) [Relazione in Atti di Convegno]
Moazen, I.; Burgio, P.
abstract

The autonomous driving motion prediction is essential to have a correct and reliable planning. The influence of the road agents on each other makes it even more challenging. However, most prior works have not considered these interactions and planning against the predictions would decrease the ability of representing the possibilities of the future interactions among the different agents. In this work, we propose a model that predicts the agents’ behavior in a jointly manner. We take advantage of using the strategy of masking to our model as the query. Our model architecture employ attention across, agent interactions, traffic rules in intersections, and the road elements. The evaluation of our model is done on autonomous driving datasets for behavior prediction and test it on Carla simulator. Our work demonstrates that motion prediction by a model with a masking strategy and having attention and traffic rules can …

2023 - Time-sensitive autonomous architectures [Articolo su rivista]
Ferraro, D; Palazzi, L; Gavioli, F; Guzzinati, M; Bernardi, A; Rouxel, B; Burgio, P; Solieri, M
abstract

Autonomous and software-defined vehicles (ASDVs) feature highly complex systems, coupling safety-critical and non-critical components such as infotainment. These systems require the highest connectivity, both inside the vehicle and with the outside world. An effective solution for network communication lies in Time-Sensitive Networking (TSN) which enables high-bandwidth and low-latency communications in a mixed-criticality environment. In this work, we present Time-Sensitive Autonomous Architectures (TSAA) to enable TSN in ASDVs. The software architecture is based on a hypervisor providing strong isolation and virtual access to TSN for virtual machines (VMs). TSAA latest iteration includes an autonomous car controlled by two Xilinx accelerators and a multiport TSN switch. We discuss the engineering challenges and the performance evaluation of the project demonstrator. In addition, we propose a Proof-of-Concept design of virtualized TSN to enable multiple VMs executing on a single board taking advantage of the inherent guarantees offered by TSN.

2022 - An FPGA Overlay for Efficient Real-Time Localization in 1/10th Scale Autonomous Vehicles [Relazione in Atti di Convegno]
Bernardi, Andrea; Brilli, Gianluca; Capotondi, Alessandro; Marongiu, Andrea; Burgio, Paolo
abstract

Heterogeneous systems-on-chip (HeSoC) based on reconfigurable accelerators, such as Field-Programmable Gate Arrays (FPGA), represent an appealing option to deliver the performance/Watt required by the advanced perception and localization tasks employed in the design of Autonomous Vehicles. Different from software-programmed GPUs, FPGA development involves significant hardware design effort, which in the context of HeSoCs is further complicated by the system-level integration of HW and SW blocks. High-Level Synthesis is increasingly being adopted to ease hardware IP design, allowing engineers to quickly prototype their solutions. However, automated tools still lack the required maturity to efficiently build the complex hard-ware/software interaction between the host CPU and the FPGA accelerator(s). In this paper we present a fully integrated system design where a particle filter for LiDAR-based localization is efficiently deployed as FPGA logic, while the rest of the compute pipeline executes on programmable cores. This design constitutes the heart of a fully-functional 1/10th-scale racing autonomous car. In our design, accelerated IPs are controlled locally to the FPGA via a proxy core. Communication between the two and with the host CPU happens via shared memory banks also implemented as FPGA IPs. This allows for a scalable and easy-to-deploy solution both from the hardware and software viewpoint, while providing better performance and energy efficiency compared to state-of-the-art solutions.

2022 - Motion Sickness Minimization Alerting System Using The Next Curvature Topology [Relazione in Atti di Convegno]
Moazen, I; Burgio, P; Castellano, A
abstract

Current intelligent car prototypes increasingly move to become autonomous where no driver is required. If an automated vehicle has rearward and forward facing seats and none of the passengers pay attention to the road, they increasingly experience the motion sickness because of the inability of passengers to anticipate the future motion trajectory. In this paper, we focus on anticipatory audio and video cues using pleasant sounds and a Human Machine Interface to display and inform the passengers about the upcoming trajectories that may lead to make the passengers sick. To be able to anticipate the next moves, we require an evaluation system of the next 1 kilometer of the road using the map. The road is investigated based on the amount of the turns and the maximum speed allowed that lead to lateral accelerations that is high enough based on Motion Sickness Dose Value to make the passengers sick. The system alerts the passengers through a Human Machine Interface to focus on the road for prevention of the Motion Sickness. We evaluate our method by using Motion Sickness Dose Value. Based on this work, we can prevent the sickness due to lateral accelerations by making the passengers to focus on the road and decrease the vestibular conflict.

2022 - Real-Time Requirements for ADAS Platforms Featuring Shared Memory Hierarchies [Articolo su rivista]
Capodieci, Nicola; Burgio, Paolo; Cavicchioli, Roberto; Olmedo, Ignacio Sanudo; Solieri, Marco; Bertogna, Marko
abstract

2022 - Sentient Spaces: Intelligent Totem Use Case in the ECSEL FRACTAL Project [Relazione in Atti di Convegno]
Caruso, F; Di Mascio, T; Frigioni, D; Pomante, L; Valente, G; Delucchi, S; Burgio, P; Di Frangia, M; Paganin, L; Garibotto, C; Vallocchia, D
abstract

The objective of the FRACTAL project is to create a novel approach to reliable edge computing. The FRACTAL computing node will be the building block of scalable Internet of Things (from Low Computing to High Computing Edge Nodes). The node will also have the capability of learning how to improve its performance against the uncertainty of the environment. In such a context, this paper presents in detail one of the key use cases: an Internet-of-Things solution, represented by intelligent totems for advertisement and wayfinding services, within advanced ICT-based shopping malls conceived as a sentient space. The paper outlines the reference scenario and provides an overview of the architecture and the functionality of the demonstrator, as well as a roadmap for its development and evaluation.

2022 - Understanding and Mitigating Memory Interference in FPGA-based HeSoCs [Relazione in Atti di Convegno]
Brilli, G.; Capotondi, A.; Burgio, P.; Marongiu, A.
abstract

Like most high-end embedded systems, FPGA-based systems-on-chip (SoC) are increasingly adopting heterogeneous designs, where CPU cores, the configurable logic and other ICs all share interconnect and main memory (DRAM) controller. This paradigm is scalable and reduces production costs and time-to-market, but creates resource contention issues, which ultimately affects the programs' timing. This problem has been widely studied on CPU- and GPU-based systems, along with strategies to mitigate such effects, but little has been done so far to systematically study the problem on FPGA-based SoCs. This work provides an in-depth analysis of memory interference on such systems, tar-geting two state-of-the-art commercial FPGA SoCs. We also discuss architectural support for Controlled Memory Request Injection (CMRI), a technique that has proven effective at reducing the bandwidth under-utilization implied by naive schemes that solve the interference problem by only allowing mutually exclusive access to the shared resources. Our experimental results show that: i) memory interference can slow down CPU tasks by up to 16×in the tested FPGA-based SoCs; ii) CMRI allows to exploit more than 40% of the memory bandwidth avail-able to FPGA accelerators (normally completely unused in PREM-like schemes), keeping the slowdown due to interference below 10%.

2021 - A Full-Featured, Enhanced Cost Function to Mitigate Motion Sickness in Semi- and Fully-autonomous Vehicles [Relazione in Atti di Convegno]
Moazen, I; Burgio, P
abstract

Current full- and semi- Autonomous car prototypes increasingly feature complex algorithms for lateral and longitudinal control of the vehicle. Unfortunately, in some cases, they might cause fussy and unwanted effects on the human body, such as motion sickness, ultimately harnessing passengers' comfort, and driving experience. Motion sickness is due to conflict between visual and vestibular inputs, and in the worst case might causes loss of control over one's movements, and reduced ability to anticipate the direction of movement. In this paper, we focus on the five main physical characteristics that affect motion sickness, including them in the function cost, to provide quality passengers' experience to vehicle passengers. We implemented our approach in a state-of-the-art Model Predictive Controller, to be used in a real Autonomous Vehicle. Preliminary tests using the Unreal Engine simulator have already shown that our approach is viable and effective, and we implemented and evaluated using Motion Sickness Dose Value and Illness Rating and then tested it in an embedded platform. We implemented it on our embedded platform, NVIDIA Jetson AGX Xavier that is representative of the next-generation AV Domain Controller.

2021 - Performance modeling of heterogeneous HW platforms [Articolo su rivista]
Rehm, F.; Dasari, D.; Hamann, A.; Pressler, M.; Ziegenbein, D.; Seitter, J.; Sanudo, I.; Capodieci, N.; Burgio, P.; Bertogna, M.
abstract

The push towards automated and connected driving functionalities mandates the use of heterogeneous hardware platforms in order to provide the required computational resources. For these platforms, established methods for performance modeling in industry are no longer effective or adequate. In this paper, we explore the detailed problem of mapping a prototypical autonomous driving application on a Nvidia Tegra X2 platform while considering different constraints of the application, including end-to-end latencies of event chains spanning CPU and GPU boundaries. With the given use-case and platform, we propose modeling concepts in Amalthea, capturing the architectural aspects of heterogeneous platforms and also the execution structure of the application. These models can be fed into appropriate tools to predict performance properties. We proposed the above problem in the Workshop on Analysis Tools and Methodologies for Embedded and Real-time Systems (WATERS) Industrial Challenge 2019 and in response, academicians came up with different solutions. In this paper, we evaluate these different solutions and summarize all approaches. The lesson learned from this challenge is then used to improve on the simplifying assumptions we made in our original formulation and discuss future modeling extensions.

2021 - Programmable systems for intelligence in automobiles (PRYSTINE): Final results after Year 3 [Relazione in Atti di Convegno]
Druml, N.; Ryabokon, A.; Schorn, R.; Koszescha, J.; Ozols, K.; Levinskis, A.; Novickis, R.; Nigussie, E.; Isoaho, J.; Solmaz, S.; Stettinger, G.; Diaz, S.; Marcano, M.; Villagra, J.; Medina, J.; Schwarz, M.; Artunedo, A.; Comi, M.; Beekelaar, R.; Ozcelik, O.; Tasdelen, E. A.; Gurbuz, Y.; Saijets, J.; Kyynarainen, J.; Morits, D.; Debaillie, B.; Rykunov, M.; Escamilla, J.; Vanne, J.; Korhonen, T.; Holma, K.; Matzhold, E. -M.; Novara, C.; Tango, F.; Burgio, P.; Calafiore, G.; Karimshoushtari, M.; Boulay, E.; Dhaens, M.; Praet, K.; Zwijnenberg, H.; Palm, H.; Ortega, D. A.; Kalali, E.; Pensala, T.; Kyytinen, A.; Larsen, M.; Veledar, O.; Macher, G.; Lafer, M.; Giraudi, L.; Reckenzaun, J.; Hammer, D.; Mohan, N.; Schmid, J.; Hoss, A.; Ophir, S.; Dubey, A.; Fuchs, J.; Lubke, M.; Anghel, A.; Ristea, N. -C.; Torngren, M.; Musralina, A.; Harter, M.; Jose, J. M.; Dimitrakopoulos, G.
abstract

Autonomous driving is disrupting the automotive industry as we know it today. For this, fail-operational behavior is essential in the sense, plan, and act stages of the automation chain in order to handle safety-critical situations on its own, which currently is not reached with state-of-the-art approaches. The European ECSEL research project PRYSTINE realizes Fail-operational Urban Surround perceptION (FUSION) based on robust Radar and LiDAR sensor fusion and control functions in order to enable safe automated driving in urban and rural environments. This paper showcases some of the key exploitable results (e.g., novel Radar sensors, innovative embedded control and E/E architectures, pioneering sensor fusion approaches, AI-controlled vehicle demonstrators) achieved until its final year 3.

2021 - SPHERE: A Multi-SoC Architecture for Next-Generation Cyber-Physical Systems Based on Heterogeneous Platforms [Articolo su rivista]
Biondi, A.; Casini, D.; Cicero, G.; Borgioli, N.; Buttazzo, G.; Patti, G.; Leonardi, L.; Bello, L. L.; Solieri, M.; Burgio, P.; Olmedo, I. S.; Ruocco, A.; Palazzi, L.; Bertogna, M.; Cilardo, A.; Mazzocca, N.; Mazzeo, A.
abstract

This paper presents SPHERE, a project aimed at the realization of an integrated framework to abstract the hardware complexity of interconnected, modern system-on-chips (SoC) and simplify the management of their heterogeneous computational resources. The SPHERE framework leverages hypervisor technology to virtualize computational resources and isolate the behavior of different subsystems running on the same platform, while providing safety, security, and real-time communication mechanisms. The main challenges addressed by SPHERE are discussed in the paper along with a set of new technologies developed in the context of the project. They include isolation mechanisms for mixed-criticality applications, predictable I/O virtualization, the management of time-sensitive networks with heterogeneous traffic flows, and the management of field-programmable gate arrays (FPGA) to provide efficient implementations for cryptography modules, as well as hardware acceleration for deep neural networks. The SPHERE architecture is validated through an autonomous driving use-case.

2020 - An automatic scenario generator for validation of automated valet parking systems [Relazione in Atti di Convegno]
Tagliavini, A.; Ferraro, D.; Kloda, T.; Burgio, P.
abstract

A primary goal of self-driving car manufacturers is to create an autonomous car system that is clearly and demonstrably safer than an average human-controlled car. The real-world tests are expensive, time-consuming and potentially dangerous. The virtual simulation is therefore required. The autonomous driving valet parking is expected to be the first commercially available automated driving function without a human driver at the wheel (SAE Level 4). Although many simulation solutions for the automotive market already exist, none of them features the parking environments. In this paper, we propose a new software virtual scenario generator for the parking sites. The tool populates the synthetics parking maps with objects and actions related to these environments: the cars driving from the drop-off point towards the vacant slots and the randomly placed parked cars, each with a given probability of exiting its slot. The generated scenarios are in the OpenSCENARIO format and are fully simulated in the Virtual Test Drive simulator.

2020 - Artificial Neural Networks: The Missing Link Between Curiosity and Accuracy [Relazione in Atti di Convegno]
Franchini, G.; Burgio, P.; Zanni, L.
abstract

Artificial Neural Networks, as the name itself suggests, are biologically inspired algorithms designed to simulate the way in which the human brain processes information. Like neurons, which consist of a cell nucleus that receives input from other neurons through a web of input terminals, an Artificial Neural Network includes hundreds of single units, artificial neurons or processing elements, connected with coefficients (weights), and are organized in layers. The power of neural computations comes from connecting neurons in a network: in fact, in an Artificial Neural Network it is possible to manage a different number of information at the same time. What is not fully understood is which is the most efficient way to train an Artificial Neural Network, and in particular what is the best mini-batch size for maximize accuracy while minimizing training time. The idea that will be developed in this study has its roots in the biological world, that inspired the creation of Artificial Neural Network in the first place. Humans have altered the face of the world through extraordinary adaptive and technological advances: those changes were made possible by our cognitive structure, particularly the ability to reasoning and build causal models of external events. This dynamism is made possible by a high degree of curiosity. In the biological world, and especially in human beings, curiosity arises from the constant search of knowledge and information: behaviours that support the information sampling mechanism range from the very small (initial mini-batch size) to the very elaborate sustained (increasing mini-batch size). The goal of this project is to train an Artificial Neural Network by increasing dynamically, in an adaptive manner (with validation set), the mini-batch size; our hypothesis is that this training method will be more efficient (in terms of time and costs) compared to the ones implemented so far.

2020 - Graphic Interfaces in ADAS: From requirements to implementation [Relazione in Atti di Convegno]
Masola, A.; Gabbi, C.; Castellano, A.; Capodieci, N.; Burgio, P.
abstract

In this paper we report our experiences in designing and implementing a digital virtual cockpit to be installed as a component within the software stack of an Advanced Driving Assisted System (ADAS). Since in next-generation automotive embedded platforms both autonomous driving related workloads and virtual cockpit rendering tasks will co-run in a hypervisor-mediated environment, they will share computational resources. For this purpose, our work has been developed by following a requirement-driven approach in which regulations, usability and visual attractiveness requirements have to be taken into account by balancing their impact in terms of computational resources of the embedded platform in which such graphics interfaces are deployed. The graphic interfaces we realized consist of a set of 2D frames for the instrument cluster (for displaying the tachometer and the speedometer) and a screen area in which a 3D representation of the vehicle surroundings is rendered alongside driving directions and the point-cloud obtained through a LIDAR. All these components are able to alert the driver of imminent and/or nearby driving hazards.

2020 - Human-automation interaction through shared and traded control applications [Relazione in Atti di Convegno]
Marcano, M.; Diaz, S.; Perez, J.; Castellano, A.; Landini, E.; Tango, F.; Burgio, P.
abstract

Automated and Highly-automated Vehicles still need to interact with the driver at different cognitive levels. Those who are SAE Level 1 or 2 consider the human in the loop all the time and require strong participation of the driver at the control level. To increase safety, trust and comfort of the driver with this kind of automation, systems with a strong cooperative component are needed. This paper introduces the design of a vehicle controller based on shared control, together with an arbitration system, and the design of a Human-Machine Interface (HMI) to foster the mutual understanding between driver and automation in a lane-keeping task. The driver-automation cooperation is achieved through incremental support, in a continuum spectrum from manual to full automation. Additionally, the design of an HMI to support the driver in a takeover maneuver is presented. This functionality is a key component of vehicles SAE Level 3 and 4.

2020 - Preface [Relazione in Atti di Convegno]
Steinhorst, S.; Deshmukh, J. V.; Ernst, R.; Saidi, S.; Ziegenbein, D.; Besselink, B.; Burgio, P.; Deshmukh, J. V.; Hamad, M.; Hamann, A.; Jin, X.; Lin, X.; Maggio, M.; Mundhenk, P.; Ramanathan, S.; Saidi, S.; Shanker, S.; Steinhorst, S.; Terechko, A.
abstract

2020 - Real-Time clustering and LiDAR-camera fusion on embedded platforms for self-driving cars [Relazione in Atti di Convegno]
Verucchi, M.; Bartoli, L.; Bagni, F.; Gatti, F.; Burgio, P.; Bertogna, M.
abstract

3D object detection and classification are crucial tasks for perception in Autonomous Driving (AD). To promptly and correctly react to environment changes and avoid hazards, it is of paramount importance to perform those operations with high accuracy and in real-time. One of the most widely adopted strategies to improve the detection precision is to fuse information from different sensors, like e.g. cameras and LiDAR. However, sensor fusion is a computationally intensive task, that may be difficult to execute in real-time on an embedded platforms. In this paper, we present a new approach for LiDAR and camera fusion, that can be suitable to execute within the tight timing requirements of an autonomous driving system. The proposed method is based on a new clustering algorithm developed for the LiDAR point cloud, a new technique for the alignment of the sensors, and an optimization of the Yolo-v3 neural network. The efficiency of the proposed method is validated comparing it against state-of-the-art solutions on commercial embedded platforms.

2020 - The Key Role of Memory in Next-Generation Embedded Systems for Military Applications [Relazione in Atti di Convegno]
Sañudo, Ignacio; Cortimiglia, Paolo; Miccio, Luca; Solieri, Marco; Burgio, Paolo; Di Biagio, Christian; Felici, Franco; Nuzzo, Giovanni; Bertogna, Marko
abstract

With the increasing use of multi-core platforms in safety-related domains, aircraft system integrators and authorities exhibit a concern about the impact of concurrent access to shared-resources in the Worst-Case Execution Time (WCET). This paper highlights the need for accurate memory-centric scheduling mechanisms for guaranteeing prioritized memory accesses to Real-Time safety-related components of the system. We implemented a software technique called cache coloring that demonstrates that isolation at timing and spatial level can be achieved by managing the lines that can be evicted in the cache. In order to show the effectiveness of this technique, the timing properties of a real application are considered as a use case, this application is made of parallel tasks that show different trade-offs between computation and memory loads.

2019 - An open source research framework for IoT-capable smart traffic lights [Relazione in Atti di Convegno]
Brilli, G.; Burgio, P.
abstract

Recent technological advances are completely reshaping the way we build our cities, and the way we enjoy them. Future smart cities will employ a number of smart sensors, which cooperatively work to deliver advanced services that improve security and quality of life. The capability of deploying and testing such technologies directly on-the-field is paramount to research, however comes with a significant effort in terms of time and price. For this reason, we introduce an opensource design framework for highly-connected smart sensors, and we implemented it in an advanced controller for traffic light, providing a single component to support researchers and engineers from the earliest stages of development in laboratories till on-the-field research and testing.

2019 - F1/10: An Open-Source Autonomous Cyber-Physical Platform [Working paper]
O'Kelly, Matthew; Sukhil, Varundev; Abbas, Houssam; Harkins, Jack; Kao, Chris; Vardhan Pant, Yash; Mangharam, Rahul; Agarwal, Dipshil; Behl, Madhur; Burgio, Paolo; Bertogna, Marko
abstract

In 2005 DARPA labeled the realization of viable autonomous vehicles (AVs) a grand challenge; a short time later the idea became a moonshot that could change the automotive industry. Today, the question of safety stands between reality and solved. Given the right platform the CPS community is poised to offer unique insights. However, testing the limits of safety and performance on real vehicles is costly and hazardous. The use of such vehicles is also outside the reach of most researchers and students. In this paper, we present F1/10: an open-source, affordable, and high-performance 1/10 scale autonomous vehicle testbed. The F1/10 testbed carries a full suite of sensors, perception, planning, control, and networking software stacks that are similar to full scale solutions. We demonstrate key examples of the research enabled by the F1/10 testbed, and how the platform can be used to augment research and education in autonomous systems, making autonomy more accessible.

2019 - PRYSTINE - Technical Progress after Year 1 [Relazione in Atti di Convegno]
Druml, N.; Veledar, O.; Macher, G.; Stettinger, G.; Selim, S.; Reckenzaun, J.; Diaz, S. E.; Marcano, M.; Villagra, J.; Beekelaar, R.; Jany-Luig, J.; Corredoira, M. M.; Burgio, P.; Ballato, C.; Debaillie, B.; Van Meurs, L.; Terechko, A.; Tango, F.; Ryabokon, A.; Anghel, A.; Icoglu, O.; Kumar, S. S.; Dimitrakopoulos, G.
abstract

Among the actual trends that will affect society in the coming years, autonomous driving stands out as having the potential to disruptively change the automotive industry as we know it today. For this, fail-operational behavior is essential in the sense, plan, and act stages of the automation chain in order to handle safety-critical situations by its own, which currently is not reached with state-of-the-art approaches also due to missing reliable environment perception and sensor fusion. PRYSTINE will realize Fail-operational Urban Surround perceptION (FUSION) which is based on robust Radar and LiDAR sensor fusion and control functions in order to enable safe automated driving in urban and rural environments. In this paper, we detail the vision of the PRYSTINE project and we showcase the results achieved during the first year.

2019 - System Performance Modelling of Heterogeneous HW Platforms: An Automated Driving Case Study [Relazione in Atti di Convegno]
Wurst, F.; Dasari, D.; Hamann, A.; Ziegenbein, D.; Sanudo, I.; Capodieci, N.; Bertogna, M.; Burgio, P.
abstract

The push towards automated and connected driving functionalities mandates the use of heterogeneous HW platforms in order to provide the required computational resources. For these platforms, the established methods for performance modelling in industry are no longer effective. In this paper, we propose an initial modelling concept for heterogeneous platforms which can then be fed into appropriate tools to derive effective performance predictions. The approach is demonstrated for a prototypical automated driving application on the Nvidia Tegra X2 platform.

2018 - Convolutional Neural Networks on Embedded Automotive Platforms: A Qualitative Comparison [Relazione in Atti di Convegno]
Brilli, Gianluca; Burgio, Paolo; Bertogna, Marko
abstract

In the last decade, the rise of power-efficient, het- erogeneous embedded platforms paved the way to the effective adoption of neural networks in several application domains. Especially, many-core accelerators (e.g., GPUs and FPGAs) are used to run Convolutional Neural Networks, e.g., in autonomous vehicles, and industry 4.0. At the same time, advanced research on neural networks is producing interesting results in computer vision applications, and NN packages for computer vision object detection and categorization such as YOLO, GoogleNet and AlexNet reached an unprecedented level of accuracy and perfor- mance. With this work, we aim at validating the effectiveness and efficiency of most recent networks on state-of-the-art embedded platforms, with commercial-off-the-shelf System-on-Chips such as the NVIDIA Tegra X2 and Xilinx Ultrascale+. In our vision, this work will support the choice of the most appropriate CNN package and computing system, and at the same time tries to “make some order” in the field.

2018 - Mapping, scheduling, and schedulability analysis [Capitolo/Saggio]
Burgio, P.; Bertogna, M.; Melani, A.; Quinones, E.; Serrano, M. A.
abstract

This chapter presents how the P-SOCRATES framework addresses the issue of scheduling multiple real-time tasks (RT tasks), made of multiple and concurrent non-preemptable task parts. In its most generic form, the scheduling problem in the architectural framework is a dual problem: scheduling task-to-threads, and scheduling thread-to-core replication.

2017 - A software stack for next-generation automotive systems on many-core heterogeneous platforms [Articolo su rivista]
Burgio, Paolo; Bertogna, Marko; Capodieci, Nicola; Cavicchioli, Roberto; Sojka, Michal; Houdek, Přemysl; Marongiu, Andrea; Gai, Paolo; Scordino, Claudio; Morelli, Bruno
abstract

The next-generation of partially and fully autonomous cars will be powered by embedded many-core platforms. Technologies for Advanced Driver Assistance Systems (ADAS) need to process an unprecedented amount of data within tight power budgets, making those platform the ideal candidate architecture. Integrating tens-to-hundreds of computing elements that run at lower frequencies allows obtaining impressive performance capabilities at a reduced power consumption, that meets the size, weight and power (SWaP) budget of automotive systems. Unfortunately, the inherent architectural complexity of many-core platforms makes it almost impossible to derive real-time guarantees using “traditional” state-of-the-art techniques, ultimately preventing their adoption in real industrial settings. Having impressive average performances with no guaranteed bounds on the response times of the critical computing activities is of little if no use in safety-critical applications. Project Hercules will address this issue, and provide the required technological infrastructure to exploit the tremendous potential of embedded many-cores for the next generation of automotive systems. This work gives an overview of the integrated Hercules software framework, which allows achieving an order-of-magnitude of predictable performance on top of cutting-edge Commercial-Off-The-Shelf components (COTS). The proposed software stack will let both real-time and non real-time application coexist on next-generation, power-efficient embedded platforms, with preserved timing guarantees.

2017 - Adaptive coordination in autonomous driving: Motivations and perspectives [Relazione in Atti di Convegno]
Bertogna, Marko; Burgio, Paolo; Cabri, Giacomo; Capodieci, Nicola
abstract

As autonomous cars are entering mainstream, new research directions are opening involving several domains, from hardware design to control systems, from energy efficiency to computer vision. An exciting direction of research is represented by the coordination of the different vehicles, moving the focus from the single one to a collective system. In this paper we propose some challenging examples thatshow the motivations for a coordination approach in autonomous driving. Moreover, we present some techniques borrowed from distributed artificial intelligence that can be exploited to tackle the previously mentioned challenges.

2016 - A Software Stack for Next-Generation Automotive Systems on Many-Core Heterogeneous Platforms [Relazione in Atti di Convegno]
Burgio, Paolo; Bertogna, Marko; Olmedo, Ignacio Sanudo; Gai, Paolo; Marongiu, Andrea; Sojka, Michal
abstract

The advent of commercial-of-the-shelf (COTS) heterogeneous many-core platforms is opening up a series of opportunities in the embedded computing market. Integrating multiple computing elements running at smaller frequencies allows obtaining impressive performance capabilities at a reduced power consumption. These platforms can be successfully adopted to build the next-generation of self-driving vehicles, where Advanced Driver Assistance Systems (ADAS) need to process unprecedently higher computing workloads at low power budgets. Unfortunately, the current methodologies for providing real-time guarantees are uneffective when applied to the complex architectures of modern many-cores. Having impressive average performances with no guaranteed bounds on the response times of the critical computing activities is of little if no use to these applications. Project HERCULES will provide the required technological infrastructure to obtain an order-of-magnitude improvement in the cost and power consumption of next generation automotive systems. This paper presents the integrated software framework of the project, which allows achieving predictable performance on top of cutting-edge heterogeneous COTS platforms. The proposed software stack will let both real-time and non real-time application coexist on next-generation, power-efficient embedded platform, with preserved timing guarantees.

2016 - Simulating next-generation cyber-physical computing platforms [REPRINT] [Relazione in Atti di Convegno]
Burgio, P.; Alvarez, C.; Ayguade, E.; Filgueras, A.; Jimenez-Gonzalez, D.; Martorell, X.; Navarro, N.; Giorgi, R.
abstract

In specific domains, such as cyber-physical systems, platforms are quickly evolving to include multiple (many-) cores and programmable logic in a single system-on-chip, while including interfaces to commodity sensors/actuators. Programmable Logic (e.g., FPGA) allows for greater flexibility and dependability. However, the task of extracting the performance/watt potential of heterogeneous many-cores is often demanded at the application level, and this has strong implication on the HW/SW co-design process. Enabling fast prototyping of a board being designed is paramount to enable low time-to-market for applications running on it, and ultimately, for the whole platform: programmers must be provided with accurate hardware models, to support the software development cycle at the very early stages of the design process. Virtual platforms fulfill this need, providing that they can be in turn efficiently developed and tested in a few months timespan. In this position paper we will share our experience in the sphere of the AXIOM project, identifying key properties that virtual platforms modeling next-generation cyber-physical systems should have to quickly enable simulation-based software development for a these platforms.

2015 - A memory-centric approach to enable timing-predictability within embedded many-core accelerators [Relazione in Atti di Convegno]
Burgio, Paolo; Marongiu, Andrea; Valente, Paolo; Bertogna, Marko
abstract

There is an increasing interest among real-time systems architects for multi- and many-core accelerated platforms. The main obstacle towards the adoption of such devices within industrial settings is related to the difficulties in tightly estimating the multiple interferences that may arise among the parallel components of the system. This in particular concerns concurrent accesses to shared memory and communication resources. Existing worst-case execution time analyses are extremely pessimistic, especially when adopted for systems composed of hundreds-tothousands of cores. This significantly limits the potential for the adoption of these platforms in real-time systems. In this paper, we study how the predictable execution model (PREM), a memory-aware approach to enable timing-predictability in realtime systems, can be successfully adopted on multi- and manycore heterogeneous platforms. Using a state-of-the-art multi-core platform as a testbed, we validate that it is possible to obtain an order-of-magnitude improvement in the WCET bounds of parallel applications, if data movements are adequately orchestrated in accordance with PREM. We identify which system parameters mostly affect the tremendous performance opportunities offered by this approach, both on average and in the worst case, moving the first step towards predictable many-core systems.

2015 - Efficient Implementation of Genetic Algorithms on GP-GPU with Scheduled Persistent CUDA Threads [Relazione in Atti di Convegno]
Capodieci, Nicola; Burgio, Paolo
abstract

In this paper we present a heavily exploration oriented implementation of genetic algorithms to be executed on graphic processor units (GPUs) that is optimized with our novel mechanism for scheduling GPU-side synchronized jobs that takes inspiration from the concept of persistent threads. Persistent Threads allow an efficient distribution of work loads throughout the GPU so to fully exploit the CUDA (NVIDIA's proprietary Compute Unified Device Architecture) architecture. Our approach (named Scheduled Light Kernel, SLK) uses a specifically designed data structure for issuing sequences of commands from the CPU to the GPU able to minimize CPUGPU communications, exploit streams of concurrent execution of different device side functions within different Streaming Multiprocessors and minimize kernels launch overhead. Results obtained on two completely different experimental settings show that our approach is able to dramatically increase the performance of the tested genetic algorithms compared to the baseline implementation that (while still running on a GPU) does not exploit our proposed approach. Our proposed SLK approach does not require substantial code rewriting and is also compared to newly introduced features in the last CUDA development toolkit, such as nested kernel invocations for dynamic parallelism.

2015 - P-SOCRATES: A parallel software framework for time-critical many-core systems [Articolo su rivista]
Pinho, Luís Miguel; Nélis, Vincent; Yomsi, Patrick Meumeu; Quiñones, Eduardo; Bertogna, Marko; Burgio, Paolo; Marongiu, Andrea; Scordino, Claudio; Gai, Paolo; Ramponi, Michele; Mardiak, Michal
abstract

Current generation of computing platforms is embracing multi-core and many-core processors to improve the overall performance of the system, meeting at the same time the stringent energy budgets requested by the market. Parallel programming languages are nowadays paramount to extracting the tremendous potential offered by these platforms: parallel computing is no longer a niche in the high performance computing (HPC) field, but an essential ingredient in all domains of computer science. The advent of next-generation many-core embedded platforms has the chance of intercepting a converging need for predictable high-performance coming from both the High-Performance Computing (HPC) and Embedded Computing (EC) domains. On one side, new kinds of HPC applications are being required by markets needing huge amounts of information to be processed within a bounded amount of time. On the other side, EC systems are increasingly concerned with providing higher performance in real-time, challenging the performance capabilities of current architectures. This converging demand raises the problem about how to guarantee timing requirements in presence of parallel execution. The paper presents how the time-criticality and parallelisation challenges are addressed by merging techniques coming from both HPC and EC domains, and provides an overview of the proposed framework to achieve these objectives.

2015 - Simulating next-generation cyber-physical computing platforms [Relazione in Atti di Convegno]
Burgio, P.; Alvarez, C.; Ayguade, E.; Filgueras, A.; Jimenez-Gonzalez, D.; Martorell, X.; Navarro, N.; Giorgi, R.
abstract

2014 - A HLS-Based Toolflow to Design Next-Generation Heterogeneous Many-Core Platforms with Shared Memory [Relazione in Atti di Convegno]
Burgio, Paolo; Marongiu, Andrea; Coussy, Philippe; Benini, Luca
abstract

This work describes how we use High-Level Synthesis to support design space exploration (DSE) of heterogeneous many-core systems. Modern embedded systems increasingly couple hardware accelerators and processing cores on the same chip, to trade specialization of the platform to an application domain for increased performance and energy efficiency. However, the process of designing such a platform is complex and error-prone, and requires skills on algorithmic aspects, hardware synthesis, and software engineering. DSE can partially be automated, and thus simplified, by coupling the use of HLS tools and virtual prototyping platforms. In this paper we enable the design space exploration of heterogeneous many-cores adopting a shared-memory architecture template, where communication and synchronization between the hardware accelerators and the cores happens through L1 shared memory. This communication infrastructure leverages a "zero-copy" scheme, which simplifies both the design process of the platform and the development of applications on top of it. Moreover, the shared-memory template perfectly fits the semantics of several high-level programming models, such as OpenMP. We provide programmers with simple yet powerful abstractions to exploit accelerators from within an OpenMP application, and propose a low-cost implementation of the necessary runtime support. An HLS-based automatic design flow is set up, to quickly explore the design space using a cycle-accurate virtual platform.

2014 - A tightly-coupled hardware controller to improve scalability and programmability of shared-memory heterogeneous clusters [Relazione in Atti di Convegno]
Burgio, Paolo; Danilo, Robin; Marongiu, Andrea; Coussy, Philippe; Benini, Luca
abstract

Modern designs for embedded many-core systems increasingly include application-specific units to accelerate key computational kernels with orders-of-magnitude higher execution speed and energy efficiency compared to software counterparts. A promising architectural template is based on heterogeneous clusters, where simple RISC cores and specialized HW units (HWPU) communicate in a tightly-coupled manner via L1 shared memory. Efficiently integrating processors and a high number of HW Processing Units (HWPUs) in such an system poses two main challenges, namely, architectural scalability and programmability. In this paper we describe an optimized Data Pump (DP) which connects several accelerators to a restricted set of communication ports, and acts as a virtualization layer for programming, exposing FIFO queues to offload “HW tasks” to them through a set of lightweight APIs. In this work, we aim at optimizing both these mechanisms, for respectively reducing modules area and making programming sequence easier and lighter.

2014 - Tightly-coupled hardware support to dynamic parallelism acceleration in embedded shared memory clusters [Relazione in Atti di Convegno]
Burgio, Paolo; Tagliavini, Giuseppe; Conti, Francesco; Marongiu, Andrea; Benini, Luca
abstract

Modern designs for embedded systems are increasingly embracing cluster-based architectures, where small sets of cores communicate through tightly-coupled shared memory banks and high-performance interconnections. At the same time, the complexity of modern applications requires new programming abstractions to exploit dynamic and/or irregular parallelism on such platforms. Supporting dynamic parallelism in systems which i) are resource-constrained and ii) run applications with small units of work calls for a runtime environment which has minimal overhead for the scheduling of parallel tasks. In this work, we study the major sources of overhead in the implementation of OpenMP dynamic loops, sections and tasks, and propose a hardware implementation of a generic Scheduling Engine (HWSE) which fits the semantics of the three constructs. The HWSE is designed as a tightly-coupled block to the PEs within a multi-core cluster, communicating through a shared-memory interface. This allows very fast programming and synchronization with the controlling PEs, fundamental to achieving fast dynamic scheduling, and ultimately to enable fine-grained parallelism. We prove the effectiveness of our solutions with real applications and synthetic benchmarks, using a cycle-accurate virtual platform.

2013 - Architecture and programming model support for efficient heterogeneous computing on tigthly-coupled shared-memory clusters [Relazione in Atti di Convegno]
Burgio, P.; Marongiu, A.; Danilo, R.; Coussy, P.; Benini, L.
abstract

Modern computer vision and image processing embedded systems exploit hardware acceleration inside scalable parallel architectures, such as tightly-coupled clusters, to achieve stringent performance and energy efficiency targets. Architectural heterogeneity typically makes software development cumbersome, thus shared memory processor-to-accelerator communication is typically preferred to simplify code offioading to HW IPs for critical computational kernels. However, tightly coupling a large number of accelerators and processors in a shared memory cluster is a challenging task, since the complexity of the resulting system quickly becomes too large. We tackle these issues by proposing a template of heterogeneous shared memory cluster which scales to a large number of accelerators, achieving up to 40% better performance/area/watt than simply designing larger main interconnects to accommodate several HW IPs. In addition, following a trend towards standardization of acceleration capabilities of future embedded systems, we develop a programming model which simplifies application development for heterogeneous clusters.

2013 - Enabling fine-grained OpenMP tasking on tightly-coupled shared memory clusters [Relazione in Atti di Convegno]
Burgio, Paolo; Tagliavini, Giuseppe; Marongiu, Andrea; Benini, Luca
abstract

Cluster-based architectures are increasingly being adopted to design embedded many-cores. These platforms can deliver very high peak performance within a contained power envelope, provided that programmers can make effective use the available parallel cores. This is becoming an extremely difficult task, as embedded applications are growing in complexity and exhibit irregular and dynamic parallelism. The OpenMP tasking extensions represent a powerful abstraction to capture this form of parallelism. However, efficiently supporting it on cluster-based embedded SoCs is not easy, because the fine-grained parallel workload present in embedded applications can not tolerate high memory and run-time overheads. In this paper we present our design of the runtime support layer to OpenMP tasking for an embedded shared memory cluster, identifying key aspects to achieving performance and discussing important architectural support to removing major bottlenecks.

2013 - Variation-tolerant OpenMP tasking on tightly-coupled processor clusters [Relazione in Atti di Convegno]
Rahimi, Abbas; Marongiu, Andrea; Burgio, Paolo; Gupta, Rajesh K.; Benini, Luca
abstract

We present a variation-tolerant tasking technique for tightly-coupled shared memory processor clusters that relies upon modeling advance across the hardware/software interface. This is implemented as an extension to the OpenMP 3.0 tasking programming model. Using the notion of Task-Level Vulnerability (TLV) proposed here, we capture dynamic variations caused by circuit-level variability as a high-level software knowledge. This is accomplished through a variation-aware hardware/software codesign where: (i) Hardware features variability monitors in conjunction with online per-core characterization of TLV metadata; (ii) Software supports a Task-level Errant Instruction Management (TEIM) technique to utilize TLV metadata in the runtime OpenMP task scheduler. This method greatly reduces the number of recovery cycles compared to the baseline scheduler of OpenMP [22], consequently instruction per cycle (IPC) of a 16-core processor cluster is increased up to 1.51× (1.17× on average). We evaluate the effectiveness of our approach with various number of cores (4,8,12,16), and across a wide temperature range(ΔT=90°C).

2012 - Fast and lightweight support for nested parallelism on cluster-based embedded many-cores [Relazione in Atti di Convegno]
Marongiu, Andrea; Burgio, Paolo; Benini, Luca
abstract

Several recent many-core accelerators have been architected as fabrics of tightly-coupled shared memory clusters. A hierarchical interconnection system is used – with a crossbarlike medium inside each cluster and a network-on-chip (NoC) at the global level – which make memory operations nonuniform (NUMA). Nested parallelism represents a powerful programming abstraction for these architectures, where a first level of parallelism can be used to distribute coarse-grained tasks to clusters, and additional levels of fine-grained parallelism can be distributed to processors within a cluster. This paper presents a lightweight and highly optimized support for nested parallelism on cluster-based embedded many-cores. We assess the costs to enable multi-level parallelization and demonstrate that our techniques allow to extract high degrees of parallelism.

2012 - OpenMP-based Synergistic Parallelization and HW Acceleration for On-Chip Shared-Memory Clusters [Relazione in Atti di Convegno]
Burgio, Paolo; Marongiu, Andrea; D., Heller; C., Chavet; P., Coussy; Benini, Luca
abstract

Modern embedded MPSoC designs increasingly couple hardware accelerators to processing cores to trade between energy efficiency and platform specialization. To assist effective design of such systems there is the need on one hand for clear methodologies to streamline accelerator definition and instantiation, on the other for architectural templates and runtime techniques that minimize processors-to-accelerator communication costs. In this paper we present an architecture featuring tightly-coupled processors and accelerators, with zerocopy communication. Efficient programming is supported by an extended OpenMP programming model, where custom directives allow to specialize code regions for execution on parallel cores, accelerators, or a mix of the two. Our integrated approach enables fast yet accurate exploration of accelerator-based HW and SW architectures.

2011 - Bus access design for combined worst and average case execution time optimization of predictable real-time applications on multiprocessor systems-on-chip [Relazione in Atti di Convegno]
Rosen, J.; Neikter, C. -F.; Eles, P.; Peng, Z.; Burgio, P.; Benini, L.
abstract

Optimization techniques for improving the average-case execution time of an application, for which predictability with respect to time is not required, have been investigated for a long time in many different contexts. However, this has traditionally been done without paying attention to the worst-case execution time. For predictable real-time applications, on the other hand, the focus has been solely on worst-case execution time optimization, ignoring how this affects the execution time in the average case. In this paper, we show that having a good average-case delay can be important also for real-time applications for which predictability is required. Furthermore, for real-time applications running on multiprocessor systems-on-chip, we present a technique for optimizing the average case and the worst case simultaneously, allowing for a good average-case execution time while still keeping the worst case as small as possible.

2011 - MPOpt-Cell: a high-performance data-flow programming environment for the CELL BE processor [Relazione in Atti di Convegno]
Franceschelli, A.; Burgio, P.; Tagliavini, G.; Marongiu, A.; Ruggiero, M.; Lombardi, M.; Bonfietti, A.; Milano, M.; Benini, L.
abstract

We present MPOpt-Cell, an architecture-aware framework for high-productivity development and efficient execution of stream applications on the CELL BE Processor. It enables developers to quickly build Synchronous Data Flow (SDF) applications using a simple and intuitive programming interface based on a set of compiler directives that capture the key abstractions of SDF. The compiler backend and system runtime efficiently manage hardware resources.

2011 - Supporting OpenMP on a multi-cluster embedded MPSoC [Articolo su rivista]
Marongiu, A.; Burgio, P.; Benini, L.
abstract

The ever-increasing complexity of MPSoCs is putting the production of software on the critical path in embedded system development. Several programming models and tools have been proposed in the recent past that aim to facilitate application development for embedded MPSoCs. OpenMP is a mature and easy-to-use standard for shared memory programming, which has recently been successfully adopted in embedded MPSoC programming as well. To achieve performance, however, it is necessary that the implementation of OpenMP constructs efficiently exploits the many peculiarities of MPSoC hardware, and that custom features are provided to the programmer to control it. In this paper we consider a representative template of a modern multi-cluster embedded MPSoC and present an extensive evaluation of the cost associated with supporting OpenMP on such a machine, investigating several implementation variants that are aware of the memory hierarchy and of the heterogeneous interconnection.

2010 - Adaptive TDMA bus allocation and elastic scheduling: A unified approach for enhancing robustness in multi-core RT systems [Relazione in Atti di Convegno]
Burgio, P.; Ruggiero, M.; Esposito, F.; Marinoni, M.; Buttazzo, G.; Benini, L.
abstract

Next-generation real-time systems will be increasingly based on heterogeneous MPSoC design paradigms, where predictability and performance will be key issues to deal with. Such issues can be tackled both at the hardware level, by embedding technologies such as TDMA busses, and at the OS level, where suitable scheduling techniques can improve performance and reduce energy consumption. Among these, elastic scheduling has been proved to provide satisfactory results by dynamically reducing task periods at run-time to ensure the highest utilization possible of the processors. On the other hand, elastic scheduling lowers the degree of predictability and increases the complexity of the analysis at the system level. This reduces the benefits given by the TDMA bus, which relies on the high level task analysis for a robust and efficient slot allocation. Starting from this consideration, we propose a system where the elastic scheduling and the TDMA bus work synergistically. We introduce a QoS-aware adaptive bus service which takes the best of both techniques, mitigating their drawbacks at the same time. We show how the overhead introduced by coordination action is small, and it is however dominated by the benefits of the overall strategy in terms of performance and predictability guarantees.

2010 - Evaluating OpenMP Support Costs on MPSoCs [Relazione in Atti di Convegno]
Marongiu, Andrea; Burgio, Paolo; Benini, Luca
abstract

The ever-increasing complexity of MPSoCs is making the production of software the critical path in embedded system development. Several programming models and tools have been proposed in the recent past that aim at facilitating application development for embedded MPSoCs. OpenMP is a mature and easy-to-use standard for shared memory programming, which has recently been successfully adopted in embedded MPSoC programming as well. To achieve performance, however, it is necessary that the implementation of OpenMP constructs efficiently exploits the many peculiarities of MPSoC hardware. In this paper we present an extensive evaluation of the cost associated with supporting OpenMP on such a machine, investigating several implementative variants that efficiently exploit the memory hierarchy. Experimental results on different benchmarks confirm the effectiveness of the optimizations in terms of performance improvements.

2010 - Vertical stealing: robust, locality-aware do-all workload distribution for 3D MPSoCs [Relazione in Atti di Convegno]
Marongiu, A.; Burgio, P.; Benini, L.
abstract

In this paper we address the issue of efficient doall workload distribution on a embedded 3D MPSoC. 3D stacking technology enables low latency and high bandwidth access to multiple, large memory banks in close spatial proximity. In our implementation one silicon layer contains multiple processors, whereas one or more DRAM layers on top host a NUMA memory subsystem. To obtain high locality and balanced workload we consider a two-step approach. First, a compiler pass analyzes memory references in a loop and schedules each iteration to the processor owning the most frequently accessed data. Second, if locality-aware loop parallelization has generated unbalanced workload we allow idle processors to execute part of the remaining work from neighbors by implementing runtime support for work stealing.

Università degli studi di Modena e Reggio Emilia

Pubblicazioni