Nuova ricerca

PAOLO BURGIO

Ricercatore t.d. art. 24 c. 3 lett. A presso: Dipartimento di Ingegneria "Enzo Ferrari"


Home | Curriculum(pdf) | Didattica |


Pubblicazioni

2020 - An automatic scenario generator for validation of automated valet parking systems [Relazione in Atti di Convegno]
Tagliavini, A.; Ferraro, D.; Kloda, T.; Burgio, P.
abstract

A primary goal of self-driving car manufacturers is to create an autonomous car system that is clearly and demonstrably safer than an average human-controlled car. The real-world tests are expensive, time-consuming and potentially dangerous. The virtual simulation is therefore required. The autonomous driving valet parking is expected to be the first commercially available automated driving function without a human driver at the wheel (SAE Level 4). Although many simulation solutions for the automotive market already exist, none of them features the parking environments. In this paper, we propose a new software virtual scenario generator for the parking sites. The tool populates the synthetics parking maps with objects and actions related to these environments: the cars driving from the drop-off point towards the vacant slots and the randomly placed parked cars, each with a given probability of exiting its slot. The generated scenarios are in the OpenSCENARIO format and are fully simulated in the Virtual Test Drive simulator.


2020 - Artificial Neural Networks: The Missing Link Between Curiosity and Accuracy [Relazione in Atti di Convegno]
Franchini, G.; Burgio, P.; Zanni, L.
abstract

Artificial Neural Networks, as the name itself suggests, are biologically inspired algorithms designed to simulate the way in which the human brain processes information. Like neurons, which consist of a cell nucleus that receives input from other neurons through a web of input terminals, an Artificial Neural Network includes hundreds of single units, artificial neurons or processing elements, connected with coefficients (weights), and are organized in layers. The power of neural computations comes from connecting neurons in a network: in fact, in an Artificial Neural Network it is possible to manage a different number of information at the same time. What is not fully understood is which is the most efficient way to train an Artificial Neural Network, and in particular what is the best mini-batch size for maximize accuracy while minimizing training time. The idea that will be developed in this study has its roots in the biological world, that inspired the creation of Artificial Neural Network in the first place. Humans have altered the face of the world through extraordinary adaptive and technological advances: those changes were made possible by our cognitive structure, particularly the ability to reasoning and build causal models of external events. This dynamism is made possible by a high degree of curiosity. In the biological world, and especially in human beings, curiosity arises from the constant search of knowledge and information: behaviours that support the information sampling mechanism range from the very small (initial mini-batch size) to the very elaborate sustained (increasing mini-batch size). The goal of this project is to train an Artificial Neural Network by increasing dynamically, in an adaptive manner (with validation set), the mini-batch size; our hypothesis is that this training method will be more efficient (in terms of time and costs) compared to the ones implemented so far.


2020 - Human-automation interaction through shared and traded control applications [Relazione in Atti di Convegno]
Marcano, M.; Diaz, S.; Perez, J.; Castellano, A.; Landini, E.; Tango, F.; Burgio, P.
abstract

Automated and Highly-automated Vehicles still need to interact with the driver at different cognitive levels. Those who are SAE Level 1 or 2 consider the human in the loop all the time and require strong participation of the driver at the control level. To increase safety, trust and comfort of the driver with this kind of automation, systems with a strong cooperative component are needed. This paper introduces the design of a vehicle controller based on shared control, together with an arbitration system, and the design of a Human-Machine Interface (HMI) to foster the mutual understanding between driver and automation in a lane-keeping task. The driver-automation cooperation is achieved through incremental support, in a continuum spectrum from manual to full automation. Additionally, the design of an HMI to support the driver in a takeover maneuver is presented. This functionality is a key component of vehicles SAE Level 3 and 4.


2020 - Real-Time clustering and LiDAR-camera fusion on embedded platforms for self-driving cars [Relazione in Atti di Convegno]
Verucchi, M.; Bartoli, L.; Bagni, F.; Gatti, F.; Burgio, P.; Bertogna, M.
abstract

3D object detection and classification are crucial tasks for perception in Autonomous Driving (AD). To promptly and correctly react to environment changes and avoid hazards, it is of paramount importance to perform those operations with high accuracy and in real-time. One of the most widely adopted strategies to improve the detection precision is to fuse information from different sensors, like e.g. cameras and LiDAR. However, sensor fusion is a computationally intensive task, that may be difficult to execute in real-time on an embedded platforms. In this paper, we present a new approach for LiDAR and camera fusion, that can be suitable to execute within the tight timing requirements of an autonomous driving system. The proposed method is based on a new clustering algorithm developed for the LiDAR point cloud, a new technique for the alignment of the sensors, and an optimization of the Yolo-v3 neural network. The efficiency of the proposed method is validated comparing it against state-of-the-art solutions on commercial embedded platforms.


2020 - The Key Role of Memory in Next-Generation Embedded Systems for Military Applications [Relazione in Atti di Convegno]
Sañudo, Ignacio; Cortimiglia, Paolo; Miccio, Luca; Solieri, Marco; Burgio, Paolo; Di Biagio, Christian; Felici, Franco; Nuzzo, Giovanni; Bertogna, Marko
abstract

With the increasing use of multi-core platforms in safety-related domains, aircraft system integrators and authorities exhibit a concern about the impact of concurrent access to shared-resources in the Worst-Case Execution Time (WCET). This paper highlights the need for accurate memory-centric scheduling mechanisms for guaranteeing prioritized memory accesses to Real-Time safety-related components of the system. We implemented a software technique called cache coloring that demonstrates that isolation at timing and spatial level can be achieved by managing the lines that can be evicted in the cache. In order to show the effectiveness of this technique, the timing properties of a real application are considered as a use case, this application is made of parallel tasks that show different trade-offs between computation and memory loads.


2019 - An open source research framework for IoT-capable smart traffic lights [Relazione in Atti di Convegno]
Brilli, G.; Burgio, P.
abstract

Recent technological advances are completely reshaping the way we build our cities, and the way we enjoy them. Future smart cities will employ a number of smart sensors, which cooperatively work to deliver advanced services that improve security and quality of life. The capability of deploying and testing such technologies directly on-the-field is paramount to research, however comes with a significant effort in terms of time and price. For this reason, we introduce an opensource design framework for highly-connected smart sensors, and we implemented it in an advanced controller for traffic light, providing a single component to support researchers and engineers from the earliest stages of development in laboratories till on-the-field research and testing.


2019 - PRYSTINE - Technical Progress after Year 1 [Relazione in Atti di Convegno]
Druml, N.; Veledar, O.; MacHer, G.; Stettinger, G.; Selim, S.; Reckenzaun, J.; DIaz, S. E.; Marcano, M.; Villagra, J.; Beekelaar, R.; Jany-Luig, J.; Corredoira, M. M.; Burgio, P.; Ballato, C.; Debaillie, B.; Van Meurs, L.; Terechko, A.; Tango, F.; Ryabokon, A.; Anghel, A.; Icoglu, O.; Kumar, S. S.; DImitrakopoulos, G.
abstract

Among the actual trends that will affect society in the coming years, autonomous driving stands out as having the potential to disruptively change the automotive industry as we know it today. For this, fail-operational behavior is essential in the sense, plan, and act stages of the automation chain in order to handle safety-critical situations by its own, which currently is not reached with state-of-the-art approaches also due to missing reliable environment perception and sensor fusion. PRYSTINE will realize Fail-operational Urban Surround perceptION (FUSION) which is based on robust Radar and LiDAR sensor fusion and control functions in order to enable safe automated driving in urban and rural environments. In this paper, we detail the vision of the PRYSTINE project and we showcase the results achieved during the first year.


2019 - System Performance Modelling of Heterogeneous HW Platforms: An Automated Driving Case Study [Relazione in Atti di Convegno]
Wurst, F.; Dasari, D.; Hamann, A.; Ziegenbein, D.; Sanudo, I.; Capodieci, N.; Bertogna, M.; Burgio, P.
abstract

The push towards automated and connected driving functionalities mandates the use of heterogeneous HW platforms in order to provide the required computational resources. For these platforms, the established methods for performance modelling in industry are no longer effective. In this paper, we propose an initial modelling concept for heterogeneous platforms which can then be fed into appropriate tools to derive effective performance predictions. The approach is demonstrated for a prototypical automated driving application on the Nvidia Tegra X2 platform.


2018 - Convolutional Neural Networks on Embedded Automotive Platforms: A Qualitative Comparison [Relazione in Atti di Convegno]
Brilli, Gianluca; Burgio, Paolo; Bertogna, Marko
abstract

In the last decade, the rise of power-efficient, het- erogeneous embedded platforms paved the way to the effective adoption of neural networks in several application domains. Especially, many-core accelerators (e.g., GPUs and FPGAs) are used to run Convolutional Neural Networks, e.g., in autonomous vehicles, and industry 4.0. At the same time, advanced research on neural networks is producing interesting results in computer vision applications, and NN packages for computer vision object detection and categorization such as YOLO, GoogleNet and AlexNet reached an unprecedented level of accuracy and perfor- mance. With this work, we aim at validating the effectiveness and efficiency of most recent networks on state-of-the-art embedded platforms, with commercial-off-the-shelf System-on-Chips such as the NVIDIA Tegra X2 and Xilinx Ultrascale+. In our vision, this work will support the choice of the most appropriate CNN package and computing system, and at the same time tries to “make some order” in the field.


2018 - Mapping, scheduling, and schedulability analysis [Capitolo/Saggio]
Burgio, P.; Bertogna, M.; Melani, A.; Quinones, E.; Serrano, M. A.
abstract

This chapter presents how the P-SOCRATES framework addresses the issue of scheduling multiple real-time tasks (RT tasks), made of multiple and concurrent non-preemptable task parts. In its most generic form, the scheduling problem in the architectural framework is a dual problem: scheduling task-to-threads, and scheduling thread-to-core replication.


2017 - A software stack for next-generation automotive systems on many-core heterogeneous platforms [Articolo su rivista]
Burgio, Paolo; Bertogna, Marko; Capodieci, Nicola; Cavicchioli, Roberto; Sojka, Michal; Houdek, Přemysl; Marongiu, Andrea; Gai, Paolo; Scordino, Claudio; Morelli, Bruno
abstract

The next-generation of partially and fully autonomous cars will be powered by embedded many-core platforms. Technologies for Advanced Driver Assistance Systems (ADAS) need to process an unprecedented amount of data within tight power budgets, making those platform the ideal candidate architecture. Integrating tens-to-hundreds of computing elements that run at lower frequencies allows obtaining impressive performance capabilities at a reduced power consumption, that meets the size, weight and power (SWaP) budget of automotive systems. Unfortunately, the inherent architectural complexity of many-core platforms makes it almost impossible to derive real-time guarantees using “traditional” state-of-the-art techniques, ultimately preventing their adoption in real industrial settings. Having impressive average performances with no guaranteed bounds on the response times of the critical computing activities is of little if no use in safety-critical applications. Project Hercules will address this issue, and provide the required technological infrastructure to exploit the tremendous potential of embedded many-cores for the next generation of automotive systems. This work gives an overview of the integrated Hercules software framework, which allows achieving an order-of-magnitude of predictable performance on top of cutting-edge Commercial-Off-The-Shelf components (COTS). The proposed software stack will let both real-time and non real-time application coexist on next-generation, power-efficient embedded platforms, with preserved timing guarantees.


2017 - Adaptive coordination in autonomous driving: Motivations and perspectives [Relazione in Atti di Convegno]
Bertogna, Marko; Burgio, Paolo; Cabri, Giacomo; Capodieci, Nicola
abstract

As autonomous cars are entering mainstream, new research directions are opening involving several domains, from hardware design to control systems, from energy efficiency to computer vision. An exciting direction of research is represented by the coordination of the different vehicles, moving the focus from the single one to a collective system. In this paper we propose some challenging examples thatshow the motivations for a coordination approach in autonomous driving. Moreover, we present some techniques borrowed from distributed artificial intelligence that can be exploited to tackle the previously mentioned challenges.


2016 - A Software Stack for Next-Generation Automotive Systems on Many-Core Heterogeneous Platforms [Relazione in Atti di Convegno]
Burgio, Paolo; Bertogna, Marko; Olmedo, Ignacio Sanudo; Gai, Paolo; Marongiu, Andrea; Sojka, Michal
abstract

The advent of commercial-of-the-shelf (COTS) heterogeneous many-core platforms is opening up a series of opportunities in the embedded computing market. Integrating multiple computing elements running at smaller frequencies allows obtaining impressive performance capabilities at a reduced power consumption. These platforms can be successfully adopted to build the next-generation of self-driving vehicles, where Advanced Driver Assistance Systems (ADAS) need to process unprecedently higher computing workloads at low power budgets. Unfortunately, the current methodologies for providing real-time guarantees are uneffective when applied to the complex architectures of modern many-cores. Having impressive average performances with no guaranteed bounds on the response times of the critical computing activities is of little if no use to these applications. Project HERCULES will provide the required technological infrastructure to obtain an order-of-magnitude improvement in the cost and power consumption of next generation automotive systems. This paper presents the integrated software framework of the project, which allows achieving predictable performance on top of cutting-edge heterogeneous COTS platforms. The proposed software stack will let both real-time and non real-time application coexist on next-generation, power-efficient embedded platform, with preserved timing guarantees.


2016 - Simulating next-generation cyber-physical computing platforms [REPRINT] [Relazione in Atti di Convegno]
Burgio, P.; Alvarez, C.; Ayguade, E.; Filgueras, A.; Jimenez-Gonzalez, D.; Martorell, X.; Navarro, N.; Giorgi, R.
abstract

In specific domains, such as cyber-physical systems, platforms are quickly evolving to include multiple (many-) cores and programmable logic in a single system-on-chip, while including interfaces to commodity sensors/actuators. Programmable Logic (e.g., FPGA) allows for greater flexibility and dependability. However, the task of extracting the performance/watt potential of heterogeneous many-cores is often demanded at the application level, and this has strong implication on the HW/SW co-design process. Enabling fast prototyping of a board being designed is paramount to enable low time-to-market for applications running on it, and ultimately, for the whole platform: programmers must be provided with accurate hardware models, to support the software development cycle at the very early stages of the design process. Virtual platforms fulfill this need, providing that they can be in turn efficiently developed and tested in a few months timespan. In this position paper we will share our experience in the sphere of the AXIOM project, identifying key properties that virtual platforms modeling next-generation cyber-physical systems should have to quickly enable simulation-based software development for a these platforms.


2015 - A memory-centric approach to enable timing-predictability within embedded many-core accelerators [Relazione in Atti di Convegno]
Burgio, Paolo; Marongiu, Andrea; Valente, Paolo; Bertogna, Marko
abstract

There is an increasing interest among real-time systems architects for multi- and many-core accelerated platforms. The main obstacle towards the adoption of such devices within industrial settings is related to the difficulties in tightly estimating the multiple interferences that may arise among the parallel components of the system. This in particular concerns concurrent accesses to shared memory and communication resources. Existing worst-case execution time analyses are extremely pessimistic, especially when adopted for systems composed of hundreds-tothousands of cores. This significantly limits the potential for the adoption of these platforms in real-time systems. In this paper, we study how the predictable execution model (PREM), a memory-aware approach to enable timing-predictability in realtime systems, can be successfully adopted on multi- and manycore heterogeneous platforms. Using a state-of-the-art multi-core platform as a testbed, we validate that it is possible to obtain an order-of-magnitude improvement in the WCET bounds of parallel applications, if data movements are adequately orchestrated in accordance with PREM. We identify which system parameters mostly affect the tremendous performance opportunities offered by this approach, both on average and in the worst case, moving the first step towards predictable many-core systems.


2015 - Efficient Implementation of Genetic Algorithms on GP-GPU with Scheduled Persistent CUDA Threads [Relazione in Atti di Convegno]
Capodieci, Nicola; Burgio, Paolo
abstract

In this paper we present a heavily exploration oriented implementation of genetic algorithms to be executed on graphic processor units (GPUs) that is optimized with our novel mechanism for scheduling GPU-side synchronized jobs that takes inspiration from the concept of persistent threads. Persistent Threads allow an efficient distribution of work loads throughout the GPU so to fully exploit the CUDA (NVIDIA's proprietary Compute Unified Device Architecture) architecture. Our approach (named Scheduled Light Kernel, SLK) uses a specifically designed data structure for issuing sequences of commands from the CPU to the GPU able to minimize CPUGPU communications, exploit streams of concurrent execution of different device side functions within different Streaming Multiprocessors and minimize kernels launch overhead. Results obtained on two completely different experimental settings show that our approach is able to dramatically increase the performance of the tested genetic algorithms compared to the baseline implementation that (while still running on a GPU) does not exploit our proposed approach. Our proposed SLK approach does not require substantial code rewriting and is also compared to newly introduced features in the last CUDA development toolkit, such as nested kernel invocations for dynamic parallelism.


2015 - P-SOCRATES: A parallel software framework for time-critical many-core systems [Articolo su rivista]
Pinho, Luís Miguel; Nélis, Vincent; Yomsi, Patrick Meumeu; Quiñones, Eduardo; Bertogna, Marko; Burgio, Paolo; Marongiu, Andrea; Scordino, Claudio; Gai, Paolo; Ramponi, Michele; Mardiak, Michal
abstract

Current generation of computing platforms is embracing multi-core and many-core processors to improve the overall performance of the system, meeting at the same time the stringent energy budgets requested by the market. Parallel programming languages are nowadays paramount to extracting the tremendous potential offered by these platforms: parallel computing is no longer a niche in the high performance computing (HPC) field, but an essential ingredient in all domains of computer science. The advent of next-generation many-core embedded platforms has the chance of intercepting a converging need for predictable high-performance coming from both the High-Performance Computing (HPC) and Embedded Computing (EC) domains. On one side, new kinds of HPC applications are being required by markets needing huge amounts of information to be processed within a bounded amount of time. On the other side, EC systems are increasingly concerned with providing higher performance in real-time, challenging the performance capabilities of current architectures. This converging demand raises the problem about how to guarantee timing requirements in presence of parallel execution. The paper presents how the time-criticality and parallelisation challenges are addressed by merging techniques coming from both HPC and EC domains, and provides an overview of the proposed framework to achieve these objectives.


2015 - Simulating next-generation cyber-physical computing platforms [Relazione in Atti di Convegno]
Burgio, P.; Alvarez, C.; Ayguade, E.; Filgueras, A.; Jimenez-Gonzalez, D.; Martorell, X.; Navarro, N.; Giorgi, R.
abstract

In specific domains, such as cyber-physical systems, platforms are quickly evolving to include multiple (many-) cores and programmable logic in a single system-on-chip, while including interfaces to commodity sensors/actuators. Programmable Logic (e.g., FPGA) allows for greater flexibility and dependability. However, the task of extracting the performance/watt potential of heterogeneous many-cores is often demanded at the application level, and this has strong implication on the HW/SW co-design process. Enabling fast prototyping of a board being designed is paramount to enable low time-to-market for applications running on it, and ultimately, for the whole platform: programmers must be provided with accurate hardware models, to support the software development cycle at the very early stages of the design process. Virtual platforms fulfill this need, providing that they can be in turn efficiently developed and tested in a few months timespan. In this position paper we will share our experience in the sphere of the AXIOM project, identifying key properties that virtual platforms modeling next-generation cyber-physical systems should have to quickly enable simulation-based software development for a these platforms.


2014 - A HLS-Based Toolflow to Design Next-Generation Heterogeneous Many-Core Platforms with Shared Memory [Relazione in Atti di Convegno]
Burgio, Paolo; Marongiu, Andrea; Coussy, Philippe; Benini, Luca
abstract

This work describes how we use High-Level Synthesis to support design space exploration (DSE) of heterogeneous many-core systems. Modern embedded systems increasingly couple hardware accelerators and processing cores on the same chip, to trade specialization of the platform to an application domain for increased performance and energy efficiency. However, the process of designing such a platform is complex and error-prone, and requires skills on algorithmic aspects, hardware synthesis, and software engineering. DSE can partially be automated, and thus simplified, by coupling the use of HLS tools and virtual prototyping platforms. In this paper we enable the design space exploration of heterogeneous many-cores adopting a shared-memory architecture template, where communication and synchronization between the hardware accelerators and the cores happens through L1 shared memory. This communication infrastructure leverages a "zero-copy" scheme, which simplifies both the design process of the platform and the development of applications on top of it. Moreover, the shared-memory template perfectly fits the semantics of several high-level programming models, such as OpenMP. We provide programmers with simple yet powerful abstractions to exploit accelerators from within an OpenMP application, and propose a low-cost implementation of the necessary runtime support. An HLS-based automatic design flow is set up, to quickly explore the design space using a cycle-accurate virtual platform.


2014 - A tightly-coupled hardware controller to improve scalability and programmability of shared-memory heterogeneous clusters [Relazione in Atti di Convegno]
Burgio, Paolo; Danilo, Robin; Marongiu, Andrea; Coussy, Philippe; Benini, Luca
abstract

Modern designs for embedded many-core systems increasingly include application-specific units to accelerate key computational kernels with orders-of-magnitude higher execution speed and energy efficiency compared to software counterparts. A promising architectural template is based on heterogeneous clusters, where simple RISC cores and specialized HW units (HWPU) communicate in a tightly-coupled manner via L1 shared memory. Efficiently integrating processors and a high number of HW Processing Units (HWPUs) in such an system poses two main challenges, namely, architectural scalability and programmability. In this paper we describe an optimized Data Pump (DP) which connects several accelerators to a restricted set of communication ports, and acts as a virtualization layer for programming, exposing FIFO queues to offload “HW tasks” to them through a set of lightweight APIs. In this work, we aim at optimizing both these mechanisms, for respectively reducing modules area and making programming sequence easier and lighter.


2014 - Tightly-coupled hardware support to dynamic parallelism acceleration in embedded shared memory clusters [Relazione in Atti di Convegno]
Burgio, Paolo; Tagliavini, Giuseppe; Conti, Francesco; Marongiu, Andrea; Benini, Luca
abstract

Modern designs for embedded systems are increasingly embracing cluster-based architectures, where small sets of cores communicate through tightly-coupled shared memory banks and high-performance interconnections. At the same time, the complexity of modern applications requires new programming abstractions to exploit dynamic and/or irregular parallelism on such platforms. Supporting dynamic parallelism in systems which i) are resource-constrained and ii) run applications with small units of work calls for a runtime environment which has minimal overhead for the scheduling of parallel tasks. In this work, we study the major sources of overhead in the implementation of OpenMP dynamic loops, sections and tasks, and propose a hardware implementation of a generic Scheduling Engine (HWSE) which fits the semantics of the three constructs. The HWSE is designed as a tightly-coupled block to the PEs within a multi-core cluster, communicating through a shared-memory interface. This allows very fast programming and synchronization with the controlling PEs, fundamental to achieving fast dynamic scheduling, and ultimately to enable fine-grained parallelism. We prove the effectiveness of our solutions with real applications and synthetic benchmarks, using a cycle-accurate virtual platform.


2013 - Architecture and programming model support for efficient heterogeneous computing on tigthly-coupled shared-memory clusters [Relazione in Atti di Convegno]
Burgio, P.; Marongiu, A.; Danilo, R.; Coussy, P.; Benini, L.
abstract

Modern computer vision and image processing embedded systems exploit hardware acceleration inside scalable parallel architectures, such as tightly-coupled clusters, to achieve stringent performance and energy efficiency targets. Architectural heterogeneity typically makes software development cumbersome, thus shared memory processor-to-accelerator communication is typically preferred to simplify code offioading to HW IPs for critical computational kernels. However, tightly coupling a large number of accelerators and processors in a shared memory cluster is a challenging task, since the complexity of the resulting system quickly becomes too large. We tackle these issues by proposing a template of heterogeneous shared memory cluster which scales to a large number of accelerators, achieving up to 40% better performance/area/watt than simply designing larger main interconnects to accommodate several HW IPs. In addition, following a trend towards standardization of acceleration capabilities of future embedded systems, we develop a programming model which simplifies application development for heterogeneous clusters.


2013 - Enabling Fine-Grained OpenMP Tasking on Tightly-Coupled Shared Memory ClustersDesign, Automation & Test in Europe Conference & Exhibition (DATE), 2013 [Relazione in Atti di Convegno]
Burgio, Paolo; Tagliavini, Giuseppe; Marongiu, Andrea; Benini, Luca
abstract

Cluster-based architectures are increasingly being adopted to design embedded many-cores. These platforms can deliver very high peak performance within a contained power envelope, provided that programmers can make effective use the available parallel cores. This is becoming an extremely difficult task, as embedded applications are growing in complexity and exhibit irregular and dynamic parallelism. The OpenMP tasking extensions represent a powerful abstraction to capture this form of parallelism. However, efficiently supporting it on cluster-based embedded SoCs is not easy, because the fine-grained parallel workload present in embedded applications can not tolerate high memory and run-time overheads. In this paper we present our design of the runtime support layer to OpenMP tasking for an embedded shared memory cluster, identifying key aspects to achieving performance and discussing important architectural support to removing major bottlenecks.


2013 - Variation-tolerant OpenMP Tasking on Tightly-coupled Processor ClustersDesign, Automation & Test in Europe Conference & Exhibition (DATE), 2013 [Relazione in Atti di Convegno]
Rahimi, Abbas; Marongiu, Andrea; Burgio, Paolo; Gupta, Rajesh K.; Benini, Luca
abstract

We present a variation-tolerant tasking technique for tightly-coupled shared memory processor clusters that relies upon modeling advance across the hardware/software interface. This is implemented as an extension to the OpenMP 3.0 tasking programming model. Using the notion of Task-Level Vulnerability (TLV) proposed here, we capture dynamic variations caused by circuit-level variability as a high-level software knowledge. This is accomplished through a variation-aware hardware/software codesign where: (i) Hardware features variability monitors in conjunction with online per-core characterization of TLV metadata; (ii) Software supports a Task-level Errant Instruction Management (TEIM) technique to utilize TLV metadata in the runtime OpenMP task scheduler. This method greatly reduces the number of recovery cycles compared to the baseline scheduler of OpenMP [22], consequently instruction per cycle (IPC) of a 16-core processor cluster is increased up to 1.51× (1.17× on average). We evaluate the effectiveness of our approach with various number of cores (4,8,12,16), and across a wide temperature range(ΔT=90°C).


2012 - Fast and lightweight support for nested parallelism on cluster-based embedded many-cores [Relazione in Atti di Convegno]
Marongiu, Andrea; Burgio, Paolo; Benini, Luca
abstract

Several recent many-core accelerators have been architected as fabrics of tightly-coupled shared memory clusters. A hierarchical interconnection system is used – with a crossbarlike medium inside each cluster and a network-on-chip (NoC) at the global level – which make memory operations nonuniform (NUMA). Nested parallelism represents a powerful programming abstraction for these architectures, where a first level of parallelism can be used to distribute coarse-grained tasks to clusters, and additional levels of fine-grained parallelism can be distributed to processors within a cluster. This paper presents a lightweight and highly optimized support for nested parallelism on cluster-based embedded many-cores. We assess the costs to enable multi-level parallelization and demonstrate that our techniques allow to extract high degrees of parallelism.


2012 - OpenMP-based Synergistic Parallelization and HW Acceleration for On-Chip Shared-Memory Clusters [Relazione in Atti di Convegno]
Burgio, Paolo; Marongiu, Andrea; D., Heller; C., Chavet; P., Coussy; Benini, Luca
abstract

Modern embedded MPSoC designs increasingly couple hardware accelerators to processing cores to trade between energy efficiency and platform specialization. To assist effective design of such systems there is the need on one hand for clear methodologies to streamline accelerator definition and instantiation, on the other for architectural templates and runtime techniques that minimize processors-to-accelerator communication costs. In this paper we present an architecture featuring tightly-coupled processors and accelerators, with zerocopy communication. Efficient programming is supported by an extended OpenMP programming model, where custom directives allow to specialize code regions for execution on parallel cores, accelerators, or a mix of the two. Our integrated approach enables fast yet accurate exploration of accelerator-based HW and SW architectures.


2011 - Bus access design for combined worst and average case execution time optimization of predictable real-time applications on multiprocessor systems-on-chip [Relazione in Atti di Convegno]
Rosen, J.; Neikter, C. -F.; Eles, P.; Peng, Z.; Burgio, P.; Benini, L.
abstract

Optimization techniques for improving the average-case execution time of an application, for which predictability with respect to time is not required, have been investigated for a long time in many different contexts. However, this has traditionally been done without paying attention to the worst-case execution time. For predictable real-time applications, on the other hand, the focus has been solely on worst-case execution time optimization, ignoring how this affects the execution time in the average case. In this paper, we show that having a good average-case delay can be important also for real-time applications for which predictability is required. Furthermore, for real-time applications running on multiprocessor systems-on-chip, we present a technique for optimizing the average case and the worst case simultaneously, allowing for a good average-case execution time while still keeping the worst case as small as possible.


2011 - MPOpt-Cell: a high-performance data-flow programming environment for the CELL BE processor [Relazione in Atti di Convegno]
Franceschelli, A.; Burgio, P.; Tagliavini, G.; Marongiu, A.; Ruggiero, M.; Lombardi, M.; Bonfietti, A.; Milano, M.; Benini, L.
abstract

We present MPOpt-Cell, an architecture-aware framework for high-productivity development and efficient execution of stream applications on the CELL BE Processor. It enables developers to quickly build Synchronous Data Flow (SDF) applications using a simple and intuitive programming interface based on a set of compiler directives that capture the key abstractions of SDF. The compiler backend and system runtime efficiently manage hardware resources.


2011 - Supporting OpenMP on a multi-cluster embedded MPSoC [Articolo su rivista]
Marongiu, A.; Burgio, P.; Benini, L.
abstract

The ever-increasing complexity of MPSoCs is putting the production of software on the critical path in embedded system development. Several programming models and tools have been proposed in the recent past that aim to facilitate application development for embedded MPSoCs. OpenMP is a mature and easy-to-use standard for shared memory programming, which has recently been successfully adopted in embedded MPSoC programming as well. To achieve performance, however, it is necessary that the implementation of OpenMP constructs efficiently exploits the many peculiarities of MPSoC hardware, and that custom features are provided to the programmer to control it. In this paper we consider a representative template of a modern multi-cluster embedded MPSoC and present an extensive evaluation of the cost associated with supporting OpenMP on such a machine, investigating several implementation variants that are aware of the memory hierarchy and of the heterogeneous interconnection.


2010 - Adaptive TDMA bus allocation and elastic scheduling: A unified approach for enhancing robustness in multi-core RT systems [Relazione in Atti di Convegno]
Burgio, P.; Ruggiero, M.; Esposito, F.; Marinoni, M.; Buttazzo, G.; Benini, L.
abstract

Next-generation real-time systems will be increasingly based on heterogeneous MPSoC design paradigms, where predictability and performance will be key issues to deal with. Such issues can be tackled both at the hardware level, by embedding technologies such as TDMA busses, and at the OS level, where suitable scheduling techniques can improve performance and reduce energy consumption. Among these, elastic scheduling has been proved to provide satisfactory results by dynamically reducing task periods at run-time to ensure the highest utilization possible of the processors. On the other hand, elastic scheduling lowers the degree of predictability and increases the complexity of the analysis at the system level. This reduces the benefits given by the TDMA bus, which relies on the high level task analysis for a robust and efficient slot allocation. Starting from this consideration, we propose a system where the elastic scheduling and the TDMA bus work synergistically. We introduce a QoS-aware adaptive bus service which takes the best of both techniques, mitigating their drawbacks at the same time. We show how the overhead introduced by coordination action is small, and it is however dominated by the benefits of the overall strategy in terms of performance and predictability guarantees.


2010 - Evaluating OpenMP Support Costs on MPSoCs [Relazione in Atti di Convegno]
Marongiu, Andrea; Burgio, Paolo; Benini, Luca
abstract

The ever-increasing complexity of MPSoCs is making the production of software the critical path in embedded system development. Several programming models and tools have been proposed in the recent past that aim at facilitating application development for embedded MPSoCs. OpenMP is a mature and easy-to-use standard for shared memory programming, which has recently been successfully adopted in embedded MPSoC programming as well. To achieve performance, however, it is necessary that the implementation of OpenMP constructs efficiently exploits the many peculiarities of MPSoC hardware. In this paper we present an extensive evaluation of the cost associated with supporting OpenMP on such a machine, investigating several implementative variants that efficiently exploit the memory hierarchy. Experimental results on different benchmarks confirm the effectiveness of the optimizations in terms of performance improvements.


2010 - Vertical stealing: robust, locality-aware do-all workload distribution for 3D MPSoCs [Relazione in Atti di Convegno]
Marongiu, A.; Burgio, P.; Benini, L.
abstract

In this paper we address the issue of efficient doall workload distribution on a embedded 3D MPSoC. 3D stacking technology enables low latency and high bandwidth access to multiple, large memory banks in close spatial proximity. In our implementation one silicon layer contains multiple processors, whereas one or more DRAM layers on top host a NUMA memory subsystem. To obtain high locality and balanced workload we consider a two-step approach. First, a compiler pass analyzes memory references in a loop and schedules each iteration to the processor owning the most frequently accessed data. Second, if locality-aware loop parallelization has generated unbalanced workload we allow idle processors to execute part of the remaining work from neighbors by implementing runtime support for work stealing.