ANDREA MARONGIU - personale UniMoRe

Nuova ricerca

ANDREA MARONGIU

Professore Associato
Dipartimento di Scienze Fisiche, Informatiche e Matematiche sede ex-Matematica

Pubblicazioni

2023 - Fine-Grained QoS Control via Tightly-Coupled Bandwidth Monitoring and Regulation for FPGA-based Heterogeneous SoCs [Relazione in Atti di Convegno]
Brilli, G.; Valente, G.; Capotondi, A.; Burgio, P.; Di Masciov, T.; Valente, P.; Marongiu, A.
abstract

2022 - An FPGA Overlay for Efficient Real-Time Localization in 1/10th Scale Autonomous Vehicles [Relazione in Atti di Convegno]
Bernardi, Andrea; Brilli, Gianluca; Capotondi, Alessandro; Marongiu, Andrea; Burgio, Paolo
abstract

Heterogeneous systems-on-chip (HeSoC) based on reconfigurable accelerators, such as Field-Programmable Gate Arrays (FPGA), represent an appealing option to deliver the performance/Watt required by the advanced perception and localization tasks employed in the design of Autonomous Vehicles. Different from software-programmed GPUs, FPGA development involves significant hardware design effort, which in the context of HeSoCs is further complicated by the system-level integration of HW and SW blocks. High-Level Synthesis is increasingly being adopted to ease hardware IP design, allowing engineers to quickly prototype their solutions. However, automated tools still lack the required maturity to efficiently build the complex hard-ware/software interaction between the host CPU and the FPGA accelerator(s). In this paper we present a fully integrated system design where a particle filter for LiDAR-based localization is efficiently deployed as FPGA logic, while the rest of the compute pipeline executes on programmable cores. This design constitutes the heart of a fully-functional 1/10th-scale racing autonomous car. In our design, accelerated IPs are controlled locally to the FPGA via a proxy core. Communication between the two and with the host CPU happens via shared memory banks also implemented as FPGA IPs. This allows for a scalable and easy-to-deploy solution both from the hardware and software viewpoint, while providing better performance and energy efficiency compared to state-of-the-art solutions.

2022 - Evaluating Controlled Memory Request Injection for Efficient Bandwidth Utilization and Predictable Execution in Heterogeneous SoCs [Articolo su rivista]
Brilli, Gianluca; Cavicchioli, Roberto; Solieri, Marco; Valente, Paolo; Marongiu, Andrea
abstract

High-performance embedded platforms are increasingly adopting heterogeneous systems-on-chip (HeSoC) that couple multi-core CPUs with accelerators such as GPU, FPGA, or AI engines. Adopting HeSoCs in the context of real-time workloads is not immediately possible, though, as contention on shared resources like the memory hierarchy—and in particular the main memory (DRAM)—causes unpredictable latency increase. To tackle this problem, both the research community and certification authorities mandate (i) that accesses from parallel threads to the shared system resources (typically, main memory) happen in a mutually exclusive manner by design, or (ii) that per-thread bandwidth regulation is enforced. Such arbitration schemes provide timing guarantees, but make poor use of the memory bandwidth available in a modern HeSoC. Controlled Memory Request Injection (CMRI) is a recently-proposed bandwidth limitation concept that builds on top of a mutually-exclusive schedule but still allows the threads currently not entitled to access memory to use as much of the unused bandwidth as possible without losing the timing guarantee. CMRI has been discussed in the context of a multi-core CPU, but the same principle applies also to a more complex system such as an HeSoC. In this article, we introduce two CMRI schemes suitable for HeSoCs: Voluntary Throttling via code refactoring and Bandwidth Regulation via dynamic throttling. We extensively characterize a proof-of-concept incarnation of both schemes on two HeSoCs: an NVIDIA Tegra TX2 and a Xilinx UltraScale+, highlighting the benefits and the costs of CMRI for synthetic workloads that model worst-case DRAM access. We also test the effectiveness of CMRI with real benchmarks, studying the effect of interference among the host CPU and the accelerators.

2022 - Reconciling QoS and Concurrency in NVIDIA GPUs via Warp-Level Scheduling [Relazione in Atti di Convegno]
Singh, J.; Olmedo, I. S.; Capodieci, N.; Marongiu, A.; Caccamo, M.
abstract

The widespread deployment of NVIDIA GPUs in latency-sensitive systems today requires predictable GPU multi-tasking, which cannot be trivially achieved. The NVIDIA CUDA API allows programmers to easily exploit the processing power provided by these massively parallel accelerators and is one of the major reasons behind their ubiquity. However, NVIDIA GPUs and the CUDA programming model favor throughput instead of latency and timing predictability. Hence, providing real-time and quality-of-service (QoS) properties to GPU applications presents an interesting research challenge. Such a challenge is paramount when considering simultaneous multikernel (SMK) scenarios, wherein kernels are executed concurrently within each streaming multiprocessor (SM). In this work, we explore QoS-based fine-grained multitasking in SMK via job arbitration at the lowest level of the GPU scheduling hierarchy, i.e., between warps. We present QoS-aware warp scheduling (QAWS) and evaluate it against state-of-the-art, kernel-agnostic policies seen in NVIDIA hardware today. Since the NVIDIA ecosystem lacks a mechanism to specify and enforce kernel priority at the warp granularity, we implement and evaluate our proposed warp scheduling policy on GPGPU-Sim. QAWS not only improves the response time of the higher priority tasks but also has comparable or better throughput than the state-of-the-art policies.

2022 - Understanding and Mitigating Memory Interference in FPGA-based HeSoCs [Relazione in Atti di Convegno]
Brilli, G.; Capotondi, A.; Burgio, P.; Marongiu, A.
abstract

Like most high-end embedded systems, FPGA-based systems-on-chip (SoC) are increasingly adopting heterogeneous designs, where CPU cores, the configurable logic and other ICs all share interconnect and main memory (DRAM) controller. This paradigm is scalable and reduces production costs and time-to-market, but creates resource contention issues, which ultimately affects the programs' timing. This problem has been widely studied on CPU- and GPU-based systems, along with strategies to mitigate such effects, but little has been done so far to systematically study the problem on FPGA-based SoCs. This work provides an in-depth analysis of memory interference on such systems, tar-geting two state-of-the-art commercial FPGA SoCs. We also discuss architectural support for Controlled Memory Request Injection (CMRI), a technique that has proven effective at reducing the bandwidth under-utilization implied by naive schemes that solve the interference problem by only allowing mutually exclusive access to the shared resources. Our experimental results show that: i) memory interference can slow down CPU tasks by up to 16×in the tested FPGA-based SoCs; ii) CMRI allows to exploit more than 40% of the memory bandwidth avail-able to FPGA accelerators (normally completely unused in PREM-like schemes), keeping the slowdown due to interference below 10%.

2021 - A RISC-V-based FPGA Overlay to Simplify Embedded Accelerator Deployment [Relazione in Atti di Convegno]
Bellocchi, Gianluca; Capotondi, Alessandro; Conti, Francesco; Marongiu, Andrea
abstract

Modern cyber-physical systems (CPS) are increasingly adopting heterogeneous systems-on-chip (HeSoCs) as a computing platform to satisfy the demands of their sophisticated workloads. FPGA-based HeSoCs can reach high performance and energy efficiency at the cost of increased design complexity. High-Level Synthesis (HLS) can ease IP design, but automated tools still lack the maturity to efficiently and easily tackle system-level integration of the many hardware and software blocks included in a modern CPS. We present an innovative hardware overlay offering plug-and-play integration of HLS-compiled or handcrafted acceleration IPs thanks to a customizable wrapper attached to the overlay interconnect and providing shared-memory communication to the overlay cores. The latter are based on the open RISC-V ISA and offer simplified software management of the acceleration IP. Deploying the proposed overlay on a Xilinx ZU9EG shows ≈ 20% LUT usage and ≈ 4× speedup compared to program execution on the ARM host core.

2021 - A Taxonomy of Modern GPGPU Programming Methods: On the Benefits of a Unified Specification [Articolo su rivista]
Capodieci, N.; Cavicchioli, R.; Marongiu, A.
abstract

Several Application Programming Interfaces (APIs) and frameworks have been proposed to simplify the development of General-Purpose GPU (GPGPU) applications. GPGPU application development typically involves specific customization for the target operating systems and hardware devices. The effort to port applications from one API to the other (or to develop multi-target applications) is complicated by the availability of a plethora of specifications, which in essence offers very similar underlying functionality. In this work we provide an in-depth study of six state-of-the-art GPGPU APIs. From these we derive a taxonomy of the common semantics and propose a unified specification. We describe a methodology to translate this unified specification into different target APIs. This simplifies cross-platform application development and provides a clean framework for benchmarking. Our proposed unified specification is called GUST (GPGPU Unified Specification and Translation) and it captures common functionality found in compute-only APIs (e.g., CUDA and OpenCL), in the compute pipeline of traditional graphic-oriented APIs (e.g., OpenGL and Direct3D11) and in last-generation bare-metal APIs (e.g., Vulkan and Direct3D12). The proposed translation methodology solves differences between specific APIs in a transparent manner, without hiding available tuning knobs for compute kernel optimizations and fostering best programming practices in a simple manner.

2021 - HePREM: A Predictable Execution Model for GPU-based Heterogeneous SoCs [Articolo su rivista]
Forsberg, B.; Benini, L.; Marongiu, A.
abstract

The ever-increasing need for computational power in embedded devices has led to the adoption heterogeneous SoCs combining a general purpose CPU with a data parallel accelerator. These systems rely on a shared main memory (DRAM), which makes them highly susceptible to memory interference. A promising software technique to counter such effects is the Predictable Execution Model (PREM). PREM ensures robustness to interference by separating programs into a sequence of memory and compute phases, and by enforcing a platform-level schedule where only a single processing subsystem is permitted to execute a memory phase at a time. This article demonstrates for the first time how PREM can be applied to heterogeneous SoCs, based on a synchronization technique for memory isolation between CPU and GPU plus a compiler to transform GPU kernels into PREM-compliant codes. For compute bound GPU workloads sharing the DRAM bandwidth 50/50 with the CPU we guarantee near-zero timing varibility at a performance loss of just 59 percent, which is one to two orders of magnitude smaller than the worst case we see for unmodified programs under memory interference.

2021 - The Predictable Execution Model in Practice: Compiling Real Applications for COTS Hardware [Articolo su rivista]
Forsberg, B.; Solieri, M.; Bertogna, M.; Benini, L.; Marongiu, A.
abstract

Adoption of multi- and many-core processors in real-time systems has so far been slowed down, if not totally barred, due do the difficulty in providing analytical real-time guarantees on worst-case execution times. The Predictable Execution Model (PREM) has been proposed to solve this problem, but its practical support requires significant code refactoring, a task better suited for a compilation tool chain than human programmers. Implementing a PREM compiler presents significant challenges to conform to PREM requirements, such as guaranteed upper bounds on memory footprint and the generation of efficient schedulable non-preemptive regions. This article presents a comprehensive description on how a PREM compiler can be implemented, based on several years of experience from the community. We provide accumulated insights on how to best balance conformance to real-time requirements and performance and present novel techniques that extend the applicability from simple benchmark suites to real-world applications. We show that code transformed by the PREM compiler enables timing predictable execution on modern commercial off-the-shelf hardware, providing novel insights on how PREM can protect 99.4% of memory accesses on random replacement policy caches at only 16% performance loss on benchmarks from the PolyBench benchmark suite. Finally, we show that the requirements imposed on the programming model are well-aligned with current coding guidelines for timing critical software, promoting easy adoption.

2021 - Unmanned vehicles in smart farming: A survey and a glance at future horizons [Relazione in Atti di Convegno]
Madronal, D.; Palumbo, F.; Capotondi, A.; Marongiu, A.
abstract

Smart and Precision Agriculture is nowadays already exploiting advanced drones and machinery, but it is foreseen that in the near future more complex and intelligent applications will be required to be brought on-board to improve qualitatively and quantitative production. In this paper, we present an overview on the current usage of autonomous drones in this field and on the augmented computing capabilities that they could count on when companion computers are coupled to flight controllers. The paper also present a novel architecture for companion computers that is under development within the Comp4Drones ECSEL-JU project.

2020 - A Synergistic Approach to Predictable Compilation and Scheduling on Commodity Multi-Cores [Relazione in Atti di Convegno]
Forsberg, B.; Mattheeuws, M.; Kurth, A.; Marongiu, A.; Benini, L.
abstract

Commodity multi-cores are still uncommon in real-Time systems, as resource sharing complicates traditional timing analysis. The Predictable Execution Model (PREM) tackles this issue in software, through scheduling and code refactoring. State-of-The-Art PREM compilers analyze tasks one at a time, maximizing task-level performance metrics, and are oblivious to system-level scheduling effects (e.g. memory serialization when tasks are co-scheduled). We propose a solution that allows PREM code generation and system scheduling to interact, based on a genetic algorithm aimed at maximizing overall system performance. Experiments on commodity hardware show that the performance increase can be as high as 31% compared to standard PREM code generation, without negatively impacting the predictability guarantees.

2020 - Dissecting the CUDA scheduling hierarchy: A Performance and Predictability Perspective [Relazione in Atti di Convegno]
Olmedo, I. S.; Capodieci, N.; Martinez, J. L.; Marongiu, A.; Bertogna, M.
abstract

Over the last few years, the ever-increasing use of Graphic Processing Units (GPUs) in safety-related domains has opened up many research problems in the real-time community. The closed and proprietary nature of the scheduling mechanisms deployed in NVIDIA GPUs, for instance, represents a major obstacle in deriving a proper schedulability analysis for latency-sensitive applications. Existing literature addresses these issues by either (i) providing simplified models for heterogeneous CPUGPU systems and their associated scheduling policies, or (ii) providing insights about these arbitration mechanisms obtained through reverse engineering. In this paper, we take one step further by correcting and consolidating previously published assumptions about the hierarchical scheduling policies of NVIDIA GPUs and their proprietary CUDA application programming interface. We also discuss how such mechanisms evolved with recently released GPU micro-architectures, and how such changes influence the scheduling models to be exploited by real-time system engineers.

2020 - Evaluating Controlled Memory Request Injection to Counter PREM Memory Underutilization [Relazione in Atti di Convegno]
Cavicchioli, R.; Capodieci, N.; Solieri, M.; Bertogna, M.; Valente, P.; Marongiu, A.
abstract

Modern heterogeneous systems-on-chip (HeSoC) feature high-performance multi-core CPUs tightly integrated with data-parallel accelerators. Such HeSoCS heavily rely on shared resources, which hinder their adoption in the context of Real-Time systems. The predictable execution model (PREM) has proven effective at preventing uncontrolled execution time lengthening due to memory interference in HeSoC sharing main memory (DRAM). However, PREM only allows one task at a time to access memory, which inherently under-utilizes the available memory bandwidth in modern HeSoCs. In this paper, we conduct a thorough experimental study aimed at assessing the potential benefits of extending PREM so as to inject controlled amounts of memory requests coming from other tasks than the one currently granted exclusive DRAM access. Focusing on a state-of-the-art HeSoC, the NVIDIA TX2, we extensively characterize the relation between the injected bandwidth and the latency experienced by the task under test. The results confirm that for various types of workload it is possible to exploit the available bandwidth much more efficiently than standard PREM arbitration, often close to its maximum, while keeping latency inflation below 10%. We discuss possible practical implementation directions, highlighting the expected benefits and technical challenges.

2020 - FlexFloat: A Software Library for Transprecision Computing [Articolo su rivista]
Tagliavini, Giuseppe; Marongiu, Andrea; Benini, Luca
abstract

In recent years approximate computing has been extensively explored as a paradigm to design hardware and software solutions that save energy by trading off on the quality of the computed results. In applications that involve numerical computations with wide dynamic range, precision tuning of floating-point (FP) variables is a key knob to leverage the energy/quality trade-off of program results. This aspect assumes maximum relevance in the transprecision computing scenario, where accuracy of data is tuned at fine grain in application code. Performing precision tuning at fine grain requires a software development flow that streamlines the assessment of which variables have “precision slack” within an application. In this paper we introduce FlexFloat, an open-source software library that has been expressly designed to aid the development of transprecision applications. FlexFloat provides a C/C++ interface for supporting multiple FP formats. Unlike alternative libraries, FlexFloat enables to control the bit-width of mantissa and exponent fields and provides advanced features for the collection of runtime statistics, reducing the FP emulation time compared to the state-of-the-art solutions. Its design allows to emulate the behavior of standard IEEE FP types and custom extensions for reduced-precision computation. This makes the library suitable for adoption in multiple contexts, from manual exploration to integration into automatic tools. Experimental findings demonstrate that our approach can be used to perform a complete precision analysis from which deriving multiple program versions depending on the energy/quality trade-off. Furthermore, we show that the adoption of our methodology can lead to a significant reduction of energy consumption even on current commercial hardware (an embedded GPGPU).

2020 - Mixed-data-model heterogeneous compilation and OpenMP offloading [Relazione in Atti di Convegno]
Kurth, A.; Wolters, K.; Forsberg, B.; Capotondi, A.; Marongiu, A.; Grosser, T.; Benini, L.
abstract

Heterogeneous computers combine a general-purpose host processor with domain-specific programmable many-core accelerators, uniting high versatility with high performance and energy efficiency. While the host manages ever-more application memory, accelerators are designed to work mainly on their local memory. This difference in addressed memory leads to a discrepancy between the optimal address width of the host and the accelerator. Today 64-bit host processors are commonplace, but few accelerators exceed 32-bit addressable local memory, a difference expected to increase with 128-bit hosts in the exascale era. Managing this discrepancy requires support for multiple data models in heterogeneous compilers. So far, compiler support for multiple data models has not been explored, which hampers the programmability of such systems and inhibits their adoption. In this work, we perform the first exploration of the feasibility and performance of implementing a mixed-data-mode heterogeneous system. To support this, we present and evaluate the first mixed-data-model compiler, supporting arbitrary address widths on host and accelerator. To hide the inherent complexity and to enable high programmer productivity, we implement transparent offloading on top of OpenMP. The proposed compiler techniques are implemented in LLVM and evaluated on a 64+32-bit heterogeneous SoC. Results on benchmarks from the PolyBench-ACC suite show that memory can be transparently shared between host and accelerator at overheads below 0.7 % compared to 32-bit-only execution, enabling mixed-data-model computers to execute at near-native performance.

2019 - Combining PREM compilation and static scheduling for high-performance and predictable MPSoC execution [Articolo su rivista]
Matejka, J.; Forsberg, B.; Sojka, M.; Sucha, P.; Benini, L.; Marongiu, A.; Hanzalek, Z.
abstract

Many applications require both high performance and predictable timing. High-performance can be provided by COTS Multi-Core System on Chips (MPSoC), however, as cores in these systems share main memory, they are susceptible to interference from each other, which is a problem for timing predictability. We achieve predictability on multi-cores by employing the predictable execution model (PREM), which splits execution into a sequence of memory and compute phases, and schedules these such that only a single core is executing a memory phase at a time. We present a toolchain consisting of a compiler and a scheduling tool. Our compiler uses region and loop based analysis and performs tiling to transform application code into PREM-compliant binaries. In addition to enabling predictable execution, the compiler transformation optimizes accesses to the shared main memory. The scheduling tool uses a state-of-the-art heuristic algorithm and is able to schedule industrial-size instances. For smaller instances, we compare the results of the algorithm with optimal solutions found by solving an integer linear programming model. Furthermore, we solve the problem of scheduling execution on multiple cores while preventing interference of memory phases. We evaluate our toolchain on Advanced Driver Assistance System (ADAS) application workloads running on an NVIDIA Tegra X1 embedded system-on-chip (SoC). The results show that our approach maintains similar average performance to the original (unmodified) program code and execution, while reducing variance of completion times by a factor of 9 with the identified optimal solutions and by a factor of 5 with schedules generated by our heuristic scheduler.

2019 - Design and Evaluation of SmallFloat SIMD extensions to the RISC-V ISA [Relazione in Atti di Convegno]
Tagliavini, G.; MacH, S.; Rossi, D.; Marongiu, A.; Benini, L.
abstract

RISC-V is an open-source instruction set architecture (ISA) with a modular design consisting of a mandatory base part plus optional extensions. The RISC-V 32IMFC ISA configuration has been widely adopted for the design of new-generation, low-power processors. Motivated by the important energy savings that smaller-than-32-bit FP types have enabled in several application domains and related compute platforms, some recent studies have published encouraging early results for their adoption in RISC-V processors. In this paper we introduce a set of ISA extensions for RISC-V 32IMFC, supporting scalar and SIMD operations (fitting the 32-bit register size) for 8-bit and two 16-bit FP types. The proposed extensions are enabled by exposing the new FP types to the standard C/C++ type system and an implementation for the RISC-V GCC compiler is presented. As a further, novel contribution, we extensively characterize the performance and energy savings achievable with the proposed extensions. On average, experimental results show that their adoption provide benefits in terms of performance (1.64× speedup for 16-bit and 2.18× for 8-bit types) and energy consumption (30% saving for 16-bit and 50% for 8-bit types). We also illustrate an approach based on automatic precision tuning to make effective use of the new FP types.

2019 - Exploring Shared Virtual Memory for FPGA Accelerators with a Configurable IOMMU [Articolo su rivista]
Vogel, Pirmin; Marongiu, Andrea; Benini, Luca
abstract

A key enabler for the ever-increasing adoption of FPGA accelerators is the availability of frameworks allowing for the seamless coupling to general-purpose host processors. Embedded FPGA+CPU systems still heavily rely on copy-based host-to-accelerator communication, which complicates application development. In this paper, we present a hardware/software framework for enabling transparent, shared virtual memory for FPGA accelerators in embedded SoCs. It can use a hard-macro IOMMU if available, or a configurable soft-core IOMMU that we provide. We explore different TLB configurations and provide a comparison with other designs for shared virtual memory to gain insight on performance-critical IOMMU components. Experimental results using pointer-rich benchmarks show that our framework not only simplifies FPGA-accelerated application development, it also achieves up to 13x speedup compared to traditional copy-based offloading.

2019 - Extending the Lifetime of Nano-Blimps via Dynamic Motor Control [Articolo su rivista]
Palossi, Daniele; Gomez, Andres; Draskovic, Stefan; Marongiu, Andrea; Thiele, Lothar; Benini, Luca
abstract

Nano-sized unmanned aerial vehicles (UAVs), e.g. quadcopters, have received significant attention in recent years. Although their capabilities have grown, they continue to have very limited flight times, tens of minutes at most. The main constraints are the battery’s energy density and the engine power required for flight. In this work, we present a nano-sized blimp platform, consisting of a helium balloon and a rotorcraft. Thanks to the lift provided by helium, the blimp requires relatively little energy to remain at a stable altitude. This lift, however, decreases with time as the balloon inevitably deflates requiring additional control mechanisms to keep the desired altitude. We study how duty-cycling high power actuators can further reduce the average energy requirements for hovering. With the addition of a solar panel, it is even feasible to sustain tens or hundreds of flight hours in modest lighting conditions. Furthermore, we study how a balloon’s deflation rate affects the blimp’s energy budget and lifetime. A functioning 68-gram prototype was thoroughly characterized and its lifetime was measured under different harvesting conditions and different power management strategies. Both our system model and the experimental results indicate our proposed platform requires less than 200 mW to hover indefinitely with an ideal balloon. With a non-ideal balloon the maximum lifetime of ∼400 h is bounded by the rotor’s maximum thrust. This represents, to the best of our knowledge, the first nano-size UAV for long term hovering with low power requirements.

2019 - Taming Data Caches for Predictable Execution on GPU-based SoCs [Relazione in Atti di Convegno]
Forsberg, B.; Benini, L.; Marongiu, A.
abstract

Heterogeneous SoCs (HeSoCs) typically share a single DRAM between the CPU and GPU, making workloads susceptible to memory interference, and predictable execution troublesome. State-of-the art predictable execution models (PREM) for HeSoCs prefetch data to the GPU scratchpad memory (SPM), for computations to be insensitive to CPU-generated DRAM traffic. However, the amount of work that the small SPM sizes allow is typically insufficient to absorb CPU/GPU synchronization costs. On-chip caches are larger, and would solve this issue, but have been argued too unpredictable due to self-evictions. We show how self-eviction can be minimized in GPU caches via clever managing of prefetches, thus lowering the performance cost, while retaining timing predictability.

2018 - A Transprecision Floating-Point Architecture for Energy-Efficient Embedded Computing [Relazione in Atti di Convegno]
Mach, Stefan; Rossi, Davide; Tagliavini, Giuseppe; Marongiu, Andrea; Benini, Luca
abstract

Ultra-low power computing is a key enabler of deeply embedded platforms used in domains such as distributed sensing, internet of things, wearable computing. The rising computational demands and high dynamic of target algorithms often call for hardware support of floating-point (FP) arithmetic and high system energy efficiency. In light of transprecision computing, where accuracy of data is consciously changed during the execution of applications, custom FP types are being used to optimize a wide range of problems. We support two such custom types - one 16 bit and one 8 bit wide - together with IEEE binary16 as a set of 'smallFloat' formats. We present an FP arithmetic unit capable of performing basic operations on smallFloat formats as well as conversions. To boost performance and energy efficiency, the smallFloat unit is extended with SIMD-style vectorization support to operate on a conventional word width of 32 bit. Finally, it is added into the execution stage of a low-power 32-bit RISC-V processor core and integrated as part of an SoC in a 65nm process. We show that the energy efficiency for processing smallFloat data in this amended system is 18% higher than the binary32 baseline, thus enabling hardware-supported power savings for applications making use of transprecision.

2018 - A Transprecision Floating-Point Platform for Ultra-Low Power Computing [Relazione in Atti di Convegno]
Tagliavini, Giuseppe; Mach, Stefan; Rossi, Davide; Marongiu, Andrea; Benini, Luca
abstract

In modern low-power embedded platforms, floating-point (FP) operations emerge as a major contributor to the energy consumption of compute-intensive applications with large dynamic range. Experimental evidence shows that 50% of the energy consumed by a core and its data memory is related to FP computations. The adoption of FP formats requiring a lower number of bits is an interesting opportunity to reduce energy consumption, since it allows to simplify the arithmetic circuitry and to reduce the memory bandwidth between memory and registers by enabling vectorization. From a theoretical point of view, the adoption of multiple FP types perfectly fits with the principle of transprecision computing, allowing fine-grained control of approximation while meeting specified constraints on the precision of final results. In this paper we propose an extended FP type system with complete hardware support to enable transprecision computing on low-power embedded processors, including two standard formats (binary32 and binary16) and two new formats (binary8 and binary16alt). First, we introduce a software library that enables exploration of FP types by tuning both precision and dynamic range of program variables. Then, we present a methodology to integrate our library with an external tool for precision tuning, and experimental results that highlight the clear benefits of introducing the new formats. Finally, we present the design of a transprecision FP unit capable of handling 8-bit and 16-bit operations in addition to standard 32-bit operations. Experimental results on FP-intensive benchmarks show that up to 90% of FP operations can be safely scaled down to 8-bit or 16-bit formats. Thanks to precision tuning and vectorization, execution time is decreased by 12% and memory accesses are reduced by 27% on average, leading to a reduction of energy consumption up to 30%.

2018 - Combining PREM compilation and ILP scheduling for high-performance and predictable MPSoC execution [Relazione in Atti di Convegno]
Matějka, Joel; Hanzálek, Zdeněk; Forsberg, Björn; Benini, Luca; Sojka, Michal; Marongiu, Andrea
abstract

2018 - Embedded operating systems [Capitolo/Saggio]
Scordino, C.; Guidieri, E.; Morelli, B.; Marongiu, A.; Tagliavini, G.; Gai, P.
abstract

In this chapter, we will provide a description of existing open-source operating systems (OSs) which have been analyzed with the objective of providing a porting for the reference architecture described in Chapter 2. Among the various possibilities, the ERIKA Enterprise RTOS (Real-Time Operating System) and Linux with preemption patches have been selected. A description of the porting effort on the reference architecture has also been provided.

2018 - Energy-quality scalable integrated circuits and systems: Continuing energy scaling in the twilight of Moore's law [Articolo su rivista]
Alioto, M.; De, V.; Marongiu, A.
abstract

This paper aims to take stock of recent advances in the field of energy-quality (EQ) scalable circuits and systems, as promising direction to continue the historical exponential energy downscaling under diminished returns from technology and voltage scaling. EQ-scalable systems explicitly trade off energy and quality at different levels of abstraction and sub-systems, dealing with 'quality' as an explicit design requirement, and reducing energy whenever the application, the task, or the dataset allow quality degradation (e.g., vision and machine learning). A general framework for EQ-scalable systems based on the concept of quality slack is presented along with scalable architectures. A taxonomy of techniques to trade off energy and quality, a VLSI perspective, and possible quality control strategies are then discussed. The state of the art is surveyed to put the advances in its different sub-fields into a unitary perspective, emphasizing the on-going and prospective trends. At the component level, the generality of the EQ-scaling concept is shown through several examples, ranging from logic to analog circuits, to memories, data converters, and accelerators. Interesting implications of the joint adoption of EQ scaling and machine learning are also discussed, suggesting that their synergy gives ample room for further energy and performance improvements. From a level of abstraction viewpoint, EQ scaling is discussed from the circuit level to architectures, the hardware-software interface, the programming language, the compiler level, and run-time adaptation. Several case studies are discussed to put EQ scaling in the context of real-world applications.

2018 - Guest Editorial Energy-Quality Scalable Circuits and Systems for Sensing and Computing: From Approximate to Communication-Inspired and Learning-Based [Articolo su rivista]
Alioto, M.; De, V.; Marongiu, A.
abstract

2018 - Hardware Transactional Memory Exploration in Coherence-Free Many-Core Architectures [Articolo su rivista]
Papagiannopoulou, Dimitra; Marongiu, Andrea; Moreshet, Tali; Benini, Luca; Herlihy, Maurice; Bahar, R. Iris
abstract

High-end embedded systems, like their general-purpose counterparts, are turning to many-core cluster-based shared-memory architectures that provide a shared memory abstraction subject to non-uniform memory access costs. In order to keep the cores and memory hierarchy simple, many-core embedded systems tend to employ simple, scratchpad-like memories, rather than hardware managed caches that require some form of cache coherence management. These “coherence-free” systems still require some means to synchronize memory accesses and guarantee memory consistency. Conventional lock-based approaches may be employed to accomplish the synchronization, but may lead to both usability and performance issues. Instead, speculative synchronization, such as hardware transactional memory, may be a more attractive approach. However, hardware speculative techniques traditionally rely on the underlying cache-coherence protocol to synchronize memory accesses among the cores. The lack of a cache-coherence protocol adds new challenges in the design of hardware speculative support. In this article, we present a new scheme for hardware transactional memory (HTM) support within a cluster-based, many-core embedded system that lacks an underlying cache-coherence protocol. We propose two alternative data versioning implementations for the HTM support, Full-Mirroring and Distributed Logging and we conduct a performance comparison between them. To the best of our knowledge, these are the first designs for speculative synchronization for this type of architecture. Through a set of benchmark experiments using our simulation platform, we show that our designs can achieve significant performance improvements over traditional lock-based schemes.

2018 - Heprem: Enabling predictable GPU execution on heterogeneous soc [Relazione in Atti di Convegno]
Forsberg, Bjorn; Benini, Luca; Marongiu, Andrea
abstract

Heterogeneous systems-on-A-chip are increasingly embracing shared memory designs, in which a single DRAM is used for both the main CPU and an integrated GPU. This architectural paradigm reduces the overheads associated with data movements and simplifies programmability. However, the deployment of real-Time workloads on such architectures is troublesome, as memory contention significantly increases execution time of tasks and the pessimism in worst-case execution time (WCET) estimates. The Predictable Execution Model (PREM) separates memory and computation phases in real-Time codes, then arbitrates memory phases from different tasks such that only one core at a time can access the DRAM. This paper revisits the original PREM proposal in the context of heterogeneous SoCs, proposing a compiler-based approach to make GPU codes PREM-compliant. Starting from high-level specifications of computation offloading, suitable program regions are selected and separated into memory and compute phases. Our experimental results show that the proposed technique is able to reduce the sensitivity of GPU kernels to memory interference to near zero, and achieves up to a 20 χ reduction in the measured WCET.

2018 - Hero: An open-source research platform for HW/SW exploration of heterogeneous manycore systems [Relazione in Atti di Convegno]
Kurth, A.; Capotondi, Alessandro; Vogel, P.; Benini, L.; Marongiu, A.
abstract

Heterogeneous systems on chip (HeSoCs) co-integrate a high-performance multicore host processor with programmable manycore accelerators (PMCAs) to combine “standard platform” software support (e.g. the Linux OS) with energy-efficient, domain-specific, highly parallel processing capabilities. In this work, we present HERO, a HeSoC platform that tackles this challenge in a novel way. HERO’s host processor is an industry-standard ARM Cortex-A multicore complex, while its PMCA is a scalable, silicon-proven, open-source many-core processing engine, based on the extensible, open RISC-V ISA. We evaluate a prototype implementation of HERO, where the PMCA implemented on an FPGA fabric is coupled with a hard ARM Cortex-A host processor, and show that the run time overhead compared to manually written PMCA code operating on private physical memory is lower than 10 % for pivotal benchmarks and operating conditions.

2018 - High-performance and time-predictable embedded computing [Curatela]
Pinho, L. M.; Quinones, E.; Bertogna, M.; Marongiu, A.; Nelis, V.; Gai, P.; Sancho, J.
abstract

Nowadays, the prevalence of computing systems in our lives is so ubiquitous that we live in a cyber-physical world dominated by computer systems, from pacemakers to cars and airplanes. These systems demand for more computational performance to process large amounts of data from multiple data sources with guaranteed processing times. Actuating outside of the required timing bounds may cause the failure of the system, being vital for systems like planes, cars, business monitoring, e-trading, etc. High-Performance and Time-Predictable Embedded Computing presents recent advances in software architecture and tools to support such complex systems, enabling the design of embedded computing devices which are able to deliver high-performance whilst guaranteeing the application required timing bounds. Technical topics discussed in the book include: Parallel embedded platforms, Programming models, Mapping and scheduling of parallel computations, Timing and schedulability analysis, Runtimes and operating systems. The work reflected in this book was done in the scope of the European project P SOCRATES, funded under the FP7 framework program of the European Commission. High-performance and time-predictable embedded computing is ideal for personnel in computer/communication/embedded industries as well as academic staff and master/research students in computer science, embedded systems, cyber-physical systems and internet-of-things.

2018 - Introduction [Capitolo/Saggio]
Pinho, L. M.; Quinones, E.; Bertogna, M.; Marongiu, A.; Nelis, V.; Gai, P.; Sancho, J.
abstract

This chapter provides an overview of the book theme, motivating the need for high-performance and time-predictable embedded computing. It describes the challenges introduced by the need for time-predictability on the one hand, and high-performance on the other, discussing on a high level how these contradictory requirements can be simultaneously supported.

2018 - Manycore platforms [Capitolo/Saggio]
Marongiu, A.; Nelis, V.; Yomsi, P. M.
abstract

This chapter surveys state-of-the-art manycore platforms. It discusses the historical evolution of computing platforms over the past decades and the technical hurdles that led to the manycore revolution, then presents in details several manycore platforms, outlining (i) the key architectural traits that enable scalability to several tens or hundreds of processing cores and (ii) the shared resources that are responsible for unpredictable timing.

2018 - On the cost of freedom from interference in heterogeneous soCs [Relazione in Atti di Convegno]
Forsberg, Björn; Benini, Luca; Marongiu, Andrea
abstract

In heterogeneous CPU+GPU SoCs where a single DRAM is shared between both devices, concurrent memory accesses from both devices can lead to slowdowns due to memory interference. This prevents the deployment of real-time tasks, which need to be guaranteed to complete before a set deadline. However, freedom from interference can be guaranteed through software memory scheduling, but may come at a significant cost due to frequent CPU-GPU synchronizations. In this paper we provide a compile-time model to help developers make informed decisions on how to achieve freedom from interference at the lowest cost.

2018 - OpenMP runtime [Capitolo/Saggio]
Marongiu, A.; Tagliavini, G.; Quinones, E.
abstract

This chapter introduces the design of the OpenMP runtime and its key components, the offloading library and the tasking runtime library. Starting from the execution model introduced in the previous chapters, we first abstractly describe the main interactions among the main actors involved in program execution. Then we focus on the optimized design of the offloading library and the tasking runtime library, followed by their performance characterization.

2018 - Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core accelerators [Articolo su rivista]
Tagliavini, Giuseppe; Haugou, Germain; Marongiu, Andrea; Benini, Luca
abstract

In recent years, image processing has been a key application area for mobile and embedded computing platforms. In this context, many-core accelerators are a viable solution to efficiently execute highly parallel kernels. However, architectural constraints impose hard limits on the main memory bandwidth, and push for software techniques which optimize the memory usage of complex multi-kernel applications. In this work, we propose a set of techniques, mainly based on graph analysis and image tiling, targeted to accelerate the execution of image processing applications expressed as standard OpenVX graphs on cluster-based many-core accelerators. We have developed a run-time framework which implements these techniques using a front-end compliant to the OpenVX standard, and based on an OpenCL extension that enables more explicit control and efficient reuse of on-chip memory and greatly reduces the recourse to off-chip memory for storing intermediate results. Experiments performed on the STHORM many-core accelerator demonstrate that our approach leads to massive reduction of time and bandwidth, even when the main memory bandwidth for the accelerator is severely constrained.

2018 - Predictable parallel programming with OpenMP [Capitolo/Saggio]
Serrano, M. A.; Royuela, S.; Marongiu, A.; Quinones, E.
abstract

This chapter motivates the use of the OpenMP (Open Multi-Processing) parallel programming model to develop future critical real-time embedded systems, and analyzes the time-predictable properties of the OpenMP tasking model. Moreover, this chapter presents the set of compiler techniques needed to extract the timing information of an OpenMP program in the form of an OpenMP Direct Acyclic Graph or OpenMP-DAG.

2018 - Preface [Capitolo/Saggio]
Pinho, L. M.; Quinones, E.; Bertogna, M.; Marongiu, A.; Nelis, V.; Gai, P.; Sancho, J.
abstract

2018 - Runtime Support for Multiple Offload-Based Programming Models on Clustered Manycore Accelerators [Articolo su rivista]
Capotondi, A; Marongiu, A; Benini, L
abstract

Heterogeneous systems coupling a main host processor with one or more manycore accelerators are being adopted virtually at every scale to achieve ever-increasing GOps/Watt targets. The increased hardware complexity of such systems is paired at the application level by a growing number of applications concurrently running on the system. Techniques that enable efficient accelerator resources sharing, supporting multiple programming models will thus be increasingly important for future heterogeneous SoCs. In this paper we present a runtime system for a cluster-based manycore accelerator, optimized for the concurrent execution of offloaded computation kernels from different programming models. The runtime supports spatial partitioning, where clusters can be grouped into several virtual accelerator instances. Our runtime design is modular and relies on a generic component for resource (cluster) scheduling, plus specialized components which deploy generic offload requests into the target programming model semantics. We evaluate the proposed runtime system on two real heterogeneous systems, focusing on two concrete use cases: i) single-user, multi-application high-end embedded systems and ii) multi-user, multi-workload low-power microservers. In the first case, our approach achieves 93% efficiency in terms of available accelerator resource exploitation. In the second case, our support allows 47% performance improvement compared to single-programming model systems.

2018 - Scalable and Efficient Virtual Memory Sharing in Heterogeneous SoCs with TLB Prefetching and MMU-Aware DMA Engine [Relazione in Atti di Convegno]
Kurth, A.; Vogel, P.; Marongiu, A.; Benini, L.
abstract

Shared virtual memory (SVM) is key in heterogeneous systems on chip (SoCs), which combine a general-purpose host processor with a many-core accelerator, both for programmability and to avoid data duplication. However, SVM can bring a significant run time overhead when translation lookaside buffer (TLB) entries are missing. Moreover, allowing DMA burst transfers to write SVM traditionally requires buffers to absorb transfers that miss in the TLB. These buffers have to be overprovisioned for the maximum burst size, wasting precious on-chip memory, and stall all SVM accesses once they are full, hampering the scalability of parallel accelerators. In this work, we present our SVM solution that avoids the majority of TLB misses with prefetching, supports parallel burst DMA transfers without additional buffers, and can be scaled with the workload and number of parallel processors. Our solution is based on three novel concepts: To minimize the rate of TLB misses, the TLB is proactively filled by compiler-generated Prefetching Helper Threads, which use run-Time information to issue timely prefetches. To reduce the latency of TLB misses, misses are handled by a variable number of parallel Miss Handling Helper Threads. To support parallel burst DMA transfers to SVM without additional buffers, we add lightweight hardware to a standard DMA engine to detect and react to TLB misses. Compared to the state of the art, our work improves accelerator performance for memory-intensive kernels by up to 4~ and by up to 60% for irregular and regular memory access patterns, respectively.

2018 - Synergistic HW/SW Approximation Techniques for Ultralow-Power Parallel Computing [Articolo su rivista]
Tagliavini, Giuseppe; Rossi, Davide; Marongiu, Andrea; Benini, Luca
abstract

Ultralow-power embedded systems have recently started the move to multicore designs. Aggressive voltage scaling techniques have the potential to reduce the power consumption within the admitted envelope, but memory operations on standard six-transistor static RAM (6T-SRAM) cells become unreliable at low voltages. While standard cell memory (SCM) overcomes this limitation, it has much lower area density than SRAM, and thus it is too costly. On the other hand, several applications have inherent tolerance to computation errors, and executing such workloads with approximation has already proven a viable way to reduce energy consumption. In this paper, we propose a novel HW/SW approach to design energy-efficient ultralow-power systems which combine the key ideas of approximate computing and hybrid memory systems featuring both SCM and 6T-SRAM. We introduce a novel hardware support to split error-tolerant data so to host most significant bits in the SCM and least significant bits (LSBs) in the 6T-SRAM. This allows to power the memory system at a low voltage while ensuring correct operation by binding possible (flip-bit) errors to the LSBs only. In addition, by organizing 6T-SRAM banks into multiple and independent voltage domains we enable fine-grained, software-controlled voltage switching policies. At the software level, we propose language constructs to specify what regions of code and what variables are tolerant to approximation, plus compiler support to optimize data placement. Experimental results show that our proposal can reduce the energy consumption of the memory system by 47% on average, always complying with the result accuracy required by practical applications constraints.

2018 - The Quest for Energy-Efficient I$ Design in Ultra-Low-Power Clustered Many-Cores [Articolo su rivista]
Loi, Igor; Capotondi, Alessandro; Rossi, Davide; Marongiu, Andrea; Benini, Luca
abstract

High performance and extreme energy efficiency are strong requirements for a fast-growing number of edge-node Internet of Things (IoT) applications. While traditional Ultra-Low-Power designs rely on single-core micro-controllers (MCU), a new generation of architectures leveraging fully programmable tightly-coupled clusters of near-threshold processors is emerging, joining the performance gain of parallel execution over multiple cores with the energy efficiency of low-voltage operation. In this work we tackle one of the most critical energy-efficiency bottlenecks for these architectures: instruction memory hierarchy. Exploiting the instruction locality typical of data-parallel applications, we explore two different shared instruction cache architectures, based on energy-efficient latch-based memory banks: one leveraging a crossbar between processors and single-port banks (SP), and one leveraging banks with multiple read ports (MP). We evaluate the proposed architectures on a set of signal processing applications with different executable sizes and working-sets. The results show that the shared cache architectures are able to efficiently execute a much wider set of applications (including those featuring large memory footprint and irregular access patterns) with a much smaller area and with much better energy efficiency with respect to the private cache. The multi-port cache is suitable for sizes up to a few kB, improving performance by up to 40%, energy efficiency by up to 20%, and energy × area efficiency by up to 30% with respect to the private cache. The single-port solution is more suitable for larger cache sizes (up to 16 kB), providing up to 20% better energy × area efficiency than the multi-port, and up to 30% better energy efficiency than private cache.

2018 - Unleashing Fine-Grained Parallelism on Embedded Many-Core Accelerators with Lightweight OpenMP Tasking [Articolo su rivista]
Tagliavini, G; Cesarini, D; Marongiu, A
abstract

In recent years, programmable many-core accelerators (PMCAs) have been introduced in embedded systems to satisfy stringent performance/Watt requirements. This has increased the urge for programming models capable of effectively leveraging hundreds to thousands of processors. Task-based parallelism has the potential to provide such capabilities, offering high-level abstractions to outline abundant and irregular parallelism in embedded applications. However, efficiently supporting this programming paradigm on embedded PMCAs is challenging, due to the large time and space overheads it introduces. In this paper we describe a lightweight OpenMP tasking runtime environment (RTE) design for a state-of-the-art embedded PMCA, the Kalray MPPA 256. We provide an exhaustive characterization of the costs of our RTE, considering both synthetic workload and real programs, and we compare to several other tasking RTEs. Experimental results confirm that our solution achieves near-ideal parallelization speedups for tasks as small as 5K cycles, and an average speedup of 12 × for real benchmarks, which is 60% higher than what we observe with the original Kalray OpenMP implementation.

2017 - A software stack for next-generation automotive systems on many-core heterogeneous platforms [Articolo su rivista]
Burgio, Paolo; Bertogna, Marko; Capodieci, Nicola; Cavicchioli, Roberto; Sojka, Michal; Houdek, Přemysl; Marongiu, Andrea; Gai, Paolo; Scordino, Claudio; Morelli, Bruno
abstract

The next-generation of partially and fully autonomous cars will be powered by embedded many-core platforms. Technologies for Advanced Driver Assistance Systems (ADAS) need to process an unprecedented amount of data within tight power budgets, making those platform the ideal candidate architecture. Integrating tens-to-hundreds of computing elements that run at lower frequencies allows obtaining impressive performance capabilities at a reduced power consumption, that meets the size, weight and power (SWaP) budget of automotive systems. Unfortunately, the inherent architectural complexity of many-core platforms makes it almost impossible to derive real-time guarantees using “traditional” state-of-the-art techniques, ultimately preventing their adoption in real industrial settings. Having impressive average performances with no guaranteed bounds on the response times of the critical computing activities is of little if no use in safety-critical applications. Project Hercules will address this issue, and provide the required technological infrastructure to exploit the tremendous potential of embedded many-cores for the next generation of automotive systems. This work gives an overview of the integrated Hercules software framework, which allows achieving an order-of-magnitude of predictable performance on top of cutting-edge Commercial-Off-The-Shelf components (COTS). The proposed software stack will let both real-time and non real-time application coexist on next-generation, power-efficient embedded platforms, with preserved timing guarantees.

2017 - Edge-TM: Exploiting transactional memory for error tolerance and energy efficiency [Articolo su rivista]
Papagiannopoulou, Dimitra; Marongiu, Andrea; Moreshet, Tali; Herlihy, Maurice; Bahar, R. Iris
abstract

Scaling of semiconductor devices has enabled higher levels of integration and performance improvements at the price of making devices more susceptible to the effects of static and dynamic variability. Adding safety margins (guardbands) on the operating frequency or supply voltage prevents timing errors, but has a negative impact on performance and energy consumption. We propose Edge-TM, an adaptive hardware/software error management policy that (i) optimistically scales the voltage beyond the edge of safe operation for better energy savings and (ii) works in combination with a Hardware Transactional Memory (HTM)-based error recovery mechanism. The policy applies dynamic voltage scaling (DVS) (while keeping frequency fixed) based on the feedback provided by HTM, which makes it simple and generally applicable. Experiments on an embedded platform show our technique capable of 57% energy improvement compared to using voltage guardbands and an extra 21-24% improvement over existing state-of-the-art error tolerance solutions, at a nominal area and time overhead.

2017 - Efficient virtual memory sharing via on-accelerator page table walking in heterogeneous embedded SoCs [Articolo su rivista]
Vogel, Pirmin; Kurth, Andreas; Weinbuch, Johannes; Marongiu, Andrea; Benini, Luca
abstract

Shared virtual memory is key in heterogeneous systems on chip (SoCs) that combine a general-purpose host processor with a many-core accelerator, both for programmability and performance. In contrast to the fullblown, hardware-only solutions predominant in modern high-end systems, lightweight hardware-software co-designs are better suited in the context of more power- and area-constrained embedded systems and provide additional benefits in terms of flexibility and predictability. As a downside, the latter solutions require the host to handle in software synchronization in case of page misses as well as miss handling. This may incur considerable run-time overheads. In this work, we present a novel hardware-software virtual memory management approach for many-core accelerators in heterogeneous embedded SoCs. It exploits an accelerator-side helper thread concept that enables the accelerator tomanage its virtualmemory hardware autonomously while operating cache-coherently on the page tables of the user-space processes of the host. This greatly reduces overhead with respect to host-side solutions while retaining flexibility. We have validated the design with a set of parameterizable benchmarks and real-world applications covering various application domains. For purely memory-bound kernels, the accelerator performance improves by a factor of 3.8 compared with host-based management and lies within 50% of a lower-bound ideal memory management unit.

2017 - Enabling zero-copy OpenMP ofloading on the PULP many-core accelerator [Relazione in Atti di Convegno]
Capotondi, Alessandro; Marongiu, Andrea
abstract

Many-core heterogeneous designs are nowadays widely available among embedded systems. Initiatives such as the HSA push for a model where the host processor and the accelerator(s) communicate via coherent, Uniied Virtual Memory (UVM). In this paper we describe our experience in porting the OpenMP v4 programming model to a low-end, heterogeneous embedded system based on the PULP many-core accelerator featuring lightweight (software-managed) UVM support. We describe a GCC-based toolchain which enables: i) the automatic generation of host and accelerator binaries from a single, high-level, OpenMP parallel program; ii) the automatic instrumentation of the accelerator program to transparently manage UVM. This enables up to 4Ã faster execution compared to traditional copy-based oload mechanisms.

2017 - GPU-Accelerated Real-Time Path Planning and the Predictable Execution Model [Relazione in Atti di Convegno]
Forsberg, Bjã¶rn; Palossi, Daniele; Marongiu, Andrea; Benini, Luca
abstract

Path planning is one of the key functional blocks for autonomous vehicles constantly updating their route in real-time. Heterogeneous many-cores are appealing candidates for its execution, but the high degree of resource sharing results in very unpredictable timing behavior. The predictable execution model (PREM) has the potential to enable the deployment of real-time applications on top of commercial off-the-shelf (COTS) heterogeneous systems by separating compute and memory operations, and scheduling the latter in an interference-free manner. This paper studies PREM applied to a state-of-the-art path planner running on a NVIDIA Tegra X1, providing insight on memory sharing and its impact on performance and predictability. The results show that PREM reduces the execution time variance to near-zero, providing a 3Ã decrease in the worst case execution time.

2017 - GPUguard: Towards supporting a predictable execution model for heterogeneous SoC [Relazione in Atti di Convegno]
Forsberg, Bjorn; Marongiu, Andrea; Benini, Luca
abstract

The deployment of real-time workloads on commercial off-the-shelf (COTS) hardware is attractive, as it reduces the cost and time-to-market of new products. Most modern high-end embedded SoCs rely on a heterogeneous design, coupling a general-purpose multi-core CPU to a massively parallel accelerator, typically a programmable GPU, sharing a single global DRAM. However, because of non-predictable hardware arbiters designed to maximize average or peak performance, it is very difficult to provide timing guarantees on such systems. In this work we present our ongoing work on GPUguard, a software technique that predictably arbitrates main memory usage in heterogeneous SoCs. A prototype implementation for the NVIDIA Tegra TX1 SoC shows that GPUguard is able to reduce the adverse effects of memory sharing, while retaining a high throughput on both the CPU and the accelerator.

2017 - HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA [Relazione in Atti di Convegno]
Kurth, Andreascc; Vogel, Pirmin; Capotondi, Alessandro; Marongiu, Andrea; Benini, Luca
abstract

2017 - Lightweight virtual memory support for zero-copy sharing of pointer-rich data structures in heterogeneous embedded SoCs [Articolo su rivista]
Vogel, Pirmin; Marongiu, Andrea; Benini, Luca
abstract

While high-end heterogeneous systems are increasingly supporting heterogeneous uniform memory access (hUMA), their low-power counterparts still lack basic features like virtual memory support for accelerators. Instead of simply passing pointers, explicit data management involving copies is needed which hampers programmability and performance. In this work, we evaluate a mixed hardware/software solution for lightweight virtual memory support for many-core accelerators in heterogeneous embedded systemson- chip. Based on an input/output translation lookaside buffer managed by a host kernel-level driver, and compiler extensions protecting the accelerator's accesses to shared data, our solution is non-intrusive to the architecture of the accelerator cores, and enables zero-copy sharing of pointer-rich data structures.

2017 - Message from ANDARE'17 general and program chairs [Breve Introduzione]
Bartolini, A.; Cardoso, J. M. P.; Silvano, C.; Palermo, G.; Barbosa, J.; Marongiu, A.; Mustafa, D.; Rohou, E.; Mantovani, F.; Agosta, G.; Martinovic, J.; Pingali, K.; Slaninova, K.; Benini, L.; Cytowski, M.; Palkovic, M.; Gerndt, M.; Sanna, N.; Diniz, P.; Rusitoru, R.; Eigenmann, R.; Patki, T.; Fahringer, T.; Rosendard, T.
abstract

2017 - On the accuracy of near-optimal CPU-based path planning for UAVs [Relazione in Atti di Convegno]
Palossi, Daniele; Marongiu, Andrea; Benini, Luca
abstract

Path planning is one of the key functional blocks for any autonomous aerial vehicle (UAV). The goal of a path planner module is to constantly update the route of the vehicle based on information sensed in real-time. Given the high computational requirements of this task, heterogeneous many-cores are appealing candidates for its execution. Approximate path computation has proven a promising approach to reduce total execution time, at the cost of a slight loss in accuracy. In this work we study performance and accuracy of state-of-the-art, near-optimal parallel path planning in combination with program transformations aimed at ensuring efficient use of embedded GPU resources. We propose a profile-based algorithmic variant which boosts GPU execution by up to â 7Ã , while maintaining the accuracy loss below 5%.

2017 - Ultra low-power visual odometry for nano-scale unmanned aerial vehicles [Relazione in Atti di Convegno]
Palossi, Daniele; Marongiu, Andrea; Benini, Luca
abstract

One of the fundamental functionalities for autonomous navigation of Unmanned Aerial Vehicles (UAVs) is the hovering capability. State-of-the-art techniques for implementing hovering on standard-size UAVs process camera stream to determine position and orientation (visual odometry). Similar techniques are considered unaffordable in the context of nano-scale UAVs (i.e. few centimeters of diameter), where the ultra-constrained power-envelopes of tiny rotor-crafts limit the onboard computational capabilities to those of low-power microcontrollers. In this work we study how the emerging ultra-low-power parallel computing paradigm could enable the execution of complex hovering algorithmic flows onto nano-scale UAVs. We provide insight on the software pipeline, the parallelization opportunities and the impact of several algorithmic enhancements. Results demonstrate that the proposed software flow and architecture can deliver unprecedented GOPS/W, achieving 117 frame-per-second within a power envelope of 10 mW.

2016 - A Software Stack for Next-Generation Automotive Systems on Many-Core Heterogeneous Platforms [Relazione in Atti di Convegno]
Burgio, Paolo; Bertogna, Marko; Olmedo, Ignacio Sanudo; Gai, Paolo; Marongiu, Andrea; Sojka, Michal
abstract

The advent of commercial-of-the-shelf (COTS) heterogeneous many-core platforms is opening up a series of opportunities in the embedded computing market. Integrating multiple computing elements running at smaller frequencies allows obtaining impressive performance capabilities at a reduced power consumption. These platforms can be successfully adopted to build the next-generation of self-driving vehicles, where Advanced Driver Assistance Systems (ADAS) need to process unprecedently higher computing workloads at low power budgets. Unfortunately, the current methodologies for providing real-time guarantees are uneffective when applied to the complex architectures of modern many-cores. Having impressive average performances with no guaranteed bounds on the response times of the critical computing activities is of little if no use to these applications. Project HERCULES will provide the required technological infrastructure to obtain an order-of-magnitude improvement in the cost and power consumption of next generation automotive systems. This paper presents the integrated software framework of the project, which allows achieving predictable performance on top of cutting-edge heterogeneous COTS platforms. The proposed software stack will let both real-time and non real-time application coexist on next-generation, power-efficient embedded platform, with preserved timing guarantees.

2016 - Always-on motion detection with application-level error control on a near-threshold approximate computing platform [Relazione in Atti di Convegno]
Tagliavini, Giuseppe; Marongiu, Andrea; Rossi, Davide; Benini, Luca
abstract

Pushing supply voltages in the near-threshold region is today one of the main avenues to minimize power consumption in digital integrated circuits. This works well with logic units, but memory operations on standard six-transistor static RAM (6T-SRAM) cells become unreliable at low voltages. Standard cell memory (SCM) works fully reliably at near-threshold voltages, but has much lower area density than 6T-SRAM and thus it is too costly. Hybrid memory designs based on a combination of 6T-SRAM and SCM have the potential to combine the best from both worlds, provided that appropriate software techniques for their management are used. Several embedded applications exhibit inherent tolerance to data approximation: this feature can be exploited by mapping error-tolerant data onto unreliable 6T-SRAM while keeping critical information error-free in SCM. However, one key issue is bounding error when it is input-data dependent. In this work we consider the motion detection stage of a computer vision pipeline, which is a major power bottleneck in always-on computer vision systems. We introduce an application-level metric for defining suitable tolerance thresholds and an associated runtime mechanism for their control. At each accuracy checkpoint the error on the computation is checked. If the runtime detects that an error threshold has been exceeded, the voltage settings are adjusted. Using this methodology, we achieve a significant reduction of the total energy consumption (up to 33% in the best case) while maintaining a tight control on quality of results.

2016 - An energy-efficient parallel algorithm for real-time near-optimal UAV path planning [Relazione in Atti di Convegno]
Palossi, D; Furci, M; Naldi, R; Marongiu, A; Marconi, L; Benini, L
abstract

We propose a shortest trajectory planning algorithm imple-mentation for Unmanned Aerial Vehicles (UAVs) on an em-bedded GPU. Our goal is the development of a fast, energy-efficient global planner for multi-rotor UAVs supporting hu-man operator during rescue missions. The work is based on OpenCL parallel non-deterministic version of the Dijkstra algorithm to solve the Single Source Shortest Path (SSSP). Our planner is suitable for real-Time path re-computation in dynamically varying environments of up to 200 m2. Results demonstrate the effcacy of the ap-proach, showing speedups of up to 74x, saving up to 98% of energy versus the sequential benchmark, while reaching near-optimal path selection, keeping the average path cost error smaller than 1.2%.

2016 - An optimized task-based runtime system for resource-constrained parallel accelerators [Relazione in Atti di Convegno]
Cesarini, D; Marongiu, A; Benini, L
abstract

Manycore accelerators have recently proven a promising solution for increasingly powerful and energy efficient computing systems. This raises the need for parallel programming models capable of effectively leveraging hundreds to thousands of processors. Task-based parallelism has the potential to provide such capabilities, offering flexible support to fine-grained and irregular parallelism. However, efficiently supporting this programming paradigm on resource-constrained parallel accelerators is a challenging task. In this paper, we present an optimized implementation of the OpenMP tasking model for embedded parallel accelerators, discussing the key design solution that guarantee small memory (footprint) and minimize performance overheads. We validate our design by comparing to several state-of-the-art tasking implementations, using the most representative parallelization patterns. The experimental results confirm that our solution achieves near-ideal speedups for tasks as small as 5K cycles.

2016 - Controlling NUMA effects in embedded manycore applications with lightweight nested parallelism support [Articolo su rivista]
Marongiu, A; Capotondi, A; Benini, L
abstract

Embedded manycore architectures are often organized as fabrics of tightly-coupled shared memory clusters. A hierarchical interconnection system is used with a crossbar-like medium inside each cluster and a network-on-chip (NoC) at the global level which make memory operations nonuniform (NUMA). Due to NUMA, regular applications typically employed in the embedded domain (e.g., image processing, computer vision, etc.) ultimately behave as irregular workloads if a flat memory system is assumed at the program level. Nested parallelism represents a powerful programming abstraction for these architectures, provided that (i) streamlined middleware support is available, whose overhead does not dominate the run-time of fine-grained applications; (ii) a mechanism to control thread binding at the cluster-level is supported. We present a lightweight runtime layer for nested parallelism on cluster-based embedded manycores, integrating our primitives in the OpenMP runtime system, and implementing a new directive to control NUMA-aware nested parallelism mapping. We explore on a set of real application use cases how NUMA makes regular parallel workloads behave as irregular, and how our approach allows to control such effects and achieve up to 28 × speedup versus flat parallelism.

2016 - Enabling OpenVX support in mW-scale parallel accelerators [Relazione in Atti di Convegno]
Tagliavini, G; Haugou, G; Marongiu, A; Benini, L
abstract

MW-scale parallel accelerators are a promising target for application domains such as the Internet of Thing (IoT), which require a strong compliance with a limited power budget combined with high performance capabilities. An important use case is given by smart sensing devices featuring increasingly sophisticated vision capabilities, at the cost of an increasing amount of near-sensor computation power. OpenVX is an emerging standard for the embedded vision, and provides a C-based application programming interface and a runtime environment. OpenVX is designed to maximize functional and performance portability across diverse hardware platforms. However, state-of-the-art implementations rely on memory-hungry data structures, which cannot be supported in constrained devices. In this paper we propose an alternative and novel approach to provide OpenVX support in mW-scale parallel accelerators. Our main contributions are: (i) an extension to the original OpenVX model to support static management of application graphs in the form of binary files; (ii) the definition of a companion runtime environment providing a lightweight support to execute binary graphs in a resource-constrained environment. Our approach achieves 68% memory footprint reduction and 3× execution speed-up compared to a baseline implementation. At the same time, data memory bandwidth is reduced by 10% and energy efficiency is improved by 2×.

2016 - Enabling the heterogeneous accelerator model on ultra-low power microcontroller platforms [Relazione in Atti di Convegno]
Conti, F; Palossi, D; Marongiu, A; Rossi, D; Benini, L
abstract

The stringent power constraints of complex microcontroller based devices (e.g. smart sensors for the IoT) represent an obstacle to the introduction of sophisticated functionality. Programmable accelerators would be extremely beneficial to provide the flexibility and energy efficiency required by fast-evolving IoT applications; however, the integration complexity and sub-10mW power budgets have been considered insurmountable obstacles so far. In this paper we demonstrate the feasibility of coupling a low power microcontroller unit (MCU) with a heterogeneous programmable accelerator for speeding-up computation-intensive algorithms at an ultra-low power (ULP) sub-10mW budget. Specifically, we develop a heterogeneous architecture coupling a Cortex-M series MCU with PULP, a programmable accelerator for ULP parallel computing. Complex functionality is enabled by the support for offloading parallel computational kernels from the MCU to the accelerator using the OpenMP programming model. We prototype this platform using a STM Nucleo board and a PULP FPGA emulator. We show that our methodology can deliver up to 60× gains in performance and energy efficiency on a diverse set of applications, opening the way for a new class of ULP heterogeneous architectures.

2016 - Exploring Single Source Shortest Path Parallelization on Shared Memory Accelerators [Relazione in Atti di Convegno]
Palossi, Daniele; Marongiu, Andrea
abstract

Single Source Shortest Path (SSSP) algorithms are widely used in embedded systems for several applications. The emerging trend towards the adoption of heterogeneous designs in embedded devices, where low-power parallel accelerators are coupled to the main processor, opens new opportunities to deliver superior performance/watt, but calls for efficient parallel SSSP implementation. In this work we provide a detailed exploration of the Î -stepping algorithm performance on a representative heterogeneous embedded system, TI Keystone II, considering the impact of several parallelization parameters (threading, load balancing, synchronization).

2016 - He-P2012: Performance and Energy Exploration of Architecturally Heterogeneous Many-Cores [Articolo su rivista]
Conti, F; Marongiu, A; Pilkington, C; Benini, L
abstract

The end of Dennardian scaling in advanced technologies brought about new architectural templates to overcome the so-called utilization wall and provide Moore’s Law-like performance and energy scaling in embedded SoCs. One of the most promising templates, architectural heterogeneity, is hindered by high cost due to the design space explosion and the lack of effective exploration tools. Our work provides three contributions towards a scalable and effective methodology for design space exploration in embedded MC-SoCs. First, we present the He-P2012 architecture, augmenting the state-of-art STMicroelectronics P2012 platform with heterogeneous shared-L1 coprocessors called HW processing elements (HWPE). Second, we propose a novel methodology for the semi-automatic definition and instantiation of shared-memory HWPEs from a C source, supporting both simple and structured data types. Third, we demonstrate that the integration of HWPEs can provide significant performance and energy efficiency benefits on a set of benchmarks originally developed for the homogeneous P2012, achieving up to 123x speedup on the accelerated code region (∼98 % of Amdahl’s law limit) while saving 2/3 of the energy.

2016 - On the effectiveness of OpenMP teams for cluster-based many-core accelerators [Relazione in Atti di Convegno]
Capotondi, Alessandro; Marongiu, Andrea
abstract

With the introduction of more powerful and massively parallel embedded processors, embedded systems are becoming HPC-capable. Heterogeneous on-chip systems (SoC) that couple a general-purposehost processor to a many-core accelerator are becoming more and more widespread, and provide tremendous peak performance/watt, well suited to execute HPC-class programs. The increased computation potential is however traded off for ease programming. Application developers are indeed required to manually deal with outlining code parts suitable for acceleration, parallelize them efficiently over many available cores, and orchestrate data transfers to/from the accelerator. In addition, since most many-cores are organized as a collection ofclusters, featuring fast local communication but slow remote communication (i.e., to another cluster's local memory), the programmer should also take care of properly mapping the parallel computation so as to avoid poor data locality. OpenMP v4.0 introduces new constructs for computation offloading, as well as directives to deploy parallel computation in a cluster-aware manner. In this paper we assess the effectiveness of OpenMP v4.0 at exploiting the massive parallelism available in embedded heterogeneous SoCs, comparing to standard parallel loops over several computation-intensive applications from the linear algebra and image processing domains.

2016 - PULP: A parallel ultra low power platform for next generation IoT applications [Relazione in Atti di Convegno]
Rossi, Davide; Conti, Francesco; Marongiu, Andrea; Pullini, Antonio; Loi, Igor; Gautschi, Michael; Tagliavini, Giuseppe; Capotondi, Alessandro; Flatresse, Philippe; Benini, Luca
abstract

This article consists of a collection of slides from the authors' conference presentation.

2016 - Thrifty-malloc: A HW/SW codesign for the dynamic management of hardware transactional memory in embedded multicore systems [Relazione in Atti di Convegno]
Carle, T; Papagiannopoulou, D; Moreshet, T; Marongiu, A; Herlihy, M; Bahar, R I
abstract

We present thrifty-malloc: a transaction-friendly dynamic memory manager for high-end embedded multicore systems. The manager combines modularity, ease-of-use and hardware transactional memory (HTM) compatibility in a lightweight and memory-efficient design. Thrifty-malloc is easy to deploy and configure for non-expert programmers, yet provides good performance with low memory overhead for highly-parallel embedded applications running on massively parallel processor arrays (MPPAs) or many-core architectures. In addition, the transparent mechanisms that increase our manager's resilience to unpredictable dynamic situations incur a low timing overhead in comparison to established techniques.

2016 - VirtualSoC: A research tool for modern MPSoCs [Articolo su rivista]
Bortolotti, Daniele; Marongiu, Andrea; Benini, Luca
abstract

Architectural heterogeneity has proven to be an effective design paradigm to cope with an ever-increasing demand for computational power within tight energy budgets, in virtually every computing domain. Programmable manycore accelerators are currently widely used not only in high-performance computing systems, but also in embedded devices, in which they operate as coprocessors under the control of a general-purpose CPU (the host processor). Clearly, such powerful hardware architectures are paired with sophisticated and complex software ecosystems, composed of operating systems, programming models plus associated runtime engines, and increasingly complex user applications with related libraries. System modeling has always played a key role in early architectural exploration or software development when the real hardware is not available. The necessity of efficiently coping with the huge HW/SW design space provided by the described heterogeneous Systems on Chip (SoCs) calls for advanced full-system simulation methodologies and tools, capable of assessing various metrics for the functional and nonfunctional properties of the target system. In this article, we describe VirtualSoC, a simulation tool targeting the full-system simulation of massively parallel heterogeneous SoCs. We also describe how VirtualSoC has been successfully adopted in several research projects.

2015 - A framework for optimizing OpenVX applications performance on embedded manycore accelerators [Relazione in Atti di Convegno]
Tagliavini, Giuseppe; Haugou, Germain; Marongiu, Andrea; Benini, Luca
abstract

2015 - A memory-centric approach to enable timing-predictability within embedded many-core accelerators [Relazione in Atti di Convegno]
Burgio, Paolo; Marongiu, Andrea; Valente, Paolo; Bertogna, Marko
abstract

There is an increasing interest among real-time systems architects for multi- and many-core accelerated platforms. The main obstacle towards the adoption of such devices within industrial settings is related to the difficulties in tightly estimating the multiple interferences that may arise among the parallel components of the system. This in particular concerns concurrent accesses to shared memory and communication resources. Existing worst-case execution time analyses are extremely pessimistic, especially when adopted for systems composed of hundreds-tothousands of cores. This significantly limits the potential for the adoption of these platforms in real-time systems. In this paper, we study how the predictable execution model (PREM), a memory-aware approach to enable timing-predictability in realtime systems, can be successfully adopted on multi- and manycore heterogeneous platforms. Using a state-of-the-art multi-core platform as a testbed, we validate that it is possible to obtain an order-of-magnitude improvement in the WCET bounds of parallel applications, if data movements are adequately orchestrated in accordance with PREM. We identify which system parameters mostly affect the tremendous performance opportunities offered by this approach, both on average and in the worst case, moving the first step towards predictable many-core systems.

2015 - ADRENALINE: An OpenVX Environment to Optimize Embedded Vision Applications on Many-core Accelerators [Relazione in Atti di Convegno]
Tagliavini, Giuseppe; Haugou, Germain; Marongiu, Andrea; Benini, Luca
abstract

The acceleration of Computer Vision algorithms is an important enabler to support the more and more pervasive applications of the embedded vision domain. Heterogeneous systems featuring a clustered many-core accelerator are a very promising target for embedded vision workloads, but the code optimization for these platforms is a challenging task. In this work we introduce ADRENALINE, a novel framework for fast prototyping and optimization of OpenVX applications for heterogeneous SoCs with many-core accelerators. ADRENALINE consists of an optimized OpenVX run-time system and a virtual platform, and it is intended to provide support to a wide range of end users. We highlight the benefits of this approach in different optimization contexts.

2015 - An Evaluation of Memory Sharing Performance for Heterogeneous Embedded SoCs with Many-Core Accelerators [Relazione in Atti di Convegno]
Vogel, Pirmin; Marongiu, Andrea; Benini, Luca
abstract

Today’s systems-on-chip (SoCs) more and more conform to the models envisioned by the Heterogeneous System Architecture (HSA) foundation in which massively parallel, programmable many-core accelerators (PMCAs) not only cooperate but also coherently share memory with a powerful, multi-core host processor. Allowing direct access to system memory from both sides greatly simplifies application development, but it increases the potential interference to the memory system due to the PMCA. In this work, we evaluate the impact of a PMCA’s memory traffic on the host performance using the Xilinx Zynq-7000 SoC. This platform features a dual-core ARM Cortex-A9 CPU, as well as a field-programmable gate array (FPGA), which we use to model a PMCA. Synthetic workload, real benchmarks from the MiBench and ALPBench suites, and collaborative workloads all show that the interference generated by the PMCA can significantly reduce the memory bandwidth seen by the host (on average up to 25% for host applications).

2015 - Architecture support for tightly-coupled multi-core clusters with shared-memory HW accelerators [Articolo su rivista]
Dehyadegari, Masoud; Marongiu, Andrea; Kakoee Mohammad, Reza; Mohammadi, Siamak; Yazdani, Naser; Benini, Luca
abstract

Coupling processors with acceleration hardware is an effective manner to improve energy efficiency of embedded systems. Many-core is nowadays a dominating design paradigm for SoCs, which opens new challenges and opportunities for designing HW blocks. Exploring acceleration solutions that naturally fit into well-established parallel programming models and that can be incrementally added on top of existing parallel applications is thus extremely important. In this paper we focus on tightly-coupled multi-core cluster architectures, representative of the basic building block of the most recent many-cores, and we enhance it with dedicated HW processing units (HWPU). We propose an architecture where the HWPUs share the same L1 data memory through which processors also communicate, implementing a zero-copy communication model. High-level synthesis (HLS) tools are used to generate HW blocks, then a custom wrapper interfaces the latter to the tightly coupled cluster. We validate our proposal on RTL models, running both synthetic workload and real applications. Experimental results demonstrate that on average our solution provides nearly identical performance to traditional private-memory coarse-grained accelerators, but it achieves up to 32 percent better performance/area/watt and it requires only minimal modifications to legacy parallel codes.

2015 - Enabling Scalable and Fine-Grained Nested Parallelism on Embedded Many-cores [Relazione in Atti di Convegno]
Capotondi, Alessandro; Marongiu, Andrea; Benini, Luca
abstract

2015 - GPU Acceleration for simulating massively parallel many-core platforms [Articolo su rivista]
Raghav, Shivani; Ruggiero, Martino; Marongiu, Andrea; Pinto, Christian; Atienza, David; Benini, Luca
abstract

—Emerging massively parallel architectures such as a general-purpose processor plus many-core programmable accelerators are creating an increasing demand for novel methods to perform their architectural simulation. Most state-of-the-art simulation technologies are exceedingly slow and the need to model full system many-core architectures adds further to the complexity issues. This paper presents a novel methodology to accelerate the simulation of many-core coprocessors using GPU platforms. We demonstrate the challenges, feasibility and benefits of our idea to use heterogeneous system (CPU and GPU) to simulate future architecture of many-core heterogeneous platforms. The target architecture selected to evaluate our methodology consists of an ARM general purpose CPU coupled with many-core coprocessor with thousands of simple in-order cores connected in a tile network. This work presents optimization techniques used to parallelize the simulation specifically for acceleration on GPUs. We partition the full system simulation between CPU and GPU, where the target general purpose CPU is simulated on the host CPU, whereas the manycore coprocessor is simulated on the NVIDIA Tesla 2070 GPU platform. Our experiments show performance of up to 50 MIPS when simulating the entire heterogeneous chip, and high scalability with increasing cores on coprocessor.

2015 - Lightweight virtual memory support for many-core accelerators in heterogeneous embedded SoCs [Relazione in Atti di Convegno]
Vogel, Pirmin; Marongiu, Andrea; Benini, Luca
abstract

While high-end heterogeneous systems are increasingly supporting heterogeneous uniform memory access (hUMA) as envisioned by the Heterogeneous System Architecture (HSA) foundation, their low-power counterparts targeting the embedded domain still lack basic features like virtual memory support for accelerators. As opposed to simply passing virtual address pointers, explicit data management involving copies is needed to share data between host processor and accelerators which hampers programmability and performance. In this work, we present a mixed hardware/software solution to enable lightweight virtual memory support for many-core accelerators in heterogeneous embedded systems-on-chip (SoCs). Based on an input/output translation lookaside buffer (IOTLB), efficiently managed by a kernel-level driver module running on the host, our solution features a considerably lower design complexity compared to conventional input/output memory management units. Using our evaluation platform based on the Xilinx Zynq-7000 SoC with a many-core accelerator implemented in the programmable logic, we demonstrate the effectiveness of our solution and the benefits of virtual memory support for embedded heterogeneous SoCs.

2015 - OpenMP and Timing Predictability: A Possible Union? [Relazione in Atti di Convegno]
Vargas, Roberto; Quinones, Eduardo; Marongiu, Andrea
abstract

2015 - P-SOCRATES: A parallel software framework for time-critical many-core systems [Articolo su rivista]
Pinho, Luís Miguel; Nélis, Vincent; Yomsi, Patrick Meumeu; Quiñones, Eduardo; Bertogna, Marko; Burgio, Paolo; Marongiu, Andrea; Scordino, Claudio; Gai, Paolo; Ramponi, Michele; Mardiak, Michal
abstract

Current generation of computing platforms is embracing multi-core and many-core processors to improve the overall performance of the system, meeting at the same time the stringent energy budgets requested by the market. Parallel programming languages are nowadays paramount to extracting the tremendous potential offered by these platforms: parallel computing is no longer a niche in the high performance computing (HPC) field, but an essential ingredient in all domains of computer science. The advent of next-generation many-core embedded platforms has the chance of intercepting a converging need for predictable high-performance coming from both the High-Performance Computing (HPC) and Embedded Computing (EC) domains. On one side, new kinds of HPC applications are being required by markets needing huge amounts of information to be processed within a bounded amount of time. On the other side, EC systems are increasingly concerned with providing higher performance in real-time, challenging the performance capabilities of current architectures. This converging demand raises the problem about how to guarantee timing requirements in presence of parallel execution. The paper presents how the time-criticality and parallelisation challenges are addressed by merging techniques coming from both HPC and EC domains, and provides an overview of the proposed framework to achieve these objectives.

2015 - Playing with fire: Transactional memory revisited for error-resilient and energy-efficient MPSOC execution [Relazione in Atti di Convegno]
Papagiannopoulou, Dimitra; Benini, Luca; Marongiu, Andrea; Herlihy, Maurice; Moreshet, Tali; Bahar, Iris
abstract

As silicon integration technology pushes toward atomic dimensions, errors due to static and dynamic variability are an increasing concern. To avoid such errors, designers often turn to "guardband" restrictions on the operating frequency and voltage. If guardbands are too conservative, they limit performance and waste energy, but less conservative guardbands risk moving the system closer to its Critical Operating Point (COP), a frequency-voltage pair that, if surpassed, causes massive instruction failures. In this paper, we propose a novel scheme that allows to dynamically adjust to an evolving COP and operate at highly reduced margins, while guaranteeing forward progress. Specifically, our scheme dynamically monitors the platform and adaptively adjusts to the COP among multiple cores, using lightweight checkpointing and roll-back mechanisms adopted from Hardware Transactional Memory (HTM) for error recovery. Experiments demonstrate that our technique is particularly effective in saving energy while also offering safe execution guarantees. To the best of our knowledge, this work is the first to describe a full-fledged HTM implementation for errorresilient and energy-efficient MPSoC execution.

2015 - Runtime Support for Multiple Offload-Based Programming Models on Embedded Manycore Accelerators [Relazione in Atti di Convegno]
Capotondi, Alessandro; Haugou, Germain; Marongiu, Andrea; Benini, Luca
abstract

2015 - Simplifying Many-Core-Based Heterogeneous SoC Programming with Offload Directives [Articolo su rivista]
Marongiu, Andrea; Capotondi, Alessandro; Tagliavini, Giuseppe; Benini, Luca
abstract

Multiprocessor systems-on-chip (MPSoC) are evolving into heterogeneous architectures based on one host processor plus many-core accelerators. While heterogeneous SoCs promise higher performance/watt, they are programmed at the cost of major code rewrites with low-level programming abstractions (e.g, OpenCL). We present a programming model based on OpenMP, with additional directives to program the accelerator from a single host program. As a test case, we evaluate an implementation of this programming model for the STMicroelectronics STHORM development board. We obtain near-ideal throughput for most benchmarks, very close performance to hand-optimized OpenCL codes at a significantly lower programming complexity, and up to 30× speedup versus host execution time.

2015 - Synergistic architecture and programming model support for approximate micropower computing [Relazione in Atti di Convegno]
Tagliavini, Giuseppe; Rossi, Davide; Benini, Luca; Marongiu, Andrea
abstract

Energy consumption is a major constraining factor for embedded multi-core systems. Using aggressive voltage scaling can reduce power consumption, but memory operations become unreliable. Several embedded applications exhibit inherent tolerance to computation approximation, for which indicating parts that can tolerate errors has proven a viable way to reduce energy consumption. In this work we propose an extension to OpenMP to specify what regions of code and data are tolerant to approximation. A compiler pass places data into memory regions with different reliability guarantees according to their tolerance to errors. The voltage supply level is dynamically adjusted according to tolerance policies, with the overall goal of minimizing energy in full compliance with precision constraints.

2015 - Task scheduling strategies to mitigate hardware variability in embedded shared memory clusters [Relazione in Atti di Convegno]
Rahimi, Abbas; Cesarini, Daniele; Marongiu, Andrea; Gupta Rajesh, K.; Benini, Luca
abstract

Manufacturing and environmental variations cause timing errors that are typically avoided by conservative design guardbands or corrected by circuit level error detection and correction. These measures incur energy and performance penalties. This paper considers methods to reduce this cost by expanding the scope of variability mitigation through the software stack. In particular, we propose workload deployment methods that reduce the likelihood of timing errors in shared memory clusters of processor cores. This and other methods are incorporated in a runtime layer in the OpenMP framework that enables parsimonious countermeasures against timing errors induced by hardware variability. The runtime system 'introspectively' monitors the costs of tasks execution on various cores and transparently associates descrIPtive metadata with the tasks. By utilizing the characterized metadata, we propose several policies that enhance the cluster choices for scheduling tasks to cores according to measured hardware variability and system workload. We devise efficient task scheduling strategies for simultaneous management of variability and workload by exploiting centralized and distributed approaches to workload distribution. Both schedulers surpass current state-of-the-art approaches; the distributed (or the centralized) achieves on average 30% (or 17%) energy, and 17% (4%) performance improvement.

2015 - Timing characterization of OpenMP4 tasking model [Relazione in Atti di Convegno]
Serrano, Maria A.; Melani, Alessandra; Vargas, Roberto; Marongiu, Andrea; Bertogna, Marko; Quiñones, Eduardo
abstract

OpenMP is increasingly being supported by the newest high-end embedded many-core processors. Despite the lack of any notion of real-time execution, the latest specification of OpenMP (v4.0) introduces a tasking model that resembles the way real-time embedded applications are modeled and designed, i.e., as a set of periodic task graphs. This makes OpenMP4 a convenient candidate to be adopted in future real-time systems. However, OpenMP4 incorporates as well features to guarantee backward compatibility with previous versions that limit its practical usability in real-time systems. The most notable example is the distinction between tied and untied tasks. Tied tasks force all parts of a task to be executed on the same thread that started the execution, whereas a suspended untied task is allowed to resume execution on a different thread. Moreover, tied tasks are forbidden to be scheduled in threads in which other non-descendant tied tasks are suspended. As a result, the execution model of tied tasks, which is the default model in OpenMP to simplify the coexistence with legacy constructs, clearly restricts the performance and has serious implications on the response time analysis of OpenMP4 applications, making difficult to adopt it in real-time environments. In this paper, we revisit OpenMP design choices, introducing timing predictability as a new and key metric of interest. Our first results confirm that even if tied tasks can be timing analyzed, the quality of the analysis is much worse than with untied tasks. We thus reason about the benefits of using untied tasks, deriving a response time analysis for this model, and so allowing OpenMP4 untied model to be applied to real-time systems.

2014 - A HLS-Based Toolflow to Design Next-Generation Heterogeneous Many-Core Platforms with Shared Memory [Relazione in Atti di Convegno]
Burgio, Paolo; Marongiu, Andrea; Coussy, Philippe; Benini, Luca
abstract

This work describes how we use High-Level Synthesis to support design space exploration (DSE) of heterogeneous many-core systems. Modern embedded systems increasingly couple hardware accelerators and processing cores on the same chip, to trade specialization of the platform to an application domain for increased performance and energy efficiency. However, the process of designing such a platform is complex and error-prone, and requires skills on algorithmic aspects, hardware synthesis, and software engineering. DSE can partially be automated, and thus simplified, by coupling the use of HLS tools and virtual prototyping platforms. In this paper we enable the design space exploration of heterogeneous many-cores adopting a shared-memory architecture template, where communication and synchronization between the hardware accelerators and the cores happens through L1 shared memory. This communication infrastructure leverages a "zero-copy" scheme, which simplifies both the design process of the platform and the development of applications on top of it. Moreover, the shared-memory template perfectly fits the semantics of several high-level programming models, such as OpenMP. We provide programmers with simple yet powerful abstractions to exploit accelerators from within an OpenMP application, and propose a low-cost implementation of the necessary runtime support. An HLS-based automatic design flow is set up, to quickly explore the design space using a cycle-accurate virtual platform.

2014 - A Virtualization Framework for IOMMU-less Many-Core Accelerators [Relazione in Atti di Convegno]
Pinto, Christian; Marongiu, Andrea; Benini, Luca
abstract

2014 - A tightly-coupled hardware controller to improve scalability and programmability of shared-memory heterogeneous clusters [Relazione in Atti di Convegno]
Burgio, Paolo; Danilo, Robin; Marongiu, Andrea; Coussy, Philippe; Benini, Luca
abstract

Modern designs for embedded many-core systems increasingly include application-specific units to accelerate key computational kernels with orders-of-magnitude higher execution speed and energy efficiency compared to software counterparts. A promising architectural template is based on heterogeneous clusters, where simple RISC cores and specialized HW units (HWPU) communicate in a tightly-coupled manner via L1 shared memory. Efficiently integrating processors and a high number of HW Processing Units (HWPUs) in such an system poses two main challenges, namely, architectural scalability and programmability. In this paper we describe an optimized Data Pump (DP) which connects several accelerators to a restricted set of communication ports, and acts as a virtualization layer for programming, exposing FIFO queues to offload “HW tasks” to them through a set of lightweight APIs. In this work, we aim at optimizing both these mechanisms, for respectively reducing modules area and making programming sequence easier and lighter.

2014 - Augmenting manycore programmable accelerators with photonic interconnect technology for the high-end embedded computing domain [Relazione in Atti di Convegno]
Balboni, Marco; Obon Marta, Ortin; Capotondi, Alessandro; Tatenguem Hervé, Fankem; Ghiribaldi, Alberto; Ramini, Luca; Viñal, Victor; Marongiu, Andrea; Bertozzi, Davide
abstract

2014 - Energy efficient parallel computing on the PULP platform with support for OpenMP [Relazione in Atti di Convegno]
Rossi, Davide; Loi, Igor; Conti, Francesco; Tagliavini, Giuseppe; Pullini, Antonio; Marongiu, Andrea
abstract

2014 - He-P2012: Architectural heterogeneity exploration on a scalable many-core platform [Relazione in Atti di Convegno]
Conti, F; Pilkington, C; Marongiu, A; Benini, L
abstract

2014 - He-P2012: Architectural heterogeneity exploration on a scalable many-core platform [Relazione in Atti di Convegno]
Conti, Francesco; Pilkington, Chuck; Marongiu, Andrea; Benini, Luca
abstract

Architectural heterogeneity is a promising solution to overcome the utilization wall and provide Moore's Law-like performance scaling in future SoCs. However, heterogeneous architectures increase the size and complexity of the design space, and significant enhancements are required to tools and methodologies to explore this design space effectively. In this work, we describe an extension to the STMicroelectronics P2012 platform and simulation flow to support tightly-coupled shared memory HW processing elements (HWPE), we propose a methodology for the semi-automatic instantiation of HWPEs from a C program, and we explore several architectural variants on a set of computer vision benchmarks.

2014 - Improving resilience to timing errors by exposing variability effects to software in tightly-coupled processor clusters [Articolo su rivista]
Rahimi, Abbas; Cesarini, Daniele; Marongiu, Andrea; Gupta Rajesh, K.; Benini, Luca
abstract

Manufacturing and environmental variations cause timing errors in microelectronic processors that are typically avoided by ultra-conservative multi-corner design margins or corrected by error detection and recovery mechanisms at the circuit-level. In contrast, we present here runtime software support for cost-effective countermeasures against hardware timing failures during system operation. We propose a variability-aware OpenMP (VOMP) programming environment, suitable for tightly-coupled shared memory processor clusters, that relies upon modeling across the hardware/software interface. VOMP is implemented as an extension to the OpenMP v3.0 programming model that covers various parallel constructs, including , , and . Using the notion of work-unit vulnerability (WUV) proposed here, we capture timing errors caused by circuit-level variability as high-level software knowledge. WUV consists of descriptive metadata to characterize the impact of variability on different work-unit types running on various cores. As such, WUV provides a useful abstraction of hardware variability to efficiently allocate a given work-unit to a suitable core for execution. VOMP enables hardware/software collaboration with online variability monitors in hardware and runtime scheduling in software. The hardware provides online per-core characterization of WUV metadata. This metadata is made available by carefully placing key data structures in a shared L1 memory and is used by VOMP schedulerss. Our results show that VOMP greatly reduces the cost of timing error recovery compared to the baseline schedulers of OpenMP, yielding speedup of 3%–36% for tasks, and 26%–49% for sections. Further, VOMP reaches energy saving of 2%–46% and 15%–50% for tasks, and sections, respectively.

2014 - On the Relevance of Architectural Awareness for Efficient Fork/Join Support on Cluster-Based Manycores [Relazione in Atti di Convegno]
Al-Khalissi, Hayder; Berekovic, Mladen; Marongiu, Andrea
abstract

Several recent manycores leverage a hierarchical design, where small-medium numbers of cores are grouped inside clusters and enjoy low-latency, high-bandwidth local com- munication through fast L1 scratchpad memories. Several clusters can be interconnected through a network-on-chip (NoC), which ensures system scalability but introduces non- uniform memory access (NUMA) effects: The cost to access a specific memory location depends of the physical path that corresponding transactions traverse. These peculiarities of the HW must clearly be carefully taken into account when designing support for programming models. In this paper we study how architectural awareness is key to supporting efficient and streamlined fork/join primitives. We compare hierarchical fork/join operations to \at" ones, where there is no notion of the hierarchical interconnection system, con- sidering two real-world manycores: Intel SCC and STMicro- electronics STHORM. Copyright 2014 ACM.

2014 - P-SOCRATES: A Parallel Software Framework for Time-Critical Many-Core Systems [Relazione in Atti di Convegno]
Pinho, L M; Quinones, E; Bertogna, M; Marongiu, A; Carlos, J P; Scordino, C; Ramponi, M
abstract

The advent of next-generation many-core embedded platforms has the chance of intercepting a converging need for predictable high-performance coming from both the High-Performance Computing (HPC) and Embedded Computing (EC) domains. On one side, new kinds of HPC applications are being required by markets needing huge amounts of information to be processed within a bounded amount of time. On the other side, EC systems are increasingly concerned with providing higher performance in real-time, challenging the performance capabilities of current architectures. This converging demand, however, raises the problem about how to guarantee timing requirements in presence of parallel execution. This paper presents the approach of project P-SOCRATES for the design of an integrated framework for the execution of workload-intensive applications with real-time requirements on top of next-generation commercial-off-the-shelf (COTS) platforms based on many-core accelerated architectures. The time-criticality and parallelisation challenges are addressed by merging techniques coming from both HPC and EC domains, identifying the main sources of indeterminism and proposing efficient mapping and scheduling algorithms, along with the associated timing and schedulability analysis, to guarantee the real-time and performance requirements of the applications.

2014 - Speculative synchronization for coherence-free embedded NUMA architectures [Relazione in Atti di Convegno]
Papagiannopoulou, Dimitra; Moreshet, Tali; Marongiu, Andrea; Benini, Luca; Herlihy, Maurice; Iris Bahar, R.
abstract

High-end embedded systems, like their general-purpose counterparts, are turning to many-core cluster-based shared-memory architectures that provide a shared memory abstraction subject to non-uniform memory access (NUMA) costs. In order to keep the cores and memory hierarchy simple, many-core embedded systems tend to employ simple, scratchpad-like memories, rather than hardware managed caches that require some form of cache coherence management. These “coherence-free” systems still require some means to synchronize memory accesses and guarantee memory consistency. Conventional lock-based approaches may be employed to accomplish the synchronization, but may lead to both useability and performance issues. Instead, speculative synchronization, such as hardware transactional memory, may be a more attractive approach. However, hardware speculative techniques traditionally rely on the underlying cache-coherence protocol to synchronize memory accesses among the cores. The lack of a cache-coherence protocol adds new challenges in the design of hardware speculative support. In this paper, we present a new scheme for hardware transactional memory support within a cluster-based NUMA system that lacks an underlying cache-coherence protocol. To the best of our knowledge, this is the first design for speculative synchronization for this type of architecture. Through a set of benchmark experiments using our simulation platform, we show that our design can achieve significant performance improvements over traditional lock-based schemes.

2014 - The Challenge of Time-Predictability in Modern Many-Core Architectures [Relazione in Atti di Convegno]
Nã©lis, Vincent; Yomsi, Patrick Meumeu; Pinho, LuÃs Miguel; Fonseca, JosÃ© Carlos; Bertogna, Marko; Quiã±ones, Eduardo; Vargas, Roberto; Marongiu, Andrea
abstract

The recent technological advancements and market trends are causing an interesting phenomenon towards the convergence of High-Performance Computing (HPC) and Embedded Computing (EC) domains. Many recent HPC applications require huge amounts of information to be processed within a bounded amount of time while EC systems are increasingly concerned with providing higher performance in real-time. The convergence of these two domains towards systems requiring both high performance and a predictable time-behavior challenges the capabilities of current hardware architectures. Fortunately, the advent of next-generation many-core embedded platforms has the chance of intercepting this converging need for predictability and high-performance, allowing HPC and EC applications to be executed on efficient and powerful heterogeneous architectures integrating general-purpose processors with many-core computing fabrics. However, addressing this mixed set of requirements is not without its own challenges and it is now of paramount importance to develop new techniques to exploit the massively parallel computation capabilities of many-core platforms in a predictable way. Â© Vincent NÃ©lis, Patrick Meumeu Yomsi, LuÃs Miguel Pinho, JosÃ© Carlos Fonseca, Marko Bertogna, Eduardo QuiÃ±ones, Roberto Vargas, and Andrea Marongiu.

2014 - Tightly-coupled hardware support to dynamic parallelism acceleration in embedded shared memory clusters [Relazione in Atti di Convegno]
Burgio, Paolo; Tagliavini, Giuseppe; Conti, Francesco; Marongiu, Andrea; Benini, Luca
abstract

Modern designs for embedded systems are increasingly embracing cluster-based architectures, where small sets of cores communicate through tightly-coupled shared memory banks and high-performance interconnections. At the same time, the complexity of modern applications requires new programming abstractions to exploit dynamic and/or irregular parallelism on such platforms. Supporting dynamic parallelism in systems which i) are resource-constrained and ii) run applications with small units of work calls for a runtime environment which has minimal overhead for the scheduling of parallel tasks. In this work, we study the major sources of overhead in the implementation of OpenMP dynamic loops, sections and tasks, and propose a hardware implementation of a generic Scheduling Engine (HWSE) which fits the semantics of the three constructs. The HWSE is designed as a tightly-coupled block to the PEs within a multi-core cluster, communicating through a shared-memory interface. This allows very fast programming and synchronization with the controlling PEs, fundamental to achieving fast dynamic scheduling, and ultimately to enable fine-grained parallelism. We prove the effectiveness of our solutions with real applications and synthetic benchmarks, using a cycle-accurate virtual platform.

2013 - A variability-aware OpenMP environment for efficient execution of accuracy-configurable computation on shared-FPU processor clusters [Relazione in Atti di Convegno]
Rahimi, Abbas; Marongiu, Andrea; Gupta, Rajesh K.; Benini, Luca
abstract

We propose a tightly-coupled, multi-core cluster architecture with shared, variation-tolerant, and accuracy-reconfigurable floating-point units (FPUs). The resilient shared-FPUs dynamically characterize FP pipeline vulnerability (FPV) and expose it as metadata to a software scheduler for reducing the cost of error correction. To further reduce this cost, our programming and runtime environment also supports controlled approximate computation through a combination of design-time and runtime techniques. We provide OpenMP extensions (as custom directives) for FP computations to specify parts of a program that can be executed approximately. We use a profiling technique to identify tolerable error significance and error rate thresholds in error-tolerant image processing applications. This information guides an application-driven hardware FPU synthesis and optimization design flow to generate efficient FPUs. At runtime, the scheduler utilizes FPV metadata and promotes FPUs to accurate mode, or demotes them to approximate mode depending upon the code region requirements. We demonstrate the effectiveness of our approach (in terms of energy savings) on a 16-core tightly-coupled cluster with eight shared-FPUs for both error-tolerant and general-purpose error-intolerant applications.

2013 - An integrated, programming model-driven framework for NoC–QoS support in cluster-based embedded many-cores [Articolo su rivista]
Joven, J.; Marongiu, A.; Angiolini, F.; Benini, L.; De Micheli, G.
abstract

Embedded SoC designs are embracing the many-core paradigm to deliver the required performance to run an ever-increasing number of applications in parallel. Networks-on-Chip (NoC) are considered as a convenient technology to implement many-core embedded platforms. The complex and non-uniform nature of the traffic flows generated when multiple parallel applications are running simultaneously calls for Quality-of-Service (QoS) extensions in the NoC, but to efficiently exploit similar services it is necessary to expose them to the software in a easy-to-use yet efficient manner. In this paper we present an integrated hardware/software approach for delivering QoS on top of an hybrid OpenMP-MPI parallel programming model. Our experimental results show the effectiveness of our proposal over a broad range of benchmarks and application mappings, demonstrating the ability to manage parallelism under QoS requirements effortlessly from the programming model.

2013 - Architecture and programming model support for efficient heterogeneous computing on tigthly-coupled shared-memory clusters [Relazione in Atti di Convegno]
Burgio, P.; Marongiu, A.; Danilo, R.; Coussy, P.; Benini, L.
abstract

Modern computer vision and image processing embedded systems exploit hardware acceleration inside scalable parallel architectures, such as tightly-coupled clusters, to achieve stringent performance and energy efficiency targets. Architectural heterogeneity typically makes software development cumbersome, thus shared memory processor-to-accelerator communication is typically preferred to simplify code offioading to HW IPs for critical computational kernels. However, tightly coupling a large number of accelerators and processors in a shared memory cluster is a challenging task, since the complexity of the resulting system quickly becomes too large. We tackle these issues by proposing a template of heterogeneous shared memory cluster which scales to a large number of accelerators, achieving up to 40% better performance/area/watt than simply designing larger main interconnects to accommodate several HW IPs. In addition, following a trend towards standardization of acceleration capabilities of future embedded systems, we develop a programming model which simplifies application development for heterogeneous clusters.

2013 - Enabling fine-grained OpenMP tasking on tightly-coupled shared memory clusters [Relazione in Atti di Convegno]
Burgio, Paolo; Tagliavini, Giuseppe; Marongiu, Andrea; Benini, Luca
abstract

Cluster-based architectures are increasingly being adopted to design embedded many-cores. These platforms can deliver very high peak performance within a contained power envelope, provided that programmers can make effective use the available parallel cores. This is becoming an extremely difficult task, as embedded applications are growing in complexity and exhibit irregular and dynamic parallelism. The OpenMP tasking extensions represent a powerful abstraction to capture this form of parallelism. However, efficiently supporting it on cluster-based embedded SoCs is not easy, because the fine-grained parallel workload present in embedded applications can not tolerate high memory and run-time overheads. In this paper we present our design of the runtime support layer to OpenMP tasking for an embedded shared memory cluster, identifying key aspects to achieving performance and discussing important architectural support to removing major bottlenecks.

2013 - Improving the programmability of STHORM-based heterogeneous systems with offload-enabled OpenMP [Relazione in Atti di Convegno]
Marongiu, Andrea; Capotondi, Alessandro; Giuseppe, Tagliavini; Luca, Benini
abstract

Heterogeneous architectures based on one fast-clocked, mod- erately multicore "host"processor plus a many-core accelera- tor represent one promising way to satisfy the ever-increasing GOps/W requirements of embedded systems-on-chip. How- ever, heterogeneous computing comes at the cost of increased programming complexity, requiring major rewrite of the ap- plications with low-level programming style (e.g, OpenCL). In this paper we present a programming model, compiler and runtime system for a prototype board from STMicroelec- tronics featuring a ARM9 host and a STHORM many-core accelerator. The programming model is based on OpenMP, with additional directives to efficiently program the acceler- Ator from a single host program. The proposed multi-ISA compilation toolchain hides all the process of outlining an ac- celerator program, compiling and loading it to the STHORM platform and implementing data sharing between the host and the accelerator. Our experimental results show that we achieve very close performance to hand-optimized OpenCL codes, at a significantly lower programming complexity.

2013 - SIMinG-1k: A thousand-core simulator running on general-purpose graphical processing units [Articolo su rivista]
Raghav, S.; Marongiu, A.; Pinto, C.; Ruggiero, M.; Atienza, D.; Benini, L.
abstract

This paper introduces SIMinG-1k—a manycore simulator infrastructure. SIMinG-1k is a graphics processing unit accelerated, parallel simulator for design-space exploration of large-scale manycore systems. It features an optimal trade-off between modeling accuracy and simulation speed. Its main objectives are high performance, flexibility, and ability to simulate thousands of cores. SIMinG-1k can model different architectures (currently, we support ARM (Available from: http://infocenter.arm.com/help/index.jsp?topic=/com. arm.doc.ddi0100i/index.html) and Intel x86) using two-step approac where architecture specific front end is decoupled from a fast and parallel manycore virtual machine running on graphical processing unit platform. We evaluate the simulator for target architecture with up to 4096 cores. Our results demonstrate very high scalability and almost linear speedup with simulation of increasing number of cores.

2013 - Synthesis-friendly techniques for tightly-coupled integration of hardware accelerators into shared-memory multi-core clusters [Relazione in Atti di Convegno]
Conti, Francesco; Marongiu, Andrea; Benini, Luca
abstract

Several many-core designs tackle scalability issues by leveraging tightly-coupled clusters as building blocks, where low-latency, high-bandwidth interconnection between a small/medium number of cores and L1 memory achieves high performance/watt. Tight coupling of hardware accelerators into these multicore clusters constitutes a promising approach to further improve performance/area/watt. However, accelerators are often clocked at a lower frequency than processor clusters for energy efficiency reasons. In this paper, we propose a technique to integrate shared-memory accelerators within the tightly-coupled clusters of the STMicroelectronics STHORM architecture. Our methodology significantly relaxes timing constraints for tightly-coupled accelerators, while optimizing data bandwidth. In addition, our technique allows to operate the accelerator at an integer submultiple of the cluster frequency. Experimental results show that the proposed approach allows to recover up to 84% of the slow-down implied by reduced accelerator speed.

2013 - Transparent and energy-efficient speculation on NUMA architectures for embedded MPSoCsProceedings of the First International Workshop on Many-core Embedded Systems - MES '13 [Relazione in Atti di Convegno]
Papagiannopoulou, Dimitra; Iris Bahar, R.; Moreshet, Tali; Herlihy, Maurice; Marongiu, Andrea; Benini, Luca
abstract

High-end embedded systems such as smart phones, game consoles, GPS-enabled automotive systems, and home entertainment centers, are becoming ubiquitous. Like their general-purpose counterparts, and for many of the same energy-related reasons, embedded sys- tems are turning to multicore architectures. Moreover, as the de- mand for more compute-intensive capabilities for embedded sys- tems increases, these multicore architectures will evolve into many- core systems for improved performance or performance/area/Watt. These systems are often organized as cluster based Non-Uniform Memory Access (NUMA) architectures that provide the program- mer with a shared-memory abstraction, with the cost of sharing memory (in terms of performance, energy, and complexity) varying substantially depending on the locations of the communicating pro- cesses. This paper investigates one of the principal challenges pre- sented by these emerging NUMA architectures for embedded sys- tems: providing efficient, energy-effective and convenient mecha- nisms for synchronization and communication. In this paper, we propose an initial solution based on hardware support for specula- tive synchronization.

2013 - Variation-tolerant OpenMP tasking on tightly-coupled processor clusters [Relazione in Atti di Convegno]
Rahimi, Abbas; Marongiu, Andrea; Burgio, Paolo; Gupta, Rajesh K.; Benini, Luca
abstract

We present a variation-tolerant tasking technique for tightly-coupled shared memory processor clusters that relies upon modeling advance across the hardware/software interface. This is implemented as an extension to the OpenMP 3.0 tasking programming model. Using the notion of Task-Level Vulnerability (TLV) proposed here, we capture dynamic variations caused by circuit-level variability as a high-level software knowledge. This is accomplished through a variation-aware hardware/software codesign where: (i) Hardware features variability monitors in conjunction with online per-core characterization of TLV metadata; (ii) Software supports a Task-level Errant Instruction Management (TEIM) technique to utilize TLV metadata in the runtime OpenMP task scheduler. This method greatly reduces the number of recovery cycles compared to the baseline scheduler of OpenMP [22], consequently instruction per cycle (IPC) of a 16-core processor cluster is increased up to 1.51× (1.17× on average). We evaluate the effectiveness of our approach with various number of cores (4,8,12,16), and across a wide temperature range(ΔT=90°C).

2013 - VirtualSoC: A Full-System Simulation Environment for Massively Parallel Heterogeneous System-on-Chip [Relazione in Atti di Convegno]
Bortolotti, Daniele; Pinto, Christian; Marongiu, Andrea; Ruggiero, Martino; Benini, Luca
abstract

Driven by flexibility, performance and cost constraints of demanding modern applications, heterogeneous System-on-Chip (SoC) is the dominant design paradigm in the embedded system computing domain. SoC architecture and heterogeneity clearly provide a wider power/performance scaling, combining high performance and power efficient general-purpose cores along with massively parallel many-core-based accelerators. Besides the complex hardware, generally these kinds of platforms host also an advanced software ecosystem, composed by an operating system, several communication protocol stacks, and various computational demanding user applications. The necessity to efficiently cope with the hugeHW/SW design space provided by this scenario makes clearly full-system simulator one of the most important design tools. We present in this paper a new emulation framework, called Virtual SoC, targeting the full-system simulation of massively parallel heterogeneous SoCs.

2012 - A Tightly-Coupled Multi-Core Cluster with Shared-Memory HW Accelerators [Relazione in Atti di Convegno]
M., Dehyadegari; Marongiu, Andrea; Kakoee, Mohammad Reza; Benini, Luca; S., Mohammadi; N., Yazdani
abstract

Tightly coupling hardware accelerators with processors is a well-known approach for boosting the efficiency of MPSoC platforms. The key design challenges in this area are: (i) streamlining accelerator definition and instantiation and (ii) developing architectural templates and run-time techniques for minimizing the cost of communication and synchronization between processors and accelerators. In this paper we present an architecture featuring tightly-coupled processors and hardware processing units (HWPU), with zero-copy communication. We also provide a simple programming API, which simplifies the process of offloading jobs to HWPUs.

2012 - An OpenMP Compiler for Efficient Use of Distributed Scratchpad Memory in MPSoCs [Articolo su rivista]
Marongiu, A.; Benini, L.
abstract

Most of today’s state-of-the-art processors for mobile and embedded systems feature on-chip scratchpad memories. To efficiently exploit the advantages of low-latency high-bandwidth memory modules in the hierarchy, there is the need for programming models and/or language features that expose such architectural details. On the other hand, effectively exploiting the limited on-chip memory space requires the programmer to devise an efficient partitioning and distributed placement of shared data at the application level. In this paper, we propose a programming framework that combines the ease of use of OpenMP with simple, yet powerful, language extensions to trigger array data partitioning. Our compiler exploits profiled information on array access count to automatically generate data allocation schemes optimized for locality of references.

2012 - Design of a collective communication infrastructure for barrier synchronization in cluster-based nanoscale MPSoCs [Relazione in Atti di Convegno]
Abellan, J. L.; Fernandez, J.; Acacio, M. E.; Bertozzi, Davide; Bortolotti, Daniele; Marongiu, Andrea; Benini, Luca
abstract

Barrier synchronization is a key programming primitive for shared memory embedded MPSoCs. As the core count increases, software implementations cannot provide the needed performance and scalability, thus making hardware acceleration critical. In this paper we describe an interconnect extension implemented with standard cells and with a mainstream industrial toolflow. We show that the area overhead is marginal with respect to the performance improvements of the resulting hardware-accelerated barriers.We integrate our HW barrier into the OpenMP programming model and discuss synchronization efficiency compared with traditional software implementations.

2012 - Fast and lightweight support for nested parallelism on cluster-based embedded many-cores [Relazione in Atti di Convegno]
Marongiu, Andrea; Burgio, Paolo; Benini, Luca
abstract

Several recent many-core accelerators have been architected as fabrics of tightly-coupled shared memory clusters. A hierarchical interconnection system is used – with a crossbarlike medium inside each cluster and a network-on-chip (NoC) at the global level – which make memory operations nonuniform (NUMA). Nested parallelism represents a powerful programming abstraction for these architectures, where a first level of parallelism can be used to distribute coarse-grained tasks to clusters, and additional levels of fine-grained parallelism can be distributed to processors within a cluster. This paper presents a lightweight and highly optimized support for nested parallelism on cluster-based embedded many-cores. We assess the costs to enable multi-level parallelization and demonstrate that our techniques allow to extract high degrees of parallelism.

2012 - Full system simulation of many-core heterogeneous SoCs using GPU and QEMU semihosting [Relazione in Atti di Convegno]
S., Raghav; Marongiu, Andrea; Pinto, Christian; D., Atienza; Ruggiero, Martino; Benini, Luca
abstract

Modern system-on-chips are evolving towards complex and heterogeneous platforms with general purpose processors cou- pled with massively parallel manycore accelerator fabrics (e.g. embedded GPUs). Platform developers are looking for ecient full-system simulators capable of simulating com- plex applications, middleware and operating systems on these heterogeneous targets. Unfortunately current virtual plat- forms are not able to tackle the complexity and heteroge- neity of state-of-the-art SoCs. Software emulators, such as the open-source QEMU project, cope quite well in terms of simulation speed and functional accuracy with homoge- neous coarse-grained multi-cores. The main contribution of this paper is the introduction of a novel virtual prototy- ping technique which exploits the heterogeneous accelera- tors available in commodity PCs to tackle the heterogeneity challenge in full-SoC system simulation. In a nutshell, our approach makes it possible to partition simulation between the host CPU and GPU. More specically, QEMU runs on the host CPU and the simulation of manycore accelerators is ooaded, through semi-hosting, to the host GPU. Our experimental results conrm the exibility and eciency of our enhanced QEMU environment.

2012 - Low-overhead barrier synchronization for openmp-like parallelism on the single-chip cloud computer [Relazione in Atti di Convegno]
Al-Khalissi, H.; Marongiu, A.; Berekovic, M.
abstract

To simplify program development for the Singlechip Cloud Computer (SCC) it is desirable to have highlevel, shared memory-based parallel programming abstractions (e.g., OpenMP-like programming model). Central to any similar programming model are barrier synchronization primitives, to coordinate the work of parallel threads. To allow high-level barrier constructs to deliver good performance, we need an efficient implementation of the underlying synchronization algorithm. In this work, we consider some of the most widely used approaches for barrier synchronization on the SCC, which constitutes the basis for implementing OpenMP-like parallelism. In particular, we consider optimizations that leverage SCC-specific hardware support for synchronization, or its explicitly-managed memory buffers. We provide a detailed evaluation of the performance achieved by different approaches.

2012 - OpenMP-based Synergistic Parallelization and HW Acceleration for On-Chip Shared-Memory Clusters [Relazione in Atti di Convegno]
Burgio, Paolo; Marongiu, Andrea; D., Heller; C., Chavet; P., Coussy; Benini, Luca
abstract

Modern embedded MPSoC designs increasingly couple hardware accelerators to processing cores to trade between energy efficiency and platform specialization. To assist effective design of such systems there is the need on one hand for clear methodologies to streamline accelerator definition and instantiation, on the other for architectural templates and runtime techniques that minimize processors-to-accelerator communication costs. In this paper we present an architecture featuring tightly-coupled processors and accelerators, with zerocopy communication. Efficient programming is supported by an extended OpenMP programming model, where custom directives allow to specialize code regions for execution on parallel cores, accelerators, or a mix of the two. Our integrated approach enables fast yet accurate exploration of accelerator-based HW and SW architectures.

2011 - Exploring Instruction caching strategies for tightly-coupled shared-memory clusters [Relazione in Atti di Convegno]
Bortolotti, D.; Paterna, F.; Pinto, C.; Marongiu, A.; Ruggiero, M.; Benini, L.
abstract

Several Chip-Multiprocessor designs today leverage tightly-coupled computing clusters as a building block. These clusters consist of a fairly large number N of simple cores, featuring fast communication through a shared multibanked L1 data memory and ≈ 1 Instruction-Per-Cycle (IPC) per core. Thus, aggregated I-fetch bandwidth approaches ƒ * N, where ƒ is the cluster clock frequency. An effective instruction cache architecture is key to support this I-fetch bandwidth. In this paper we compare two main architectures for instruction caching targeting tightly coupled CMP clusters: (i) private instruction caches per core and (ii) shared instruction cache per cluster. We developed a cycle-accurate model of the tightly coupled cluster with several configurable architectural parameters for exploration, plus a programming environment targeted at efficient data-parallel computing. We conduct an in-depth study of the two architectural templates based on the use of both synthetic microbenchmarks and real program workloads. Our results provide useful insights and guidelines for designers.

2011 - GPGPU-Accelerated Parallel and Fast Simulation of Thousand-core Platforms [Relazione in Atti di Convegno]
Pinto, C.; Raghav, S.; Marongiu, A.; Ruggiero, M.; Atienza, D.; Benini, L.
abstract

The multicore revolution and the ever-increasing complexity of computing systems is dramatically changing sys- tem design, analysis and programming of computing platforms. Future architectures will feature hundreds to thousands of simple processors and on-chip memories connected through a network-on-chip. Architectural simulators will remain primary tools for design space exploration, software development and performance evaluation of these massively parallel architec- tures. However, architectural simulation performance is a serious concern, as virtual platforms and simulation technology are not able to tackle the complexity of thousands of core future scenarios. The main contribution of this paper is the development of a new simulation approach and technology for many core processors which exploit the enormous par- allel processing capability of low-cost and widely available General Purpose Graphic Processing Units (GPGPU). The simulation of many-core architectures exhibits indeed a high level of parallelism and is inherently parallelizable, but GPGPU acceleration of architectural simulation requires an in-depth revision of the data structures and functional partitioning traditionally used in parallel simulation. We demonstrate our GPGPU simulator on a target architecture composed by several cores (i.e. ARM ISA based), with instruction and data caches, connected through a Network-on-Chip (NoC). Our experiments confirm the feasibility of our approach.

2011 - MPOpt-Cell: a high-performance data-flow programming environment for the CELL BE processor [Relazione in Atti di Convegno]
Franceschelli, A.; Burgio, P.; Tagliavini, G.; Marongiu, A.; Ruggiero, M.; Lombardi, M.; Bonfietti, A.; Milano, M.; Benini, L.
abstract

We present MPOpt-Cell, an architecture-aware framework for high-productivity development and efficient execution of stream applications on the CELL BE Processor. It enables developers to quickly build Synchronous Data Flow (SDF) applications using a simple and intuitive programming interface based on a set of compiler directives that capture the key abstractions of SDF. The compiler backend and system runtime efficiently manage hardware resources.

2011 - SoC-TM: Integrated HW/SW Support for Transactional Memory Programming on Embedded MPSoCs [Relazione in Atti di Convegno]
C., Ferri; Marongiu, Andrea; B., Lipton; I. R., Bahar; Benini, Luca; M., Herlihy; T., Moreshet
abstract

Two overriding concerns in the development of embedded MPSoCs are ease of programming and hardware complexity. In this paper we present SoC-TM, an integrated HW/SW solution for transactional programming on embedded MPSoCs. Our proposal leverages a Hardware Transactional Memory (HTM) design, based on a dedicated HW module for conflict management, whose functionality is exposed to the software through compiler directives, implemented as an extension to the popular OpenMP programming model. To further improve ease of programming, our framework supports speculative parallelism, thanks to the ability of enforcing a given commit order in hardware. Our experimental results confirm that SoC-TM is a viable and cost-effective solution for embedded MPSoCs, in terms of energy, performance and productivity.

2011 - Supporting OpenMP on a multi-cluster embedded MPSoC [Articolo su rivista]
Marongiu, A.; Burgio, P.; Benini, L.
abstract

The ever-increasing complexity of MPSoCs is putting the production of software on the critical path in embedded system development. Several programming models and tools have been proposed in the recent past that aim to facilitate application development for embedded MPSoCs. OpenMP is a mature and easy-to-use standard for shared memory programming, which has recently been successfully adopted in embedded MPSoC programming as well. To achieve performance, however, it is necessary that the implementation of OpenMP constructs efficiently exploits the many peculiarities of MPSoC hardware, and that custom features are provided to the programmer to control it. In this paper we consider a representative template of a modern multi-cluster embedded MPSoC and present an extensive evaluation of the cost associated with supporting OpenMP on such a machine, investigating several implementation variants that are aware of the memory hierarchy and of the heterogeneous interconnection.

2010 - Efficient OpenMP data mapping for multicore platforms with vertically stacked memory [Relazione in Atti di Convegno]
Marongiu, Andrea; Ruggiero, M.; Benini, Luca
abstract

Emerging TSV-based 3D integration technologies have shown great promise to overcome scalability limitations in 2D designs by stacking multiple memory dies on top of a many-core die. Application software developers need programming models and tools to fully exploit the potential of vertically stacked memory. In this work, we focus on efficient data mapping for SPMD parallel applications on an explicitly managed 3D-stacked memory hierarchy, which requires placement of data across multiple vertical memory stacks to be carefully optimized. We propose a programming framework with compiler support that enables array partitioning. Partitions are mapped to the 3D-stacked memory on top of the processor that mostly accesses it to take advantage of the lower latencies of vertical interconnect and for minimizing high-latency traffic on the horizontal plane.

2010 - Evaluating OpenMP Support Costs on MPSoCs [Relazione in Atti di Convegno]
Marongiu, Andrea; Burgio, Paolo; Benini, Luca
abstract

The ever-increasing complexity of MPSoCs is making the production of software the critical path in embedded system development. Several programming models and tools have been proposed in the recent past that aim at facilitating application development for embedded MPSoCs. OpenMP is a mature and easy-to-use standard for shared memory programming, which has recently been successfully adopted in embedded MPSoC programming as well. To achieve performance, however, it is necessary that the implementation of OpenMP constructs efficiently exploits the many peculiarities of MPSoC hardware. In this paper we present an extensive evaluation of the cost associated with supporting OpenMP on such a machine, investigating several implementative variants that efficiently exploit the memory hierarchy. Experimental results on different benchmarks confirm the effectiveness of the optimizations in terms of performance improvements.

2010 - Exploring programming model-driven QoS support for NoC-based platforms [Relazione in Atti di Convegno]
Joven, J.; Marongiu, A.; Angiolini, F.; Benini, L.; DE MICHELI, G.
abstract

Networks-on-Chip (NoCs) are being increasingly considered as a central enabling technology to communication-centric designs as more and more IP blocks are integrated on the same SoC. Embedded applications, in turn, are becoming extremely sophisticated, and often require guaranteed levels of service and performance. The complex and non-uniform nature of network traffic generated by parallel applications running on a large number of possibly heterogeneous IPs makes a strong case for providing Quality of Service (QoS) support for traffic streams over the NoC infrastructure. In this paper we consider an integrated hardware/software approach for delivering QoS at the application level. We designed NoC hardware support, low-level middleware and APIs which enable QoS control at the application level. Furthermore, we identify a set of programming abstractions useful to associate the notion of priority to each running task in the system. An initial implementation of this programming model is also presented, which leverages a set of extensions to a MPSoC-specific OpenMP compiler and runtime environment.

2010 - Scalable instruction set simulator for thousand-core architectures running on GPGPUs [Relazione in Atti di Convegno]
Raghav, S.; Ruggiero, M.; Atienza, D.; Pinto, C.; Marongiu, A.; Benini, L.
abstract

Simulators are still the primary tools for development and performance evaluation of applications running on massively parallel architectures. However, current virtual platforms are not able to tackle the complexity issues introduced by 1000-core future scenarios. We present a fast and accurate simulation framework targeting extremely large parallel systems by specifically taking advantage of the inherent potential processing parallelism available in modern GPGPUs.

2010 - Vertical stealing: robust, locality-aware do-all workload distribution for 3D MPSoCs [Relazione in Atti di Convegno]
Marongiu, A.; Burgio, P.; Benini, L.
abstract

In this paper we address the issue of efficient doall workload distribution on a embedded 3D MPSoC. 3D stacking technology enables low latency and high bandwidth access to multiple, large memory banks in close spatial proximity. In our implementation one silicon layer contains multiple processors, whereas one or more DRAM layers on top host a NUMA memory subsystem. To obtain high locality and balanced workload we consider a two-step approach. First, a compiler pass analyzes memory references in a loop and schedules each iteration to the processor owning the most frequently accessed data. Second, if locality-aware loop parallelization has generated unbalanced workload we allow idle processors to execute part of the remaining work from neighbors by implementing runtime support for work stealing.

2009 - Efficient OpenMP support and extensions for MPSoCs with explicitly managed memory hierarchy [Relazione in Atti di Convegno]
Marongiu, A.; Benini, L.
abstract

Abstract—OpenMP is a de facto standard interface of the shared address space parallel programming model. Recently, there have been many attempts to use it as a programming environment for embedded MultiProcessor Systems-On-Chip (MPSoCs). This is due both to the ease of specifying parallel execution within a sequential code with OpenMP directives, and to the lack of a standard parallel programming method on MPSoCs. However, MPSoC platforms for embedded applications often feature non-uniform, explicitly managed memory hierarchies with no hardware cache coherency as well as heterogeneous cores with heterogeneous run-time systems. In this paper we present an optimized implementation of the compiler and runtime support infrastructure for OpenMP programming for a non-cache-coherent distributed memory MPSoC with explicitly managed scratchpad memories (SPM). The proposed framework features specific extensions to the OpenMP programming model that leverage explicit management of the memory hierarchy. Experimental results on different real-life applications confirm the effectiveness of the optimization in terms of performance improvements.

2009 - OpenMP Support for NBTI-Induced Aging Tolerance in MPSoCs [Relazione in Atti di Convegno]
Marongiu, A.; Acquaviva, A.; Benini, L.
abstract

This book constitutes the refereed proceedings of the 11th International Symposium on Stabilization, Safety, and Security of Distributed Systems, SSS 2009, held in Lyon, France, in November 2009. The 49 revised full papers and 14 brief announcements presented together with three invited talks were carefully reviewed and selected from 126 submissions. The papers address all safety and security-related aspects of self-stabilizing systems in various areas. The most topics related to self-* systems. The special topics were alternative systems and models, autonomic computational science, cloud computing, embedded systems, fault-tolerance in distributed systems / dependability, formal methods in distributed systems, grid computing, mobility and dynamic networks, multicore computing, peer-to-peer systems, self-organizing systems, sensor networks, stabilization, and system safety and security

2008 - Analysis of Power Management Strategies for a Large-Scale SoC Platform in 65nm Technology [Relazione in Atti di Convegno]
Marongiu, Andrea; Acquaviva, Andrea; Benini, Luca; Bartolini, Andrea
abstract

Leakage power has become a major concern in nanometer technologies (65nm and beyond), and new strategies are being proposed to overcome the limitations of traditional dynamic voltage and frequency scaling (DVFS) and shutdown (SD) approaches in dealing with leakage power and with its strong dependency on temperature and process variations. Even though many researchers have proposed effective point-solutions to these issues, a detailed analysis of their impact on a commercial large-scale multimedia SoC platform is still missing. This paper presents an explorative and comparative analysis of DVFS and SD power management options on a multi-million-gate SoC in 65nm technology, and provides methodology directions and design insights

2007 - Lightweight barrier-based parallelization support for non-cache-coherent MPSoC platforms [Relazione in Atti di Convegno]
Marongiu, A.; Benini, L.; Kandemir, M.
abstract

Many MPSoC applications are loop-intensive and amenable to automatic parallelization with suitable compiler support. One of the key components of any compiler-parallelized code is barrier instructions which are used to perform global synchronization across parallel processors. This scenario calls for a lightweight synchronization infrastructure. In this work we describe a lightweight barrier support library for a non-cache-coherent MPSoC architecture. The library is coupled with a parallelizing compiler front-end to set up a complete automated flow which, starting from a sequential code, produces the parallelized binary code that can be directly executed onto an MPSoC target (a multi-core non-cache-coherent ARM7 platform). This tool-flow has been characterized in terms of system performance and energy.

2006 - Automatic application partitioning on FPGA/CPU systems based on detailed low-level information [Relazione in Atti di Convegno]
Busonera, Giovanni; Marongiu, Andrea; Carta, Salvatore; Raffo, Luigi
abstract

Re configurable FPGA/CPU systems are widely described in literature as a viable processing solution for embedded and high end processing. One of the key issues of this kind of approach is the code partitioning between CPU and FPGA. The development of automatic partitioning tools allows to obtain optimized architecture without a specific knowledge of digital design. In this paper we present a framework which, starting from an ANSI C application code: (i) automatically identifies code fragments suitable for hardware implementation as specialized functional units (ii)for all these segments a synthesizable code is generated and sent to a synthesis tool, (iii) from the synthesis results, the segments to be implemented on FPGA are selected (iv) bitstream to configure the FPGA and modified C code to be executed on the CPU are generated. We applied this tool to standard benchmarks obtaining, with respect to state of the art, an improvement of up to 250% in the accuracy of performances estimation related to the selected segments of code. This leads to a more optimized code partitioning. Â© 2006 IEEE.

Università degli studi di Modena e Reggio Emilia

Pubblicazioni