Nuova ricerca

ALESSANDRO CAPOTONDI

Ricercatore Legge 240/10 - t.det.
Dipartimento di Scienze Fisiche, Informatiche e Matematiche sede ex-Matematica


Home | Curriculum(pdf) | Didattica |


Pubblicazioni

2024 - ShareBERT: Embeddings Are Capable of Learning Hidden Layers [Relazione in Atti di Convegno]
Hu, JIA CHENG; Cavicchioli, Roberto; Berardinelli, Giulia; Capotondi, Alessandro
abstract


2023 - A Request for Clarity over the End of Sequence Token in the Self-Critical Sequence Training [Relazione in Atti di Convegno]
Hu, JIA CHENG; Cavicchioli, R.; Capotondi, A.
abstract

The Image Captioning research field is currently compromised by the lack of transparency and awareness over the End-of-Sequence token () in the Self-Critical Sequence Training. If the token is omitted, a model can boost its performance up to +4.1 CIDEr-D using trivial sentence fragments. While this phenomenon poses an obstacle to a fair evaluation and comparison of established works, people involved in new projects are given the arduous choice between lower scores and unsatisfactory descriptions due to the competitive nature of the research. This work proposes to solve the problem by spreading awareness of the issue itself. In particular, we invite future works to share a simple and informative signature with the help of a library called SacreEOS. Code available at: https://github.com/jchenghu/sacreeos.


2023 - Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning [Relazione in Atti di Convegno]
Hu, JIA CHENG; Cavicchioli, Roberto; Capotondi, Alessandro
abstract


2023 - Fine-Grained QoS Control via Tightly-Coupled Bandwidth Monitoring and Regulation for FPGA-based Heterogeneous SoCs [Relazione in Atti di Convegno]
Brilli, G.; Valente, G.; Capotondi, A.; Burgio, P.; Di Masciov, T.; Valente, P.; Marongiu, A.
abstract


2023 - HULK-V: a Heterogeneous Ultra-low-power Linux capable RISC-V SoC [Relazione in Atti di Convegno]
Valente, Luca; Tortorella, Yvan; Sinigaglia, Mattia; Tagliavini, Giuseppe; Capotondi, Alessandro; Benini, Luca; Rossi, Davide
abstract

IoT applications span a wide range in performance and memory footprint, under tight cost and power constraints. High-end applications rely on power-hungry Systems-on-Chip (SoCs) featuring powerful processors, large LPDDR/DDR3/4/5 memories, and supporting full-fledged Operating Systems (OS). On the contrary, low-end applications typically rely on Ultra-Low-Power μcontrollers with a “close to metal” software en-vironment and simple micro-kernel-based runtimes. Emerging applications and trends of IoT require the “best of both worlds”: cheap and low-power SoC systems with a well-known and agile software environment based on full-fledged OS (e.g., Linux), coupled with extreme energy efficiency and parallel digital signal processing capabilities. We present HULK-V: an open-source Heterogeneous Linux-capable RISC-V-based SoC coupling a 64-bit RISC-V processor with an 8-core Programmable Multi-Core Accelerator (PMCA), delivering up to 13.8 GOps, up to 157 GOps/W and accelerating the execution of complex DSP and ML tasks by up to 112× over the host processor. HULK-V leverages a lightweight, fully digital memory hierarchy based on HyperRAM IoT DRAM that exposes up to 512 MB of DRAM memory to the host CPU. Featuring HyperRAMs, HULK-V doubles the energy efficiency without significant performance loss compared to featuring power-hungry LPDDR memories, requiring expensive and large mixed-signal PHYs. HULK-V, implemented in Global Foundries 22nm FDX technology, is a fully digital ultra-low-cost SoC running a 64-bit Linux software stack with OpenMP host-to-PMCA offload within a power envelope of just 250 mW.


2023 - Heterogeneous Encoders Scaling in the Transformer for Neural Machine Translation [Relazione in Atti di Convegno]
Hu, JIA CHENG; Cavicchioli, R.; Berardinelli, G.; Capotondi, A.
abstract


2022 - An FPGA Overlay for Efficient Real-Time Localization in 1/10th Scale Autonomous Vehicles [Relazione in Atti di Convegno]
Bernardi, Andrea; Brilli, Gianluca; Capotondi, Alessandro; Marongiu, Andrea; Burgio, Paolo
abstract

Heterogeneous systems-on-chip (HeSoC) based on reconfigurable accelerators, such as Field-Programmable Gate Arrays (FPGA), represent an appealing option to deliver the performance/Watt required by the advanced perception and localization tasks employed in the design of Autonomous Vehicles. Different from software-programmed GPUs, FPGA development involves significant hardware design effort, which in the context of HeSoCs is further complicated by the system-level integration of HW and SW blocks. High-Level Synthesis is increasingly being adopted to ease hardware IP design, allowing engineers to quickly prototype their solutions. However, automated tools still lack the required maturity to efficiently build the complex hard-ware/software interaction between the host CPU and the FPGA accelerator(s). In this paper we present a fully integrated system design where a particle filter for LiDAR-based localization is efficiently deployed as FPGA logic, while the rest of the compute pipeline executes on programmable cores. This design constitutes the heart of a fully-functional 1/10th-scale racing autonomous car. In our design, accelerated IPs are controlled locally to the FPGA via a proxy core. Communication between the two and with the host CPU happens via shared memory banks also implemented as FPGA IPs. This allows for a scalable and easy-to-deploy solution both from the hardware and software viewpoint, while providing better performance and energy efficiency compared to state-of-the-art solutions.


2022 - Understanding and Mitigating Memory Interference in FPGA-based HeSoCs [Relazione in Atti di Convegno]
Brilli, G.; Capotondi, A.; Burgio, P.; Marongiu, A.
abstract

Like most high-end embedded systems, FPGA-based systems-on-chip (SoC) are increasingly adopting heterogeneous designs, where CPU cores, the configurable logic and other ICs all share interconnect and main memory (DRAM) controller. This paradigm is scalable and reduces production costs and time-to-market, but creates resource contention issues, which ultimately affects the programs' timing. This problem has been widely studied on CPU- and GPU-based systems, along with strategies to mitigate such effects, but little has been done so far to systematically study the problem on FPGA-based SoCs. This work provides an in-depth analysis of memory interference on such systems, tar-geting two state-of-the-art commercial FPGA SoCs. We also discuss architectural support for Controlled Memory Request Injection (CMRI), a technique that has proven effective at reducing the bandwidth under-utilization implied by naive schemes that solve the interference problem by only allowing mutually exclusive access to the shared resources. Our experimental results show that: i) memory interference can slow down CPU tasks by up to 16×in the tested FPGA-based SoCs; ii) CMRI allows to exploit more than 40% of the memory bandwidth avail-able to FPGA accelerators (normally completely unused in PREM-like schemes), keeping the slowdown due to interference below 10%.


2021 - A RISC-V-based FPGA Overlay to Simplify Embedded Accelerator Deployment [Relazione in Atti di Convegno]
Bellocchi, Gianluca; Capotondi, Alessandro; Conti, Francesco; Marongiu, Andrea
abstract

Modern cyber-physical systems (CPS) are increasingly adopting heterogeneous systems-on-chip (HeSoCs) as a computing platform to satisfy the demands of their sophisticated workloads. FPGA-based HeSoCs can reach high performance and energy efficiency at the cost of increased design complexity. High-Level Synthesis (HLS) can ease IP design, but automated tools still lack the maturity to efficiently and easily tackle system-level integration of the many hardware and software blocks included in a modern CPS. We present an innovative hardware overlay offering plug-and-play integration of HLS-compiled or handcrafted acceleration IPs thanks to a customizable wrapper attached to the overlay interconnect and providing shared-memory communication to the overlay cores. The latter are based on the open RISC-V ISA and offer simplified software management of the acceleration IP. Deploying the proposed overlay on a Xilinx ZU9EG shows ≈ 20% LUT usage and ≈ 4× speedup compared to program execution on the ARM host core.


2021 - A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays [Articolo su rivista]
Ravaglia, L.; Rusci, M.; Nadalini, D.; Capotondi, A.; Conti, F.; Benini, L.
abstract

In the last few years, research and development on Deep Learning models & techniques for ultra-low-power devices- in a word, TinyML - has mainly focused on a train-then-deploy assumption, with static models that cannot be adapted to newly collected data without cloud-based data collection and fine-tuning. Latent Replay-based Continual Learning (CL) techniques (Pellegrini et al., 2020) enable online, serverless adaptation in principle, but so far they have still been too computation- and memory-hungry for ultra-low-power TinyML devices, which are typically based on microcontrollers. In this work, we introduce a HW/SW platform for end-to-end CL based on a 10-core FP32 -enabled parallel ultra-low-power (PULP) processor. We rethink the baseline Latent Replay CL algorithm, leveraging quantization of the frozen stage of the model and Latent Replays (LRs) to reduce their memory cost with minimal impact on accuracy. In particular, 8-bit compression of the LR memory proves to be almost lossless (-0.26% with 3000LR) compared to the full-precision baseline implementation, but requires 4times less memory, while 7-bit can also be used with an additional minimal accuracy degradation (up to 5%). We also introduce optimized primitives for forward and backward propagation on the PULP processor, together with data tiling strategies to fully exploit its memory hierarchy, while maximizing efficiency. Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory - an amount compatible with embedding in TinyML devices. On an advanced 22nm prototype of our platform, called VEGA, the proposed solution performs on average 65 times faster than a low-power STM32 L4 microcontroller, being 37times more energy efficient - enough for a lifetime of 535h when learning a new mini-batch of data once every minute.


2021 - Robustifying the deployment of tinyML models for autonomous mini-vehicles [Relazione in Atti di Convegno]
de Prado, M.; Rusci, M.; Donze, R.; Capotondi, A.; Monnerat, S.; Benini, L.; Pazos, N.
abstract

Standard-size autonomous navigation vehicles have rapidly improved thanks to the breakthroughs of deep learning. However, scaling autonomous driving to low-power systems deployed on dynamic environments poses several challenges that prevent their adoption. To address them, we propose a closed-loop learning flow for autonomous driving mini-vehicles that includes the target environment in-the-loop. We leverage a family of compact and high-throughput tinyCNNs to control the mini-vehicle, which learn in the target environment by imitating a computer vision algorithm, i.e., the expert. Thus, the tinyCNNs, having only access to an on-board fast-rate linear camera, gain robustness to lighting conditions and improve over time. Further, we leverage GAP8, a parallel ultra-low-power RISC-V SoC, to meet the inference requirements. When running the family of CNNs, our GAP8's solution outperforms any other implementation on the STM32L4 and NXP k64f (Cortex-M4), reducing the latency by over 13x and the energy consummation by 92%.


2021 - Robustifying the deployment of tinyml models for autonomous mini-vehicles [Articolo su rivista]
de Prado, M.; Rusci, M.; Capotondi, A.; Donze, R.; Benini, L.; Pazos, N.
abstract

Standard-sized autonomous vehicles have rapidly improved thanks to the breakthroughs of deep learning. However, scaling autonomous driving to mini-vehicles poses several challenges due to their limited on-board storage and computing capabilities. Moreover, autonomous systems lack robustness when deployed in dynamic environments where the underlying distribution is different from the distribution learned during training. To address these challenges, we propose a closed-loop learning flow for autonomous driving mini-vehicles that includes the target deployment environment in-the-loop. We leverage a family of compact and high-throughput tinyCNNs to control the mini-vehicle that learn by imitating a computer vision algorithm, i.e., the expert, in the target environment. Thus, the tinyCNNs, having only access to an on-board fast-rate linear camera, gain robustness to lighting conditions and improve over time. Moreover, we introduce an online predictor that can choose between different tinyCNN models at runtime—trading accuracy and latency— which minimises the inference’s energy consumption by up to 3.2×. Finally, we leverage GAP8, a parallel ultra-low-power RISC-V-based micro-controller unit (MCU), to meet the real-time inference requirements. When running the family of tinyCNNs, our solution running on GAP8 outperforms any other implementation on the STM32L4 and NXP k64f (traditional single-core MCUs), reducing the latency by over 13× and the energy consumption by 92%.


2021 - Unmanned vehicles in smart farming: A survey and a glance at future horizons [Relazione in Atti di Convegno]
Madronal, D.; Palumbo, F.; Capotondi, A.; Marongiu, A.
abstract

Smart and Precision Agriculture is nowadays already exploiting advanced drones and machinery, but it is foreseen that in the near future more complex and intelligent applications will be required to be brought on-board to improve qualitatively and quantitative production. In this paper, we present an overview on the current usage of autonomous drones in this field and on the augmented computing capabilities that they could count on when companion computers are coupled to flight controllers. The paper also present a novel architecture for companion computers that is under development within the Comp4Drones ECSEL-JU project.


2020 - A Systematic Assessment of Embedded Neural Networks for Object Detection [Relazione in Atti di Convegno]
Verucchi, M.; Brilli, G.; Sapienza, D.; Verasani, M.; Arena, M.; Gatti, F.; Capotondi, A.; Cavicchioli, R.; Bertogna, M.; Solieri, M.
abstract

Object detection is arguably one of the most important and complex tasks to enable the advent of next-generation autonomous systems. Recent advancements in deep learning techniques allowed a significant improvement in detection accuracy and latency of modern neural networks, allowing their adoption in automotive, avionics and industrial embedded systems, where performances are required to meet size, weight and power constraints.Multiple benchmarks and surveys exist to compare state-of-the-art detection networks, profiling important metrics, like precision, latency and power efficiency on Commercial-off-the-Shelf (COTS) embedded platforms. However, we observed a fundamental lack of fairness in the existing comparisons, with a number of implicit assumptions that may significantly bias the metrics of interest. This includes using heterogeneous settings for the input size, training dataset, threshold confidences, and, most importantly, platform-specific optimizations, that are especially important when assessing latency and energy-related values. The lack of uniform comparisons is mainly due to the significant effort required to re-implement network models, whenever openly available, on the specific platforms, to properly configure the available acceleration engines for optimizing performance, and to re-train the model using a homogeneous dataset.This paper aims at filling this gap, providing a comprehensive and fair comparison of the best-in-class Convolution Neural Networks (CNNs) for real-time embedded systems, detailing the effort made to achieve an unbiased characterization on cutting-edge system-on-chips. Multi-dimensional trade-offs are explored for achieving a proper configuration of the available programmable accelerators for neural inference, adopting the best available software libraries. To stimulate the adoption of fair benchmarking assessments, the framework is released to the public in an open source repository.


2020 - CMix-NN: Mixed Low-Precision CNN Library for Memory-Constrained Edge Devices [Articolo su rivista]
Capotondi, A.; Rusci, M.; Fariselli, M.; Benini, L.
abstract

Low-precision integer arithmetic is a necessary ingredient for enabling Deep Learning inference on tiny and resource-constrained IoT edge devices. This brief presents CMix-NN, a flexible open-sourceCMix-NN is available at https://github.com/EEESlab/CMix-NN. mixed low-precision (independent tensors quantization of weight and activations at 8, 4, 2 bits) inference library for low bitwidth Quantized Networks. CMix-NN efficiently supports both Per-Layer and Per-Channel quantization strategies of weights and activations. Thanks to CMix-NN, we deploy on an STM32H7 microcontroller a set of Mobilenet family networks with the largest input resolutions ( 224 imes 224 ) and higher accuracies (up to 68% Top1) when compressed with a mixed low precision technique, achieving up to +8% accuracy improvement concerning any other published solution for MCU devices.


2020 - Exploring NEURAghe: A Customizable Template for APSoC-Based CNN Inference at the Edge [Articolo su rivista]
Meloni, P.; Loi, D.; Deriu, G.; Carreras, M.; Conti, F.; Capotondi, A.; Rossi, D.
abstract

The NEURAghe architecture has proved to be a powerful accelerator for deep convolutional neural networks running on heterogeneous architectures based on Xilinx Zynq-7000 all programmable system-on-chips. NEURAghe exploits the processing system and the programmable logic available in these devices to improve performance through parallelism, and to widen the scope of use-cases that can be supported. In this letter, we extend the NEURAghe template-based architecture to guarantee design-time scalability to multiprocessor SoCs with vastly different cost, size, and power envelope, such as Xilinx's Z-7007s, Z-7020, and Z-7045. The proposed architecture achieves state-of-the-art performance and cost effectiveness in all the analyzed configurations, reaching up to 335 GOps/s on the Z-7045.


2020 - Leveraging Automated Mixed-Low-Precision Quantization for Tiny Edge Microcontrollers [Relazione in Atti di Convegno]
Rusci, M.; Fariselli, M.; Capotondi, A.; Benini, L.
abstract

The severe on-chip memory limitations are currently preventing the deployment of the most accurate Deep Neural Network (DNN) models on tiny MicroController Units (MCUs), even if leveraging an effective 8-bit quantization scheme. To tackle this issue, in this paper we present an automated mixed-precision quantization flow based on the HAQ framework but tailored for the memory and computational characteristics of MCU devices. Specifically, a Reinforcement Learning agent searches for the best uniform quantization levels, among 2, 4, 8 bits, of individual weight and activation tensors, under the tight constraints on RAM and FLASH embedded memory sizes. We conduct an experimental analysis on MobileNetV1, MobileNetV2 and MNasNet models for Imagenet classification. Concerning the quantization policy search, the RL agent selects quantization policies that maximize the memory utilization. Given an MCU-class memory bound of 2 MB for weight-only quantization, the compressed models produced by the mixed-precision engine result as accurate as the state-of-the-art solutions quantized with a non-uniform function, which is not tailored for CPUs featuring integer-only arithmetic. This denotes the viability of uniform quantization, required for MCU deployments, for deep weights compression. When also limiting the activation memory budget to 512 kB, the best MobileNetV1 model scores up to 68.4% on Imagenet thanks to the found quantization policy, resulting to be 4% more accurate than the other 8-bit networks fitting the same memory constraints.


2020 - Memory-Latency-Accuracy Trade-Offs for Continual Learning on a RISC-V Extreme-Edge Node [Relazione in Atti di Convegno]
Ravaglia, L.; Rusci, M.; Capotondi, A.; Conti, F.; Pellegrini, L.; Lomonaco, V.; Maltoni, D.; Benini, L.
abstract

AI-powered edge devices currently lack the ability to adapt their embedded inference models to the ever-changing envi ronment. To tackle this issue, Continual Learning (CL) strategies aim at incrementally improving the decision capabilities based on newly acquired data. In this work, after quantifying memory and computational requirements of CL algorithms, we define a novel HW/SW extreme-edge platform featuring a low power RISC-V octa-core cluster tailored for on-demand incremental learning over locally sensed data. The presented multi-core HW/SW architecture achieves a peak performance of 2.21 and 1.70 MAC/cycle, respectively, when running forward and backward steps of the gradient descent. We report the trade-off between memory footprint, latency, and accuracy for learning a new class with Latent Replay CL when targeting an image classification task on the CORe50 dataset. For a CL setting that retrains all the layers, taking 5h to learn a new class and achieving up to 77.3% of precision, a more efficient solution retrains only part of the network, reaching an accuracy of 72.5% with a memory requirement of 300 MB and a computation latency of 1.5 hours. On the other side, retraining only the last layer results in the fastest (867 ms) and less memory hungry (20 MB) solution but scoring 58% on the CORe50 dataset. Thanks to the parallelism of the low-power cluster engine, our HW/SW platform results 25× faster than typical MCU device, on which CL is still impractical, and demonstrates an 11× gain in terms of energy consumption with respect to mobile-class solutions.


2020 - Mixed-data-model heterogeneous compilation and OpenMP offloading [Relazione in Atti di Convegno]
Kurth, A.; Wolters, K.; Forsberg, B.; Capotondi, A.; Marongiu, A.; Grosser, T.; Benini, L.
abstract

Heterogeneous computers combine a general-purpose host processor with domain-specific programmable many-core accelerators, uniting high versatility with high performance and energy efficiency. While the host manages ever-more application memory, accelerators are designed to work mainly on their local memory. This difference in addressed memory leads to a discrepancy between the optimal address width of the host and the accelerator. Today 64-bit host processors are commonplace, but few accelerators exceed 32-bit addressable local memory, a difference expected to increase with 128-bit hosts in the exascale era. Managing this discrepancy requires support for multiple data models in heterogeneous compilers. So far, compiler support for multiple data models has not been explored, which hampers the programmability of such systems and inhibits their adoption. In this work, we perform the first exploration of the feasibility and performance of implementing a mixed-data-mode heterogeneous system. To support this, we present and evaluate the first mixed-data-model compiler, supporting arbitrary address widths on host and accelerator. To hide the inherent complexity and to enable high programmer productivity, we implement transparent offloading on top of OpenMP. The proposed compiler techniques are implemented in LLVM and evaluated on a 64+32-bit heterogeneous SoC. Results on benchmarks from the PolyBench-ACC suite show that memory can be transparently shared between host and accelerator at overheads below 0.7 % compared to 32-bit-only execution, enabling mixed-data-model computers to execute at near-native performance.


2018 - Hero: An open-source research platform for HW/SW exploration of heterogeneous manycore systems [Relazione in Atti di Convegno]
Kurth, A.; Capotondi, Alessandro; Vogel, P.; Benini, L.; Marongiu, A.
abstract

Heterogeneous systems on chip (HeSoCs) co-integrate a high-performance multicore host processor with programmable manycore accelerators (PMCAs) to combine “standard platform” software support (e.g. the Linux OS) with energy-efficient, domain-specific, highly parallel processing capabilities. In this work, we present HERO, a HeSoC platform that tackles this challenge in a novel way. HERO’s host processor is an industry-standard ARM Cortex-A multicore complex, while its PMCA is a scalable, silicon-proven, open-source many-core processing engine, based on the extensible, open RISC-V ISA. We evaluate a prototype implementation of HERO, where the PMCA implemented on an FPGA fabric is coupled with a hard ARM Cortex-A host processor, and show that the run time overhead compared to manually written PMCA code operating on private physical memory is lower than 10 % for pivotal benchmarks and operating conditions.


2018 - Neuraghe: Exploiting CPU-FPGA synergies for efficient and flexible CNN inference acceleration on zynQ SoCs [Articolo su rivista]
Meloni, P.; Capotondi, A.; Deriu, G.; Brian, M.; Conti, F.; Rossi, D.; Raffo, L.; Benini, L.
abstract

Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURAghe, a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs. NEURAghe leverages the synergistic usage of Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at an ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: While the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of the NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURAghe supports end-to-end CNN-based classification with a peak performance of 169GOps/s and an energy efficiency of 17GOps/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 frames per second (fps) on the end-to-end execution of VGG-16 and 6.6fps on ResNet-18.


2018 - Runtime Support for Multiple Offload-Based Programming Models on Clustered Manycore Accelerators [Articolo su rivista]
Capotondi, A; Marongiu, A; Benini, L
abstract

Heterogeneous systems coupling a main host processor with one or more manycore accelerators are being adopted virtually at every scale to achieve ever-increasing GOps/Watt targets. The increased hardware complexity of such systems is paired at the application level by a growing number of applications concurrently running on the system. Techniques that enable efficient accelerator resources sharing, supporting multiple programming models will thus be increasingly important for future heterogeneous SoCs. In this paper we present a runtime system for a cluster-based manycore accelerator, optimized for the concurrent execution of offloaded computation kernels from different programming models. The runtime supports spatial partitioning, where clusters can be grouped into several virtual accelerator instances. Our runtime design is modular and relies on a generic component for resource (cluster) scheduling, plus specialized components which deploy generic offload requests into the target programming model semantics. We evaluate the proposed runtime system on two real heterogeneous systems, focusing on two concrete use cases: i) single-user, multi-application high-end embedded systems and ii) multi-user, multi-workload low-power microservers. In the first case, our approach achieves 93% efficiency in terms of available accelerator resource exploitation. In the second case, our support allows 47% performance improvement compared to single-programming model systems.


2018 - The Quest for Energy-Efficient I$ Design in Ultra-Low-Power Clustered Many-Cores [Articolo su rivista]
Loi, Igor; Capotondi, Alessandro; Rossi, Davide; Marongiu, Andrea; Benini, Luca
abstract

High performance and extreme energy efficiency are strong requirements for a fast-growing number of edge-node Internet of Things (IoT) applications. While traditional Ultra-Low-Power designs rely on single-core micro-controllers (MCU), a new generation of architectures leveraging fully programmable tightly-coupled clusters of near-threshold processors is emerging, joining the performance gain of parallel execution over multiple cores with the energy efficiency of low-voltage operation. In this work we tackle one of the most critical energy-efficiency bottlenecks for these architectures: instruction memory hierarchy. Exploiting the instruction locality typical of data-parallel applications, we explore two different shared instruction cache architectures, based on energy-efficient latch-based memory banks: one leveraging a crossbar between processors and single-port banks (SP), and one leveraging banks with multiple read ports (MP). We evaluate the proposed architectures on a set of signal processing applications with different executable sizes and working-sets. The results show that the shared cache architectures are able to efficiently execute a much wider set of applications (including those featuring large memory footprint and irregular access patterns) with a much smaller area and with much better energy efficiency with respect to the private cache. The multi-port cache is suitable for sizes up to a few kB, improving performance by up to 40%, energy efficiency by up to 20%, and energy × area efficiency by up to 30% with respect to the private cache. The single-port solution is more suitable for larger cache sizes (up to 16 kB), providing up to 20% better energy × area efficiency than the multi-port, and up to 30% better energy efficiency than private cache.


2018 - Work-in-Progress: Quantized NNs as the Definitive solution for inference on low-power ARM MCUs? [Relazione in Atti di Convegno]
Rusci, M.; Capotondi, A.; Conti, F.; Benini, L.
abstract

High energy efficiency and low memory footprint are the key requirements for the deployment of deep learning based analytics on low-power microcontrollers. Here we present work-in-progress results with Q-bit Quantized Neural Networks (QNNs) deployed on a commercial Cortex-M7 class microcontroller by means of an extension to the ARM CMSIS-NN library. We show that i) for Q=4 and Q=2 low memory footprint QNNs can be deployed with an energy overhead of 30% and 36% respectively against the 8-bit CMSIS-NN due to the lack of quantization support in the ISA; ii) for Q=1 native instructions can be used, yielding an energy and latency reduction of ∼3.8× with respect to CMSIS-NN. Our initial results suggest that a small set of QNN-related specialized instructions could improve performance by as much as 7.5× for Q=4, 13.6× for Q=2 and 6.5× for binary NNs.


2017 - Enabling zero-copy OpenMP ofloading on the PULP many-core accelerator [Relazione in Atti di Convegno]
Capotondi, Alessandro; Marongiu, Andrea
abstract

Many-core heterogeneous designs are nowadays widely available among embedded systems. Initiatives such as the HSA push for a model where the host processor and the accelerator(s) communicate via coherent, Uniied Virtual Memory (UVM). In this paper we describe our experience in porting the OpenMP v4 programming model to a low-end, heterogeneous embedded system based on the PULP many-core accelerator featuring lightweight (software-managed) UVM support. We describe a GCC-based toolchain which enables: i) the automatic generation of host and accelerator binaries from a single, high-level, OpenMP parallel program; ii) the automatic instrumentation of the accelerator program to transparently manage UVM. This enables up to 4× faster execution compared to traditional copy-based oload mechanisms.


2017 - HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA [Relazione in Atti di Convegno]
Kurth, Andreascc; Vogel, Pirmin; Capotondi, Alessandro; Marongiu, Andrea; Benini, Luca
abstract


2016 - Controlling NUMA effects in embedded manycore applications with lightweight nested parallelism support [Articolo su rivista]
Marongiu, A; Capotondi, A; Benini, L
abstract

Embedded manycore architectures are often organized as fabrics of tightly-coupled shared memory clusters. A hierarchical interconnection system is used with a crossbar-like medium inside each cluster and a network-on-chip (NoC) at the global level which make memory operations nonuniform (NUMA). Due to NUMA, regular applications typically employed in the embedded domain (e.g., image processing, computer vision, etc.) ultimately behave as irregular workloads if a flat memory system is assumed at the program level. Nested parallelism represents a powerful programming abstraction for these architectures, provided that (i) streamlined middleware support is available, whose overhead does not dominate the run-time of fine-grained applications; (ii) a mechanism to control thread binding at the cluster-level is supported. We present a lightweight runtime layer for nested parallelism on cluster-based embedded manycores, integrating our primitives in the OpenMP runtime system, and implementing a new directive to control NUMA-aware nested parallelism mapping. We explore on a set of real application use cases how NUMA makes regular parallel workloads behave as irregular, and how our approach allows to control such effects and achieve up to 28 × speedup versus flat parallelism.


2016 - On the effectiveness of OpenMP teams for cluster-based many-core accelerators [Relazione in Atti di Convegno]
Capotondi, Alessandro; Marongiu, Andrea
abstract

With the introduction of more powerful and massively parallel embedded processors, embedded systems are becoming HPC-capable. Heterogeneous on-chip systems (SoC) that couple a general-purposehost processor to a many-core accelerator are becoming more and more widespread, and provide tremendous peak performance/watt, well suited to execute HPC-class programs. The increased computation potential is however traded off for ease programming. Application developers are indeed required to manually deal with outlining code parts suitable for acceleration, parallelize them efficiently over many available cores, and orchestrate data transfers to/from the accelerator. In addition, since most many-cores are organized as a collection ofclusters, featuring fast local communication but slow remote communication (i.e., to another cluster's local memory), the programmer should also take care of properly mapping the parallel computation so as to avoid poor data locality. OpenMP v4.0 introduces new constructs for computation offloading, as well as directives to deploy parallel computation in a cluster-aware manner. In this paper we assess the effectiveness of OpenMP v4.0 at exploiting the massive parallelism available in embedded heterogeneous SoCs, comparing to standard parallel loops over several computation-intensive applications from the linear algebra and image processing domains.


2016 - PULP: A parallel ultra low power platform for next generation IoT applications [Relazione in Atti di Convegno]
Rossi, Davide; Conti, Francesco; Marongiu, Andrea; Pullini, Antonio; Loi, Igor; Gautschi, Michael; Tagliavini, Giuseppe; Capotondi, Alessandro; Flatresse, Philippe; Benini, Luca
abstract

This article consists of a collection of slides from the authors' conference presentation.


2015 - Enabling Scalable and Fine-Grained Nested Parallelism on Embedded Many-cores [Relazione in Atti di Convegno]
Capotondi, Alessandro; Marongiu, Andrea; Benini, Luca
abstract


2015 - Runtime Support for Multiple Offload-Based Programming Models on Embedded Manycore Accelerators [Relazione in Atti di Convegno]
Capotondi, Alessandro; Haugou, Germain; Marongiu, Andrea; Benini, Luca
abstract


2015 - Simplifying Many-Core-Based Heterogeneous SoC Programming with Offload Directives [Articolo su rivista]
Marongiu, Andrea; Capotondi, Alessandro; Tagliavini, Giuseppe; Benini, Luca
abstract

Multiprocessor systems-on-chip (MPSoC) are evolving into heterogeneous architectures based on one host processor plus many-core accelerators. While heterogeneous SoCs promise higher performance/watt, they are programmed at the cost of major code rewrites with low-level programming abstractions (e.g, OpenCL). We present a programming model based on OpenMP, with additional directives to program the accelerator from a single host program. As a test case, we evaluate an implementation of this programming model for the STMicroelectronics STHORM development board. We obtain near-ideal throughput for most benchmarks, very close performance to hand-optimized OpenCL codes at a significantly lower programming complexity, and up to 30× speedup versus host execution time.


2014 - Augmenting manycore programmable accelerators with photonic interconnect technology for the high-end embedded computing domain [Relazione in Atti di Convegno]
Balboni, Marco; Obon Marta, Ortin; Capotondi, Alessandro; Tatenguem Hervé, Fankem; Ghiribaldi, Alberto; Ramini, Luca; Viñal, Victor; Marongiu, Andrea; Bertozzi, Davide
abstract


2013 - Improving the programmability of STHORM-based heterogeneous systems with offload-enabled OpenMP [Relazione in Atti di Convegno]
Marongiu, Andrea; Capotondi, Alessandro; Giuseppe, Tagliavini; Luca, Benini
abstract

Heterogeneous architectures based on one fast-clocked, mod- erately multicore "host"processor plus a many-core accelera- tor represent one promising way to satisfy the ever-increasing GOps/W requirements of embedded systems-on-chip. How- ever, heterogeneous computing comes at the cost of increased programming complexity, requiring major rewrite of the ap- plications with low-level programming style (e.g, OpenCL). In this paper we present a programming model, compiler and runtime system for a prototype board from STMicroelec- tronics featuring a ARM9 host and a STHORM many-core accelerator. The programming model is based on OpenMP, with additional directives to efficiently program the acceler- Ator from a single host program. The proposed multi-ISA compilation toolchain hides all the process of outlining an ac- celerator program, compiling and loading it to the STHORM platform and implementing data sharing between the host and the accelerator. Our experimental results show that we achieve very close performance to hand-optimized OpenCL codes, at a significantly lower programming complexity.