Simone CALDERARA - personale UniMoRe

Nuova ricerca

Simone CALDERARA

Professore Ordinario
Dipartimento di Ingegneria "Enzo Ferrari"

Pubblicazioni

2024 - A Graph-Based Multi-Scale Approach with Knowledge Distillation for WSI Classification [Articolo su rivista]
Bontempo, Gianpaolo; Bolelli, Federico; Porrello, Angelo; Calderara, Simone; Ficarra, Elisa
abstract

The usage of Multi Instance Learning (MIL) for classifying Whole Slide Images (WSIs) has recently increased. Due to their gigapixel size, the pixel-level annotation of such data is extremely expensive and time-consuming, practically unfeasible. For this reason, multiple automatic approaches have been raised in the last years to support clinical practice and diagnosis. Unfortunately, most state-of-the-art proposals apply attention mechanisms without considering the spatial instance correlation and usually work on a single-scale resolution. To leverage the full potential of pyramidal structured WSI, we propose a graph-based multi-scale MIL approach, DAS-MIL. Our model comprises three modules: i) a self-supervised feature extractor, ii) a graph-based architecture that precedes the MIL mechanism and aims at creating a more contextualized representation of the WSI structure by considering the mutual (spatial) instance correlation both inter and intra-scale. Finally, iii) a (self) distillation loss between resolutions is introduced to compensate for their informative gap and significantly improve the final prediction. The effectiveness of the proposed framework is demonstrated on two well-known datasets, where we outperform SOTA on WSI classification, gaining a +2.7% AUC and +3.7% accuracy on the popular Camelyon16 benchmark.

2024 - ClusterFix: A Cluster-Based Debiasing Approach without Protected-Group Supervision [Relazione in Atti di Convegno]
Capitani, Giacomo; Bolelli, Federico; Porrello, Angelo; Calderara, Simone; Ficarra, Elisa
abstract

The failures of Deep Networks can sometimes be ascribed to biases in the data or algorithmic choices. Existing debiasing approaches exploit prior knowledge to avoid unintended solutions; we acknowledge that, in real-world settings, it could be unfeasible to gather enough prior information to characterize the bias, or it could even raise ethical considerations. We hence propose a novel debiasing approach, termed ClusterFix, which does not require any external hint about the nature of biases. Such an approach alters the standard empirical risk minimization and introduces a per-example weight, encoding how critical and far from the majority an example is. Notably, the weights consider how difficult it is for the model to infer the correct pseudo-label, which is obtained in a self-supervised manner by dividing examples into multiple clusters. Extensive experiments show that the misclassification error incurred in identifying the correct cluster allows for identifying examples prone to bias-related issues. As a result, our approach outperforms existing methods on standard benchmarks for bias removal and fairness.

2024 - Semantic Residual Prompts for Continual Learning [Relazione in Atti di Convegno]
Menabue, Martin; Frascaroli, Emanuele; Boschini, Matteo; Sangineto, Enver; Bonicelli, Lorenzo; Porrello, Angelo; Calderara, Simone
abstract

Prompt-tuning methods for Continual Learning (CL) freeze a large pre-trained model and train a few parameter vectors termed prompts. Most of these methods organize these vectors in a pool of key-value pairs and use the input image as query to retrieve the prompts (values). However, as keys are learned while tasks progress, the prompting selection strategy is itself subject to catastrophic forgetting, an issue often overlooked by existing approaches. For instance, prompts introduced to accommodate new tasks might end up interfering with previously learned prompts. To make the selection strategy more stable, we leverage a foundation model (CLIP) to select our prompts within a two-level adaptation mechanism. Specifically, the first level leverages a standard textual prompt pool for the CLIP textual encoder, leading to stable class prototypes. The second level, instead, uses these prototypes along with the query image as keys to index a second pool. The retrieved prompts serve to adapt a pre-trained ViT, granting plasticity. In doing so, we also propose a novel residual mechanism to transfer CLIP semantics to the ViT layers. Through extensive analysis on established CL benchmarks, we show that our method significantly outperforms both state-of-the-art CL approaches and the zero-shot CLIP test. Notably, our findings hold true even for datasets with a substantial domain gap w.r.t. the pre-training knowledge of the backbone model, as showcased by experiments on satellite imagery and medical datasets. The codebase is available at https://github.com/aimagelab/mammoth.

2023 - Buffer-MIL: Robust Multi-instance Learning with a Buffer-Based Approach [Relazione in Atti di Convegno]
Bontempo, G.; Lumetti, L.; Porrello, A.; Bolelli, F.; Calderara, S.; Ficarra, E.
abstract

Histopathological image analysis is a critical area of research with the potential to aid pathologists in faster and more accurate diagnoses. However, Whole-Slide Images (WSIs) present challenges for deep learning frameworks due to their large size and lack of pixel-level annotations. Multi-Instance Learning (MIL) is a popular approach that can be employed for handling WSIs, treating each slide as a bag composed of multiple patches or instances. In this work we propose Buffer-MIL, which aims at tackling the covariate shift and class imbalance characterizing most of the existing histopathological datasets. With this goal, a buffer containing the most representative instances of each disease-positive slide of the training set is incorporated into our model. An attention mechanism is then used to compare all the instances against the buffer, to find the most critical ones in a given slide. We evaluate Buffer-MIL on two publicly available WSI datasets, Camelyon16 and TCGA lung cancer, outperforming current state-of-the-art models by 2.2% of accuracy on Camelyon16.

2023 - Class-Incremental Continual Learning into the eXtended DER-verse [Articolo su rivista]
Boschini, Matteo; Bonicelli, Lorenzo; Buzzega, Pietro; Porrello, Angelo; Calderara, Simone
abstract

The staple of human intelligence is the capability of acquiring knowledge in a continuous fashion. In stark contrast, Deep Networks forget catastrophically and, for this reason, the sub-field of Class-Incremental Continual Learning fosters methods that learn a sequence of tasks incrementally, blending sequentially-gained knowledge into a comprehensive prediction. This work aims at assessing and overcoming the pitfalls of our previous proposal Dark Experience Replay (DER), a simple and effective approach that combines rehearsal and Knowledge Distillation. Inspired by the way our minds constantly rewrite past recollections and set expectations for the future, we endow our model with the abilities to i) revise its replay memory to welcome novel information regarding past data ii) pave the way for learning yet unseen classes. We show that the application of these strategies leads to remarkable improvements; indeed, the resulting method – termed eXtended-DER (X-DER) – outperforms the state of the art on both standard benchmarks (such as CIFAR-100 and miniImageNet) and a novel one here introduced. To gain a better understanding, we further provide extensive ablation studies that corroborate and extend the findings of our previous research (e.g. the value of Knowledge Distillation and flatter minima in continual learning setups). We make our results fully reproducible; the codebase is available at https://github.com/aimagelab/mammoth.

2023 - Consistency-Based Self-supervised Learning for Temporal Anomaly Localization [Relazione in Atti di Convegno]
Panariello, A.; Porrello, A.; Calderara, S.; Cucchiara, R.
abstract

2023 - DAS-MIL: Distilling Across Scales for MILClassification of Histological WSIs [Relazione in Atti di Convegno]
Bontempo, Gianpaolo; Porrello, Angelo; Bolelli, Federico; Calderara, Simone; Ficarra, Elisa
abstract

The adoption of Multi-Instance Learning (MIL) for classifying Whole-Slide Images (WSIs) has increased in recent years. Indeed, pixel-level annotation of gigapixel WSI is mostly unfeasible and time-consuming in practice. For this reason, MIL approaches have been profitably integrated with the most recent deep-learning solutions for WSI classification to support clinical practice and diagnosis. Nevertheless, the majority of such approaches overlook the multi-scale nature of the WSIs; the few existing hierarchical MIL proposals simply flatten the multi-scale representations by concatenation or summation of features vectors, neglecting the spatial structure of the WSI. Our work aims to unleash the full potential of pyramidal structured WSI; to do so, we propose a graph-based multi-scale MIL approach, termed DAS-MIL, that exploits message passing to let information flows across multiple scales. By means of a knowledge distillation schema, the alignment between the latent space representation at different resolutions is encouraged while preserving the diversity in the informative content. The effectiveness of the proposed framework is demonstrated on two well-known datasets, where we outperform SOTA on WSI classification, gaining a +1.9% AUC and +3.3¬curacy on the popular Camelyon16 benchmark.

2023 - Input Perturbation Reduces Exposure Bias in Diffusion Models [Relazione in Atti di Convegno]
Ning, M.; Sangineto, E.; Porrello, A.; Calderara, S.; Cucchiara, R.
abstract

Denoising Diffusion Probabilistic Models have shown an impressive generation quality although their long sampling chain leads to high computational costs. In this paper, we observe that a long sampling chain also leads to an error accumulation phenomenon, which is similar to the exposure bias problem in autoregressive text generation. Specifically, we note that there is a discrepancy between training and testing, since the former is conditioned on the ground truth samples, while the latter is conditioned on the previously generated results. To alleviate this problem, we propose a very simple but effective training regularization, consisting in perturbing the ground truth samples to simulate the inference time prediction errors. We empirically show that, without affecting the recall and precision, the proposed input perturbation leads to a significant improvement in the sample quality while reducing both the training and the inference times. For instance, on CelebA 64×64, we achieve a new state-of-the-art FID score of 1.27, while saving 37.5% of the training time. The code is available at https://github.com/forever208/DDPM-IP.

2023 - Let's stay close: An examination of the effects of imagined contact on behavior toward children with disability [Articolo su rivista]
Cocco, V. M.; Bisagno, E.; Bernardo, G. A. D.; Bicocchi, N.; Calderara, S.; Palazzi, A.; Cucchiara, R.; Zambonelli, F.; Cadamuro, A.; Stathi, S.; Crisp, R.; Vezzali, L.
abstract

In line with current developments in indirect intergroup contact literature, we conducted a field study using the imagined contact paradigm among high-status (Italian children) and low-status (children with foreign origins) group members (N = 122; 53 females, mean age = 7.52 years). The experiment aimed to improve attitudes and behavior toward a different low-status group, children with disability. To assess behavior, we focused on an objective measure that captures the physical distance between participants and a child with disability over the course of a five-minute interaction (i.e., while playing together). Results from a 3-week intervention revealed that in the case of high-status children imagined contact, relative to a no-intervention control condition, improved outgroup attitudes and behavior, and strengthened helping and contact intentions. These effects however did not emerge among low-status children. The results are discussed in the context of intergroup contact literature, with emphasis on the implications of imagined contact for educational settings.

2023 - Neuro Symbolic Continual Learning: Knowledge, Reasoning Shortcuts and Concept Rehearsal [Working paper]
Marconato, Emanuele; Bontempo, Gianpaolo; Ficarra, Elisa; Calderara, Simone; Passerini, Andrea; Teso, Stefano
abstract

2023 - Neuro-Symbolic Continual Learning: Knowledge, Reasoning Shortcuts and Concept Rehearsal [Relazione in Atti di Convegno]
Marconato, E.; Bontempo, G.; Ficarra, E.; Calderara, S.; Passerini, A.; Teso, S.
abstract

We introduce Neuro-Symbolic Continual Learning, where a model has to solve a sequence of neuro-symbolic tasks, that is, it has to map sub-symbolic inputs to high-level concepts and compute predictions by reasoning consistently with prior knowledge. Our key observation is that neuro-symbolic tasks, although different, often share concepts whose semantics remains stable over time. Traditional approaches fall short: existing continual strategies ignore knowledge altogether, while stock neuro-symbolic architectures suffer from catastrophic forgetting. We show that leveraging prior knowledge by combining neurosymbolic architectures with continual strategies does help avoid catastrophic forgetting, but also that doing so can yield models affected by reasoning shortcuts. These undermine the semantics of the acquired concepts, even when detailed prior knowledge is provided upfront and inference is exact, and in turn continual performance. To overcome these issues, we introduce COOL, a COncept-level cOntinual Learning strategy tailored for neuro-symbolic continual problems that acquires high-quality concepts and remembers them over time. Our experiments on three novel benchmarks highlights how COOL attains sustained high performance on neuro-symbolic continual learning tasks in which other strategies fail.

2023 - Spotting Virus from Satellites: Modeling the Circulation of West Nile Virus Through Graph Neural Networks [Articolo su rivista]
Bonicelli, Lorenzo; Porrello, Angelo; Vincenzi, Stefano; Ippoliti, Carla; Iapaolo, Federica; Conte, Annamaria; Calderara, Simone
abstract

2023 - TrackFlow: Multi-Object Tracking with Normalizing Flows [Relazione in Atti di Convegno]
Mancusi, Gianluca; Panariello, Aniello; Porrello, Angelo; Fabbri, Matteo; Calderara, Simone; Cucchiara, Rita
abstract

The field of multi-object tracking has recently seen a renewed interest in the good old schema of tracking-by-detection, as its simplicity and strong priors spare it from the complex design and painful babysitting of tracking-by-attention approaches. In view of this, we aim at extending tracking-by-detection to multi-modal settings, where a comprehensive cost has to be computed from heterogeneous information e.g., 2D motion cues, visual appearance, and pose estimates. More precisely, we follow a case study where a rough estimate of 3D information is also available and must be merged with other traditional metrics (e.g., the IoU). To achieve that, recent approaches resort to either simple rules or complex heuristics to balance the contribution of each cost. However, i) they require careful tuning of tailored hyperparameters on a hold-out set, and ii) they imply these costs to be independent, which does not hold in reality. We address these issues by building upon an elegant probabilistic formulation, which considers the cost of a candidate association as the negative log-likelihood yielded by a deep density estimator, trained to model the conditional joint probability distribution of correct associations. Our experiments, conducted on both simulated and real benchmarks, show that our approach consistently enhances the performance of several tracking-by-detection algorithms.

2022 - Catastrophic Forgetting in Continual Concept Bottleneck Models [Relazione in Atti di Convegno]
Marconato, E.; Bontempo, G.; Teso, S.; Ficarra, E.; Calderara, S.; Passerini, A.
abstract

2022 - Continual semi-supervised learning through contrastive interpolation consistency [Articolo su rivista]
Boschini, Matteo; Buzzega, Pietro; Bonicelli, Lorenzo; Porrello, Angelo; Calderara, Simone
abstract

Continual Learning (CL) investigates how to train Deep Networks on a stream of tasks without incurring forgetting. CL settings proposed in literature assume that every incoming example is paired with ground-truth annotations. However, this clashes with many real-world applications: gathering labeled data, which is in itself tedious and expensive, becomes infeasible when data flow as a stream. This work explores Continual Semi-Supervised Learning (CSSL): here, only a small fraction of labeled input examples are shown to the learner. We assess how current CL methods (e.g.: EWC, LwF, iCaRL, ER, GDumb, DER) perform in this novel and challenging scenario, where overfitting entangles forgetting. Subsequently, we design a novel CSSL method that exploits metric learning and consistency regularization to leverage unlabeled examples while learning. We show that our proposal exhibits higher resilience to diminishing supervision and, even more surprisingly, relying only on supervision suffices to outperform SOTA methods trained under full supervision.

2022 - Effects of Auxiliary Knowledge on Continual Learning [Relazione in Atti di Convegno]
Bellitto, Giovanni; Pennisi, Matteo; Palazzo, Simone; Bonicelli, Lorenzo; Boschini, Matteo; Calderara, Simone; Spampinato, Concetto
abstract

In Continual Learning (CL), a neural network is trained on a stream of data whose distribution changes over time. In this context, the main problem is how to learn new information without forgetting old knowledge (i.e., Catastrophic Forgetting). Most existing CL approaches focus on finding solutions to preserve acquired knowledge, so working on the past of the model. However, we argue that as the model has to continually learn new tasks, it is also important to put focus on the present knowledge that could improve following tasks learning. In this paper we propose a new, simple, CL algorithm that focuses on solving the current task in a way that might facilitate the learning of the next ones. More specifically, our approach combines the main data stream with a secondary, diverse and uncorrelated stream, from which the network can draw auxiliary knowledge. This helps the model from different perspectives, since auxiliary data may contain useful features for the current and the next tasks and incoming task classes can be mapped onto auxiliary classes. Furthermore, the addition of data to the current task is implicitly making the classifier more robust as we are forcing the extraction of more discriminative features. Our method can outperform existing state-of-the-art models on the most common CL Image Classification benchmarks.

2022 - First Steps Towards 3D Pedestrian Detection and Tracking from Single Image [Relazione in Atti di Convegno]
Mancusi, G.; Fabbri, M.; Egidi, S.; Verasani, M.; Scarabelli, P.; Calderara, S.; Cucchiara, R.
abstract

Since decades, the problem of multiple people tracking has been tackled leveraging 2D data only. However, people moves and interact in a three-dimensional space. For this reason, using only 2D data might be limiting and overly challenging, especially due to occlusions and multiple overlapping people. In this paper, we take advantage of 3D synthetic data from the novel MOTSynth dataset, to train our proposed 3D people detector, whose observations are fed to a tracker that works in the corresponding 3D space. Compared to conventional 2D trackers, we show an overall improvement in performance with a reduction of identity switches on both real and synthetic data. Additionally, we propose a tracker that jointly exploits 3D and 2D data, showing an improvement over the proposed baselines. Our experiments demonstrate that 3D data can be beneficial, and we believe this paper will pave the road for future efforts in leveraging 3D data for tackling multiple people tracking. The code is available at (https://github.com/GianlucaMancusi/LoCO-Det ).

2022 - How many Observations are Enough? Knowledge Distillation for Trajectory Forecasting [Relazione in Atti di Convegno]
Monti, A.; Porrello, A.; Calderara, S.; Coscia, P.; Ballan, L.; Cucchiara, R.
abstract

Accurate prediction of future human positions is an essential task for modern video-surveillance systems. Current state-of-the-art models usually rely on a "history" of past tracked locations (e.g., 3 to 5 seconds) to predict a plausible sequence of future locations (e.g., up to the next 5 seconds). We feel that this common schema neglects critical traits of realistic applications: as the collection of input trajectories involves machine perception (i.e., detection and tracking), incorrect detection and fragmentation errors may accumulate in crowded scenes, leading to tracking drifts. On this account, the model would be fed with corrupted and noisy input data, thus fatally affecting its prediction performance.In this regard, we focus on delivering accurate predictions when only few input observations are used, thus potentially lowering the risks associated with automatic perception. To this end, we conceive a novel distillation strategy that allows a knowledge transfer from a teacher network to a student one, the latter fed with fewer observations (just two ones). We show that a properly defined teacher supervision allows a student network to perform comparably to state-of-the-art approaches that demand more observations. Besides, extensive experiments on common trajectory forecasting datasets highlight that our student network better generalizes to unseen scenarios.

2022 - Learning the Quality of Machine Permutations in Job Shop Scheduling [Articolo su rivista]
Corsini, A.; Calderara, S.; Dell'Amico, M.
abstract

In recent years, the power demonstrated by Machine Learning (ML) has increasingly attracted the interest of the optimization community that is starting to leverage ML for enhancing and automating the design of algorithms. One combinatorial optimization problem recently tackled with ML is the Job Shop scheduling Problem (JSP). Most of the works on the JSP using ML focus on Deep Reinforcement Learning (DRL), and only a few of them leverage supervised learning techniques. The recurrent reasons for avoiding supervised learning seem to be the difficulty in casting the right learning task, i.e., what is meaningful to predict, and how to obtain labels. Therefore, we first propose a novel supervised learning task that aims at predicting the quality of machine permutations. Then, we design an original methodology to estimate this quality, and we use these estimations to create an accurate sequential deep learning model (binary accuracy above 95%). Finally, we empirically demonstrate the value of predicting the quality of machine permutations by enhancing the performance of a simple Tabu Search algorithm inspired by the works in the literature.

2022 - On the Effectiveness of Lipschitz-Driven Rehearsal in Continual Learning [Relazione in Atti di Convegno]
Bonicelli, Lorenzo; Boschini, Matteo; Porrello, Angelo; Spampinato, Concetto; Calderara, Simone
abstract

Rehearsal approaches enjoy immense popularity with Continual Learning (CL) practitioners. These methods collect samples from previously encountered data distributions in a small memory buffer; subsequently, they repeatedly optimize on the latter to prevent catastrophic forgetting. This work draws attention to a hidden pitfall of this widespread practice: repeated optimization on a small pool of data inevitably leads to tight and unstable decision boundaries, which are a major hindrance to generalization. To address this issue, we propose Lipschitz-DrivEn Rehearsal (LiDER), a surrogate objective that induces smoothness in the backbone network by constraining its layer-wise Lipschitz constants w.r.t. replay examples. By means of extensive experiments, we show that applying LiDER delivers a stable performance gain to several state-of-the-art rehearsal CL methods across multiple datasets, both in the presence and absence of pre-training. Through additional ablative experiments, we highlight peculiar aspects of buffer overfitting in CL and better characterize the effect produced by LiDER. Code is available at https://github.com/aimagelab/LiDER

2022 - SeeFar: Vehicle Speed Estimation and Flow Analysis from a Moving UAV [Relazione in Atti di Convegno]
Ning, M.; Ma, X.; Lu, Y.; Calderara, S.; Cucchiara, R.
abstract

Visual perception from drones has been largely investigated for Intelligent Traffic Monitoring System (ITMS) recently. In this paper, we introduce SeeFar to achieve vehicle speed estimation and traffic flow analysis based on YOLOv5 and DeepSORT from a moving drone. SeeFar differs from previous works in three key ways: the speed estimation and flow analysis components are integrated into a unified framework; our method of predicting car speed has the least constraints while maintaining a high accuracy; our flow analysor is direction-aware and outlier-aware. Specifically, we design the speed estimator only using the camera imaging geometry, where the transformation between world space and image space is completed by the variable Ground Sampling Distance. Besides, previous papers do not evaluate their speed estimators at scale due to the difficulty of obtaining the ground truth, we therefore propose a simple yet efficient approach to estimate the true speeds of vehicles via the prior size of the road signs. We evaluate SeeFar on our ten videos that contain 929 vehicle samples. Experiments on these sequences demonstrate the effectiveness of SeeFar by achieving 98.0% accuracy of speed estimation and 99.1% accuracy of traffic volume prediction, respectively.

2022 - Transfer without Forgetting [Relazione in Atti di Convegno]
Boschini, Matteo; Bonicelli, Lorenzo; Porrello, Angelo; Bellitto, Giovanni; Pennisi, Matteo; Palazzo, Simone; Spampinato, Concetto; Calderara, Simone
abstract

This work investigates the entanglement between Continual Learning (CL) and Transfer Learning (TL). In particular, we shed light on the widespread application of network pretraining, highlighting that it is itself subject to catastrophic forgetting. Unfortunately, this issue leads to the under-exploitation of knowledge transfer during later tasks. On this ground, we propose Transfer without Forgetting (TwF), a hybrid Continual Transfer Learning approach building upon a fixed pretrained sibling network, which continuously propagates the knowledge inherent in the source domain through a layer-wise loss term. Our experiments indicate that TwF steadily outperforms other CL methods across a variety of settings, averaging a 4.81% gain in Class-Incremental accuracy over a variety of datasets and different buffer sizes.

2022 - Warp and Learn: Novel Views Generation for Vehicles and Other Objects [Articolo su rivista]
Palazzi, Andrea; Bergamini, Luca; Calderara, Simone; Cucchiara, Rita
abstract

In this work we introduce a new self-supervised, semi-parametric approach for synthesizing novel views of a vehicle starting from a single monocular image.Differently from parametric (i.e. entirely learning-based) methods, we show how a-priori geometric knowledge about the object and the 3D world can be successfully integrated into a deep learning based image generation framework. As this geometric component is not learnt, we call our approach semi-parametric.In particular, we exploit man-made object symmetry and piece-wise planarity to integrate rich a-priori visual information into the novel viewpoint synthesis process. An Image Completion Network (ICN) is then trained to generate a realistic image starting from this geometric guidance.This blend between parametric and non-parametric components allows us to i) operate in a real-world scenario, ii) preserve high-frequency visual information such as textures, iii) handle truly arbitrary 3D roto-translations of the input and iv) perform shape transfer to completely different 3D models. Eventually, we show that our approach can be easily complemented with synthetic data and extended to other rigid objects with completely different topology, even in presence of concave structures and holes.A comprehensive experimental analysis against state-of-the-art competitors shows the efficacy of our method both from a quantitative and a perceptive point of view.

2021 - AC-VRNN: Attentive Conditional-VRNN for multi-future trajectory prediction [Articolo su rivista]
Bertugli, A.; Calderara, S.; Coscia, P.; Ballan, L.; Cucchiara, R.
abstract

Anticipating human motion in crowded scenarios is essential for developing intelligent transportation systems, social-aware robots and advanced video surveillance applications. A key component of this task is represented by the inherently multi-modal nature of human paths which makes socially acceptable multiple futures when human interactions are involved. To this end, we propose a generative architecture for multi-future trajectory predictions based on Conditional Variational Recurrent Neural Networks (C-VRNNs). Conditioning mainly relies on prior belief maps, representing most likely moving directions and forcing the model to consider past observed dynamics in generating future positions. Human interactions are modelled with a graph-based attention mechanism enabling an online attentive hidden state refinement of the recurrent estimation. To corroborate our model, we perform extensive experiments on publicly-available datasets (e.g., ETH/UCY, Stanford Drone Dataset, STATS SportVU NBA, Intersection Drone Dataset and TrajNet++) and demonstrate its effectiveness in crowded scenes compared to several state-of-the-art methods.

2021 - Avalanche: An end-to-end library for continual learning [Relazione in Atti di Convegno]
Lomonaco, V.; Pellegrini, L.; Cossu, A.; Carta, A.; Graffieti, G.; Hayes, T. L.; De Lange, M.; Masana, M.; Pomponi, J.; Van De Ven, G. M.; Mundt, M.; She, Q.; Cooper, K.; Forest, J.; Belouadah, E.; Calderara, S.; Parisi, G. I.; Cuzzolin, F.; Tolias, A. S.; Scardapane, S.; Antiga, L.; Ahmad, S.; Popescu, A.; Kanan, C.; Van De Weijer, J.; Tuytelaars, T.; Bacciu, D.; Maltoni, D.
abstract

Learning continually from non-stationary data streams is a long-standing goal and a challenging problem in machine learning. Recently, we have witnessed a renewed and fast-growing interest in continual learning, especially within the deep learning community. However, algorithmic solutions are often difficult to re-implement, evaluate and port across different settings, where even results on standard benchmarks are hard to reproduce. In this work, we propose Avalanche, an open-source end-to-end library for continual learning research based on PyTorch. Avalanche is designed to provide a shared and collaborative codebase for fast prototyping, training, and reproducible evaluation of continual learning algorithms.

2021 - DAG-Net: Double Attentive Graph Neural Network for Trajectory Forecasting [Relazione in Atti di Convegno]
Monti, Alessio; Bertugli, Alessia; Calderara, Simone; Cucchiara, Rita
abstract

Understanding human motion behaviour is a critical task for several possible applications like self-driving cars or social robots, and in general for all those settings where an autonomous agent has to navigate inside a human-centric environment. This is non-trivial because human motion is inherently multi-modal: given a history of human motion paths, there are many plausible ways by which people could move in the future. Additionally, people activities are often driven by goals, e.g. reaching particular locations or interacting with the environment. We address the aforementioned aspects by proposing a new recurrent generative model that considers both single agents' future goals and interactions between different agents. The model exploits a double attention-based graph neural network to collect information about the mutual influences among different agents and to integrate it with data about agents' possible future objectives. Our proposal is general enough to be applied to different scenarios: the model achieves state-of-the-art results in both urban environments and also in sports applications.

2021 - Extracting accurate long-term behavior changes from a large pig dataset [Relazione in Atti di Convegno]
Bergamini, L.; Pini, S.; Simoni, A.; Vezzani, R.; Calderara, S.; Eath, R. B. D.; Fisher, R. B.
abstract

Visual observation of uncontrolled real-world behavior leads to noisy observations, complicated by occlusions, ambiguity, variable motion rates, detection and tracking errors, slow transitions between behaviors, etc. We show in this paper that reliable estimates of long-term trends can be extracted given enough data, even though estimates from individual frames may be noisy. We validate this concept using a new public dataset of approximately 20+ million daytime pig observations over 6 weeks of their main growth stage, and we provide annotations for various tasks including 5 individual behaviors. Our pipeline chains detection, tracking and behavior classification combining deep and shallow computer vision techniques. While individual detections may be noisy, we show that long-term behavior changes can still be extracted reliably, and we validate these results qualitatively on the full dataset. Eventually, starting from raw RGB video data we are able to both tell what pigs main daily activities are, and how these change through time.

2021 - Future Urban Scenes Generation Through Vehicles Synthesis [Relazione in Atti di Convegno]
Simoni, Alessandro; Bergamini, Luca; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita
abstract

In this work we propose a deep learning pipeline to predict the visual future appearance of an urban scene. Despite recent advances, generating the entire scene in an end-to-end fashion is still far from being achieved. Instead, here we follow a two stages approach, where interpretable information is included in the loop and each actor is modelled independently. We leverage a per-object novel view synthesis paradigm; i.e. generating a synthetic representation of an object undergoing a geometrical roto-translation in the 3D space. Our model can be easily conditioned with constraints (e.g. input trajectories) provided by state-of-the-art tracking methods or by the user itself. This allows us to generate a set of diverse realistic futures starting from the same input in a multi-modal fashion. We visually and quantitatively show the superiority of this approach over traditional end-to-end scene-generation methods on CityFlow, a challenging real world dataset.

2021 - MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking? [Relazione in Atti di Convegno]
Fabbri, Matteo; Braso, Guillem; Maugeri, Gianluca; Cetintas, Orcun; Gasparini, Riccardo; Osep, Aljosa; Calderara, Simone; Leal-Taixe, Laura; Cucchiara, Rita
abstract

2021 - RMS-Net: Regression and Masking for Soccer Event Spotting [Relazione in Atti di Convegno]
Tomei, Matteo; Baraldi, Lorenzo; Calderara, Simone; Bronzin, Simone; Cucchiara, Rita
abstract

2021 - The color out of space: learning self-supervised representations for Earth Observation imagery [Relazione in Atti di Convegno]
Vincenzi, Stefano; Porrello, Angelo; Buzzega, Pietro; Cipriano, Marco; Fronte, Pietro; Cuccu, Roberto; Ippoliti, Carla; Conte, Annamaria; Calderara, Simone
abstract

The recent growth in the number of satellite images fosters the development of effective deep-learning techniques for Remote Sensing (RS). However, their full potential is untapped due to the lack of large annotated datasets. Such a problem is usually countered by fine-tuning a feature extractor that is previously trained on the ImageNet dataset. Unfortunately, the domain of natural images differs from the RS one, which hinders the final performance. In this work, we propose to learn meaningful representations from satellite imagery, leveraging its high-dimensionality spectral bands to reconstruct the visible colors. We conduct experiments on land cover classification (BigEarthNet) and West Nile Virus detection, showing that colorization is a solid pretext task for training a feature extractor. Furthermore, we qualitatively observe that guesses based on natural images and colorization rely on different parts of the input. This paves the way to an ensemble model that eventually outperforms both the above-mentioned techniques.

2021 - Training convolutional neural networks to score pneumonia in slaughtered pigs [Articolo su rivista]
Bonicelli, L.; Trachtman, A. R.; Rosamilia, A.; Liuzzo, G.; Hattab, J.; Alcaraz, E. M.; Del Negro, E.; Vincenzi, S.; Dondona, A. C.; Calderara, S.; Marruchella, G.
abstract

The slaughterhouse can act as a valid checkpoint to estimate the prevalence and the economic impact of diseases in farm animals. At present, scoring lesions is a challenging and time‐consuming activity, which is carried out by veterinarians serving the slaughter chain. Over recent years, artificial intelligence(AI) has gained traction in many fields of research, including livestock production. In particular, AI‐based methods appear able to solve highly repetitive tasks and to consistently analyze large amounts of data, such as those collected by veterinarians during postmortem inspection in high‐throughput slaughterhouses. The present study aims to develop an AI‐based method capable of recognizing and quantifying enzootic pneumonia‐like lesions on digital images captured from slaughtered pigs under routine abattoir conditions. Overall, the data indicate that the AI‐based method proposed herein could properly identify and score enzootic pneumonia‐like lesions without interfering with the slaughter chain routine. According to European legislation, the application of such a method avoids the handling of carcasses and organs, decreasing the risk of microbial contamination, and could provide further alternatives in the field of food hygiene.

2021 - Video action detection by learning graph-based spatio-temporal interactions [Articolo su rivista]
Tomei, Matteo; Baraldi, Lorenzo; Calderara, Simone; Bronzin, Simone; Cucchiara, Rita
abstract

Action Detection is a complex task that aims to detect and classify human actions in video clips. Typically, it has been addressed by processing fine-grained features extracted from a video classification backbone. Recently, thanks to the robustness of object and people detectors, a deeper focus has been added on relationship modelling. Following this line, we propose a graph-based framework to learn high-level interactions between people and objects, in both space and time. In our formulation, spatio-temporal relationships are learned through self-attention on a multi-layer graph structure which can connect entities from consecutive clips, thus considering long-range spatial and temporal dependencies. The proposed module is backbone independent by design and does not require end-to-end training. Extensive experiments are conducted on the AVA dataset, where our model demonstrates state-of-the-art results and consistent improvements over baselines built with different backbones. Code is publicly available at https://github.com/aimagelab/STAGE_action_detection.

2020 - Anomaly Detection for Vision-based Railway Inspection [Relazione in Atti di Convegno]
Gasparini, Riccardo; Pini, Stefano; Borghi, Guido; Scaglione, Giuseppe; Calderara, Simone; Fedeli, Eugenio; Cucchiara, Rita
abstract

2020 - Anomaly Detection, Localization and Classification for Railway Inspection [Relazione in Atti di Convegno]
Gasparini, Riccardo; D'Eusanio, Andrea; Borghi, Guido; Pini, Stefano; Scaglione, Giuseppe; Calderara, Simone; Fedeli, Eugenio; Cucchiara, Rita
abstract

2020 - Compressed Volumetric Heatmaps for Multi-Person 3D Pose Estimation [Relazione in Atti di Convegno]
Fabbri, Matteo; Lanzi, Fabio; Calderara, Simone; Alletto, Stefano; Cucchiara, Rita
abstract

In this paper we present a novel approach for bottom-up multi-person 3D human pose estimation from monocular RGB images. We propose to use high resolution volumetric heatmaps to model joint locations, devising a simple and effective compression method to drastically reduce the size of this representation. At the core of the proposed method lies our Volumetric Heatmap Autoencoder, a fully-convolutional network tasked with the compression of ground-truth heatmaps into a dense intermediate representation. A second model, the Code Predictor, is then trained to predict these codes, which can be decompressed at test time to re-obtain the original representation. Our experimental evaluation shows that our method performs favorably when compared to state of the art on both multi-person and single-person 3D human pose estimation datasets and, thanks to our novel compression strategy, can process full-HD images at the constant runtime of 8 fps regardless of the number of subjects in the scene.

2020 - Conditional Channel Gated Networks for Task-Aware Continual Learning [Relazione in Atti di Convegno]
Abati, Davide; Tomczak, Jakub; Blankevoort, Tijmen; Calderara, Simone; Cucchiara, Rita; Bejnordi, Babak Ehteshami
abstract

2020 - Dark Experience for General Continual Learning: a Strong, Simple Baseline [Relazione in Atti di Convegno]
Buzzega, Pietro; Boschini, Matteo; Porrello, Angelo; Abati, Davide; Calderara, Simone
abstract

Continual Learning has inspired a plethora of approaches and evaluation settings; however, the majority of them overlooks the properties of a practical scenario, where the data stream cannot be shaped as a sequence of tasks and offline training is not viable. We work towards General Continual Learning (GCL), where task boundaries blur and the domain and class distributions shift either gradually or suddenly. We address it through mixing rehearsal with knowledge distillation and regularization; our simple baseline, Dark Experience Replay, matches the network's logits sampled throughout the optimization trajectory, thus promoting consistency with its past. By conducting an extensive analysis on both standard benchmarks and a novel GCL evaluation setting (MNIST-360), we show that such a seemingly simple baseline outperforms consolidated approaches and leverages limited resources. We further explore the generalization capabilities of our objective, showing its regularization being beneficial beyond mere performance.

2020 - Deep learning-based method for vision-guided robotic grasping of unknown objects [Articolo su rivista]
Bergamini, L.; Sposato, M.; Pellicciari, M.; Peruzzini, M.; Calderara, S.; Schmidt, J.
abstract

Nowadays, robots are heavily used in factories for different tasks, most of them including grasping and manipulation of generic objects in unstructured scenarios. In order to better mimic a human operator involved in a grasping action, where he/she needs to identify the object and detect an optimal grasp by means of visual information, a widely adopted sensing solution is Artificial Vision. Nonetheless, state-of-art applications need long training and fine-tuning for manually build the object's model that is used at run-time during the normal operations, which reduce the overall operational throughput of the robotic system. To overcome such limits, the paper presents a framework based on Deep Convolutional Neural Networks (DCNN) to predict both single and multiple grasp poses for multiple objects all at once, using a single RGB image as input. Thanks to a novel loss function, our framework is trained in an end-to-end fashion and matches state-of-art accuracy with a substantially smaller architecture, which gives unprecedented real-time performances during experimental tests, and makes the application reliable for working on real robots. The system has been implemented using the ROS framework and tested on a Baxter collaborative robot.

2020 - Face-from-Depth for Head Pose Estimation on Depth Images [Articolo su rivista]
Borghi, Guido; Fabbri, Matteo; Vezzani, Roberto; Calderara, Simone; Cucchiara, Rita
abstract

Depth cameras allow to set up reliable solutions for people monitoring and behavior understanding, especially when unstable or poor illumination conditions make unusable common RGB sensors. Therefore, we propose a complete framework for the estimation of the head and shoulder pose based on depth images only. A head detection and localization module is also included, in order to develop a complete end-to-end system. The core element of the framework is a Convolutional Neural Network, called POSEidon+, that receives as input three types of images and provides the 3D angles of the pose as output. Moreover, a Face-from-Depth component based on a Deterministic Conditional GAN model is able to hallucinate a face from the corresponding depth image. We empirically demonstrate that this positively impacts the system performances. We test the proposed framework on two public datasets, namely Biwi Kinect Head Pose and ICT-3DHP, and on Pandora, a new challenging dataset mainly inspired by the automotive setup. Experimental results show that our method overcomes several recent state-of-art works based on both intensity and depth input data, running in real-time at more than 30 frames per second.

2020 - Predicting WNV circulation in Italy using earth observation data and extreme gradient boosting model [Articolo su rivista]
Candeloro, L.; Ippoliti, C.; Iapaolo, F.; Monaco, F.; Morelli, D.; Cuccu, R.; Fronte, P.; Calderara, S.; Vincenzi, S.; Porrello, A.; D'Alterio, N.; Calistri, P.; Conte, A.
abstract

West Nile Disease (WND) is one of the most spread zoonosis in Italy and Europe caused by a vector-borne virus. Its transmission cycle is well understood, with birds acting as the primary hosts and mosquito vectors transmitting the virus to other birds, while humans and horses are occasional dead-end hosts. Identifying suitable environmental conditions across large areas containing multiple species of potential hosts and vectors can be difficult. The recent and massive availability of Earth Observation data and the continuous development of innovative Machine Learning methods can contribute to automatically identify patterns in big datasets and to make highly accurate identification of areas at risk. In this paper, we investigated the West Nile Virus (WNV) circulation in relation to Land Surface Temperature, Normalized Difference Vegetation Index and Surface Soil Moisture collected during the 160 days before the infection took place, with the aim of evaluating the predictive capacity of lagged remotely sensed variables in the identification of areas at risk for WNV circulation. WNV detection in mosquitoes, birds and horses in 2017, 2018 and 2019, has been collected from the National Information System for Animal Disease Notification. An Extreme Gradient Boosting model was trained with data from 2017 and 2018 and tested for the 2019 epidemic, predicting the spatio-temporal WNV circulation two weeks in advance with an overall accuracy of 0.84. This work lays the basis for a future early warning system that could alert public authorities when climatic and environmental conditions become favourable to the onset and spread of WNV.

2020 - Rethinking Experience Replay: a Bag of Tricks for Continual Learning [Relazione in Atti di Convegno]
Buzzega, Pietro; Boschini, Matteo; Porrello, Angelo; Calderara, Simone
abstract

In Continual Learning, a Neural Network is trained on a stream of data whose distribution shifts over time. Under these assumptions, it is especially challenging to improve on classes appearing later in the stream while remaining accurate on previous ones. This is due to the infamous problem of catastrophic forgetting, which causes a quick performance degradation when the classifier focuses on learning new categories. Recent literature proposed various approaches to tackle this issue, often resorting to very sophisticated techniques. In this work, we show that naïve rehearsal can be patched to achieve similar performance. We point out some shortcomings that restrain Experience Replay (ER) and propose five tricks to mitigate them. Experiments show that ER, thus enhanced, displays an accuracy gain of 51.2 and 26.9 percentage points on the CIFAR-10 and CIFAR-100 datasets respectively (memory buffer size 1000). As a result, it surpasses current state-of-the-art rehearsal-based methods.

2020 - Robust Re-Identification by Multiple Views Knowledge Distillation [Relazione in Atti di Convegno]
Porrello, Angelo; Bergamini, Luca; Calderara, Simone
abstract

To achieve robustness in Re-Identification, standard methods leverage tracking information in a Video-To-Video fashion. However, these solutions face a large drop in performance for single image queries (e.g., Image-To-Video setting). Recent works address this severe degradation by transferring temporal information from a Video-based network to an Image-based one. In this work, we devise a training strategy that allows the transfer of a superior knowledge, arising from a set of views depicting the target object. Our proposal - Views Knowledge Distillation (VKD) - pins this visual variety as a supervision signal within a teacher-student framework, where the teacher educates a student who observes fewer views. As a result, the student outperforms not only its teacher but also the current state-of-the-art in Image-To-Video by a wide margin (6.3% mAP on MARS, 8.6% on Duke-Video-ReId and 5% on VeRi-776). A thorough analysis - on Person, Vehicle and Animal Re-ID - investigates the properties of VKD from a qualitatively and quantitatively perspective.

2020 - Scoring pleurisy in slaughtered pigs using convolutional neural networks [Articolo su rivista]
Trachtman, A. R.; Bergamini, L.; Palazzi, A.; Porrello, A.; Capobianco Dondona, A.; Del Negro, E.; Paolini, A.; Vignola, G.; Calderara, S.; Marruchella, G.
abstract

Diseases of the respiratory system are known to negatively impact the profitability of the pig industry, worldwide. Considering the relatively short lifespan of pigs, lesions can be still evident at slaughter, where they can be usefully recorded and scored. Therefore, the slaughterhouse represents a key check-point to assess the health status of pigs, providing unique and valuable feedback to the farm, as well as an important source of data for epidemiological studies. Although relevant, scoring lesions in slaughtered pigs represents a very time-consuming and costly activity, thus making difficult their systematic recording. The present study has been carried out to train a convolutional neural network-based system to automatically score pleurisy in slaughtered pigs. The automation of such a process would be extremely helpful to enable a systematic examination of all slaughtered livestock. Overall, our data indicate that the proposed system is well able to differentiate half carcasses affected with pleurisy from healthy ones, with an overall accuracy of 85.5%. The system was better able to recognize severely affected half carcasses as compared with those showing less severe lesions. The training of convolutional neural networks to identify and score pneumonia, on the one hand, and the achievement of trials in large capacity slaughterhouses, on the other, represent the natural pursuance of the present study. As a result, convolutional neural network-based technologies could provide a fast and cheap tool to systematically record lesions in slaughtered pigs, thus supplying an enormous amount of useful data to all stakeholders in the pig industry.

2019 - A Deep-learning-based approach to VM behavior Identification in Cloud Systems [Relazione in Atti di Convegno]
Stefanini, M.; Lancellotti, R.; Baraldi, L.; Calderara, S.
abstract

2019 - Can adversarial networks hallucinate occluded people with a plausible aspect? [Articolo su rivista]
Fulgeri, F.; Fabbri, Matteo; Alletto, Stefano; Calderara, S.; Cucchiara, R.
abstract

When you see a person in a crowd, occluded by other persons, you miss visual information that can be used to recognize, re-identify or simply classify him or her. You can imagine its appearance given your experience, nothing more. Similarly, AI solutions can try to hallucinate missing information with specific deep learning architectures, suitably trained with people with and without occlusions. The goal of this work is to generate a complete image of a person, given an occluded version in input, that should be a) without occlusion b) similar at pixel level to a completely visible people shape c) capable to conserve similar visual attributes (e.g. male/female) of the original one. For the purpose, we propose a new approach by integrating the state-of-the-art of neural network architectures, namely U-nets and GANs, as well as discriminative attribute classification nets, with an architecture specifically designed to de-occlude people shapes. The network is trained to optimize a Loss function which could take into account the aforementioned objectives. As well we propose two datasets for testing our solution: the first one, occluded RAP, created automatically by occluding real shapes of the RAP dataset created by Li et al. (2016) (which collects also attributes of the people aspect); the second is a large synthetic dataset, AiC, generated in computer graphics with data extracted from the GTA video game, that contains 3D data of occluded objects by construction. Results are impressive and outperform any other previous proposal. This result could be an initial step to many further researches to recognize people and their behavior in an open crowded world.

2019 - Classifying Signals on Irregular Domains via Convolutional Cluster Pooling [Relazione in Atti di Convegno]
Porrello, Angelo; Abati, Davide; Calderara, Simone; Cucchiara, Rita
abstract

We present a novel and hierarchical approach for supervised classification of signals spanning over a fixed graph, reflecting shared properties of the dataset. To this end, we introduce a Convolutional Cluster Pooling layer exploiting a multi-scale clustering in order to highlight, at different resolutions, locally connected regions on the input graph. Our proposal generalises well-established neural models such as Convolutional Neural Networks (CNNs) on irregular and complex domains, by means of the exploitation of the weight sharing property in a graph-oriented architecture. In this work, such property is based on the centrality of each vertex within its soft-assigned cluster. Extensive experiments on NTU RGB+D, CIFAR-10 and 20NEWS demonstrate the effectiveness of the proposed technique in capturing both local and global patterns in graph-structured data out of different domains.

2019 - End-to-end 6-DoF Object Pose Estimation through Differentiable Rasterization [Relazione in Atti di Convegno]
Palazzi, Andrea; Bergamini, Luca; Calderara, Simone; Cucchiara, Rita
abstract

Here we introduce an approximated differentiable renderer to refine a 6-DoF pose prediction using only 2D alignment information. To this end, a two-branched convolutional encoder network is employed to jointly estimate the object class and its 6-DoF pose in the scene. We then propose a new formulation of an approximated differentiable renderer to re-project the 3D object on the image according to its predicted pose; in this way the alignment error between the observed and the re-projected object silhouette can be measured. Since the renderer is differentiable, it is possible to back-propagate through it to correct the estimated pose at test time in an online learning fashion. Eventually we show how to leverage the classification branch to profitably re-project a representative model of the predicted class (i.e. a medoid) instead. Each object in the scene is processed independently and novel viewpoints in which both objects arrangement and mutual pose are preserved can be rendered. Differentiable renderer code is available at:https://github.com/ndrplz/tensorflow-mesh-renderer.

2019 - Gait-Based Diplegia Classification Using LSMT Networks [Articolo su rivista]
Ferrari, Alberto; Bergamini, Luca; Guerzoni, Giorgio; Calderara, Simone; Bicocchi, Nicola; Vitetta, Giorgio; Borghi, Corrado; Neviani, Rita; Ferrari, Adriano
abstract

Diplegia is a specific subcategory of the wide spectrum of motion disorders gathered under the name of cerebral palsy. Recent works proposed to use gait analysis for diplegia classification paving the way for automated analysis. A clinically established gait-based classification system divides diplegic patients into 4 main forms, each one associated with a peculiar walking pattern. In this work, we apply two different deep learning techniques, namely, multilayer perceptron and recurrent neural networks, to automatically classify children into the 4 clinical forms. For the analysis, we used a dataset comprising gait data of 174 patients collected by means of an optoelectronic system. The measurements describing walking patterns have been processed to extract 27 angular parameters and then used to train both kinds of neural networks. Classification results are comparable with those provided by experts in 3 out of 4 forms.

2019 - Latent Space Autoregression for Novelty Detection [Relazione in Atti di Convegno]
Abati, Davide; Porrello, Angelo; Calderara, Simone; Cucchiara, Rita
abstract

Novelty detection is commonly referred to as the discrimination of observations that do not conform to a learned model of regularity. Despite its importance in different application settings, designing a novelty detector is utterly complex due to the unpredictable nature of novelties and its inaccessibility during the training procedure, factors which expose the unsupervised nature of the problem. In our proposal, we design a general framework where we equip a deep autoencoder with a parametric density estimator that learns the probability distribution underlying its latent representations through an autoregressive procedure. We show that a maximum likelihood objective, optimized in conjunction with the reconstruction of normal samples, effectively acts as a regularizer for the task at hand, by minimizing the differential entropy of the distribution spanned by latent vectors. In addition to providing a very general formulation, extensive experiments of our model on publicly available datasets deliver on-par or superior performances if compared to state-of-the-art methods in one-class and video anomaly detection settings. Differently from prior works, our proposal does not make any assumption about the nature of the novelties, making our work readily applicable to diverse contexts.

2019 - METODO DI VALUTAZIONE DI UNO STATO DI SALUTE DI UN ELEMENTO ANATOMICO, RELATIVO DISPOSITIVO DI VALUTAZIONE E RELATIVO SISTEMA DI VALUTAZIONE [Brevetto]
Giuseppe, Marrucchella; Bergamini, Luca; Porrello, Angelo; Del Negro, Ercole; Capobianco Dondona, Andrea; Di Tondo, Francesco; Calderara, Simone
abstract

Sistema in grado di rilevare le lesioni delle mezzene al macello attraverso l'utilizzo di tecniche di deep learning per individuazioni del tipo di lesioni presenti

2019 - Predicting the Driver's Focus of Attention: the DR(eye)VE Project [Articolo su rivista]
Palazzi, Andrea; Abati, Davide; Calderara, Simone; Solera, Francesco; Cucchiara, Rita
abstract

Predicting the Driver's Focus of Attention: the DR(eye)VE Project Andrea Palazzi, Davide Abati, Simone Calderara, Francesco Solera, Rita Cucchiara (Submitted on 10 May 2017 (v1), last revised 6 Jun 2018 (this version, v3)) In this work we aim to predict the driver's focus of attention. The goal is to estimate what a person would pay attention to while driving, and which part of the scene around the vehicle is more critical for the task. To this end we propose a new computer vision model based on a multi-branch deep architecture that integrates three sources of information: raw video, motion and scene semantics. We also introduce DR(eye)VE, the largest dataset of driving scenes for which eye-tracking annotations are available. This dataset features more than 500,000 registered frames, matching ego-centric views (from glasses worn by drivers) and car-centric views (from roof-mounted camera), further enriched by other sensors measurements. Results highlight that several attention patterns are shared across drivers and can be reproduced to some extent. The indication of which elements in the scene are likely to capture the driver's attention may benefit several applications in the context of human-vehicle interaction and driver attention analysis.

2019 - Segmentation Guided Scoring of Pathological Lesions in Swine Through CNNs [Relazione in Atti di Convegno]
Bergamini, L.; Trachtman, A. R.; Palazzi, A.; Negro, E. D.; Capobianco Dondona, A.; Marruchella, G.; Calderara, S.
abstract

The slaughterhouse is widely recognised as a useful checkpoint for assessing the health status of livestock. At the moment, this is implemented through the application of scoring systems by human experts. The automation of this process would be extremely helpful for veterinarians to enable a systematic examination of all slaughtered livestock, positively influencing herd management. However, such systems are not yet available, mainly because of a critical lack of annotated data. In this work we: (i) introduce a large scale dataset to enable the development and benchmarking of these systems, featuring more than 4000 high-resolution swine carcass images annotated by domain experts with pixel-level segmentation; (ii) exploit part of this annotation to train a deep learning model in the task of pleural lesion scoring. In this setting, we propose a segmentation-guided framework which stacks together a fully convolutional neural network performing semantic segmentation with a rule-based classifier integrating a-priori veterinary knowledge in the process. Thorough experimental analysis against state-of-the-art baselines proves our method to be superior both in terms of accuracy and in terms of model interpretability. Code and dataset are publicly available here: https://github.com/lucabergamini/swine-lesion-scoring.

2019 - Self-Supervised Optical Flow Estimation by Projective Bootstrap [Articolo su rivista]
Alletto, Stefano; Abati, Davide; Calderara, Simone; Cucchiara, Rita; Rigazio, Luca
abstract

Dense optical flow estimation is complex and time consuming, with state-of-the-art methods relying either on large synthetic data sets or on pipelines requiring up to a few minutes per frame pair. In this paper, we address the problem of optical flow estimation in the automotive scenario in a self-supervised manner. We argue that optical flow can be cast as a geometrical warping between two successive video frames and devise a deep architecture to estimate such transformation in two stages. First, a dense pixel-level flow is computed with a projective bootstrap on rigid surfaces. We show how such global transformation can be approximated with a homography and extend spatial transformer layers so that they can be employed to compute the flow field implied by such transformation. Subsequently, we refine the prediction by feeding a second, deeper network that accounts for moving objects. A final reconstruction loss compares the warping of frame Xₜ with the subsequent frame Xₜ₊₁ and guides both estimates. The model has the speed advantages of end-to-end deep architectures while achieving competitive performances, both outperforming recent unsupervised methods and showing good generalization capabilities on new automotive data sets.

2019 - Spotting Insects from Satellites: Modeling the Presence of Culicoides Imicola Through Deep CNNs [Relazione in Atti di Convegno]
Vincenzi, Stefano; Porrello, Angelo; Buzzega, Pietro; Conte, Annamaria; Ippoliti, Carla; Candeloro, Luca; Di Lorenzo, Alessio; Capobianco Dondona, Andrea; Calderara, Simone
abstract

Nowadays, Vector-Borne Diseases (VBDs) raise a severe threat for public health, accounting for a considerable amount of human illnesses. Recently, several surveillance plans have been put in place for limiting the spread of such diseases, typically involving on-field measurements. Such a systematic and effective plan still misses, due to the high costs and efforts required for implementing it. Ideally, any attempt in this field should consider the triangle vectors-host-pathogen, which is strictly linked to the environmental and climatic conditions. In this paper, we exploit satellite imagery from Sentinel-2 mission, as we believe they encode the environmental factors responsible for the vector's spread. Our analysis - conducted in a data-driver fashion - couples spectral images with ground-truth information on the abundance of Culicoides imicola. In this respect, we frame our task as a binary classification problem, underpinning Convolutional Neural Networks (CNNs) as being able to learn useful representation from multi-band images. Additionally, we provide a multi-instance variant, aimed at extracting temporal patterns from a short sequence of spectral images. Experiments show promising results, providing the foundations for novel supportive tools, which could depict where surveillance and prevention measures could be prioritized.

2018 - Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era [Articolo su rivista]
Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita
abstract

Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades both in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we discuss the effectiveness of convolutional neural networks (CNNs) models in saliency prediction. We present a set of Deep Learning architectures developed by us, which can combine both bottom-up cues and higher-level semantics, and extract spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. We will show how these deep networks closely recall the early saliency models, although improved with the semantics learned from the human ground-truth. Eventually, we will present a use-case in which saliency prediction is used to improve the automatic description of images.

2018 - Comportamento non verbale intergruppi “oggettivo”: una replica dello studio di Dovidio, kawakami e Gaertner (2002) [Abstract in Atti di Convegno]
Di Bernardo, Gian Antonio; Vezzali, Loris; Giovannini, Dino; Palazzi, Andrea; Calderara, Simone; Bicocchi, Nicola; Zambonelli, Franco; Cucchiara, Rita; Cadamuro, Alessia; Cocco, Veronica Margherita
abstract

Vi è una lunga tradizione di ricerca che ha analizzato il comportamento non verbale, anche considerando relazioni intergruppi. Solitamente, questi studi si avvalgono di valutazioni di coder esterni, che tuttavia sono soggettive e aperte a distorsioni. Abbiamo condotto uno studio in cui si è preso come riferimento il celebre studio di Dovidio, Kawakami e Gaertner (2002), apportando tuttavia alcune modifiche e considerando la relazione tra bianchi e neri. Partecipanti bianchi, dopo aver completato misure di pregiudizio esplicito e implicito, incontravano (in ordine contro-bilanciato) un collaboratore bianco e uno nero. Con ognuno di essi, parlavano per tre minuti di un argomento neutro e di un argomento saliente per la distinzione di gruppo (in ordine contro-bilanciato). Tali interazioni erano registrate con una telecamera kinect, che è in grado di tenere conto della componente tridimensionale del movimento. I risultati hanno rivelato vari elementi di interesse. Anzitutto, si sono creati indici oggettivi, a partire da un’analisi della letteratura, alcuni dei quali non possono essere rilevati da coder esterni, quali distanza interpersonale e volume di spazio tra le persone. I risultati hanno messo in luce alcuni aspetti rilevanti: (1) l’atteggiamento implicito è associato a vari indici di comportamento non verbale, i quali mediano sulle valutazioni dei partecipanti fornite dai collaboratori; (2) le interazioni vanno considerate in maniera dinamica, tenendo conto che si sviluppano nel tempo; (3) ciò che può essere importante è il comportamento non verbale globale, piuttosto che alcuni indici specifici pre-determinati dagli sperimentatori.

2018 - Domain Translation with Conditional GANs: from Depth to RGB Face-to-Face [Relazione in Atti di Convegno]
Fabbri, Matteo; Borghi, Guido; Lanzi, Fabio; Vezzani, Roberto; Calderara, Simone; Cucchiara, Rita
abstract

Can faces acquired by low-cost depth sensors be useful to see some characteristic details of the faces? Typically the answer is not. However, new deep architectures can generate RGB images from data acquired in a different modality, such as depth data. In this paper we propose a new Deterministic Conditional GAN, trained on annotated RGB-D face datasets, effective for a face-to-face translation from depth to RGB. Although the network cannot reconstruct the exact somatic features for unknown individual faces, it is capable to reconstruct plausible faces; their appearance is accurate enough to be used in many pattern recognition tasks. In fact, we test the network capability to hallucinate with some Perceptual Probes, as for instance face aspect classification or landmark detection. Depth face can be used in spite of the correspondent RGB images, that often are not available for darkness of difficult luminance conditions. Experimental results are very promising and are as far as better than previous proposed approaches: this domain translation can constitute a new way to exploit depth data in new future applications.

2018 - Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World [Relazione in Atti di Convegno]
Fabbri, Matteo; Lanzi, Fabio; Calderara, Simone; Palazzi, Andrea; Vezzani, Roberto; Cucchiara, Rita
abstract

Multi-People Tracking in an open-world setting requires a special effort in precise detection. Moreover, temporal continuity in the detection phase gains more importance when scene cluttering introduces the challenging problems of occluded targets. For the purpose, we propose a deep network architecture that jointly extracts people body parts and associates them across short temporal spans. Our model explicitly deals with occluded body parts, by hallucinating plausible solutions of not visible joints. We propose a new end-to-end architecture composed by four branches (visible heatmaps, occluded heatmaps, part affinity fields and temporal affinity fields) fed by a time linker feature extractor. To overcome the lack of surveillance data with tracking, body part and occlusion annotations we created the vastest Computer Graphics dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, almost 10 million body poses) of human body parts for people tracking in urban scenarios. Our architecture trained on virtual data exhibits good generalization capabilities also on public real tracking benchmarks, when image resolution and sharpness are high enough, producing reliable tracklets useful for further batch data association or re-id modules.

2018 - Metodo e sistema per il riconoscimento biometrico univoco di un animale, basati sull'utilizzo di tecniche di deep learning [Brevetto]
Calderara, Simone; Bergamini, Luca; Capobianco Dondona, Andrea; Del Negro, Ercole; Di Tondo, Francesco
abstract

La presente invenzione descrive un metodo e sistema per il riconoscimento biometrico univoco di un animale, basato sull’utilizzo di tecniche di deep learning. Il metodo è caratterizzato dalle seguenti fasi: a. fase di allenamento su di un dominio umano ed un dominio animale per l’ottenimento di embedding animali in uno spazio latente omologo a quello umano per mezzo di reti neurali convolutive; b. memorizzazione degli embedding animali ottenuti in una banca dati; c. riconoscimento di una identità animale per mezzo di reti neurali convolutive. La presente invenzione comprende anche un sistema per il riconoscimento biometrico univoco di un animale che utilizza il metodo precedentemente descritto.

2018 - Multi-views Embedding for Cattle Re-identification [Relazione in Atti di Convegno]
Bergamini, Luca; Porrello, Angelo; Andrea Capobianco Dondona, ; Ercole Del Negro, ; Mauro, Mattioli; Nicola, D’Alterio; Calderara, Simone
abstract

People re-identification task has seen enormous improvements in the latest years, mainly due to the development of better image features extraction from deep Convolutional Neural Networks (CNN) and the availability of large datasets. However, little research has been conducted on animal identification and re-identification, even if this knowledge may be useful in a rich variety of different scenarios. Here, we tackle cattle re-identification exploiting deep CNN and show how this task is poorly related to the human one, presenting unique challenges that make it far from being solved. We present various baselines, both based on deep architectures or on standard machine learning algorithms, and compared them with our solution. Finally, a rich ablation study has been conducted to further investigate the unique peculiarities of this task.

2018 - Unsupervised vehicle re-identification using triplet networks [Relazione in Atti di Convegno]
Marin-Reyes, P. A.; Bergamini, L.; Lorenzo-Navarro, J.; Palazzi, A.; Calderara, S.; Cucchiara, R.
abstract

Vehicle re-identification plays a major role in modern smart surveillance systems. Specifically, the task requires the capability to predict the identity of a given vehicle, given a dataset of known associations, collected from different views and surveillance cameras. Generally, it can be cast as a ranking problem: given a probe image of a vehicle, the model needs to rank all database images based on their similarities w.r.t the probe image. In line with recent research, we devise a metric learning model that employs a supervision based on local constraints. In particular, we leverage pairwise and triplet constraints for training a network capable of assigning a high degree of similarity to samples sharing the same identity, while keeping different identities distant in feature space. Eventually, we show how vehicle tracking can be exploited to automatically generate a weakly labelled dataset that can be used to train the deep network for the task of vehicle re-identification. Learning and evaluation is carried out on the NVIDIA AI city challenge videos.

2018 - Using Kinect camera for investigating intergroup non-verbal human interactions [Abstract in Atti di Convegno]
Vezzali, Loris; Di Bernardo, Gian Antonio; Cadamuro, Alessia; Cocco, Veronica Margherita; Crapolicchio, Eleonora; Bicocchi, Nicola; Calderara, Simone; Giovannini, Dino; Zambonelli, Franco; Cucchiara, Rita
abstract

A long tradition in social psychology focused on nonverbal behaviour displayed during dyadic interactions generally relying on evaluations from external coders. However, in addition to the fact that external coders may be biased, they may not capture certain type of behavioural indices. We designed three studies examining explicit and implicit prejudice as predictors of nonberval behaviour as reflected in objective indices provided by Kinect cameras. In the first study, we considered White-Black relations from the perspective of 36 White participants. Results revealed that implicit prejudice was associated with a reduction in interpersonal distance and in the volume of space between Whites and Blacks (vs. Whites and Whites), which in turn were associated with evaluations by collaborators taking part in the interaction. In the second study, 37 non-HIV participants interacted with HIV individuals. We found that implicit prejudice was associated with reduced volume of space between interactants over time (a process of bias overcorrection) only when they tried hard to control their behaviour (as captured by a stroop test). In the third study 35 non-disabled children interacted with disabled children. Results revealed that implicit prejudice was associated with reduced interpersonal distance over time.

2017 - A new era in the study of intergroup nonverbal behaviour: Studying intergroup dyadic interactions “online” [Abstract in Atti di Convegno]
DI BERNARDO, GIAN ANTONIO; Vezzali, Loris; Palazzi, Andrea; Calderara, Simone; Bicocchi, Nicola; Zambonelli, Franco; Cucchiara, Rita; Cadamuro, Alessia
abstract

We examined predictors and consequences of intergroup nonverbal behaviour by relying on new technologies and new objective indices. In three studies, both in the laboratory and in the field with children, behaviour was a function of implicit prejudice.

2017 - Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era [Relazione in Atti di Convegno]
Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita
abstract

Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we propose a discussion on why convolutional neural networks (CNNs) are so accurate in saliency prediction. We present our DL architectures which combine both bottom-up cues and higher-level semantics, and incorporate the concept of time in the attentional process through LSTM recurrent architectures. Eventually, we present a video-specific architecture based on the C3D network, which can extracts spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. The merit of this work is to show how these deep networks are not mere brute-force methods tuned on massive amount of data, but represent well-defined architectures which recall very closely the early saliency models, although improved with the semantics learned by human ground-thuth.

2017 - From Groups to Leaders and Back. Exploring Mutual Predictability Between Social Groups and Their Leaders [Capitolo/Saggio]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Recently, social theories and empirical observations identified small groups and leaders as the basic elements which shape a crowd. This leads to an intermediate level of abstraction that is placed between the crowd as a flow of people, and the crowd as a collection of individuals. Consequently, automatic analysis of crowds in computer vision is also experiencing a shift in focus from individuals to groups and from small groups to their leaders. In this chapter, we present state-of-the-art solutions to the groups and leaders detection problem, which are able to account for physical factors as well as for sociological evidence observed over short time windows. The presented algorithms are framed as structured learning problems over the set of individual trajectories. However, the way trajectories are exploited to predict the structure of the crowd is not fixed but rather learned from recorded and annotated data, enabling the method to adapt these concepts to different scenarios, densities, cultures, and other unobservable complexities. Additionally, we investigate the relation between leaders and their groups and propose the first attempt to exploit leadership as prior knowledge for group detection.

2017 - Generative Adversarial Models for People Attribute Recognition in Surveillance [Relazione in Atti di Convegno]
Fabbri, Matteo; Calderara, Simone; Cucchiara, Rita
abstract

In this paper we propose a deep architecture for detecting people attributes (e.g. gender, race, clothing ...) in surveillance contexts. Our proposal explicitly deal with poor resolution and occlusion issues that often occur in surveillance footages by enhancing the images by means of Deep Convolutional Generative Adversarial Networks (DCGAN). Experiments show that by combining both our Generative Reconstruction and Deep Attribute Classification Network we can effectively extract attributes even when resolution is poor and in presence of strong occlusions up to 80% of the whole person figure.

2017 - Learning Where to Attend Like a Human Driver [Relazione in Atti di Convegno]
Palazzi, Andrea; Solera, Francesco; Calderara, Simone; Alletto, Stefano; Cucchiara, Rita
abstract

Despite the advent of autonomous cars, it's likely - at least in the near future - that human attention will still maintain a central role as a guarantee in terms of legal responsibility during the driving task. In this paper we study the dynamics of the driver's gaze and use it as a proxy to understand related attentional mechanisms. First, we build our analysis upon two questions: where and what the driver is looking at? Second, we model the driver's gaze by training a coarse-to-fine convolutional network on short sequences extracted from the DR(eye)VE dataset. Experimental comparison against different baselines reveal that the driver's gaze can indeed be learnt to some extent, despite i) being highly subjective and ii) having only one driver's gaze available for each sequence due to the irreproducibility of the scene. Eventually, we advocate for a new assisted driving paradigm which suggests to the driver, with no intervention, where she should focus her attention.

2017 - Learning to Map Vehicles into Bird's Eye View [Relazione in Atti di Convegno]
Palazzi, Andrea; Borghi, Guido; Abati, Davide; Calderara, Simone; Cucchiara, Rita
abstract

Awareness of the road scene is an essential component for both autonomous vehicles and Advances Driver Assistance Systems and is gaining importance both for the academia and car companies. This paper presents a way to learn a semantic-aware transformation which maps detections from a dashboard camera view onto a broader bird's eye occupancy map of the scene. To this end, a huge synthetic dataset featuring 1M couples of frames, taken from both car dashboard and bird's eye view, has been collected and automatically annotated. A deep-network is then trained to warp detections from the first to the second view. We demonstrate the effectiveness of our model against several baselines and observe that is able to generalize on real-world data despite having been trained solely on synthetic ones.

2017 - Signal Processing and Machine Learning for Diplegia Classification [Relazione in Atti di Convegno]
Bergamini, Luca; Calderara, Simone; Bicocchi, Nicola; Ferrari, Alberto; Vitetta, Giorgio
abstract

Diplegia is one of the most common forms of a broad family of motion disorders named cerebral palsy (CP) affecting the voluntary muscular system. In recent years, various classification criteria have been proposed for CP, to assist in diagnosis, clinical decision-making and communication. In this manuscript, we divide the spastic forms of CP into 4 other categories according to a previous classification criterion and propose a machine learning approach for automatically classifying patients. Training and validation of our approach are based on data about 200 patients acquired using 19 markers and high frequency VICON cameras in an Italian hospital. Our approach makes use of the latest deep learning techniques. More specifically, it involves a multi-layer perceptron network (MLP), combined with Fourier analysis. An encouraging classification performance is obtained for two of the four classes.

2017 - Tracking social groups within and across cameras [Articolo su rivista]
Solera, Francesco; Calderara, Simone; Ristani, Ergys; Tomasi, Carlo; Cucchiara, Rita
abstract

We propose a method for tracking groups from single and multiple cameras with disjoint fields of view. Our formulation follows the tracking-by-detection paradigm where groups are the atomic entities and are linked over time to form long and consistent trajectories. To this end, we formulate the problem as a supervised clustering problem where a Structural SVM classifier learns a similarity measure appropriate for group entities. Multi-camera group tracking is handled inside the framework by adopting an orthogonal feature encoding that allows the classifier to learn inter- and intra-camera feature weights differently. Experiments were carried out on a novel annotated group tracking data set, the DukeMTMC-Groups data set. Since this is the first data set on the problem it comes with the proposal of a suitable evaluation measure. Results of adopting learning for the task are encouraging, scoring a +15% improvement in F1 measure over a non-learning based clustering baseline. To our knowledge this is the first proposal of this kind dealing with multi-camera group tracking.

2016 - DR(eye)VE: a Dataset for Attention-Based Tasks with Applications to Autonomous and Assisted Driving [Relazione in Atti di Convegno]
Alletto, Stefano; Palazzi, Andrea; Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Autonomous and assisted driving are undoubtedly hot topics in computer vision. However, the driving task is extremely complex and a deep understanding of drivers' behavior is still lacking. Several researchers are now investigating the attention mechanism in order to define computational models for detecting salient and interesting objects in the scene. Nevertheless, most of these models only refer to bottom up visual saliency and are focused on still images. Instead, during the driving experience the temporal nature and peculiarity of the task influence the attention mechanisms, leading to the conclusion that real life driving data is mandatory. In this paper we propose a novel and publicly available dataset acquired during actual driving. Our dataset, composed by more than 500,000 frames, contains drivers' gaze fixations and their temporal integration providing task-specific saliency maps. Geo-referenced locations, driving speed and course complete the set of released data. To the best of our knowledge, this is the first publicly available dataset of this kind and can foster new discussions on better understanding, exploiting and reproducing the driver's attention process in the autonomous and assisted cars of future generations.

2016 - Quick, accurate, smart: 3D computer vision technology helps assessing confined animals' behaviour [Articolo su rivista]
Barnard, Shanis; Calderara, Simone; Pistocchi, Simone; Cucchiara, Rita; Podaliri Vulpiani, Michele; Messori, Stefano; Ferri, Nicola
abstract

Mankind directly controls the environment and lifestyles of several domestic species for purposes ranging from production and research to conservation and companionship. These environments and lifestyles may not offer these animals the best quality of life. Behaviour is a direct reflection of how the animal is coping with its environment. Behavioural indicators are thus among the preferred parameters to assess welfare. However, behavioural recording (usually from video) can be very time consuming and the accuracy and reliability of the output rely on the experience and background of the observers. The outburst of new video technology and computer image processing gives the basis for promising solutions. In this pilot study, we present a new prototype software able to automatically infer the behaviour of dogs housed in kennels from 3D visual data and through structured machine learning frameworks. Depth information acquired through 3D features, body part detection and training are the key elements that allow the machine to recognise postures, trajectories inside the kennel and patterns of movement that can be later labelled at convenience. The main innovation of the software is its ability to automatically cluster frequently observed temporal patterns of movement without any pre-set ethogram. Conversely, when common patterns are defined through training, a deviation from normal behaviour in time or between individuals could be assessed. The software accuracy in correctly detecting the dogs' behaviour was checked through a validation process. An automatic behaviour recognition system, independent from human subjectivity, could add scientific knowledge on animals' quality of life in confinement as well as saving time and resources. This 3D framework was designed to be invariant to the dog's shape and size and could be extended to farm, laboratory and zoo quadrupeds in artificial housing. The computer vision technique applied to this software is innovative in non-human animal behaviour science. Further improvements and validation are needed, and future applications and limitations are discussed.

2016 - Socially Constrained Structural Learning for Groups Detection in Crowd [Articolo su rivista]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Modern crowd theories agree that collective behavior is the result of the underlying interactions among small groups of individuals. In this work, we propose a novel algorithm for detecting social groups in crowds by means of a Correlation Clustering procedure on people trajectories. The affinity between crowd members is learned through an online formulation of the Structural SVM framework and a set of specifically designed features characterizing both their physical and social identity, inspired by Proxemic theory, Granger causality, DTW and Heat-maps. To adhere to sociological observations, we introduce a loss function (G-MITRE) able to deal with the complexity of evaluating group detection performances. We show our algorithm achieves state-of-the-art results when relying on both ground truth trajectories and tracklets previously extracted by available detector/tracker systems.

2016 - Spotting prejudice with nonverbal behaviours [Relazione in Atti di Convegno]
Palazzi, Andrea; Calderara, Simone; Bicocchi, Nicola; Vezzali, Loris; DI BERNARDO, GIAN ANTONIO; Zambonelli, Franco; Cucchiara, Rita
abstract

Despite prejudice cannot be directly observed, nonverbal behaviours provide profound hints on people inclinations. In this paper, we use recent sensing technologies and machine learning techniques to automatically infer the results of psychological questionnaires frequently used to assess implicit prejudice. In particular, we recorded 32 students discussing with both white and black collaborators. Then, we identiﬁed a set of features allowing automatic extraction and measured their degree of correlation with psychological scores. Results conﬁrmed that automated analysis of nonverbal behaviour is actually possible thus paving the way for innovative clinical tools and eventually more secure societies.

2016 - Transductive People Tracking in Unconstrained Surveillance [Articolo su rivista]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
abstract

Long term tracking of people in unconstrained scenarios is still an open problem due to the absence of constant elements in the problem setting. The camera, when active, may move and both the background and the target appearance may change abruptly leading to the inadequacy of most standard tracking techniques. We propose to exploit a learning approach that considers the tracking task as a semi supervised learning (SSL) problem. Given few target samples the aim is to search the target occurrences in the video stream re-interpreting the problem as label propagation on a similarity graph. We propose a solution based on graph transduction that works iteratively frame by frame. Additionally, in order to avoid drifting, we introduce an update strategy based on an evolutionary clustering technique that chooses the visual templates that better describe target appearance evolving the model during the processing of the video. Since we model people appearance by means of covariance matrices on color and gradient information our framework is directly related to structure learning on Riemannian manifolds. Tests on publicly available datasets and comparisons with stateof- the-art techniques allow to conclude that our solution exhibit interesting performances in terms of tracking precision and recall in most of the considered scenarios.

2015 - Active query process for digital video surveillance forensic applications [Articolo su rivista]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
abstract

Multimedia forensics is a new emerging discipline regarding the analysis and exploitation of digital data as support for investigation to extract probative elements. Among them, visual data about people and people activities, extracted from videos in an efficient way, are becoming day by day more appealing for forensics, due to the availability of large video-surveillance footage. Thus, many research studies and prototypes investigate the analysis of soft biometrics data, such as people appearance and people trajectories. In this work, we propose new solutions for querying and retrieving visual data in an interactive and active fashion for soft biometrics in forensics. The innovative proposal joins the capability of transductive learning for semi-supervised search by similarity and a typical multimedia methodology based on user-guided relevance feedback to allow an active interaction with the visual data of people, appearance and trajectory in large surveillance areas. Approaches proposed are very general and can be exploited independently by the surveillance setting and the type of video analytic tools.

2015 - Learning to Divide and Conquer for Online Multi-Target Tracking [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Online Multiple Target Tracking (MTT) is often addressed within the tracking-by-detection paradigm. Detections are previously extracted independently in each frame and then objects trajectories are built by maximizing specifically designed coherence functions. Nevertheless, ambiguities arise in presence of occlusions or detection errors. In this paper we claim that the ambiguities in tracking could be solved by a selective use of the features, by working with more reliable features if possible and exploiting a deeper representation of the target only if necessary. To this end, we propose an online divide and conquer tracker for static camera scenes, which partitions the assignment problem in local subproblems and solves them by selectively choosing and combining the best features. The complete framework is cast as a structural learning task that unifies these phases and learns tracker parameters from examples. Experiments on two different datasets highlights a significant improvement of tracking performances (MOTA +10%) over the state of the art.

2015 - Learning to identify leaders in crowd [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Leader identification is a crucial task in social analysis, crowd management and emergency planning. In this paper, we investigate a computational model for the individuation of leaders in crowded scenes. We deal with the lack of a formal definition of leadership by learning, in a supervised fashion, a metric space based exclusively on people spatiotemporal information. Based on Tarde's work on crowd psychology, individuals are modeled as nodes of a directed graph and leaders inherits their relevance thanks to other members references. We note this is analogous to the way websites are ranked by the PageRank algorithm. During experiments, we observed different feature weights depending on the specific type of crowd, highlighting the impossibility to provide a unique interpretation of leadership. To our knowledge, this is the first attempt to study leader identification as a metric learning problem

2015 - Towards the evaluation of reproducible robustness in tracking-by-detection [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Conventional experiments on MTT are built upon the belief that fixing the detections to different trackers is sufficient to obtain a fair comparison. In this work we argue how the true behavior of a tracker is exposed when evaluated by varying the input detections rather than by fixing them. We propose a systematic and reproducible protocol and a MATLAB toolbox for generating synthetic data starting from ground truth detections, a proper set of metrics to understand and compare trackers peculiarities and respective visualization solutions.

2015 - Understanding social relationships in egocentric vision [Articolo su rivista]
Alletto, Stefano; Serra, Giuseppe; Calderara, Simone; Cucchiara, Rita
abstract

The understanding of mutual people interaction is a key component for recognizing people social behavior, but it strongly relies on a personal point of view resulting difficult to be a-priori modeled. We propose the adoption of the unique head mounted cameras first person perspective (ego-vision) to promptly detect people interaction in different social contexts. The proposal relies on a complete and reliable system that extracts people׳s head pose combining landmarks and shape descriptors in a temporal smoothed HMM framework. Finally, interactions are detected through supervised clustering on mutual head orientation and people distances exploiting a structural learning framework that specifically adjusts the clustering measure according to a peculiar scenario. Our solution provides the flexibility to capture the interactions disregarding the number of individuals involved and their level of acquaintance in context with a variable degree of social involvement. The proposed system shows competitive performances on both publicly available ego-vision datasets and ad hoc benchmarks built with real life situations.

2014 - A complete system for garment segmentation and color classification [Articolo su rivista]
Manfredi, Marco; Grana, Costantino; Calderara, Simone; Cucchiara, Rita
abstract

In this paper, we propose a general approach for automatic segmentation, color-based retrieval and classification of garments in fashion store databases, exploiting shape and color information. The garment segmentation is automatically initialized by learning geometric constraints and shape cues, then it is performed by modeling both skin and accessory colors with Gaussian Mixture Models. For color similarity retrieval and classification, to adapt the color description to the users’ perception and the company marketing directives, a color histogram with an optimized binning strategy, learned on the given color classes, is introduced and combined with HOG features for garment classification. Experiments validating the proposed strategy, and a free-to-use dataset publicly available for scientific purposes, are finally detailed.

2014 - Detection of static groups and crowds gathered in open spaces by texture classification [Articolo su rivista]
Manfredi, Marco; Vezzani, Roberto; Calderara, Simone; Cucchiara, Rita
abstract

A surveillance system specifically developed to manage crowded scenes is described in this paper. In particular we focused on static crowds, composed by groups of people gathered and stayed in the same place for a while. The detection and spatial localization of static crowd situations is performed by means of a One Class Support Vector Machine, working on texture features extracted at patch level. Spatial regions containing crowds are identified and filtered using motion information to prevent noise and false alarms due to moving flows of people. By means of one class classification and inner texture descriptors, we are able to obtain, from a single training set, a sufficiently general crowd model that can be used for all the scenarios that shares a similar viewpoint. Tests on public datasets and real setups validate the proposed system.

2014 - From Ego to Nos-Vision: Detecting Social Relationships in First-Person Views [Relazione in Atti di Convegno]
Alletto, Stefano; Serra, Giuseppe; Calderara, Simone; Solera, Francesco; Cucchiara, Rita
abstract

In this paper we present a novel approach to detect groups in ego-vision scenarios. People in the scene are tracked through the video sequence and their head pose and 3D location are estimated. Based on the concept of f-formation, we define with the orientation and distance an inherently social pairwise feature that describes the affinity of a pair of people in the scene. We apply a correlation clustering algorithm that merges pairs of people into socially related groups. Due to the very shifting nature of social interactions and the different meanings that orientations and distances can assume in different contexts, we learn the weight vector of the correlation clustering using Structural SVMs. We extensively test our approach on two publicly available datasets showing encouraging results when detecting groups from first-person camera views.

2014 - Head Pose Estimation in First-Person Camera Views [Relazione in Atti di Convegno]
Alletto, Stefano; Serra, Giuseppe; Calderara, Simone; Cucchiara, Rita
abstract

In this paper we present a new method for head pose real-time estimation in ego-vision scenarios that is a key step in the understanding of social interactions. In order to robustly detect head under changing aspect ratio, scale and orientation we use and extend the Hough-Based Tracker which allows to follow simultaneously each subject in the scene. In an ego-vision scenario where a group interacts in a discussion, each subject's head orientation will be more likely to remain focused for a while on the person who has the floor. In order to encode this behavior we include a stateful Hidden Markov Model technique that enforces the predicted pose with the temporal coherence from a video sequence. We extensively test our approach on several indoor and outdoor ego-vision videos with high illumination variations showing its validity and outperforming other recent related state of the art approaches.

2014 - Kernelized Structural Classification for 3D Dogs Body Parts Detection [Relazione in Atti di Convegno]
Pistocchi, Simone; Calderara, Simone; Barnard, S.; Ferri, N.; Cucchiara, Rita
abstract

Despite pattern recognition methods for human behavioral analysis has flourished in the last decade, animal behavioral analysis has been almost neglected. Those few approaches are mostly focused on preserving livestock economic value while attention on the welfare of companion animals, like dogs, is now emerging as a social need. In this work, following the analogy with human behavior recognition, we propose a system for recognizing body parts of dogs kept in pens. We decide to adopt both 2D and 3D features in order to obtain a rich description of the dog model. Images are acquired using the Microsoft Kinect to capture the depth map images of the dog. Upon depth maps a Structural Support Vector Machine (SSVM) is employed to identify the body parts using both 3D features and 2D images. The proposal relies on a kernelized discriminative structural classificator specifically tailored for dogs independently from the size and breed. The classification is performed in an online fashion using the LaRank optimization technique to obtaining real time performances. Promising results have emerged during the experimental evaluation carried out at a dog shelter, managed by IZSAM, in Teramo, Italy.

2014 - Pattern recognition and crowd analysis [Articolo su rivista]
Bandini, S.; Calderara, S.; Cucchiara, R.
abstract

2014 - Visual Tracking: An Experimental Survey [Articolo su rivista]
A. W. M., Smeulder; D. M., Chu; Cucchiara, Rita; Calderara, Simone; A., Dehghan; M., Shah
abstract

There is a large variety of trackers, which have been proposed in the literature during the last two decades with some mixed success. Object tracking in realistic scenarios is difficult problem, therefore it remains a most active area of research in Computer Vision. A good tracker should perform well in a large number of videos involving illumination changes, occlusion, clutter, camera motion, low contrast, specularities and at least six more aspects. However, the performance of proposed trackers have been evaluated typically on less than ten videos, or on the special purpose datasets. In this paper, we aim to evaluate trackers systematically and experimentally on 315 video fragments covering above aspects. We selected a set of nineteen trackers to include a wide variety of algorithms often cited in literature, supplemented with trackers appearing in 2010 and 2011 for which the code was publicly available. We demonstrate that trackers can be evaluated objectively by survival curves, Kaplan Meier statistics, and Grubs testing. We find that in the evaluation practice the F-score is as effective as the object tracking accuracy (OTA) score. The analysis under a large variety of circumstances provides objective insight into the strengths and weaknesses of trackers.

2013 - Social Groups Detection in Crowd through Shape-Augmented Structured LearningImage Analysis and Processing – ICIAP 2013 [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone
abstract

Most of the behaviors people exhibit while being part of a crowd are social processes that tend to emerge among groups and as a consequence, detecting groups in crowds is becoming an important issue in modern behavior analysis. We propose a supervised correlation clustering technique that employs Structural SVM and a proxemic based feature to learn how to partition people trajectories in groups, by injecting in the model socially plausible shape configurations. By taking into account social groups patterns, the system is able to outperform state of the art methods on two publicly available benchmark sets of videos.

2013 - Social groups detection in crowd through shape-augmented structured learning [Relazione in Atti di Convegno]
Solera, F.; Calderara, S.
abstract

2013 - Structured learning for detection of social groups in crowd [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Group detection in crowds will play a key role in future behavior analysis surveillance systems. In this work we build a new Structural SVM-based learning framework able to solve the group detection task by exploiting annotated video data to deduce a sociologically motivated distance measure founded on Hall's proxemics and Granger's causality. We improve over state-of-the-art results even in the most crowded test scenarios, while keeping the classification time affordable for quasi-real time applications. A new scoring scheme specifically designed for the group detection task is also proposed.

2012 - Integrate tool for online analysis and offline mining of people trajectories [Articolo su rivista]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

In the past literature, online alarm-based video-surveillance and offline forensic-based data mining systems are often treated separately, even from different scientific communities. However, the founding techniques are almost the same and, despite some examples in commercial systems, the cases on which an integrated approach is followed are limited. For this reason, this study describes an integrated tool capable of putting together these two subsystems in an effective way. Despite its generality, the proposal is here reported in the case of people trajectory analysis, both in real time and offline. Trajectories are modelled based on either their spatial location or their shape, and proper similarity measures are proposed. Special solutions to meet real-time requirements in both cases are also presented and the trade-off between efficiency and efficacy is analysed by comparing when using a statistical model and when not. Examples of results in large datasets acquired in the University campus are reported as preliminary evaluation of the system.

2012 - Learning Non-Target Items for Interesting Clothes Segmentation in Fashion Images [Relazione in Atti di Convegno]
Grana, Costantino; Calderara, Simone; Borghesani, Daniele; Cucchiara, Rita
abstract

In this paper we propose a color-based approach for skin detection and interest garment selection aimed at an automatic segmentation of pieces of clothing. For both purposes, the color description is extracted by an iterative energy minimization approach and an automatic initialization strategy is proposed by learning geometric constraints and shape cues. Experiments confirms the good performance of this technique both in the context of skin removal and in the context of classification of garments.

2012 - Understanding dyadic interactions applying proxemic theory on videosurveillance trajectories [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita
abstract

Understanding social and collective people behaviour in open spaces is one of the frontier of modern video surveillance. Many sociological theories, and proxemics in particular, have been proved their validity as a support for classifying and interpreting human behaviour. Proxemics suggest some simple but effective behavioural rules, useful to understand what people are doing and their social involvement with other individuals. In this paper we propose to extend the proxemics analysis along the time and provide a solution for analysing sequences of proxemic states computed between trajectories of people pairs (dyads). Trajectories, computed from videosurveillance videos, are first analysed and converted to a sequence of symbols according to proxemic theory. Then an elastic measure for comparing those sequences is introduced. Finally, interactions are classified both in an off-line unsupervised way and in an on-line fashion. Results on videosurveillance data, demonstrate that sequences of proxemic states can be effective in characterizing mutual interactions and experiments in capturing the most frequent dyads interactions and on-line classifying them when a labelled training set is available are proposed.

2011 - Appearance tracking by transduction in surveillance scenarios [Relazione in Atti di Convegno]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
abstract

We propose a formulation of people tracking problem as a Transductive Learning (TL) problem. TL is an effective semi-supervised learning technique by which many classification problems have been recently reinterpreted as learning labels from incomplete datasets. In our proposal the joint exploitation of spectral graph theory and Riemannian manifold learning tools leads to the formulation of a robust approach for appearance based tracking in Video Surveillance scenarios. The key advantage of the presented method is a continuously updated model of the tracked target, used in the TL process, that allows to on-line learn the target visual appearance and consequently to improve the tracker accuracy. Experiments on public datasets show an encouraging advancement over alternative state-of the-art techniques.

2011 - Detecting Anomalies in People’s Trajectories using Spectral Graph Analysis [Articolo su rivista]
Calderara, Simone; Uri, Heinemann; Prati, Andrea; Cucchiara, Rita; Naftali, Tishby
abstract

Video surveillance is becoming the technology of choice for monitoring crowded areas for security threats. While video provides ample information for human inspectors, there is a great need for robust automated techniques that can efficiently detect anomalous behavior in streaming video from single ormultiple cameras. In this work we synergistically combine two state-of-the-art methodologies. The rst is the ability to track and label single person trajectories in a crowded area using multiple video cameras, and the second is a new class of novelty detection algorithms based on spectral analysis of graphs. By representing the trajectories as sequences of transitions betweennodes in a graph, shared individual trajectories capture only a small subspace of the possible trajectories on the graph. This subspace is characterized by large connected components of the graph, which are spanned by the eigenvectors with the low eigenvalues of the graph Laplacian matrix. Using this technique, we develop robust invariant distance measures for detectinganomalous trajectories, and demonstrate their application on realvideo data.

2011 - Feature Space Warping Relevance Feedback with Transductive Learning [Relazione in Atti di Convegno]
Borghesani, Daniele; Coppi, Dalia; Grana, Costantino; Calderara, Simone; Cucchiara, Rita
abstract

Relevance feedback is a widely adopted approach to improve content-based information retrieval systems by keeping the user in the retrieval loop. Among the fundamental relevance feedback approaches, feature space warping has been proposed as an effective approach for bridging the gap between high-level semantics and the low-level features. Recently, combination of feature space warping and query point movement techniques has been proposed in contrast to learning based approaches, showing good performance under dierent data distributions. In this paper we propose to merge feature space warping and transductive learning, in order to benet from both the ability of adapting data to the user hints and the information coming from unlabeled samples. Experimental results on an image retrieval task reveal signicant performance improvements from the proposed method.

2011 - Iterative active querying for surveillance data retrieval in crime detection and forensics [Relazione in Atti di Convegno]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
abstract

Large sets of visual data are now available both, in real time andoff line, at time of investigation in multimedia forensics, however passive querying systems often encounter difﬁculties in retrieving signiﬁcant results. In this paper we propose an iterativeactive querying system for video surveillance and forensic applications based on the continuous interaction between the userand the system. The positive and negative user feedbacks areexploited as the input of a graph based transductive procedurefor iteratively reﬁning the initial query results. Experimentsare shown using people trajectories and people appearance asdistance metrics.

2011 - Markerless Body Part Tracking for Action Recognition [Articolo su rivista]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper presents a method for recognising human actions bytracking body parts without using artificial markers. A sophisticated appearance-based tracking able to cope with occlusions is exploited to extract a probability map for each moving object. A segmentation technique based on mixture of Gaussians (MoG) is then employed to extract and track significantpoints on this map, corresponding to significant regions on the human silhouette. The evolution of the mixture in time is analysed by transforming it in a sequence of symbols (corresponding to a MoG). The similarity between actions is computed by applying global alignment and dynamic programming techniques to the corresponding sequences and using a variational approximation of the Kullback-Leibler divergence to measure the dissimilarity between two MoGs. Experiments on publicly available datasets and comparison with existing methods are provided.

2011 - Mixtures of von Mises Distributions for People Trajectory Shape Analysis [Articolo su rivista]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

People trajectory analysis is a recurrent task inmany pattern recognition applications, such as surveillance,behavior analysis, video annotation, and many others. In thispaper we propose a new framework for analyzing trajectoryshape, invariant to spatial shifts of the people motion in thescene. In order to cope with the noise and the uncertainty ofthe trajectory samples, we propose to describe the trajectoriesas a sequence of angles modelled by distributions of circularstatistics, i.e. a mixture of von Mises (MovM) distributions.To deal with MovM, we define a new specific EM algorithmfor estimating the parameters and derive a closed form of theBhattacharyya distance between single vM pdfs. Trajectories arethen modelled with a sequence of symbols, corresponding tothe most suitable distribution in the mixture, and comparedeach other after a global alignment procedure to cope withtrajectories of different lengths. The trajectories in the trainingset are clustered according with their shape similarity in an offlinephase, and testing trajectories are then classified with aspecific on-line EM, based on sufficient statistics. The approachis particularly suitable for classifying people trajectories in videosurveillance, searching for abnormal (i.e. infrequent) paths. Testson synthetic and real data are provided with also a completecomparison with other circular statistical and alignment methods.

2011 - People appearance tracing in video by spectral graph transduction [Relazione in Atti di Convegno]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
abstract

Following people in different video sources is a challenging task: variations in the type of camera, in the lighting conditions, in the scene settings (e.g. crowd or occlusions) and in the point of view must be accounted. In this paper we propose a system based only on appearance information that, disregarding temporal and spatial information, can be flexibly applied on both moving and static cameras. We exploit the joint use of transductive learning and spectral properties of graph Laplacians proposing a formulation of the people tracing problem as a semi-supervised classification. The knowledge encoded in two labeled input sets of positive and negative samples of the target person and the continuous spectral update of these models allow us to obtain a robust approach for people tracing in surveillance video sequences. Experiments on publicly available datasets show satisfactory results and exhibit a good robustness in dealing with short and long term occlusions.

2011 - Vision based smoke detection system using image energy and color information [Articolo su rivista]
Calderara, Simone; Piccinini, Paolo; Cucchiara, Rita
abstract

Smoke detection is a crucial task in many video surveillance applications and could have a great impact to raise the level of safety of urban areas. Many commercial smoke detection sensors exist but most of them cannot be applied in open space or outdoor scenarios. With this aim, the paper presents a smoke detection system that uses a common CCD camera sensor to detect smoke in images and trigger alarms. First, a proper background model is proposed to reliably extract smoke regions and avoid over-segmentation and false positives in outdoor scenarios where many distractors are present, such as moving trees or light reflexes. A novel Bayesian approach is adopted to detect smoke regions in the scene analyzing image energy by means of the Wavelet Transform coefficients and Color Information. A statistical model of image energy is built, using a temporal Gaussian Mixture, to analyze the energy decay that typically occurs when smoke covers the scene then the detection is strengthen evaluating the color blending between a reference smoke color and the input frame. The proposed system is capable of detecting rapidly smoke events both in night and in day conditions with a reduced number of false alarms hence is particularly suitable for monitoring large outdoor scenarios where common sensors would fail. An extensive experimental campaign both on recorded videos and live cameras evaluates the efficacy and efficiency of the system in many real world scenarios, such as outdoor storages and forests.

2010 - A Videosurveillance data browsing software architecture for forensics: From trajectories similarities to video fragments [Relazione in Atti di Convegno]
Aravecchia, M.; Calderara, S.; Chiossi, S.; Cucchiara, R.
abstract

The information contained in digital video surveillance repositories can present relevant hints, when not even legal evidence, during investigations. As the amount of video data often forbids manual search, some tools have been developed during the past years in order to aid investigators in the look up process. We propose an application for forensic video analysis which aims at analysing the activities in a given scenario, particularly focusing on trajectories followed by people and their visual appearances. The recorded videos can be browsed by investigators thanks to a user-friendly interface, allowing easy information retrieval, through the choice of the best mining strategy. The underlying application architecture implements different feature and query models as well as query optimization strategies in order to return the best response in terms of both efficacy and efficiency.

2010 - Alignment-based Similarity of People Trajectories using Semi-directional Statistics [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper presents a method for comparing people trajectories for video surveillance applications, based on semi-directional statistics. In fact, the modelling of a trajectory as a sequence of angles, speeds and time lags, requires the use of a statistical tool capable to jointly consider periodic and linear variables. Our statistical method is compared with two state-of-the-art methods.

2010 - Moving pixels in static cameras: detecting dangerous situations due to environment or people [Capitolo/Saggio]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

Dangerous situations arise in everyday life and many efforts have been lavished to exploit technology to increase the level of safety in urban areas. Video analysis is absolutely one of the most important and emerging technology for security purposes. Automatic video surveillance systems commonly analyze the scene searching for moving objects. Well known techniques exist to cope with this problem that is commonly referred as change detection". Every time a dierence against a reference model is sensed, it should be analyzed to allow the system to discriminateamong a usual situation or a possible threat. When the sensor is a camera, motion is the key element to detect changes and moving objects must be correctly classied according to their nature. In this context we can distinguish among two dierent kinds of threat that can lead to dangerous situations in a video-surveilled environment. The first one is due to environmental changes such as rain, fog or smoke present in the scene. This kind of phenomena are sensed by the camera as moving pixelsand, subsequently as moving objects in the scene. This kind of threats shares some common characteristics such as texture, shape and color information and can be detected observing the features' evolution in time. The second situation arises whenpeople are directly responsible of the dangerous situation. In this case a subject is acting in an unusual way leading to an abnormal situation. From the sensor's point of view, moving pixels are still observed, but specic features and time-dependent statistical models should be adopted to learn and then correctly detect unusual and dangerous behaviors. With these premises, this chapter will present two different case studies. The rst one describes the detection of environmental changes in theobserved scene and details the problem of reliably detecting smoke in outdoor environments using both motion information and global image features, such as color information and texture energy computed by the means of the Wavelet transform.The second refers to the problem of detecting suspicious or abnormal people behaviors by means of people trajectory analysis in a multiple cameras video-surveillance scenario. Specically, a technique to infer and learn the concept of normality is proposed jointly with a suitable statistical tool to model and robustly compare people trajectories.

2010 - People trajectory mining with statistical pattern recognition [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita
abstract

People social interaction analysis is a complex and interesting problem that can be faced from several points of view depending on the application context. In videosurveillance contexts many indicators of people habits and relations exist and, among these, people trajectories analysis can reveal many aspects of the way people behave in social environments. We propose a statistical framework for trajectories mining that analyzes, in an integrated solution, several aspects of the trajectories such as location, shape and speed properties. Three different models are proposed to deal with non-idealities of the selected features in conjunction with a robust inexact- matching similarity measure for comparing sequences with different lengths. Experimental results in a real scenario demonstrates the efficacy of the framework in clustering people trajectories with the purpose of analyze frequent behaviors in complex environments.

2009 - A Real-Time System for Abnormal Path Detection [Relazione in Atti di Convegno]
Calderara, Simone; C., Alaimo; Prati, Andrea; Cucchiara, Rita
abstract

This paper proposes a real-time system capable to extract andmodel object trajectories from a multi-camera setup with theaim of identifying abnormal paths. The trajectories are modeledas a sequence of positional distributions (2D Gaussians)and clustered in the training phase by exploiting an innovativedistance measure based on a global alignment techniqueand Bhattacharyya distance between Gaussians. An on-lineclassification procedure is proposed in order to on-the-fly classifynew trajectories into either “normal” or “abnormal” (in thesense of rarely seen before, thus unusual and potentially interesting).Experiments on a real scenario will be presented.

2009 - Learning People Trajectories using Semi-directional Statistics [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper proposes a system for people trajectory shape analysis by exploiting a statistical approach which accounts for sequences of both directional (the directions of the trajectory) and linear (the speeds) data. A semi-directional distribution (AWLG - Approximated Wrapped and Linear Gaussian) is used with a mixture to find main directions and speeds. A variational version of the mutual information criterion is proposed to prove the statistical dependency of the data. Then, in order to compare data sequences, we define an inexact method with a Kullback-Leibler-based distance measure and employ a global alignment technique is to handle sequences of different lengths and with local shifts or deformations. A comprehensive analysis of variable dependency and parameter estimation techniques are reported and evaluated on both synthetic and real data sets.

2009 - Statistical Pattern Recognition for Multi-Camera Detection, Tracking and Trajectory Analysis [Capitolo/Saggio]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

This chapter will address most of the aspects of modern video surveillance with the reference to the research activity conducted at University of Modena and Reggio Emilia, Italy, within the scopes of the national FREE SURF (FREE SUrveillance in a pRivacy-respectFul way) and NATO-funded BE SAFE (Behavioral lEarning in Surveilled Areas with Feature Extraction) projects. Moving object detection and tracking from a single camera, multi-camera consistent labeling and trajectory shape analysis for path classification will be the main topics of this chapter.

2009 - Statistical pattern recognition for multi-camera detection, tracking, and trajectory analysis [Capitolo/Saggio]
Calderara, S.; Cucchiara, R.; Vezzani, R.; Prati, A.
abstract

2009 - Video surveillance and multimedia forensics: an application to trajectory analysis [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper reports an application of trajectory analysis in which forensics and video surveillance techniques are jointly employed for providing a new tool of multimedia forensics. Advanced video surveillance techniques are used to extract from a multi-camera system the trajectories of the moving people which are then modelled by either their positions (projected on the ground plane) or their directions of movement. Both these two representations can be very suitable for querying large video repositories, by searching for similar trajectories in terms of either sequences of positions or trajectory shape (encoded as sequence of angles, where positions do not care). Preliminary examples of the possible use of this approach are shown.

2008 - "Inside the Bible": Segmentation, Annotation and Retrieval for a New Browsing Experience [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Calderara, Simone; Cucchiara, Rita
abstract

In this paper we present a system for automatic segmentation, annotation and image retrieval based on content, focused on illuminated manuscripts and in particular the Borso D'Este Holy Bible. To enhance the interaction possibilities with this work, full of decorations and illustrations, we exploit some well known document analysis techniques in addition to some new approaches, in order to achieve good segmentation of pages into meaningful visual objects with the relative annotation. We wanted to extend the standard keyword-based retrieval approach in a commentary with a modern visual-based retrieval by appearance similarity: an entire software user interface for exploration and visual search of illuminated manuscripts.

2008 - A Markerless Approach for Consistent Action Recognition in a Multi-camera System [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper presents a method for recognizing human actions in a multi-camera setup. The proposed method automatically extracts significant points on the human body, without the need of artificial markers. A sophisticated appearance-based tracking able to cope with occlusions is exploited to extract a probability map for each moving object. A segmentation technique based on mixture of Gaussians is then employed to extract and track significant points on this map, corresponding to significant regions on the human silhouette. The point tracking produces a set of 3D trajectories that are compared with other trajectories by means of global alignment and dynamic programming techniques. Preliminary experiments showed the potentiality of the proposed approach.

2008 - Action Signature: a Novel Holistic Representation for Action Recognition [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

Recognizing different actions with a unique approach can be a difficult task. This paper proposes a novel holistic representation of actions that we called "action signature". This 1D trajectory is obtained by parsing the 2D image containing the orientations of the gradient calculated on the motion feature map called motion-history image. In this way, the trajectory is a sketch representation of how the object motion varies in time. A robust statistical framework based on mixtures of von Mises distributions and dynamic programming for sequence alignment are used to compare and classify actions/trajectories. The experimental results show a rather high accuracy in distinguishing quite complicated actions, such as drinking, jumping, or abandoning an object.

2008 - Bayesian-competitive Consistent Labeling for People Surveillance [Articolo su rivista]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

This paper presents a novel and robust approach to consistent labeling for people surveillance in multi-camera systems. A general framework scalable to any number of cameras with overlapped views is devised. An off-line training process automatically computes ground-plane homography and recovers epipolar geometry. When a new object is detected in any one camera, hypotheses for potential matching objects in the other cameras are established. Each of the hypotheses is evaluated using a prior and likelihood value. The prior accounts for the positions of the potential matching objects, while the likelihood is computed by warping the vertical axis of the new object on the field of view of the other cameras and measuring the amount of match. In the likelihood, two contributions (forward and backward) are considered so as to correctly handle the case of groups of people merged into single objects. Eventually, a maximum-a-posteriori approach estimates the best label assignment for the new object. Comparisons with other methods based on homography and extensive outdoor experiments demonstrate that the proposed approach is accurate and robust in coping with segmentation errors and in disambiguating groups.

2008 - HECOL: Homography and Epipolar-based Consistent Labeling for Outdoor Park Surveillance [Articolo su rivista]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

Outdoor surveillance is one of the most attractive application of video processing and analysis. Robust algorithms must be defined and tuned to cope with the non-idealities of outdoor scenes. For instance, in a public park, an automatic video surveillance system must discriminate between shadows, reflections, waving trees, people standing still or moving, and other objects. Visual knowledge coming from multiple cameras can disambiguate cluttered and occluded targets by providing a continuous consistent labeling of tracked objects among the different views. This work proposes a new approach for coping with this problem in multi-camera systems with overlapped Fields of View (FoVs). The presence of overlapped zones allows the definition of a geometry-based approach to reconstruct correspondences between FoVs, using only homography and epipolar lines (hereinafter HECOL: Homography and Epipolar-based COnsistent Labeling) computed automatically with a training phase. We also propose a complete system that provides segmentation and tracking of people in each camera module. Segmentation is performed by means of the SAKBOT (Statistical and Knowledge Based Object Tracker) approach, suitably modified to cope with multi-modal backgrounds, reflections and other artefacts, typical of outdoor scenes. The extracted objects are tracked using a statistical appearance model robust against occlusions and segmentation errors. The main novelty of this paper is the approach to consistent labeling. A specific Camera Transition Graph is adopted to efficiently select the possible correspondence hypotheses between labels. A Bayesian MAP optimization assigns consistent labels to objects detected by several points of views: the object axis is computed from the shape tracked in each camera module and homography and epipolar lines allow a correct axis warping in other image planes. Both forward and backward probability contributions from the two different warping directions make the approach robust against segmentation errors, and capable of disambiguating groups of people. The system has been tested in a real setup of a urban public park, within the Italian LAICA (Laboratory of Ambient Intelligence for a friendly city) project. The experiments show how the system can correctly track and label objects in a distributed system with real-time performance. Comparisons with simpler consistent labeling methods and extensive outdoor experiments with ground truth demonstrate the accuracy and robustness of the proposed approach.

2008 - Reliable smoke detection system in the domains of image energy and color [Relazione in Atti di Convegno]
Piccinini, Paolo; Calderara, Simone; Cucchiara, Rita
abstract

Smoke detection calls for a reliable and fast distinction between background, moving objects and variable shapes that are recognizable as smoke. In our system we propose a stable background suppression module joined with a smoke detection module working on segmented objects. It exploits two features: the energy variation in wavelet model and a color model of the smoke. The decrease of energy ratio in wavelet domain between background and current image is a clue to detect smoke representing the variations of texture level. A mixture of Gaussians models this texture ratio for temporal evolution. The color model is used as reference to measure the deviation of the current pixel color from the model. The two features have been combined using a Bayesian classifier to detect smoke in the scene. Experiments on real data and a comparison between our background model and Gaussian Mixture(MOG) model for smoke detection are presented. © 2008 IEEE.

2008 - Smoke detection in video surveillance: A MoG model in the wavelet domain [Relazione in Atti di Convegno]
Calderara, Simone; Piccinini, Paolo; Cucchiara, Rita
abstract

The paper presents a new fast and robust technique of smoke detection in video surveillance images. The approach aims at detecting the spring or the presence of smoke by analyzing color and texture features of moving objects, segmented with background subtraction. The proposal embodies some novelties: first the temporal behavior of the smoke is modeled by a Mixture of Gaussians (MoG ) of the energy variation in the wavelet domain. The MoG takes into account the image energy variation due to either external luminance changes or the smoke propagation. It allows a distinction to energy variation due to the presence of real moving objects such as people and vehicles. Second, this textural analysis is enriched by a color analysis based on the blending function. Third, a Bayesian model is defined where the texture and color features, detected at block level, contributes to model the likelihood while a global evaluation of the entire image models the prior probability contribution. The resulting approach is very flexible and can be adopted in conjunction to a whichever video surveillance system based on dynamic background model. Several tests on tens of different contexts, both outdoor and indoor prove its robustness and precision. © 2008 Springer-Verlag Berlin Heidelberg.

2008 - Smoke detection in videosurveillance: the use of VISOR (Video Surveillance On-line Repository) [Relazione in Atti di Convegno]
Vezzani, Roberto; Calderara, Simone; Piccinini, Paolo; Cucchiara, Rita
abstract

Visor (VIdeo Surveillance Online Repository) is a large videorepository, designed for containing annotated video surveillancefootages, comparing annotations, evaluating systemperformance, and performing retrieval tasks. The web interfaceallows video browse, query by annotated conceptsor by keywords, compressed video preview, media downloadand upload. The repository contains metadata annotations,both manually created ground-truth data and automaticallyobtained outputs of particular systems. An exampleof application is the collection of videos and annotationsfor smoke detection, an important video surveillance task. Inthis paper we present the architecture of ViSOR, the build-insurveillance ontology which integrates many concepts, alsocoming from LSCOM, and MediaMill, the annotation toolsand the visualization of results for performance evaluation.The annotation is obtained with an automatic smoke detectionsystem, capable to detect people, moving objects, andsmoke in real-time.

2008 - Using circular statistics for trajectory shape analysis [Relazione in Atti di Convegno]
Prati, Andrea; Calderara, Simone; Cucchiara, Rita
abstract

The analysis of patterns of movement is a crucial task for several surveillance applications, for instance to classify normal or abnormal people trajectories on the basis of their occurrence. This paper proposes to model the shape of a single trajectory as a sequence of angles described using a Mixture of Von Mises (MoVM) distribution. A complete EM (Expectation Maximization) algorithm is derived for MoVM parameters estimation and an on-line version proposed to meet real time requirement. Maximum-A-Posteriori is used to encode the trajectory as a sequence of symbols corresponding to the MoVM components. Iterative k-medoids clustering groups trajectories in a variable number of similarity classes. The similarity is computed aligning (with dynamic programming) two sequences and considering as symbol-to-symbol distance the Bhattacharyya distance between von Mises distributions. Extensive experiments have been performed on both synthetic and real data. ©2008 IEEE.

2007 - A Distributed Outdoor Video Surveillance System for Detection of Abnormal People Trajectories [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

Distributed surveillance systems are nowadays widely adopted to monitor large areas for security purposes. In this paper, we present a complete multicamera system designed for people tracking from multiple partially overlapped views and capable of inferring and detecting abnormal people trajectories. Detection and tracking are performed by means of background suppression and an appearance-based probabilistic approach. Objects' label ambiguities are geometrically solved and the concept of "normality" is learned from data using a robust statistical model based on Von Mises distributions. Abnormal trajectories are detected using a first-order Bayesian network and, for each abnormal event, the appearance of the subject from each view is logged. Experiments demonstrate that our system can process with real-time performance up to three cameras simultaneously in an unsupervised setup and under varying environmental conditions.

2007 - A Dynamic Programming Technique for Classifying Trajectories [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, A.
abstract

This paper proposes the exploitation of a dynamic programming technique for efficiently comparing people trajectories adopting an encoding scheme that jointly takes into account both the direction and the velocity of movement. With this approach, each pair of trajectories in the training set is compared and the corresponding distance computed. Clustering is achieved by using the k-medoids algorithm and each cluster is modeled with a 1-D Gaussian over the distance from the medoid. A MAP framework is adopted for the testing phase. The reported results are encouraging.

2007 - Detection of Abnormal Behaviors using a Mixture of Von Mises Distributions [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

This paper proposes the use of a mixture of Von Mises distributions to detect abnormal behaviors of moving people. The mixture is created from an unsupervised training set by exploiting k-medoids clustering algorithm based on Bhattacharyya distance between distributions. The extracted medoids are used as modes in the multi-modal mixture whose weights are the priors of the specific medoid. Given the mixture model a new trajectory is verified on the model by considering each direction composing it as independent. Experiments over a real scenario composed of multiple, partially-overlapped cameras are reported.

2006 - Group Detection at Camera Handoff for Collecting People Appearance in Multi-camera Systems [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

Logging information on moving objects is crucial in video surveillance systems. Distributed multi-camera systems can provide the appearance of objects/people from different viewpoints and at different resolutions, allowing a more complete and precise logging of the information. This is achieved through consistent labeling to correlate collected information of the same person. This paper proposes a novel approach to consistent labeling also capable to fully characterize groups of people and to manage miss segmentations. The ground-plane homography and the epipolar geometry are automatically learned and exploited to warp objects' principal axes between overlapped cameras. A MAP estimator that exploits two contributions (forward and backward) is used to choose the most probable label configuration to be assigned at the handoff of a new object. Extensive experiments demonstrate the accuracy of the proposed method in detecting single and simultaneous handoffs, miss segmentations, and groups.

2006 - Multimedia Surveillance: Content-based Retrieval with Multicamera People Tracking [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

Multimedia surveillance relates to the exploitation of multimedia tools for retrieving information from surveillance data, for emerging applications such as video post-analysis for forensic purposes. Searching for all the sequences in which a certain person was present is a typical query that is carried out by means of example images. Unfortunately, surveillance cameras often have low resolution, making retrieval based on appearance difficult. This paper proposes to exploit a two-step retrieval process that merges similarity-based retrieval with multicamera tracking-based retrieval able to create consistent traces of a person from different views and, thus, different resolutions. A mixture model is used to summarize these traces into a single prototype on which retrieval is performed. Experimental results demonstrate the accuracy of the retrieval process also in the case of varying illumination conditions.

2006 - Reliable background suppression for complex scenes [Relazione in Atti di Convegno]
Calderara, Simone; Melli, Rudy Mirko; Prati, Andrea; Cucchiara, Rita
abstract

This paper describes a system for motion detection based on background suppression,specifically conceived for working in complex scenes with vacillating background,camouflage, illumination changing, etc.. The system contains proper techniques for background bootstrapping, shadow removal, ghost suppression and selective updating of the background model. The results on the challenging videos provided in VSSN '06 Open Source Algorithm Competition dataset demonstrate that the proposed system outperforms the widely-used mixture-of-Gaussians approach.

2006 - The LAICA project: Experiments on Multicamera People Tracking and Logging [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

Logging information on moving objects is crucial in video surveillance systems. Distributed multi-camera systems can provide the appearance of objects/people from differentviewpoints and at different resolutions, allowing a more complete and precise logging of the information. This is achieved through consistent labeling to correlate collected information of the same person. This paper proposes a novel approach to consistent labeling also capable tofully characterize groups of people and to manage miss segmentations. The ground-plane homography and the epipolar geometry are automatically learned and exploited to warp objects’ principal axes between overlapped cameras. A MAP estimator that exploits two contributions (forward and backward) is used to choose the most probable label con£guration to be assigned at the handoff of a new object. Extensive experiments demonstrate the accuracy of the proposed method in detecting single and simultaneous handoffs, miss segmentations, and groups.

2005 - Consistent labeling for multi-camera object tracking [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Vezzani, Roberto; Cucchiara, Rita
abstract

In this paper, we present a new approach to multi-camera object tracking based on the consistent labeling. An automatic and reliable procedure allows to obtain the homographic transformation between two overlapped views, without any manual calibration of the cameras. Object's positions are matched by using the homography when the object is firstly detected in one of the two views. The approach has been tested also in the case of simultaneous transitions and in the case in which people are detected as a group during the transition. Promising results are reported over a real setup of overlapped cameras.

2005 - Entry Edge of Field of View for multi-camera tracking in distributed video surveillance [Relazione in Atti di Convegno]
Calderara, Simone; Vezzani, Roberto; Prati, Andrea; Cucchiara, Rita
abstract

Efficient solution to people tracking in distributed videosurveillance is requested to monitor crowded and large environments.This paper proposes a novel use of the EntryEdges of Field of View (E2oFoV) to solve the consistentlabeling problem between partially overlapped views. Anautomatic and reliable procedure allows to obtain the homographictransformation between two overlapped views,without any manual calibration of the cameras. Throughthe homography, the consistent labeling is established eachtime a new track is detected in one of the cameras. A CameraTransition Graph (CTG) is defined to speed up the establishmentprocess by reducing the search space. Experimentalresults prove the effectiveness of the proposed solutionalso in challenging conditions.

Università degli studi di Modena e Reggio Emilia

Pubblicazioni