Nuova ricerca

Simone CALDERARA

Professore Associato presso: Dipartimento di Ingegneria "Enzo Ferrari"


Home | Curriculum(pdf) | Didattica |


Pubblicazioni

2020 - Anomaly Detection for Vision-based Railway Inspection [Relazione in Atti di Convegno]
Gasparini, Riccardo; Pini, Stefano; Borghi, Guido; Scaglione, Giuseppe; Calderara, Simone; Fedeli, Eugenio; Cucchiara, Rita
abstract


2020 - Compressed Volumetric Heatmaps for Multi-Person 3D Pose Estimation [Relazione in Atti di Convegno]
Fabbri, Matteo; Lanzi, Fabio; Calderara, Simone; Alletto, Stefano; Cucchiara, Rita
abstract

In this paper we present a novel approach for bottom-up multi-person 3D human pose estimation from monocular RGB images. We propose to use high resolution volumetric heatmaps to model joint locations, devising a simple and effective compression method to drastically reduce the size of this representation. At the core of the proposed method lies our Volumetric Heatmap Autoencoder, a fully-convolutional network tasked with the compression of ground-truth heatmaps into a dense intermediate representation. A second model, the Code Predictor, is then trained to predict these codes, which can be decompressed at test time to re-obtain the original representation. Our experimental evaluation shows that our method performs favorably when compared to state of the art on both multi-person and single-person 3D human pose estimation datasets and, thanks to our novel compression strategy, can process full-HD images at the constant runtime of 8 fps regardless of the number of subjects in the scene.


2020 - Deep learning-based method for vision-guided robotic grasping of unknown objects [Articolo su rivista]
Bergamini, L.; Sposato, M.; Pellicciari, M.; Peruzzini, M.; Calderara, S.; Schmidt, J.
abstract

Nowadays, robots are heavily used in factories for different tasks, most of them including grasping and manipulation of generic objects in unstructured scenarios. In order to better mimic a human operator involved in a grasping action, where he/she needs to identify the object and detect an optimal grasp by means of visual information, a widely adopted sensing solution is Artificial Vision. Nonetheless, state-of-art applications need long training and fine-tuning for manually build the object's model that is used at run-time during the normal operations, which reduce the overall operational throughput of the robotic system. To overcome such limits, the paper presents a framework based on Deep Convolutional Neural Networks (DCNN) to predict both single and multiple grasp poses for multiple objects all at once, using a single RGB image as input. Thanks to a novel loss function, our framework is trained in an end-to-end fashion and matches state-of-art accuracy with a substantially smaller architecture, which gives unprecedented real-time performances during experimental tests, and makes the application reliable for working on real robots. The system has been implemented using the ROS framework and tested on a Baxter collaborative robot.


2020 - Future Urban Scenes Generation Through Vehicles Synthesis [Relazione in Atti di Convegno]
Simoni, Alessandro; Bergamini, Luca; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita
abstract


2019 - A Deep-learning-based approach to VM behavior Identification in Cloud Systems [Relazione in Atti di Convegno]
Stefanini, M.; Lancellotti, R.; Baraldi, L.; Calderara, S.
abstract


2019 - Can adversarial networks hallucinate occluded people with a plausible aspect? [Articolo su rivista]
Fulgeri, F.; Fabbri, Matteo; Alletto, Stefano; Calderara, S.; Cucchiara, R.
abstract

When you see a person in a crowd, occluded by other persons, you miss visual information that can be used to recognize, re-identify or simply classify him or her. You can imagine its appearance given your experience, nothing more. Similarly, AI solutions can try to hallucinate missing information with specific deep learning architectures, suitably trained with people with and without occlusions. The goal of this work is to generate a complete image of a person, given an occluded version in input, that should be a) without occlusion b) similar at pixel level to a completely visible people shape c) capable to conserve similar visual attributes (e.g. male/female) of the original one. For the purpose, we propose a new approach by integrating the state-of-the-art of neural network architectures, namely U-nets and GANs, as well as discriminative attribute classification nets, with an architecture specifically designed to de-occlude people shapes. The network is trained to optimize a Loss function which could take into account the aforementioned objectives. As well we propose two datasets for testing our solution: the first one, occluded RAP, created automatically by occluding real shapes of the RAP dataset created by Li et al. (2016) (which collects also attributes of the people aspect); the second is a large synthetic dataset, AiC, generated in computer graphics with data extracted from the GTA video game, that contains 3D data of occluded objects by construction. Results are impressive and outperform any other previous proposal. This result could be an initial step to many further researches to recognize people and their behavior in an open crowded world.


2019 - Classifying Signals on Irregular Domains via Convolutional Cluster Pooling [Relazione in Atti di Convegno]
Porrello, Angelo; Abati, Davide; Calderara, Simone; Cucchiara, Rita
abstract

We present a novel and hierarchical approach for supervised classification of signals spanning over a fixed graph, reflecting shared properties of the dataset. To this end, we introduce a Convolutional Cluster Pooling layer exploiting a multi-scale clustering in order to highlight, at different resolutions, locally connected regions on the input graph. Our proposal generalises well-established neural models such as Convolutional Neural Networks (CNNs) on irregular and complex domains, by means of the exploitation of the weight sharing property in a graph-oriented architecture. In this work, such property is based on the centrality of each vertex within its soft-assigned cluster. Extensive experiments on NTU RGB+D, CIFAR-10 and 20NEWS demonstrate the effectiveness of the proposed technique in capturing both local and global patterns in graph-structured data out of different domains.


2019 - Gait-Based Diplegia Classification Using LSMT Networks [Articolo su rivista]
Ferrari, Alberto; Bergamini, Luca; Guerzoni, Giorgio; Calderara, Simone; Bicocchi, Nicola; Vitetta, Giorgio; Borghi, Corrado; Neviani, Rita; Ferrari, Adriano
abstract

Diplegia is a specific subcategory of the wide spectrum of motion disorders gathered under the name of cerebral palsy. Recent works proposed to use gait analysis for diplegia classification paving the way for automated analysis. A clinically established gait-based classification system divides diplegic patients into 4 main forms, each one associated with a peculiar walking pattern. In this work, we apply two different deep learning techniques, namely, multilayer perceptron and recurrent neural networks, to automatically classify children into the 4 clinical forms. For the analysis, we used a dataset comprising gait data of 174 patients collected by means of an optoelectronic system. The measurements describing walking patterns have been processed to extract 27 angular parameters and then used to train both kinds of neural networks. Classification results are comparable with those provided by experts in 3 out of 4 forms.


2019 - Latent Space Autoregression for Novelty Detection [Relazione in Atti di Convegno]
Abati, Davide; Porrello, Angelo; Calderara, Simone; Cucchiara, Rita
abstract

Novelty detection is commonly referred to as the discrimination of observations that do not conform to a learned model of regularity. Despite its importance in different application settings, designing a novelty detector is utterly complex due to the unpredictable nature of novelties and its inaccessibility during the training procedure, factors which expose the unsupervised nature of the problem. In our proposal, we design a general framework where we equip a deep autoencoder with a parametric density estimator that learns the probability distribution underlying its latent representations through an autoregressive procedure. We show that a maximum likelihood objective, optimized in conjunction with the reconstruction of normal samples, effectively acts as a regularizer for the task at hand, by minimizing the differential entropy of the distribution spanned by latent vectors. In addition to providing a very general formulation, extensive experiments of our model on publicly available datasets deliver on-par or superior performances if compared to state-of-the-art methods in one-class and video anomaly detection settings. Differently from prior works, our proposal does not make any assumption about the nature of the novelties, making our work readily applicable to diverse contexts.


2019 - METODO DI VALUTAZIONE DI UNO STATO DI SALUTE DI UN ELEMENTO ANATOMICO, RELATIVO DISPOSITIVO DI VALUTAZIONE E RELATIVO SISTEMA DI VALUTAZIONE [Brevetto]
Giuseppe, Marrucchella; Bergamini, Luca; Porrello, Angelo; Del Negro, Ercole; Capobianco Dondona, Andrea; Di Tondo, Francesco; Calderara, Simone
abstract

Sistema in grado di rilevare le lesioni delle mezzene al macello attraverso l'utilizzo di tecniche di deep learning per individuazioni del tipo di lesioni presenti


2019 - Predicting the Driver's Focus of Attention: the DR(eye)VE Project [Articolo su rivista]
Palazzi, Andrea; Abati, Davide; Calderara, Simone; Solera, Francesco; Cucchiara, Rita
abstract

Predicting the Driver's Focus of Attention: the DR(eye)VE Project Andrea Palazzi, Davide Abati, Simone Calderara, Francesco Solera, Rita Cucchiara (Submitted on 10 May 2017 (v1), last revised 6 Jun 2018 (this version, v3)) In this work we aim to predict the driver's focus of attention. The goal is to estimate what a person would pay attention to while driving, and which part of the scene around the vehicle is more critical for the task. To this end we propose a new computer vision model based on a multi-branch deep architecture that integrates three sources of information: raw video, motion and scene semantics. We also introduce DR(eye)VE, the largest dataset of driving scenes for which eye-tracking annotations are available. This dataset features more than 500,000 registered frames, matching ego-centric views (from glasses worn by drivers) and car-centric views (from roof-mounted camera), further enriched by other sensors measurements. Results highlight that several attention patterns are shared across drivers and can be reproduced to some extent. The indication of which elements in the scene are likely to capture the driver's attention may benefit several applications in the context of human-vehicle interaction and driver attention analysis.


2018 - Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era [Articolo su rivista]
Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita
abstract

Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades both in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we discuss the effectiveness of convolutional neural networks (CNNs) models in saliency prediction. We present a set of Deep Learning architectures developed by us, which can combine both bottom-up cues and higher-level semantics, and extract spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. We will show how these deep networks closely recall the early saliency models, although improved with the semantics learned from the human ground-truth. Eventually, we will present a use-case in which saliency prediction is used to improve the automatic description of images.


2018 - Comportamento non verbale intergruppi “oggettivo”: una replica dello studio di Dovidio, kawakami e Gaertner (2002) [Abstract in Atti di Convegno]
Di Bernardo, Gian Antonio; Vezzali, Loris; Giovannini, Dino; Palazzi, Andrea; Calderara, Simone; Bicocchi, Nicola; Zambonelli, Franco; Cucchiara, Rita; Cadamuro, Alessia; Cocco, Veronica Margherita
abstract

Vi è una lunga tradizione di ricerca che ha analizzato il comportamento non verbale, anche considerando relazioni intergruppi. Solitamente, questi studi si avvalgono di valutazioni di coder esterni, che tuttavia sono soggettive e aperte a distorsioni. Abbiamo condotto uno studio in cui si è preso come riferimento il celebre studio di Dovidio, Kawakami e Gaertner (2002), apportando tuttavia alcune modifiche e considerando la relazione tra bianchi e neri. Partecipanti bianchi, dopo aver completato misure di pregiudizio esplicito e implicito, incontravano (in ordine contro-bilanciato) un collaboratore bianco e uno nero. Con ognuno di essi, parlavano per tre minuti di un argomento neutro e di un argomento saliente per la distinzione di gruppo (in ordine contro-bilanciato). Tali interazioni erano registrate con una telecamera kinect, che è in grado di tenere conto della componente tridimensionale del movimento. I risultati hanno rivelato vari elementi di interesse. Anzitutto, si sono creati indici oggettivi, a partire da un’analisi della letteratura, alcuni dei quali non possono essere rilevati da coder esterni, quali distanza interpersonale e volume di spazio tra le persone. I risultati hanno messo in luce alcuni aspetti rilevanti: (1) l’atteggiamento implicito è associato a vari indici di comportamento non verbale, i quali mediano sulle valutazioni dei partecipanti fornite dai collaboratori; (2) le interazioni vanno considerate in maniera dinamica, tenendo conto che si sviluppano nel tempo; (3) ciò che può essere importante è il comportamento non verbale globale, piuttosto che alcuni indici specifici pre-determinati dagli sperimentatori.


2018 - Domain Translation with Conditional GANs: from Depth to RGB Face-to-Face [Relazione in Atti di Convegno]
Fabbri, Matteo; Borghi, Guido; Lanzi, Fabio; Vezzani, Roberto; Calderara, Simone; Cucchiara, Rita
abstract

Can faces acquired by low-cost depth sensors be useful to see some characteristic details of the faces? Typically the answer is not. However, new deep architectures can generate RGB images from data acquired in a different modality, such as depth data. In this paper we propose a new Deterministic Conditional GAN, trained on annotated RGB-D face datasets, effective for a face-to-face translation from depth to RGB. Although the network cannot reconstruct the exact somatic features for unknown individual faces, it is capable to reconstruct plausible faces; their appearance is accurate enough to be used in many pattern recognition tasks. In fact, we test the network capability to hallucinate with some Perceptual Probes, as for instance face aspect classification or landmark detection. Depth face can be used in spite of the correspondent RGB images, that often are not available for darkness of difficult luminance conditions. Experimental results are very promising and are as far as better than previous proposed approaches: this domain translation can constitute a new way to exploit depth data in new future applications.


2018 - End-to-end 6-DoF Object Pose Estimation through Differentiable Rasterization [Relazione in Atti di Convegno]
Palazzi, Andrea; Bergamini, Luca; Calderara, Simone; Cucchiara, Rita
abstract

Here we introduce an approximated differentiable renderer to refine a 6-DoF pose prediction using only 2D alignment information. To this end, a two-branched convolutional encoder network is employed to jointly estimate the object class and its 6-DoF pose in the scene. We then propose a new formulation of an approximated differentiable renderer to re-project the 3D object on the image according to its predicted pose; in this way the alignment error between the observed and the re-projected object silhouette can be measured. Since the renderer is differentiable, it is possible to back-propagate through it to correct the estimated pose at test time in an online learning fashion. Eventually we show how to leverage the classification branch to profitably re-project a representative model of the predicted class (i.e. a medoid) instead. Each object in the scene is processed independently and novel viewpoints in which both objects arrangement and mutual pose are preserved can be rendered. Differentiable renderer code is available at:https://github.com/ndrplz/tensorflow-mesh-renderer.


2018 - Face-from-Depth for Head Pose Estimation on Depth Images [Articolo su rivista]
BORGHI, GUIDO; FABBRI, MATTEO; VEZZANI, Roberto; CALDERARA, Simone; CUCCHIARA, Rita
abstract

Depth cameras allow to set up reliable solutions for people monitoring and behavior understanding, especially when unstable or poor illumination conditions make unusable common RGB sensors. Therefore, we propose a complete framework for the estimation of the head and shoulder pose based on depth images only. A head detection and localization module is also included, in order to develop a complete end-to-end system. The core element of the framework is a Convolutional Neural Network, called POSEidon+, that receives as input three types of images and provides the 3D angles of the pose as output. Moreover, a Face-from-Depth component based on a Deterministic Conditional GAN model is able to hallucinate a face from the corresponding depth image. We empirically demonstrate that this positively impacts the system performances. We test the proposed framework on two public datasets, namely Biwi Kinect Head Pose and ICT-3DHP, and on Pandora, a new challenging dataset mainly inspired by the automotive setup. Experimental results show that our method overcomes several recent state-of-art works based on both intensity and depth input data, running in real-time at more than 30 frames per second.


2018 - Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World [Relazione in Atti di Convegno]
Fabbri, Matteo; Lanzi, Fabio; Calderara, Simone; Palazzi, Andrea; Vezzani, Roberto; Cucchiara, Rita
abstract

Multi-People Tracking in an open-world setting requires a special effort in precise detection. Moreover, temporal continuity in the detection phase gains more importance when scene cluttering introduces the challenging problems of occluded targets. For the purpose, we propose a deep network architecture that jointly extracts people body parts and associates them across short temporal spans. Our model explicitly deals with occluded body parts, by hallucinating plausible solutions of not visible joints. We propose a new end-to-end architecture composed by four branches (visible heatmaps, occluded heatmaps, part affinity fields and temporal affinity fields) fed by a time linker feature extractor. To overcome the lack of surveillance data with tracking, body part and occlusion annotations we created the vastest Computer Graphics dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, almost 10 million body poses) of human body parts for people tracking in urban scenarios. Our architecture trained on virtual data exhibits good generalization capabilities also on public real tracking benchmarks, when image resolution and sharpness are high enough, producing reliable tracklets useful for further batch data association or re-id modules.


2018 - Metodo e sistema per il riconoscimento biometrico univoco di un animale, basati sull'utilizzo di tecniche di deep learning [Brevetto]
Calderara, Simone; Bergamini, Luca; Capobianco Dondona, Andrea; Del Negro, Ercole; Di Tondo, Francesco
abstract

La presente invenzione descrive un metodo e sistema per il riconoscimento biometrico univoco di un animale, basato sull’utilizzo di tecniche di deep learning. Il metodo è caratterizzato dalle seguenti fasi: a. fase di allenamento su di un dominio umano ed un dominio animale per l’ottenimento di embedding animali in uno spazio latente omologo a quello umano per mezzo di reti neurali convolutive; b. memorizzazione degli embedding animali ottenuti in una banca dati; c. riconoscimento di una identità animale per mezzo di reti neurali convolutive. La presente invenzione comprende anche un sistema per il riconoscimento biometrico univoco di un animale che utilizza il metodo precedentemente descritto.


2018 - Multi-views Embedding for Cattle Re-identification [Relazione in Atti di Convegno]
Bergamini, Luca; Porrello, Angelo; Andrea Capobianco Dondona, ; Ercole Del Negro, ; Mauro, Mattioli; Nicola, D’Alterio; Calderara, Simone
abstract

People re-identification task has seen enormous improvements in the latest years, mainly due to the development of better image features extraction from deep Convolutional Neural Networks (CNN) and the availability of large datasets. However, little research has been conducted on animal identification and re-identification, even if this knowledge may be useful in a rich variety of different scenarios. Here, we tackle cattle re-identification exploiting deep CNN and show how this task is poorly related to the human one, presenting unique challenges that make it far from being solved. We present various baselines, both based on deep architectures or on standard machine learning algorithms, and compared them with our solution. Finally, a rich ablation study has been conducted to further investigate the unique peculiarities of this task.


2018 - Self-Supervised Optical Flow Estimation by Projective Bootstrap [Articolo su rivista]
Alletto, Stefano; Abati, Davide; Calderara, Simone; Cucchiara, Rita; Rigazio, Luca
abstract

Dense optical flow estimation is complex and time consuming, with state-of-the-art methods relying either on large synthetic data sets or on pipelines requiring up to a few minutes per frame pair. In this paper, we address the problem of optical flow estimation in the automotive scenario in a self-supervised manner. We argue that optical flow can be cast as a geometrical warping between two successive video frames and devise a deep architecture to estimate such transformation in two stages. First, a dense pixel-level flow is computed with a projective bootstrap on rigid surfaces. We show how such global transformation can be approximated with a homography and extend spatial transformer layers so that they can be employed to compute the flow field implied by such transformation. Subsequently, we refine the prediction by feeding a second, deeper network that accounts for moving objects. A final reconstruction loss compares the warping of frame Xₜ with the subsequent frame Xₜ₊₁ and guides both estimates. The model has the speed advantages of end-to-end deep architectures while achieving competitive performances, both outperforming recent unsupervised methods and showing good generalization capabilities on new automotive data sets.


2018 - Using Kinect camera for investigating intergroup non-verbal human interactions [Abstract in Atti di Convegno]
Vezzali, Loris; Di Bernardo, Gian Antonio; Cadamuro, Alessia; Cocco, Veronica Margherita; Crapolicchio, Eleonora; Bicocchi, Nicola; Calderara, Simone; Giovannini, Dino; Zambonelli, Franco; Cucchiara, Rita
abstract

A long tradition in social psychology focused on nonverbal behaviour displayed during dyadic interactions generally relying on evaluations from external coders. However, in addition to the fact that external coders may be biased, they may not capture certain type of behavioural indices. We designed three studies examining explicit and implicit prejudice as predictors of nonberval behaviour as reflected in objective indices provided by Kinect cameras. In the first study, we considered White-Black relations from the perspective of 36 White participants. Results revealed that implicit prejudice was associated with a reduction in interpersonal distance and in the volume of space between Whites and Blacks (vs. Whites and Whites), which in turn were associated with evaluations by collaborators taking part in the interaction. In the second study, 37 non-HIV participants interacted with HIV individuals. We found that implicit prejudice was associated with reduced volume of space between interactants over time (a process of bias overcorrection) only when they tried hard to control their behaviour (as captured by a stroop test). In the third study 35 non-disabled children interacted with disabled children. Results revealed that implicit prejudice was associated with reduced interpersonal distance over time.


2017 - A new era in the study of intergroup nonverbal behaviour: Studying intergroup dyadic interactions “online” [Abstract in Atti di Convegno]
DI BERNARDO, GIAN ANTONIO; Vezzali, Loris; Palazzi, Andrea; Calderara, Simone; Bicocchi, Nicola; Zambonelli, Franco; Cucchiara, Rita; Cadamuro, Alessia
abstract

We examined predictors and consequences of intergroup nonverbal behaviour by relying on new technologies and new objective indices. In three studies, both in the laboratory and in the field with children, behaviour was a function of implicit prejudice.


2017 - Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era [Relazione in Atti di Convegno]
Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita
abstract

Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we propose a discussion on why convolutional neural networks (CNNs) are so accurate in saliency prediction. We present our DL architectures which combine both bottom-up cues and higher-level semantics, and incorporate the concept of time in the attentional process through LSTM recurrent architectures. Eventually, we present a video-specific architecture based on the C3D network, which can extracts spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. The merit of this work is to show how these deep networks are not mere brute-force methods tuned on massive amount of data, but represent well-defined architectures which recall very closely the early saliency models, although improved with the semantics learned by human ground-thuth.


2017 - From Groups to Leaders and Back [Capitolo/Saggio]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Recently, social theories and empirical observations identified small groups and leaders as the basic elements which shape a crowd. This leads to an intermediate level of abstraction that is placed between the crowd as a flow of people, and the crowd as a collection of individuals. Consequently, automatic analysis of crowds in computer vision is also experiencing a shift in focus from individuals to groups and from small groups to their leaders. In this chapter, we present state-of-the-art solutions to the groups and leaders detection problem, which are able to account for physical factors as well as for sociological evidence observed over short time windows. The presented algorithms are framed as structured learning problems over the set of individual trajectories. However, the way trajectories are exploited to predict the structure of the crowd is not fixed but rather learned from recorded and annotated data, enabling the method to adapt these concepts to different scenarios, densities, cultures, and other unobservable complexities. Additionally, we investigate the relation between leaders and their groups and propose the first attempt to exploit leadership as prior knowledge for group detection.


2017 - Generative Adversarial Models for People Attribute Recognition in Surveillance [Relazione in Atti di Convegno]
Fabbri, Matteo; Calderara, Simone; Cucchiara, Rita
abstract

In this paper we propose a deep architecture for detecting people attributes (e.g. gender, race, clothing ...) in surveillance contexts. Our proposal explicitly deal with poor resolution and occlusion issues that often occur in surveillance footages by enhancing the images by means of Deep Convolutional Generative Adversarial Networks (DCGAN). Experiments show that by combining both our Generative Reconstruction and Deep Attribute Classification Network we can effectively extract attributes even when resolution is poor and in presence of strong occlusions up to 80% of the whole person figure.


2017 - Learning Where to Attend Like a Human Driver [Relazione in Atti di Convegno]
Palazzi, Andrea; Solera, Francesco; Calderara, Simone; Alletto, Stefano; Cucchiara, Rita
abstract

Despite the advent of autonomous cars, it's likely - at least in the near future - that human attention will still maintain a central role as a guarantee in terms of legal responsibility during the driving task. In this paper we study the dynamics of the driver's gaze and use it as a proxy to understand related attentional mechanisms. First, we build our analysis upon two questions: where and what the driver is looking at? Second, we model the driver's gaze by training a coarse-to-fine convolutional network on short sequences extracted from the DR(eye)VE dataset. Experimental comparison against different baselines reveal that the driver's gaze can indeed be learnt to some extent, despite i) being highly subjective and ii) having only one driver's gaze available for each sequence due to the irreproducibility of the scene. Eventually, we advocate for a new assisted driving paradigm which suggests to the driver, with no intervention, where she should focus her attention.


2017 - Learning to Map Vehicles into Bird's Eye View [Relazione in Atti di Convegno]
Palazzi, Andrea; Borghi, Guido; Abati, Davide; Calderara, Simone; Cucchiara, Rita
abstract

Awareness of the road scene is an essential component for both autonomous vehicles and Advances Driver Assistance Systems and is gaining importance both for the academia and car companies. This paper presents a way to learn a semantic-aware transformation which maps detections from a dashboard camera view onto a broader bird's eye occupancy map of the scene. To this end, a huge synthetic dataset featuring 1M couples of frames, taken from both car dashboard and bird's eye view, has been collected and automatically annotated. A deep-network is then trained to warp detections from the first to the second view. We demonstrate the effectiveness of our model against several baselines and observe that is able to generalize on real-world data despite having been trained solely on synthetic ones.


2017 - Signal Processing and Machine Learning for Diplegia Classification [Relazione in Atti di Convegno]
Bergamini, Luca; Calderara, Simone; Bicocchi, Nicola; Ferrari, Alberto; Vitetta, Giorgio
abstract

Diplegia is one of the most common forms of a broad family of motion disorders named cerebral palsy (CP) affecting the voluntary muscular system. In recent years, various classification criteria have been proposed for CP, to assist in diagnosis, clinical decision-making and communication. In this manuscript, we divide the spastic forms of CP into 4 other categories according to a previous classification criterion and propose a machine learning approach for automatically classifying patients. Training and validation of our approach are based on data about 200 patients acquired using 19 markers and high frequency VICON cameras in an Italian hospital. Our approach makes use of the latest deep learning techniques. More specifically, it involves a multi-layer perceptron network (MLP), combined with Fourier analysis. An encouraging classification performance is obtained for two of the four classes.


2017 - Tracking social groups within and across cameras [Articolo su rivista]
Solera, Francesco; Calderara, Simone; Ristani, Ergys; Tomasi, Carlo; Cucchiara, Rita
abstract

We propose a method for tracking groups from single and multiple cameras with disjoint fields of view. Our formulation follows the tracking-by-detection paradigm where groups are the atomic entities and are linked over time to form long and consistent trajectories. To this end, we formulate the problem as a supervised clustering problem where a Structural SVM classifier learns a similarity measure appropriate for group entities. Multi-camera group tracking is handled inside the framework by adopting an orthogonal feature encoding that allows the classifier to learn inter- and intra-camera feature weights differently. Experiments were carried out on a novel annotated group tracking data set, the DukeMTMC-Groups data set. Since this is the first data set on the problem it comes with the proposal of a suitable evaluation measure. Results of adopting learning for the task are encouraging, scoring a +15% improvement in F1 measure over a non-learning based clustering baseline. To our knowledge this is the first proposal of this kind dealing with multi-camera group tracking.


2016 - DR(eye)VE: a Dataset for Attention-Based Tasks with Applications to Autonomous and Assisted Driving [Relazione in Atti di Convegno]
Alletto, Stefano; Palazzi, Andrea; Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Autonomous and assisted driving are undoubtedly hot topics in computer vision. However, the driving task is extremely complex and a deep understanding of drivers' behavior is still lacking. Several researchers are now investigating the attention mechanism in order to define computational models for detecting salient and interesting objects in the scene. Nevertheless, most of these models only refer to bottom up visual saliency and are focused on still images. Instead, during the driving experience the temporal nature and peculiarity of the task influence the attention mechanisms, leading to the conclusion that real life driving data is mandatory. In this paper we propose a novel and publicly available dataset acquired during actual driving. Our dataset, composed by more than 500,000 frames, contains drivers' gaze fixations and their temporal integration providing task-specific saliency maps. Geo-referenced locations, driving speed and course complete the set of released data. To the best of our knowledge, this is the first publicly available dataset of this kind and can foster new discussions on better understanding, exploiting and reproducing the driver's attention process in the autonomous and assisted cars of future generations.


2016 - Quick, accurate, smart: 3D computer vision technology helps assessing confined animals' behaviour [Articolo su rivista]
Barnard, Shanis; Calderara, Simone; Pistocchi, Simone; Cucchiara, Rita; Podaliri Vulpiani, Michele; Messori, Stefano; Ferri, Nicola
abstract

Mankind directly controls the environment and lifestyles of several domestic species for purposes ranging from production and research to conservation and companionship. These environments and lifestyles may not offer these animals the best quality of life. Behaviour is a direct reflection of how the animal is coping with its environment. Behavioural indicators are thus among the preferred parameters to assess welfare. However, behavioural recording (usually from video) can be very time consuming and the accuracy and reliability of the output rely on the experience and background of the observers. The outburst of new video technology and computer image processing gives the basis for promising solutions. In this pilot study, we present a new prototype software able to automatically infer the behaviour of dogs housed in kennels from 3D visual data and through structured machine learning frameworks. Depth information acquired through 3D features, body part detection and training are the key elements that allow the machine to recognise postures, trajectories inside the kennel and patterns of movement that can be later labelled at convenience. The main innovation of the software is its ability to automatically cluster frequently observed temporal patterns of movement without any pre-set ethogram. Conversely, when common patterns are defined through training, a deviation from normal behaviour in time or between individuals could be assessed. The software accuracy in correctly detecting the dogs' behaviour was checked through a validation process. An automatic behaviour recognition system, independent from human subjectivity, could add scientific knowledge on animals' quality of life in confinement as well as saving time and resources. This 3D framework was designed to be invariant to the dog's shape and size and could be extended to farm, laboratory and zoo quadrupeds in artificial housing. The computer vision technique applied to this software is innovative in non-human animal behaviour science. Further improvements and validation are needed, and future applications and limitations are discussed.


2016 - Socially Constrained Structural Learning for Groups Detection in Crowd [Articolo su rivista]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Modern crowd theories agree that collective behavior is the result of the underlying interactions among small groups of individuals. In this work, we propose a novel algorithm for detecting social groups in crowds by means of a Correlation Clustering procedure on people trajectories. The affinity between crowd members is learned through an online formulation of the Structural SVM framework and a set of specifically designed features characterizing both their physical and social identity, inspired by Proxemic theory, Granger causality, DTW and Heat-maps. To adhere to sociological observations, we introduce a loss function (G-MITRE) able to deal with the complexity of evaluating group detection performances. We show our algorithm achieves state-of-the-art results when relying on both ground truth trajectories and tracklets previously extracted by available detector/tracker systems.


2016 - Spotting prejudice with nonverbal behaviours [Relazione in Atti di Convegno]
Palazzi, Andrea; Calderara, Simone; Bicocchi, Nicola; Vezzali, Loris; DI BERNARDO, GIAN ANTONIO; Zambonelli, Franco; Cucchiara, Rita
abstract

Despite prejudice cannot be directly observed, nonverbal behaviours provide profound hints on people inclinations. In this paper, we use recent sensing technologies and machine learning techniques to automatically infer the results of psychological questionnaires frequently used to assess implicit prejudice. In particular, we recorded 32 students discussing with both white and black collaborators. Then, we identified a set of features allowing automatic extraction and measured their degree of correlation with psychological scores. Results confirmed that automated analysis of nonverbal behaviour is actually possible thus paving the way for innovative clinical tools and eventually more secure societies.


2016 - Transductive People Tracking in Unconstrained Surveillance [Articolo su rivista]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
abstract

Long term tracking of people in unconstrained scenarios is still an open problem due to the absence of constant elements in the problem setting. The camera, when active, may move and both the background and the target appearance may change abruptly leading to the inadequacy of most standard tracking techniques. We propose to exploit a learning approach that considers the tracking task as a semi supervised learning (SSL) problem. Given few target samples the aim is to search the target occurrences in the video stream re-interpreting the problem as label propagation on a similarity graph. We propose a solution based on graph transduction that works iteratively frame by frame. Additionally, in order to avoid drifting, we introduce an update strategy based on an evolutionary clustering technique that chooses the visual templates that better describe target appearance evolving the model during the processing of the video. Since we model people appearance by means of covariance matrices on color and gradient information our framework is directly related to structure learning on Riemannian manifolds. Tests on publicly available datasets and comparisons with stateof- the-art techniques allow to conclude that our solution exhibit interesting performances in terms of tracking precision and recall in most of the considered scenarios.


2015 - Active query process for digital video surveillance forensic applications [Articolo su rivista]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
abstract

Multimedia forensics is a new emerging discipline regarding the analysis and exploitation of digital data as support for investigation to extract probative elements. Among them, visual data about people and people activities, extracted from videos in an efficient way, are becoming day by day more appealing for forensics, due to the availability of large video-surveillance footage. Thus, many research studies and prototypes investigate the analysis of soft biometrics data, such as people appearance and people trajectories. In this work, we propose new solutions for querying and retrieving visual data in an interactive and active fashion for soft biometrics in forensics. The innovative proposal joins the capability of transductive learning for semi-supervised search by similarity and a typical multimedia methodology based on user-guided relevance feedback to allow an active interaction with the visual data of people, appearance and trajectory in large surveillance areas. Approaches proposed are very general and can be exploited independently by the surveillance setting and the type of video analytic tools.


2015 - Learning to Divide and Conquer for Online Multi-Target Tracking [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Online Multiple Target Tracking (MTT) is often addressed within the tracking-by-detection paradigm. Detections are previously extracted independently in each frame and then objects trajectories are built by maximizing specifically designed coherence functions. Nevertheless, ambiguities arise in presence of occlusions or detection errors. In this paper we claim that the ambiguities in tracking could be solved by a selective use of the features, by working with more reliable features if possible and exploiting a deeper representation of the target only if necessary. To this end, we propose an online divide and conquer tracker for static camera scenes, which partitions the assignment problem in local subproblems and solves them by selectively choosing and combining the best features. The complete framework is cast as a structural learning task that unifies these phases and learns tracker parameters from examples. Experiments on two different datasets highlights a significant improvement of tracking performances (MOTA +10%) over the state of the art.


2015 - Learning to identify leaders in crowd [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Leader identification is a crucial task in social analysis, crowd management and emergency planning. In this paper, we investigate a computational model for the individuation of leaders in crowded scenes. We deal with the lack of a formal definition of leadership by learning, in a supervised fashion, a metric space based exclusively on people spatiotemporal information. Based on Tarde's work on crowd psychology, individuals are modeled as nodes of a directed graph and leaders inherits their relevance thanks to other members references. We note this is analogous to the way websites are ranked by the PageRank algorithm. During experiments, we observed different feature weights depending on the specific type of crowd, highlighting the impossibility to provide a unique interpretation of leadership. To our knowledge, this is the first attempt to study leader identification as a metric learning problem


2015 - Towards the evaluation of reproducible robustness in tracking-by-detection [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Conventional experiments on MTT are built upon the belief that fixing the detections to different trackers is sufficient to obtain a fair comparison. In this work we argue how the true behavior of a tracker is exposed when evaluated by varying the input detections rather than by fixing them. We propose a systematic and reproducible protocol and a MATLAB toolbox for generating synthetic data starting from ground truth detections, a proper set of metrics to understand and compare trackers peculiarities and respective visualization solutions.


2015 - Understanding social relationships in egocentric vision [Articolo su rivista]
Alletto, Stefano; Serra, Giuseppe; Calderara, Simone; Cucchiara, Rita
abstract

The understanding of mutual people interaction is a key component for recognizing people social behavior, but it strongly relies on a personal point of view resulting difficult to be a-priori modeled. We propose the adoption of the unique head mounted cameras first person perspective (ego-vision) to promptly detect people interaction in different social contexts. The proposal relies on a complete and reliable system that extracts people׳s head pose combining landmarks and shape descriptors in a temporal smoothed HMM framework. Finally, interactions are detected through supervised clustering on mutual head orientation and people distances exploiting a structural learning framework that specifically adjusts the clustering measure according to a peculiar scenario. Our solution provides the flexibility to capture the interactions disregarding the number of individuals involved and their level of acquaintance in context with a variable degree of social involvement. The proposed system shows competitive performances on both publicly available ego-vision datasets and ad hoc benchmarks built with real life situations.


2014 - A complete system for garment segmentation and color classification [Articolo su rivista]
Manfredi, Marco; Grana, Costantino; Calderara, Simone; Cucchiara, Rita
abstract

In this paper, we propose a general approach for automatic segmentation, color-based retrieval and classification of garments in fashion store databases, exploiting shape and color information. The garment segmentation is automatically initialized by learning geometric constraints and shape cues, then it is performed by modeling both skin and accessory colors with Gaussian Mixture Models. For color similarity retrieval and classification, to adapt the color description to the users’ perception and the company marketing directives, a color histogram with an optimized binning strategy, learned on the given color classes, is introduced and combined with HOG features for garment classification. Experiments validating the proposed strategy, and a free-to-use dataset publicly available for scientific purposes, are finally detailed.


2014 - Detection of static groups and crowds gathered in open spaces by texture classification [Articolo su rivista]
Manfredi, Marco; Vezzani, Roberto; Calderara, Simone; Cucchiara, Rita
abstract

A surveillance system specifically developed to manage crowded scenes is described in this paper. In particular we focused on static crowds, composed by groups of people gathered and stayed in the same place for a while. The detection and spatial localization of static crowd situations is performed by means of a One Class Support Vector Machine, working on texture features extracted at patch level. Spatial regions containing crowds are identified and filtered using motion information to prevent noise and false alarms due to moving flows of people. By means of one class classification and inner texture descriptors, we are able to obtain, from a single training set, a sufficiently general crowd model that can be used for all the scenarios that shares a similar viewpoint. Tests on public datasets and real setups validate the proposed system.


2014 - From Ego to Nos-Vision: Detecting Social Relationships in First-Person Views [Relazione in Atti di Convegno]
Alletto, Stefano; Serra, Giuseppe; Calderara, Simone; Solera, Francesco; Cucchiara, Rita
abstract

In this paper we present a novel approach to detect groups in ego-vision scenarios. People in the scene are tracked through the video sequence and their head pose and 3D location are estimated. Based on the concept of f-formation, we define with the orientation and distance an inherently social pairwise feature that describes the affinity of a pair of people in the scene. We apply a correlation clustering algorithm that merges pairs of people into socially related groups. Due to the very shifting nature of social interactions and the different meanings that orientations and distances can assume in different contexts, we learn the weight vector of the correlation clustering using Structural SVMs. We extensively test our approach on two publicly available datasets showing encouraging results when detecting groups from first-person camera views.


2014 - Head Pose Estimation in First-Person Camera Views [Relazione in Atti di Convegno]
Alletto, Stefano; Serra, Giuseppe; Calderara, Simone; Cucchiara, Rita
abstract

In this paper we present a new method for head pose real-time estimation in ego-vision scenarios that is a key step in the understanding of social interactions. In order to robustly detect head under changing aspect ratio, scale and orientation we use and extend the Hough-Based Tracker which allows to follow simultaneously each subject in the scene. In an ego-vision scenario where a group interacts in a discussion, each subject's head orientation will be more likely to remain focused for a while on the person who has the floor. In order to encode this behavior we include a stateful Hidden Markov Model technique that enforces the predicted pose with the temporal coherence from a video sequence. We extensively test our approach on several indoor and outdoor ego-vision videos with high illumination variations showing its validity and outperforming other recent related state of the art approaches.


2014 - Kernelized Structural Classification for 3D Dogs Body Parts Detection [Relazione in Atti di Convegno]
Pistocchi, Simone; Calderara, Simone; Barnard, S.; Ferri, N.; Cucchiara, Rita
abstract

Despite pattern recognition methods for human behavioral analysis has flourished in the last decade, animal behavioral analysis has been almost neglected. Those few approaches are mostly focused on preserving livestock economic value while attention on the welfare of companion animals, like dogs, is now emerging as a social need. In this work, following the analogy with human behavior recognition, we propose a system for recognizing body parts of dogs kept in pens. We decide to adopt both 2D and 3D features in order to obtain a rich description of the dog model. Images are acquired using the Microsoft Kinect to capture the depth map images of the dog. Upon depth maps a Structural Support Vector Machine (SSVM) is employed to identify the body parts using both 3D features and 2D images. The proposal relies on a kernelized discriminative structural classificator specifically tailored for dogs independently from the size and breed. The classification is performed in an online fashion using the LaRank optimization technique to obtaining real time performances. Promising results have emerged during the experimental evaluation carried out at a dog shelter, managed by IZSAM, in Teramo, Italy.


2014 - Visual Tracking: An Experimental Survey [Articolo su rivista]
A. W. M., Smeulder; D. M., Chu; Cucchiara, Rita; Calderara, Simone; A., Dehghan; M., Shah
abstract

There is a large variety of trackers, which have been proposed in the literature during the last two decades with some mixed success. Object tracking in realistic scenarios is difficult problem, therefore it remains a most active area of research in Computer Vision. A good tracker should perform well in a large number of videos involving illumination changes, occlusion, clutter, camera motion, low contrast, specularities and at least six more aspects. However, the performance of proposed trackers have been evaluated typically on less than ten videos, or on the special purpose datasets. In this paper, we aim to evaluate trackers systematically and experimentally on 315 video fragments covering above aspects. We selected a set of nineteen trackers to include a wide variety of algorithms often cited in literature, supplemented with trackers appearing in 2010 and 2011 for which the code was publicly available. We demonstrate that trackers can be evaluated objectively by survival curves, Kaplan Meier statistics, and Grubs testing. We find that in the evaluation practice the F-score is as effective as the object tracking accuracy (OTA) score. The analysis under a large variety of circumstances provides objective insight into the strengths and weaknesses of trackers.


2013 - Social Groups Detection in Crowd through Shape-Augmented Structured LearningImage Analysis and Processing – ICIAP 2013 [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone
abstract

Most of the behaviors people exhibit while being part of a crowd are social processes that tend to emerge among groups and as a consequence, detecting groups in crowds is becoming an important issue in modern behavior analysis. We propose a supervised correlation clustering technique that employs Structural SVM and a proxemic based feature to learn how to partition people trajectories in groups, by injecting in the model socially plausible shape configurations. By taking into account social groups patterns, the system is able to outperform state of the art methods on two publicly available benchmark sets of videos.


2013 - Structured learning for detection of social groups in crowd [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Group detection in crowds will play a key role in future behavior analysis surveillance systems. In this work we build a new Structural SVM-based learning framework able to solve the group detection task by exploiting annotated video data to deduce a sociologically motivated distance measure founded on Hall's proxemics and Granger's causality. We improve over state-of-the-art results even in the most crowded test scenarios, while keeping the classification time affordable for quasi-real time applications. A new scoring scheme specifically designed for the group detection task is also proposed.


2012 - Integrate tool for online analysis and offline mining of people trajectories [Articolo su rivista]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

In the past literature, online alarm-based video-surveillance and offline forensic-based data mining systems are often treated separately, even from different scientific communities. However, the founding techniques are almost the same and, despite some examples in commercial systems, the cases on which an integrated approach is followed are limited. For this reason, this study describes an integrated tool capable of putting together these two subsystems in an effective way. Despite its generality, the proposal is here reported in the case of people trajectory analysis, both in real time and offline. Trajectories are modelled based on either their spatial location or their shape, and proper similarity measures are proposed. Special solutions to meet real-time requirements in both cases are also presented and the trade-off between efficiency and efficacy is analysed by comparing when using a statistical model and when not. Examples of results in large datasets acquired in the University campus are reported as preliminary evaluation of the system.


2012 - Learning Non-Target Items for Interesting Clothes Segmentation in Fashion Images [Relazione in Atti di Convegno]
Grana, Costantino; Calderara, Simone; Borghesani, Daniele; Cucchiara, Rita
abstract

In this paper we propose a color-based approach for skin detection and interest garment selection aimed at an automatic segmentation of pieces of clothing. For both purposes, the color description is extracted by an iterative energy minimization approach and an automatic initialization strategy is proposed by learning geometric constraints and shape cues. Experiments confirms the good performance of this technique both in the context of skin removal and in the context of classification of garments.


2012 - Understanding dyadic interactions applying proxemic theory on videosurveillance trajectories [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita
abstract

Understanding social and collective people behaviour in open spaces is one of the frontier of modern video surveillance. Many sociological theories, and proxemics in particular, have been proved their validity as a support for classifying and interpreting human behaviour. Proxemics suggest some simple but effective behavioural rules, useful to understand what people are doing and their social involvement with other individuals. In this paper we propose to extend the proxemics analysis along the time and provide a solution for analysing sequences of proxemic states computed between trajectories of people pairs (dyads). Trajectories, computed from videosurveillance videos, are first analysed and converted to a sequence of symbols according to proxemic theory. Then an elastic measure for comparing those sequences is introduced. Finally, interactions are classified both in an off-line unsupervised way and in an on-line fashion. Results on videosurveillance data, demonstrate that sequences of proxemic states can be effective in characterizing mutual interactions and experiments in capturing the most frequent dyads interactions and on-line classifying them when a labelled training set is available are proposed.


2011 - Appearance tracking by transduction in surveillance scenarios [Relazione in Atti di Convegno]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
abstract

We propose a formulation of people tracking problem as a Transductive Learning (TL) problem. TL is an effective semi-supervised learning technique by which many classification problems have been recently reinterpreted as learning labels from incomplete datasets. In our proposal the joint exploitation of spectral graph theory and Riemannian manifold learning tools leads to the formulation of a robust approach for appearance based tracking in Video Surveillance scenarios. The key advantage of the presented method is a continuously updated model of the tracked target, used in the TL process, that allows to on-line learn the target visual appearance and consequently to improve the tracker accuracy. Experiments on public datasets show an encouraging advancement over alternative state-of the-art techniques.


2011 - Detecting Anomalies in People’s Trajectories using Spectral Graph Analysis [Articolo su rivista]
Calderara, Simone; Uri, Heinemann; Prati, Andrea; Cucchiara, Rita; Naftali, Tishby
abstract

Video surveillance is becoming the technology of choice for monitoring crowded areas for security threats. While video provides ample information for human inspectors, there is a great need for robust automated techniques that can efficiently detect anomalous behavior in streaming video from single ormultiple cameras. In this work we synergistically combine two state-of-the-art methodologies. The rst is the ability to track and label single person trajectories in a crowded area using multiple video cameras, and the second is a new class of novelty detection algorithms based on spectral analysis of graphs. By representing the trajectories as sequences of transitions betweennodes in a graph, shared individual trajectories capture only a small subspace of the possible trajectories on the graph. This subspace is characterized by large connected components of the graph, which are spanned by the eigenvectors with the low eigenvalues of the graph Laplacian matrix. Using this technique, we develop robust invariant distance measures for detectinganomalous trajectories, and demonstrate their application on realvideo data.


2011 - Feature Space Warping Relevance Feedback with Transductive Learning [Relazione in Atti di Convegno]
Borghesani, Daniele; Coppi, Dalia; Grana, Costantino; Calderara, Simone; Cucchiara, Rita
abstract

Relevance feedback is a widely adopted approach to improve content-based information retrieval systems by keeping the user in the retrieval loop. Among the fundamental relevance feedback approaches, feature space warping has been proposed as an effective approach for bridging the gap between high-level semantics and the low-level features. Recently, combination of feature space warping and query point movement techniques has been proposed in contrast to learning based approaches, showing good performance under dierent data distributions. In this paper we propose to merge feature space warping and transductive learning, in order to benet from both the ability of adapting data to the user hints and the information coming from unlabeled samples. Experimental results on an image retrieval task reveal signicant performance improvements from the proposed method.


2011 - Iterative active querying for surveillance data retrieval in crime detection and forensics [Relazione in Atti di Convegno]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
abstract

Large sets of visual data are now available both, in real time andoff line, at time of investigation in multimedia forensics, however passive querying systems often encounter difficulties in retrieving significant results. In this paper we propose an iterativeactive querying system for video surveillance and forensic applications based on the continuous interaction between the userand the system. The positive and negative user feedbacks areexploited as the input of a graph based transductive procedurefor iteratively refining the initial query results. Experimentsare shown using people trajectories and people appearance asdistance metrics.


2011 - Markerless Body Part Tracking for Action Recognition [Articolo su rivista]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper presents a method for recognising human actions bytracking body parts without using artificial markers. A sophisticated appearance-based tracking able to cope with occlusions is exploited to extract a probability map for each moving object. A segmentation technique based on mixture of Gaussians (MoG) is then employed to extract and track significantpoints on this map, corresponding to significant regions on the human silhouette. The evolution of the mixture in time is analysed by transforming it in a sequence of symbols (corresponding to a MoG). The similarity between actions is computed by applying global alignment and dynamic programming techniques to the corresponding sequences and using a variational approximation of the Kullback-Leibler divergence to measure the dissimilarity between two MoGs. Experiments on publicly available datasets and comparison with existing methods are provided.


2011 - Mixtures of von Mises Distributions for People Trajectory Shape Analysis [Articolo su rivista]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

People trajectory analysis is a recurrent task inmany pattern recognition applications, such as surveillance,behavior analysis, video annotation, and many others. In thispaper we propose a new framework for analyzing trajectoryshape, invariant to spatial shifts of the people motion in thescene. In order to cope with the noise and the uncertainty ofthe trajectory samples, we propose to describe the trajectoriesas a sequence of angles modelled by distributions of circularstatistics, i.e. a mixture of von Mises (MovM) distributions.To deal with MovM, we define a new specific EM algorithmfor estimating the parameters and derive a closed form of theBhattacharyya distance between single vM pdfs. Trajectories arethen modelled with a sequence of symbols, corresponding tothe most suitable distribution in the mixture, and comparedeach other after a global alignment procedure to cope withtrajectories of different lengths. The trajectories in the trainingset are clustered according with their shape similarity in an offlinephase, and testing trajectories are then classified with aspecific on-line EM, based on sufficient statistics. The approachis particularly suitable for classifying people trajectories in videosurveillance, searching for abnormal (i.e. infrequent) paths. Testson synthetic and real data are provided with also a completecomparison with other circular statistical and alignment methods.


2011 - People appearance tracing in video by spectral graph transduction [Relazione in Atti di Convegno]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
abstract

Following people in different video sources is a challenging task: variations in the type of camera, in the lighting conditions, in the scene settings (e.g. crowd or occlusions) and in the point of view must be accounted. In this paper we propose a system based only on appearance information that, disregarding temporal and spatial information, can be flexibly applied on both moving and static cameras. We exploit the joint use of transductive learning and spectral properties of graph Laplacians proposing a formulation of the people tracing problem as a semi-supervised classification. The knowledge encoded in two labeled input sets of positive and negative samples of the target person and the continuous spectral update of these models allow us to obtain a robust approach for people tracing in surveillance video sequences. Experiments on publicly available datasets show satisfactory results and exhibit a good robustness in dealing with short and long term occlusions.


2011 - Vision based smoke detection system using image energy and color information [Articolo su rivista]
Calderara, Simone; Piccinini, Paolo; Cucchiara, Rita
abstract

Smoke detection is a crucial task in many video surveillance applications and could have a great impact to raise the level of safety of urban areas. Many commercial smoke detection sensors exist but most of them cannot be applied in open space or outdoor scenarios. With this aim, the paper presents a smoke detection system that uses a common CCD camera sensor to detect smoke in images and trigger alarms. First, a proper background model is proposed to reliably extract smoke regions and avoid over-segmentation and false positives in outdoor scenarios where many distractors are present, such as moving trees or light reflexes. A novel Bayesian approach is adopted to detect smoke regions in the scene analyzing image energy by means of the Wavelet Transform coefficients and Color Information. A statistical model of image energy is built, using a temporal Gaussian Mixture, to analyze the energy decay that typically occurs when smoke covers the scene then the detection is strengthen evaluating the color blending between a reference smoke color and the input frame. The proposed system is capable of detecting rapidly smoke events both in night and in day conditions with a reduced number of false alarms hence is particularly suitable for monitoring large outdoor scenarios where common sensors would fail. An extensive experimental campaign both on recorded videos and live cameras evaluates the efficacy and efficiency of the system in many real world scenarios, such as outdoor storages and forests.


2010 - Alignment-based Similarity of People Trajectories using Semi-directional Statistics [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper presents a method for comparing people trajectories for video surveillance applications, based on semi-directional statistics. In fact, the modelling of a trajectory as a sequence of angles, speeds and time lags, requires the use of a statistical tool capable to jointly consider periodic and linear variables. Our statistical method is compared with two state-of-the-art methods.


2010 - Moving pixels in static cameras: detecting dangerous situations due to environment or people [Capitolo/Saggio]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

Dangerous situations arise in everyday life and many efforts have been lavished to exploit technology to increase the level of safety in urban areas. Video analysis is absolutely one of the most important and emerging technology for security purposes. Automatic video surveillance systems commonly analyze the scene searching for moving objects. Well known techniques exist to cope with this problem that is commonly referred as \change detection". Every time a dierence against a reference model is sensed, it should be analyzed to allow the system to discriminateamong a usual situation or a possible threat. When the sensor is a camera, motion is the key element to detect changes and moving objects must be correctly classied according to their nature. In this context we can distinguish among two dierent kinds of threat that can lead to dangerous situations in a video-surveilled environment. The first one is due to environmental changes such as rain, fog or smoke present in the scene. This kind of phenomena are sensed by the camera as moving pixelsand, subsequently as moving objects in the scene. This kind of threats shares some common characteristics such as texture, shape and color information and can be detected observing the features' evolution in time. The second situation arises whenpeople are directly responsible of the dangerous situation. In this case a subject is acting in an unusual way leading to an abnormal situation. From the sensor's point of view, moving pixels are still observed, but specic features and time-dependent statistical models should be adopted to learn and then correctly detect unusual and dangerous behaviors. With these premises, this chapter will present two different case studies. The rst one describes the detection of environmental changes in theobserved scene and details the problem of reliably detecting smoke in outdoor environments using both motion information and global image features, such as color information and texture energy computed by the means of the Wavelet transform.The second refers to the problem of detecting suspicious or abnormal people behaviors by means of people trajectory analysis in a multiple cameras video-surveillance scenario. Specically, a technique to infer and learn the concept of normality is proposed jointly with a suitable statistical tool to model and robustly compare people trajectories.


2010 - People trajectory mining with statistical pattern recognition [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita
abstract

People social interaction analysis is a complex and interesting problem that can be faced from several points of view depending on the application context. In videosurveillance contexts many indicators of people habits and relations exist and, among these, people trajectories analysis can reveal many aspects of the way people behave in social environments. We propose a statistical framework for trajectories mining that analyzes, in an integrated solution, several aspects of the trajectories such as location, shape and speed properties. Three different models are proposed to deal with non-idealities of the selected features in conjunction with a robust inexact- matching similarity measure for comparing sequences with different lengths. Experimental results in a real scenario demonstrates the efficacy of the framework in clustering people trajectories with the purpose of analyze frequent behaviors in complex environments.


2009 - A Real-Time System for Abnormal Path Detection [Relazione in Atti di Convegno]
Calderara, Simone; C., Alaimo; Prati, Andrea; Cucchiara, Rita
abstract

This paper proposes a real-time system capable to extract andmodel object trajectories from a multi-camera setup with theaim of identifying abnormal paths. The trajectories are modeledas a sequence of positional distributions (2D Gaussians)and clustered in the training phase by exploiting an innovativedistance measure based on a global alignment techniqueand Bhattacharyya distance between Gaussians. An on-lineclassification procedure is proposed in order to on-the-fly classifynew trajectories into either “normal” or “abnormal” (in thesense of rarely seen before, thus unusual and potentially interesting).Experiments on a real scenario will be presented.


2009 - Learning People Trajectories using Semi-directional Statistics [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper proposes a system for people trajectory shape analysis by exploiting a statistical approach which accounts for sequences of both directional (the directions of the trajectory) and linear (the speeds) data. A semi-directional distribution (AWLG - Approximated Wrapped and Linear Gaussian) is used with a mixture to find main directions and speeds. A variational version of the mutual information criterion is proposed to prove the statistical dependency of the data. Then, in order to compare data sequences, we define an inexact method with a Kullback-Leibler-based distance measure and employ a global alignment technique is to handle sequences of different lengths and with local shifts or deformations. A comprehensive analysis of variable dependency and parameter estimation techniques are reported and evaluated on both synthetic and real data sets.


2009 - Statistical Pattern Recognition for Multi-Camera Detection, Tracking and Trajectory Analysis [Capitolo/Saggio]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

This chapter will address most of the aspects of modern video surveillance with the reference to the research activity conducted at University of Modena and Reggio Emilia, Italy, within the scopes of the national FREE SURF (FREE SUrveillance in a pRivacy-respectFul way) and NATO-funded BE SAFE (Behavioral lEarning in Surveilled Areas with Feature Extraction) projects. Moving object detection and tracking from a single camera, multi-camera consistent labeling and trajectory shape analysis for path classification will be the main topics of this chapter.


2009 - Video surveillance and multimedia forensics: an application to trajectory analysis [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper reports an application of trajectory analysis in which forensics and video surveillance techniques are jointly employed for providing a new tool of multimedia forensics. Advanced video surveillance techniques are used to extract from a multi-camera system the trajectories of the moving people which are then modelled by either their positions (projected on the ground plane) or their directions of movement. Both these two representations can be very suitable for querying large video repositories, by searching for similar trajectories in terms of either sequences of positions or trajectory shape (encoded as sequence of angles, where positions do not care). Preliminary examples of the possible use of this approach are shown.


2008 - "Inside the Bible": Segmentation, Annotation and Retrieval for a New Browsing Experience [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Calderara, Simone; Cucchiara, Rita
abstract

In this paper we present a system for automatic segmentation, annotation and image retrieval based on content, focused on illuminated manuscripts and in particular the Borso D'Este Holy Bible. To enhance the interaction possibilities with this work, full of decorations and illustrations, we exploit some well known document analysis techniques in addition to some new approaches, in order to achieve good segmentation of pages into meaningful visual objects with the relative annotation. We wanted to extend the standard keyword-based retrieval approach in a commentary with a modern visual-based retrieval by appearance similarity: an entire software user interface for exploration and visual search of illuminated manuscripts.


2008 - A Markerless Approach for Consistent Action Recognition in a Multi-camera System [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper presents a method for recognizing human actions in a multi-camera setup. The proposed method automatically extracts significant points on the human body, without the need of artificial markers. A sophisticated appearance-based tracking able to cope with occlusions is exploited to extract a probability map for each moving object. A segmentation technique based on mixture of Gaussians is then employed to extract and track significant points on this map, corresponding to significant regions on the human silhouette. The point tracking produces a set of 3D trajectories that are compared with other trajectories by means of global alignment and dynamic programming techniques. Preliminary experiments showed the potentiality of the proposed approach.


2008 - Action Signature: a Novel Holistic Representation for Action Recognition [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

Recognizing different actions with a unique approach can be a difficult task. This paper proposes a novel holistic representation of actions that we called "action signature". This 1D trajectory is obtained by parsing the 2D image containing the orientations of the gradient calculated on the motion feature map called motion-history image. In this way, the trajectory is a sketch representation of how the object motion varies in time. A robust statistical framework based on mixtures of von Mises distributions and dynamic programming for sequence alignment are used to compare and classify actions/trajectories. The experimental results show a rather high accuracy in distinguishing quite complicated actions, such as drinking, jumping, or abandoning an object.


2008 - Bayesian-competitive Consistent Labeling for People Surveillance [Articolo su rivista]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

This paper presents a novel and robust approach to consistent labeling for people surveillance in multi-camera systems. A general framework scalable to any number of cameras with overlapped views is devised. An off-line training process automatically computes ground-plane homography and recovers epipolar geometry. When a new object is detected in any one camera, hypotheses for potential matching objects in the other cameras are established. Each of the hypotheses is evaluated using a prior and likelihood value. The prior accounts for the positions of the potential matching objects, while the likelihood is computed by warping the vertical axis of the new object on the field of view of the other cameras and measuring the amount of match. In the likelihood, two contributions (forward and backward) are considered so as to correctly handle the case of groups of people merged into single objects. Eventually, a maximum-a-posteriori approach estimates the best label assignment for the new object. Comparisons with other methods based on homography and extensive outdoor experiments demonstrate that the proposed approach is accurate and robust in coping with segmentation errors and in disambiguating groups.


2008 - HECOL: Homography and Epipolar-based Consistent Labeling for Outdoor Park Surveillance [Articolo su rivista]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

Outdoor surveillance is one of the most attractive application of video processing and analysis. Robust algorithms must be defined and tuned to cope with the non-idealities of outdoor scenes. For instance, in a public park, an automatic video surveillance system must discriminate between shadows, reflections, waving trees, people standing still or moving, and other objects. Visual knowledge coming from multiple cameras can disambiguate cluttered and occluded targets by providing a continuous consistent labeling of tracked objects among the different views. This work proposes a new approach for coping with this problem in multi-camera systems with overlapped Fields of View (FoVs). The presence of overlapped zones allows the definition of a geometry-based approach to reconstruct correspondences between FoVs, using only homography and epipolar lines (hereinafter HECOL: Homography and Epipolar-based COnsistent Labeling) computed automatically with a training phase. We also propose a complete system that provides segmentation and tracking of people in each camera module. Segmentation is performed by means of the SAKBOT (Statistical and Knowledge Based Object Tracker) approach, suitably modified to cope with multi-modal backgrounds, reflections and other artefacts, typical of outdoor scenes. The extracted objects are tracked using a statistical appearance model robust against occlusions and segmentation errors. The main novelty of this paper is the approach to consistent labeling. A specific Camera Transition Graph is adopted to efficiently select the possible correspondence hypotheses between labels. A Bayesian MAP optimization assigns consistent labels to objects detected by several points of views: the object axis is computed from the shape tracked in each camera module and homography and epipolar lines allow a correct axis warping in other image planes. Both forward and backward probability contributions from the two different warping directions make the approach robust against segmentation errors, and capable of disambiguating groups of people. The system has been tested in a real setup of a urban public park, within the Italian LAICA (Laboratory of Ambient Intelligence for a friendly city) project. The experiments show how the system can correctly track and label objects in a distributed system with real-time performance. Comparisons with simpler consistent labeling methods and extensive outdoor experiments with ground truth demonstrate the accuracy and robustness of the proposed approach.


2008 - Reliable smoke detection system in the domains of image energy and color [Relazione in Atti di Convegno]
Piccinini, Paolo; Calderara, Simone; Cucchiara, Rita
abstract

Smoke detection calls for a reliable and fast distinction between background, moving objects and variable shapes that are recognizable as smoke. In our system we propose a stable background suppression module joined with a smoke detection module working on segmented objects. It exploits two features: the energy variation in wavelet model and a color model of the smoke. The decrease of energy ratio in wavelet domain between background and current image is a clue to detect smoke representing the variations of texture level. A mixture of Gaussians models this texture ratio for temporal evolution. The color model is used as reference to measure the deviation of the current pixel color from the model. The two features have been combined using a Bayesian classifier to detect smoke in the scene. Experiments on real data and a comparison between our background model and Gaussian Mixture(MOG) model for smoke detection are presented. © 2008 IEEE.


2008 - Smoke detection in video surveillance: A MoG model in the wavelet domain [Relazione in Atti di Convegno]
Calderara, Simone; Piccinini, Paolo; Cucchiara, Rita
abstract

The paper presents a new fast and robust technique of smoke detection in video surveillance images. The approach aims at detecting the spring or the presence of smoke by analyzing color and texture features of moving objects, segmented with background subtraction. The proposal embodies some novelties: first the temporal behavior of the smoke is modeled by a Mixture of Gaussians (MoG ) of the energy variation in the wavelet domain. The MoG takes into account the image energy variation due to either external luminance changes or the smoke propagation. It allows a distinction to energy variation due to the presence of real moving objects such as people and vehicles. Second, this textural analysis is enriched by a color analysis based on the blending function. Third, a Bayesian model is defined where the texture and color features, detected at block level, contributes to model the likelihood while a global evaluation of the entire image models the prior probability contribution. The resulting approach is very flexible and can be adopted in conjunction to a whichever video surveillance system based on dynamic background model. Several tests on tens of different contexts, both outdoor and indoor prove its robustness and precision. © 2008 Springer-Verlag Berlin Heidelberg.


2008 - Smoke detection in videosurveillance: the use of VISOR (Video Surveillance On-line Repository) [Relazione in Atti di Convegno]
Vezzani, Roberto; Calderara, Simone; Piccinini, Paolo; Cucchiara, Rita
abstract

Visor (VIdeo Surveillance Online Repository) is a large videorepository, designed for containing annotated video surveillancefootages, comparing annotations, evaluating systemperformance, and performing retrieval tasks. The web interfaceallows video browse, query by annotated conceptsor by keywords, compressed video preview, media downloadand upload. The repository contains metadata annotations,both manually created ground-truth data and automaticallyobtained outputs of particular systems. An exampleof application is the collection of videos and annotationsfor smoke detection, an important video surveillance task. Inthis paper we present the architecture of ViSOR, the build-insurveillance ontology which integrates many concepts, alsocoming from LSCOM, and MediaMill, the annotation toolsand the visualization of results for performance evaluation.The annotation is obtained with an automatic smoke detectionsystem, capable to detect people, moving objects, andsmoke in real-time.


2008 - Using circular statistics for trajectory shape analysis [Relazione in Atti di Convegno]
Prati, Andrea; Calderara, Simone; Cucchiara, Rita
abstract

The analysis of patterns of movement is a crucial task for several surveillance applications, for instance to classify normal or abnormal people trajectories on the basis of their occurrence. This paper proposes to model the shape of a single trajectory as a sequence of angles described using a Mixture of Von Mises (MoVM) distribution. A complete EM (Expectation Maximization) algorithm is derived for MoVM parameters estimation and an on-line version proposed to meet real time requirement. Maximum-A-Posteriori is used to encode the trajectory as a sequence of symbols corresponding to the MoVM components. Iterative k-medoids clustering groups trajectories in a variable number of similarity classes. The similarity is computed aligning (with dynamic programming) two sequences and considering as symbol-to-symbol distance the Bhattacharyya distance between von Mises distributions. Extensive experiments have been performed on both synthetic and real data. ©2008 IEEE.


2007 - A Distributed Outdoor Video Surveillance System for Detection of Abnormal People Trajectories [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

Distributed surveillance systems are nowadays widely adopted to monitor large areas for security purposes. In this paper, we present a complete multicamera system designed for people tracking from multiple partially overlapped views and capable of inferring and detecting abnormal people trajectories. Detection and tracking are performed by means of background suppression and an appearance-based probabilistic approach. Objects' label ambiguities are geometrically solved and the concept of "normality" is learned from data using a robust statistical model based on Von Mises distributions. Abnormal trajectories are detected using a first-order Bayesian network and, for each abnormal event, the appearance of the subject from each view is logged. Experiments demonstrate that our system can process with real-time performance up to three cameras simultaneously in an unsupervised setup and under varying environmental conditions.


2007 - A Dynamic Programming Technique for Classifying Trajectories [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, A.
abstract

This paper proposes the exploitation of a dynamic programming technique for efficiently comparing people trajectories adopting an encoding scheme that jointly takes into account both the direction and the velocity of movement. With this approach, each pair of trajectories in the training set is compared and the corresponding distance computed. Clustering is achieved by using the k-medoids algorithm and each cluster is modeled with a 1-D Gaussian over the distance from the medoid. A MAP framework is adopted for the testing phase. The reported results are encouraging.


2007 - Detection of Abnormal Behaviors using a Mixture of Von Mises Distributions [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

This paper proposes the use of a mixture of Von Mises distributions to detect abnormal behaviors of moving people. The mixture is created from an unsupervised training set by exploiting k-medoids clustering algorithm based on Bhattacharyya distance between distributions. The extracted medoids are used as modes in the multi-modal mixture whose weights are the priors of the specific medoid. Given the mixture model a new trajectory is verified on the model by considering each direction composing it as independent. Experiments over a real scenario composed of multiple, partially-overlapped cameras are reported.


2006 - Group Detection at Camera Handoff for Collecting People Appearance in Multi-camera Systems [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

Logging information on moving objects is crucial in video surveillance systems. Distributed multi-camera systems can provide the appearance of objects/people from different viewpoints and at different resolutions, allowing a more complete and precise logging of the information. This is achieved through consistent labeling to correlate collected information of the same person. This paper proposes a novel approach to consistent labeling also capable to fully characterize groups of people and to manage miss segmentations. The ground-plane homography and the epipolar geometry are automatically learned and exploited to warp objects' principal axes between overlapped cameras. A MAP estimator that exploits two contributions (forward and backward) is used to choose the most probable label configuration to be assigned at the handoff of a new object. Extensive experiments demonstrate the accuracy of the proposed method in detecting single and simultaneous handoffs, miss segmentations, and groups.


2006 - Multimedia Surveillance: Content-based Retrieval with Multicamera People Tracking [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

Multimedia surveillance relates to the exploitation of multimedia tools for retrieving information from surveillance data, for emerging applications such as video post-analysis for forensic purposes. Searching for all the sequences in which a certain person was present is a typical query that is carried out by means of example images. Unfortunately, surveillance cameras often have low resolution, making retrieval based on appearance difficult. This paper proposes to exploit a two-step retrieval process that merges similarity-based retrieval with multicamera tracking-based retrieval able to create consistent traces of a person from different views and, thus, different resolutions. A mixture model is used to summarize these traces into a single prototype on which retrieval is performed. Experimental results demonstrate the accuracy of the retrieval process also in the case of varying illumination conditions.


2006 - Reliable background suppression for complex scenes [Relazione in Atti di Convegno]
Calderara, Simone; Melli, Rudy Mirko; Prati, Andrea; Cucchiara, Rita
abstract

This paper describes a system for motion detection based on background suppression,specifically conceived for working in complex scenes with vacillating background,camouflage, illumination changing, etc.. The system contains proper techniques for background bootstrapping, shadow removal, ghost suppression and selective updating of the background model. The results on the challenging videos provided in VSSN '06 Open Source Algorithm Competition dataset demonstrate that the proposed system outperforms the widely-used mixture-of-Gaussians approach.


2006 - The LAICA project: Experiments on Multicamera People Tracking and Logging [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

Logging information on moving objects is crucial in video surveillance systems. Distributed multi-camera systems can provide the appearance of objects/people from differentviewpoints and at different resolutions, allowing a more complete and precise logging of the information. This is achieved through consistent labeling to correlate collected information of the same person. This paper proposes a novel approach to consistent labeling also capable tofully characterize groups of people and to manage miss segmentations. The ground-plane homography and the epipolar geometry are automatically learned and exploited to warp objects’ principal axes between overlapped cameras. A MAP estimator that exploits two contributions (forward and backward) is used to choose the most probable label con£guration to be assigned at the handoff of a new object. Extensive experiments demonstrate the accuracy of the proposed method in detecting single and simultaneous handoffs, miss segmentations, and groups.


2005 - Consistent labeling for multi-camera object tracking [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Vezzani, Roberto; Cucchiara, Rita
abstract

In this paper, we present a new approach to multi-camera object tracking based on the consistent labeling. An automatic and reliable procedure allows to obtain the homographic transformation between two overlapped views, without any manual calibration of the cameras. Object's positions are matched by using the homography when the object is firstly detected in one of the two views. The approach has been tested also in the case of simultaneous transitions and in the case in which people are detected as a group during the transition. Promising results are reported over a real setup of overlapped cameras.


2005 - Entry Edge of Field of View for multi-camera tracking in distributed video surveillance [Relazione in Atti di Convegno]
Calderara, Simone; Vezzani, Roberto; Prati, Andrea; Cucchiara, Rita
abstract

Efficient solution to people tracking in distributed videosurveillance is requested to monitor crowded and large environments.This paper proposes a novel use of the EntryEdges of Field of View (E2oFoV) to solve the consistentlabeling problem between partially overlapped views. Anautomatic and reliable procedure allows to obtain the homographictransformation between two overlapped views,without any manual calibration of the cameras. Throughthe homography, the consistent labeling is established eachtime a new track is detected in one of the cameras. A CameraTransition Graph (CTG) is defined to speed up the establishmentprocess by reducing the search space. Experimentalresults prove the effectiveness of the proposed solutionalso in challenging conditions.