- A feature-based separation method, for separating a plurality of loosely-arranged duplicate articles and a system for actuating the method for supplying a packaging machine [Brevetto]
G. Monti; A. Prati; P. Piccinini; R. Cucchiara
The invention relates to a segmentation method based on the characteristics for segmenting a plurality of duplicate articles (3) arranged loosely, which comprises stages of: acquiring an image (M) of a sample article (30); calculating keypoint-descriptors of the image (M); defining an identifying figure (Z) on the image (M); acquiring a first image (11) of a plurality of duplicate articles; performing a matching of the thus-defined keypoint-descriptor pairs; acquiring a position and an orientation of the identifying figure (Z) with respect to a first keypoint-descriptor pair of the image (M) having a match with a second keypoint-descriptor pair of the first image (11); defining, in the first image (11), an identifying figure of projection as a Euclidean transformation of the identifying figure (Z), with reference to the first and second pairs; applying the two preceding stages to a plurality of keypoint-descriptor pairs of the image (M) having a match with a keypoint-descriptor pair of the first image (11); collecting together identifying figures of projection having between them a predetermined degree of superposing; defining a representative figure for each group of identifying figures of projection which is formed by a minimum predetermined number of identifying figures of projection, which representative figure has a same shape and dimension as an identifying figure of projection, and is selected in order to estimate a position of a corresponding article illustrated in the first image of a plurality of duplicate articles. The invention also relates to a method for picking up articles (3) arranged loosely in a storage zone of articles (5) and for positioning the articles (3) in an outlet station (SU), and a group for actuating the method.

- Feature-based segmentation method, for segmenting a plurality of loosely-arranged duplicate articles and a group for actuating the method for supplying a packaging machine [Brevetto]
G. Monti; A. Prati; P. Piccinini; R. Cucchiara
The invention relates to a segmentation method based on the characteristics for segmenting a plurality of duplicate articles (3) arranged loosely, which comprises stages of: acquiring an image (M) of a sample article (30); calculating keypoint-descriptors of the image (M); defining an identifying figure (Z) on the image (M); acquiring a first image (11) of a plurality of duplicate articles; performing a matching of the thus-defined keypoint-descriptor pairs; acquiring a position and an orientation of the identifying figure (Z) with respect to a first keypoint-descriptor pair of the image (M) having a match with a second keypoint-descriptor pair of the first image (11); defining, in the first image (11), an identifying figure of projection as a Euclidean transformation of the identifying figure (Z), with reference to the first and second pairs; applying the two preceding stages to a plurality of keypoint-descriptor pairs of the image (M) having a match with a keypoint-descriptor pair of the first image (11); collecting together identifying figures of projection having between them a predetermined degree of superposing; defining a representative figure for each group of identifying figures of projection which is formed by a minimum predetermined number of identifying figures of projection, which representative figure has a same shape and dimension as an identifying figure of projection, and is selected in order to estimate a position of a corresponding article illustrated in the first image of a plurality of duplicate articles. The invention also relates to a method for picking up articles (3) arranged loosely in a storage zone of articles (5) and for positioning the articles (3) in an outlet station (SU), and a group for actuating the method.

2019 - Classifying Signals on Irregular Domains via Convolutional Cluster Pooling [Relazione in Atti di Convegno]
Porrello, Angelo; Abati, Davide; Calderara, Simone; Cucchiara, Rita
We present a novel and hierarchical approach for supervised classification of signals spanning over a fixed graph, reflecting shared properties of the dataset. To this end, we introduce a Convolutional Cluster Pooling layer exploiting a multi-scale clustering in order to highlight, at different resolutions, locally connected regions on the input graph. Our proposal generalises well-established neural models such as Convolutional Neural Networks (CNNs) on irregular and complex domains, by means of the exploitation of the weight sharing property in a graph-oriented architecture. In this work, such property is based on the centrality of each vertex within its soft-assigned cluster. Extensive experiments on NTU RGB+D, CIFAR-10 and 20NEWS demonstrate the effectiveness of the proposed technique in capturing both local and global patterns in graph-structured data out of different domains.

2019 - Recognizing social relationships from an egocentric vision perspective [Capitolo/Saggio]
Alletto, Stefano; Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita
In this chapter we address the problem of partitioning social gatherings into interacting groups in egocentric scenarios. People in the scene are tracked, their head pose and 3D location are estimated. Following the formalism of the f-formation, we define with the orientation and distance an inherently social pairwise feature capable of describing how two people stand in relation to one another. We present a Structural SVM based approach to learn how to weight each component of the feature vector depending on the social situation is applied to. To better understand the social dynamics, we also estimate what we call social relevance of each subject in a group using a saliency attentive model. Extensive tests on two publicly available datasets show that our solution achieves encouraging results when detecting social groups and their relevant subjects in the challenging egocentric scenarios.

2018 - Aligning Text and Document Illustrations: towards Visually Explainable Digital Humanities [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Cornia, Marcella; Grana, Costantino; Cucchiara, Rita
While several approaches to bring vision and language together are emerging, none of them has yet addressed the digital humanities domain, which, nevertheless, is a rich source of visual and textual data. To foster research in this direction, we investigate the learning of visual-semantic embeddings for historical document illustrations, devising both supervised and semi-supervised approaches. We exploit the joint visual-semantic embeddings to automatically align illustrations and textual elements, thus providing an automatic annotation of the visual content of a manuscript. Experiments are performed on the Borso d'Este Holy Bible, one of the most sophisticated illuminated manuscript from the Renaissance, which we manually annotate aligning every illustration with textual commentaries written by experts. Experimental results quantify the domain shift between ordinary visual-semantic datasets and the proposed one, validate the proposed strategies, and devise future works on the same line.

2018 - Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era [Articolo su rivista]
Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita
Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades both in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we discuss the effectiveness of convolutional neural networks (CNNs) models in saliency prediction. We present a set of Deep Learning architectures developed by us, which can combine both bottom-up cues and higher-level semantics, and extract spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. We will show how these deep networks closely recall the early saliency models, although improved with the semantics learned from the human ground-truth. Eventually, we will present a use-case in which saliency prediction is used to improve the automatic description of images.

2018 - Automatic Image Cropping and Selection using Saliency: an Application to Historical Manuscripts [Relazione in Atti di Convegno]
Cornia, Marcella; Pini, Stefano; Baraldi, Lorenzo; Cucchiara, Rita
Automatic image cropping techniques are particularly important to improve the visual quality of cropped images and can be applied to a wide range of applications such as photo-editing, image compression, and thumbnail selection. In this paper, we propose a saliency-based image cropping method which produces significant cropped images by only relying on the corresponding saliency maps. Experiments on standard image cropping datasets demonstrate the benefit of the proposed solution with respect to other cropping methods. Moreover, we present an image selection method that can be effectively applied to automatically select the most representative pages of historical manuscripts thus improving the navigation of historical digital libraries.

2018 - Comportamento non verbale intergruppi “oggettivo”: una replica dello studio di Dovidio, kawakami e Gaertner (2002) [Abstract in Atti di Convegno]
Di Bernardo, Gian Antonio; Vezzali, Loris; Giovannini, Dino; Palazzi, Andrea; Calderara, Simone; Bicocchi, Nicola; Zambonelli, Franco; Cucchiara, Rita; Cadamuro, Alessia; Cocco, Veronica Margherita
Vi è una lunga tradizione di ricerca che ha analizzato il comportamento non verbale, anche considerando relazioni intergruppi. Solitamente, questi studi si avvalgono di valutazioni di coder esterni, che tuttavia sono soggettive e aperte a distorsioni. Abbiamo condotto uno studio in cui si è preso come riferimento il celebre studio di Dovidio, Kawakami e Gaertner (2002), apportando tuttavia alcune modifiche e considerando la relazione tra bianchi e neri. Partecipanti bianchi, dopo aver completato misure di pregiudizio esplicito e implicito, incontravano (in ordine contro-bilanciato) un collaboratore bianco e uno nero. Con ognuno di essi, parlavano per tre minuti di un argomento neutro e di un argomento saliente per la distinzione di gruppo (in ordine contro-bilanciato). Tali interazioni erano registrate con una telecamera kinect, che è in grado di tenere conto della componente tridimensionale del movimento. I risultati hanno rivelato vari elementi di interesse. Anzitutto, si sono creati indici oggettivi, a partire da un’analisi della letteratura, alcuni dei quali non possono essere rilevati da coder esterni, quali distanza interpersonale e volume di spazio tra le persone. I risultati hanno messo in luce alcuni aspetti rilevanti: (1) l’atteggiamento implicito è associato a vari indici di comportamento non verbale, i quali mediano sulle valutazioni dei partecipanti fornite dai collaboratori; (2) le interazioni vanno considerate in maniera dinamica, tenendo conto che si sviluppano nel tempo; (3) ciò che può essere importante è il comportamento non verbale globale, piuttosto che alcuni indici specifici pre-determinati dagli sperimentatori.

2018 - Domain Translation with Conditional GANs: from Depth to RGB Face-to-Face [Relazione in Atti di Convegno]
Fabbri, Matteo; Borghi, Guido; Lanzi, Fabio; Vezzani, Roberto; Calderara, Simone; Cucchiara, Rita
Can faces acquired by low-cost depth sensors be useful to see some characteristic details of the faces? Typically the answer is not. However, new deep architectures can generate RGB images from data acquired in a different modality, such as depth data. In this paper we propose a new Deterministic Conditional GAN, trained on annotated RGB-D face datasets, effective for a face-to-face translation from depth to RGB. Although the network cannot reconstruct the exact somatic features for unknown individual faces, it is capable to reconstruct plausible faces; their appearance is accurate enough to be used in many pattern recognition tasks. In fact, we test the network capability to hallucinate with some Perceptual Probes, as for instance face aspect classification or landmark detection. Depth face can be used in spite of the correspondent RGB images, that often are not available for darkness of difficult luminance conditions. Experimental results are very promising and are as far as better than previous proposed approaches: this domain translation can constitute a new way to exploit depth data in new future applications.

2018 - End-to-end 6-DoF Object Pose Estimation through Differentiable Rasterization [Relazione in Atti di Convegno]
Palazzi, Andrea; Bergamini, Luca; Calderara, Simone; Cucchiara, Rita
Here we introduce an approximated differentiable renderer to refine a 6-DoF pose prediction using only 2D alignment information. To this end, a two-branched convolutional encoder network is employed to jointly estimate the object class and its 6-DoF pose in the scene. We then propose a new formulation of an approximated differentiable renderer to re-project the 3D object on the image according to its predicted pose; in this way the alignment error between the observed and the re-projected object silhouette can be measured. Since the renderer is differentiable, it is possible to back-propagate through it to correct the estimated pose at test time in an online learning fashion. Eventually we show how to leverage the classification branch to profitably re-project a representative model of the predicted class (i.e. a medoid) instead. Each object in the scene is processed independently and novel viewpoints in which both objects arrangement and mutual pose are preserved can be rendered. Differentiable renderer code is available at:https://github.com/ndrplz/tensorflow-mesh-renderer.

2018 - Face Verification from Depth using Privileged Information [Relazione in Atti di Convegno]
Borghi, Guido; Pini, Stefano; Grazioli, Filippo; Vezzani, Roberto; Cucchiara, Rita
In this paper, a deep Siamese architecture for depth-based face verification is presented. The proposed approach efficiently verifies if two face images belong to the same person while handling a great variety of head poses and occlusions. The architecture, namely JanusNet, consists in a combination of a depth, a RGB and a hybrid Siamese network. During the training phase, the hybrid network learns to extract complementary mid-level convolutional features which mimic the features of the RGB network, simultaneously leveraging on the light invariance of depth images. At testing time, the model, relying only on depth data, achieves state-of-art results and real time performance, despite the lack of deep-oriented depth-based datasets.

2018 - Face-from-Depth for Head Pose Estimation on Depth Images [Articolo su rivista]
Borghi, Guido; Fabbri, Matteo; Vezzani, Roberto; Calderara, Simone; Cucchiara, Rita
Depth cameras allow to set up reliable solutions for people monitoring and behavior understanding, especially when unstable or poor illumination conditions make unusable common RGB sensors. Therefore, we propose a complete framework for the estimation of the head and shoulder pose based on depth images only. A head detection and localization module is also included, in order to develop a complete end-to-end system. The core element of the framework is a Convolutional Neural Network, called POSEidon+, that receives as input three types of images and provides the 3D angles of the pose as output. Moreover, a Face-from-Depth component based on a Deterministic Conditional GAN model is able to hallucinate a face from the corresponding depth image. We empirically demonstrate that this positively impacts the system performances. We test the proposed framework on two public datasets, namely Biwi Kinect Head Pose and ICT-3DHP, and on Pandora, a new challenging dataset mainly inspired by the automotive setup. Experimental results show that our method overcomes several recent state-of-art works based on both intensity and depth input data, running in real-time at more than 30 frames per second.

2018 - Fully Convolutional Network for Head Detection with Depth Images [Relazione in Atti di Convegno]
Ballotta, Diego; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
Head detection and localization are one of most investigated and demanding tasks of the Computer Vision community. These are also a key element for many disciplines, like Human Computer Interaction, Human Behavior Understanding, Face Analysis and Video Surveillance. In last decades, many efforts have been conducted to develop accurate and reliable head or face detectors on standard RGB images, but only few solutions concern other types of images, such as depth maps. In this paper, we propose a novel method for head detection on depth images, based on a deep learning approach. In particular, the presented system overcomes the classic sliding-window approach, that is often the main computational bottleneck of many object detectors, through a Fully Convolutional Network. Two public datasets, namely Pandora and Watch-n-Patch, are exploited to train and test the proposed network. Experimental results confirm the effectiveness of the method, that is able to exceed all the state-of-art works based on depth images and to run with real time performance.

2018 - Hands on the wheel: a Dataset for Driver Hand Detection and Tracking [Relazione in Atti di Convegno]
Borghi, Guido; Frigieri, Elia; Vezzani, Roberto; Cucchiara, Rita
The ability to detect, localize and track the hands is crucial in many applications requiring the understanding of the person behavior, attitude and interactions. In particular, this is true for the automotive context, in which hand analysis allows to predict preparatory movements for maneuvers or to investigate the driver’s attention level. Moreover, due to the recent diffusion of cameras inside new car cockpits, it is feasible to use hand gestures to develop new Human-Car Interaction systems, more user-friendly and safe. In this paper, we propose a new dataset, called Turms, that consists of infrared images of driver’s hands, collected from the back of the steering wheel, an innovative point of view. The Leap Motion device has been selected for the recordings, thanks to its stereo capabilities and the wide view-angle. Besides, we introduce a method to detect the presence and the location of driver’s hands on the steering wheel, during driving activity tasks.

2018 - LAMV: Learning to align and match videos with kernelized temporal layers [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Douze, Matthijs; Cucchiara, Rita; Jégou, Hervé
This paper considers a learnable approach for comparing and aligning videos. Our architecture builds upon and revisits temporal match kernels within neural networks: we propose a new temporal layer that finds temporal alignments by maximizing the scores between two sequences of vectors, according to a time-sensitive similarity metric parametrized in the Fourier domain. We learn this layer with a temporal proposal strategy, in which we minimize a triplet loss that takes into account both the localization accuracy and the recognition rate. We evaluate our approach on video alignment, copy detection and event retrieval. Our approach outperforms the state on the art on temporal video alignment and video copy detection datasets in comparable setups. It also attains the best reported results for particular event search, while precisely aligning videos.

2018 - Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World [Relazione in Atti di Convegno]
Fabbri, Matteo; Lanzi, Fabio; Calderara, Simone; Palazzi, Andrea; Vezzani, Roberto; Cucchiara, Rita
Multi-People Tracking in an open-world setting requires a special effort in precise detection. Moreover, temporal continuity in the detection phase gains more importance when scene cluttering introduces the challenging problems of occluded targets. For the purpose, we propose a deep network architecture that jointly extracts people body parts and associates them across short temporal spans. Our model explicitly deals with occluded body parts, by hallucinating plausible solutions of not visible joints. We propose a new end-to-end architecture composed by four branches (visible heatmaps, occluded heatmaps, part affinity fields and temporal affinity fields) fed by a time linker feature extractor. To overcome the lack of surveillance data with tracking, body part and occlusion annotations we created the vastest Computer Graphics dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, almost 10 million body poses) of human body parts for people tracking in urban scenarios. Our architecture trained on virtual data exhibits good generalization capabilities also on public real tracking benchmarks, when image resolution and sharpness are high enough, producing reliable tracklets useful for further batch data association or re-id modules.

2018 - Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention [Articolo su rivista]
Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita
Image captioning has been recently gaining a lot of attention thanks to the impressive achievements shown by deep captioning architectures, which combine Convolutional Neural Networks to extract image representations, and Recurrent Neural Networks to generate the corresponding captions. At the same time, a significant research effort has been dedicated to the development of saliency prediction models, which can predict human eye fixations. Despite saliency information could be useful to condition an image captioning architecture, by providing an indication of what is salient and what is not, no model has yet succeeded in effectively incorporating these two techniques. In this work, we propose an image captioning approach in which a generative recurrent neural network can focus on different parts of the input image during the generation of the caption, by exploiting the conditioning given by a saliency prediction model on which parts of the image are salient and which are contextual. We demonstrate, through extensive quantitative and qualitative experiments on large scale datasets, that our model achieves superior performances with respect to different image captioning baselines with and without saliency. Finally, we also show that the trained model can focus on salient and contextual regions during the generation of the caption in an appropriate way.

2018 - Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model [Articolo su rivista]
Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita
Data-driven saliency has recently gained a lot of attention thanks to the use of Convolutional Neural Networks for predicting gaze fixations. In this paper we go beyond standard approaches to saliency prediction, in which gaze maps are computed with a feed-forward network, and present a novel model which can predict accurate saliency maps by incorporating neural attentive mechanisms. The core of our solution is a Convolutional LSTM that focuses on the most salient regions of the input image to iteratively refine the predicted saliency map. Additionally, to tackle the center bias typical of human eye fixations, our model can learn a set of prior maps generated with Gaussian functions. We show, through an extensive evaluation, that the proposed architecture outperforms the current state of the art on public saliency prediction datasets. We further study the contribution of each key component to demonstrate their robustness on different scenarios.

2018 - Predicting the Driver's Focus of Attention: the DR(eye)VE Project [Articolo su rivista]
Palazzi, Andrea; Abati, Davide; Calderara, Simone; Solera, Francesco; Cucchiara, Rita
Predicting the Driver's Focus of Attention: the DR(eye)VE Project Andrea Palazzi, Davide Abati, Simone Calderara, Francesco Solera, Rita Cucchiara (Submitted on 10 May 2017 (v1), last revised 6 Jun 2018 (this version, v3)) In this work we aim to predict the driver's focus of attention. The goal is to estimate what a person would pay attention to while driving, and which part of the scene around the vehicle is more critical for the task. To this end we propose a new computer vision model based on a multi-branch deep architecture that integrates three sources of information: raw video, motion and scene semantics. We also introduce DR(eye)VE, the largest dataset of driving scenes for which eye-tracking annotations are available. This dataset features more than 500,000 registered frames, matching ego-centric views (from glasses worn by drivers) and car-centric views (from roof-mounted camera), further enriched by other sensors measurements. Results highlight that several attention patterns are shared across drivers and can be reproduced to some extent. The indication of which elements in the scene are likely to capture the driver's attention may benefit several applications in the context of human-vehicle interaction and driver attention analysis.

2018 - SAM: Pushing the Limits of Saliency Prediction Models [Relazione in Atti di Convegno]
Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita
The prediction of human eye fixations has been recently gaining a lot of attention thanks to the improvements shown by deep architectures. In our work, we go beyond classical feed-forward networks to predict saliency maps and propose a Saliency Attentive Model which incorporates neural attention mechanisms to iteratively refine predictions. Experiments demonstrate that the proposed strategy overcomes by a considerable margin the state of the art on the largest dataset available for saliency prediction. Here, we provide experimental results on other popular saliency datasets to confirm the effectiveness and the generalization capabilities of our model, which enable us to reach the state of the art on all considered datasets.

2018 - Self-Supervised Optical Flow Estimation by Projective Bootstrap [Articolo su rivista]
Alletto, Stefano; Abati, Davide; Calderara, Simone; Cucchiara, Rita; Rigazio, Luca
Dense optical flow estimation is complex and time consuming, with state-of-the-art methods relying either on large synthetic data sets or on pipelines requiring up to a few minutes per frame pair. In this paper, we address the problem of optical flow estimation in the automotive scenario in a self-supervised manner. We argue that optical flow can be cast as a geometrical warping between two successive video frames and devise a deep architecture to estimate such transformation in two stages. First, a dense pixel-level flow is computed with a projective bootstrap on rigid surfaces. We show how such global transformation can be approximated with a homography and extend spatial transformer layers so that they can be employed to compute the flow field implied by such transformation. Subsequently, we refine the prediction by feeding a second, deeper network that accounts for moving objects. A final reconstruction loss compares the warping of frame Xₜ with the subsequent frame Xₜ₊₁ and guides both estimates. The model has the speed advantages of end-to-end deep architectures while achieving competitive performances, both outperforming recent unsupervised methods and showing good generalization capabilities on new automotive data sets.

2018 - Towards Cycle-Consistent Models for Text and Image Retrieval [Relazione in Atti di Convegno]
Cornia, Marcella; Baraldi, Lorenzo; Rezazadegan Tavakoli, Hamed; Cucchiara, Rita
Cross-modal retrieval has been recently becoming an hot-spot research, thanks to the development of deeply-learnable architectures. Such architectures generally learn a joint multi-modal embedding space in which text and images could be projected and compared. Here we investigate a different approach, and reformulate the problem of cross-modal retrieval as that of learning a translation between the textual and visual domain. In particular, we propose an end-to-end trainable model which can translate text into image features and vice versa, and regularizes this mapping with a cycle-consistency criterion. Preliminary experimental evaluations show promising results with respect to ordinary visual-semantic models.

2018 - Using Kinect camera for investigating intergroup non-verbal human interactions [Abstract in Atti di Convegno]
Vezzali, Loris; Di Bernardo, Gian Antonio; Cadamuro, Alessia; Cocco, Veronica Margherita; Crapolicchio, Eleonora; Bicocchi, Nicola; Calderara, Simone; Giovannini, Dino; Zambonelli, Franco; Cucchiara, Rita
A long tradition in social psychology focused on nonverbal behaviour displayed during dyadic interactions generally relying on evaluations from external coders. However, in addition to the fact that external coders may be biased, they may not capture certain type of behavioural indices. We designed three studies examining explicit and implicit prejudice as predictors of nonberval behaviour as reflected in objective indices provided by Kinect cameras. In the first study, we considered White-Black relations from the perspective of 36 White participants. Results revealed that implicit prejudice was associated with a reduction in interpersonal distance and in the volume of space between Whites and Blacks (vs. Whites and Whites), which in turn were associated with evaluations by collaborators taking part in the interaction. In the second study, 37 non-HIV participants interacted with HIV individuals. We found that implicit prejudice was associated with reduced volume of space between interactants over time (a process of bias overcorrection) only when they tried hard to control their behaviour (as captured by a stroop test). In the third study 35 non-disabled children interacted with disabled children. Results revealed that implicit prejudice was associated with reduced interpersonal distance over time.

2018 - Visual-Semantic Alignment Across Domains Using a Semi-Supervised Approach [Relazione in Atti di Convegno]
Carraggi, Angelo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval of images and sentences. In this setting, data coming from different modalities can be projected in a common embedding space, in which distances can be used to infer the similarity between pairs of images and sentences. While this approach has shown impressive performances on fully supervised settings, its application to semi-supervised scenarios has been rarely investigated. In this paper we propose a domain adaptation model for cross-modal retrieval, in which the knowledge learned from a supervised dataset can be transferred on a target dataset in which the pairing between images and sentences is not known, or not useful for training due to the limited size of the set. Experiments are performed on two target unsupervised scenarios, respectively related to the fashion and cultural heritage domain. Results show that our model is able to effectively transfer the knowledge learned on ordinary visual-semantic datasets, achieving promising results. As an additional contribution, we collect and release the dataset used for the cultural heritage domain.

2018 - What was Monet seeing while painting? Translating artworks to photo-realistic images [Relazione in Atti di Convegno]
Tomei, Matteo; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita
State of the art Computer Vision techniques exploit the availability of large-scale datasets, most of which consist of images captured from the world as it is. This brings to an incompatibility between such methods and digital data from the artistic domain, on which current techniques under-perform. A possible solution is to reduce the domain shift at the pixel level, thus translating artistic images to realistic copies. In this paper, we present a model capable of translating paintings to photo-realistic images, trained without paired examples. The idea is to enforce a patch level similarity between real and generated images, aiming to reproduce photo-realistic details from a memory bank of real images. This is subsequently adopted in the context of an unpaired image-to-image translation framework, mapping each image from one distribution to a new one belonging to the other distribution. Qualitative and quantitative results are presented on Monet, Cezanne and Van Gogh paintings translation tasks, showing that our approach increases the realism of generated images with respect to the CycleGAN approach.

2017 - A Video Library System Using Scene Detection and Automatic Tagging [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
We present a novel video browsing and retrieval system for edited videos, in which videos are automatically decomposed into meaningful and storytelling parts (i.e. scenes) and tagged according to their transcript. The system relies on a Triplet Deep Neural Network which exploits multimodal features, and has been implemented as a set of extensions to the eXo Platform Enterprise Content Management System (ECMS). This set of extensions enable the interactive visualization of a video, its automatic and semi-automatic annotation, as well as a keyword-based search inside the video collection. The platform also allows a natural integration with third-party add-ons, so that automatic annotations can be exploited outside the proposed platform.

2017 - A new era in the study of intergroup nonverbal behaviour: Studying intergroup dyadic interactions “online” [Abstract in Atti di Convegno]
DI BERNARDO, GIAN ANTONIO; Vezzali, Loris; Palazzi, Andrea; Calderara, Simone; Bicocchi, Nicola; Zambonelli, Franco; Cucchiara, Rita; Cadamuro, Alessia
We examined predictors and consequences of intergroup nonverbal behaviour by relying on new technologies and new objective indices. In three studies, both in the laboratory and in the field with children, behaviour was a function of implicit prejudice.

2017 - Affective level design for a role-playing videogame evaluated by a brain–computer interface and machine learning methods [Articolo su rivista]
Balducci, Fabrizio; Grana, Costantino; Cucchiara, Rita
Game science has become a research field, which attracts industry attention due to a worldwide rich sell-market. To understand the player experience, concepts like flow or boredom mental states require formalization and empirical investigation, taking advantage of the objective data that psychophysiological methods like electroencephalography (EEG) can provide. This work studies the affective ludology and shows two different game levels for Neverwinter Nights 2 developed with the aim to manipulate emotions; two sets of affective design guidelines are presented, with a rigorous formalization that considers the characteristics of role-playing genre and its specific gameplay. An empirical investigation with a brain–computer interface headset has been conducted: by extracting numerical data features, machine learning techniques classify the different activities of the gaming sessions (task and events) to verify if their design differentiation coincides with the affective one. The observed results, also supported by subjective questionnaires data, confirm the goodness of the proposed guidelines, suggesting that this evaluation methodology could be extended to other evaluation tasks.

2017 - Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era [Relazione in Atti di Convegno]
Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita
Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we propose a discussion on why convolutional neural networks (CNNs) are so accurate in saliency prediction. We present our DL architectures which combine both bottom-up cues and higher-level semantics, and incorporate the concept of time in the attentional process through LSTM recurrent architectures. Eventually, we present a video-specific architecture based on the C3D network, which can extracts spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. The merit of this work is to show how these deep networks are not mere brute-force methods tuned on massive amount of data, but represent well-defined architectures which recall very closely the early saliency models, although improved with the semantics learned by human ground-thuth.

2017 - Embedded Recurrent Network for Head Pose Estimation in Car [Relazione in Atti di Convegno]
Borghi, Guido; Gasparini, Riccardo; Vezzani, Roberto; Cucchiara, Rita
An accurate and fast driver's head pose estimation is a rich source of information, in particular in the automotive context. Head pose is a key element for driver's behavior investigation, pose analysis, attention monitoring and also a useful component to improve the efficacy of Human-Car Interaction systems. In this paper, a Recurrent Neural Network is exploited to tackle the problem of driver head pose estimation, directly and only working on depth images to be more reliable in presence of varying or insufficient illumination. Experimental results, obtained from two public dataset, namely Biwi Kinect Head Pose and ICT-3DHP Database, prove the efficacy of the proposed method that overcomes state-of-art works. Besides, the entire system is implemented and tested on two embedded boards with real time performance.

2017 - Fast and Accurate Facial Landmark Localization in Depth Images for In-car Applications [Relazione in Atti di Convegno]
Frigieri, Elia; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
A correct and reliable localization of facial landmark enables several applications in many fields, ranging from Human Computer Interaction to video surveillance. For instance, it can provide a valuable input to monitor the driver physical state and attention level in automotive context. In this paper, we tackle the problem of facial landmark localization through a deep approach. The developed system runs in real time and, in particular, is more reliable than state-of-the-art competitors specially in presence of light changes and poor illumination, thanks to the use of depth images as input. We also collected and shared a new realistic dataset inside a car, called MotorMark, to train and test the system. In addition, we exploited the public Eurecom Kinect Face Dataset for the evaluation phase, achieving promising results both in terms of accuracy and computational speed.

2017 - From Groups to Leaders and Back [Capitolo/Saggio]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
Recently, social theories and empirical observations identified small groups and leaders as the basic elements which shape a crowd. This leads to an intermediate level of abstraction that is placed between the crowd as a flow of people, and the crowd as a collection of individuals. Consequently, automatic analysis of crowds in computer vision is also experiencing a shift in focus from individuals to groups and from small groups to their leaders. In this chapter, we present state-of-the-art solutions to the groups and leaders detection problem, which are able to account for physical factors as well as for sociological evidence observed over short time windows. The presented algorithms are framed as structured learning problems over the set of individual trajectories. However, the way trajectories are exploited to predict the structure of the crowd is not fixed but rather learned from recorded and annotated data, enabling the method to adapt these concepts to different scenarios, densities, cultures, and other unobservable complexities. Additionally, we investigate the relation between leaders and their groups and propose the first attempt to exploit leadership as prior knowledge for group detection.

2017 - Generative Adversarial Models for People Attribute Recognition in Surveillance [Relazione in Atti di Convegno]
Fabbri, Matteo; Calderara, Simone; Cucchiara, Rita
In this paper we propose a deep architecture for detecting people attributes (e.g. gender, race, clothing ...) in surveillance contexts. Our proposal explicitly deal with poor resolution and occlusion issues that often occur in surveillance footages by enhancing the images by means of Deep Convolutional Generative Adversarial Networks (DCGAN). Experiments show that by combining both our Generative Reconstruction and Deep Attribute Classification Network we can effectively extract attributes even when resolution is poor and in presence of strong occlusions up to 80% of the whole person figure.

2017 - Head Detection with Depth Images in the Wild [Relazione in Atti di Convegno]
Ballotta, Diego; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
Head detection and localization is a demanding task and a key element for many computer vision applications, like video surveillance, Human Computer Interaction and face analysis. The stunning amount of work done for detecting faces on RGB images, together with the availability of huge face datasets, allowed to setup very effective systems on that domain. However, due to illumination issues, infrared or depth cameras may be required in real applications. In this paper, we introduce a novel method for head detection on depth images that exploits the classification ability of deep learning approaches. In addition to reduce the dependency on the external illumination, depth images implicitly embed useful information to deal with the scale of the target objects. Two public datasets have been exploited: the first one, called Pandora, is used to train a deep binary classifier with face and non-face images. The second one, collected by Cornell University, is used to perform a cross-dataset test during daily activities in unconstrained environments. Experimental results show that the proposed method overcomes the performance of state-of-art methods working on depth images.

2017 - Hierarchical Boundary-Aware Neural Encoder for Video Captioning [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
The use of Recurrent Neural Networks for video captioning has recently gained a lot of attention, since they can be used both to encode the input video and to generate the corresponding description. In this paper, we present a recurrent video encoding scheme which can discover and leverage the hierarchical structure of the video. Unlike the classical encoder-decoder approach, in which a video is encoded continuously by a recurrent layer, we propose a novel LSTM cell, which can identify discontinuity points between frames or segments and modify the temporal connections of the encoding layer accordingly. We evaluate our approach on three large-scale datasets: the Montreal Video Annotation dataset, the MPII Movie Description dataset and the Microsoft Video Description Corpus. Experiments show that our approach can discover appropriate hierarchical representations of input videos and improve the state of the art results on movie description datasets.

2017 - Layout analysis and content classification in digitized books [Relazione in Atti di Convegno]
Corbelli, Andrea; Baraldi, Lorenzo; Balducci, Fabrizio; Grana, Costantino; Cucchiara, Rita
Automatic layout analysis has proven to be extremely important in the process of digitization of large amounts of documents. In this paper we present a mixed approach to layout analysis, introducing a SVM-aided layout segmentation process and a classification process based on local and geometrical features. The final output of the automatic analysis algorithm is a complete and structured annotation in JSON format, containing the digitalized text as well as all the references to the illustrations of the input page, and which can be used by visualization interfaces as well as annotation interfaces. We evaluate our algorithm on a large dataset built upon the first volume of the “Enciclopedia Treccani”.

2017 - Learning Where to Attend Like a Human Driver [Relazione in Atti di Convegno]
Palazzi, Andrea; Solera, Francesco; Calderara, Simone; Alletto, Stefano; Cucchiara, Rita
Despite the advent of autonomous cars, it's likely - at least in the near future - that human attention will still maintain a central role as a guarantee in terms of legal responsibility during the driving task. In this paper we study the dynamics of the driver's gaze and use it as a proxy to understand related attentional mechanisms. First, we build our analysis upon two questions: where and what the driver is looking at? Second, we model the driver's gaze by training a coarse-to-fine convolutional network on short sequences extracted from the DR(eye)VE dataset. Experimental comparison against different baselines reveal that the driver's gaze can indeed be learnt to some extent, despite i) being highly subjective and ii) having only one driver's gaze available for each sequence due to the irreproducibility of the scene. Eventually, we advocate for a new assisted driving paradigm which suggests to the driver, with no intervention, where she should focus her attention.

2017 - Learning to Map Vehicles into Bird's Eye View [Relazione in Atti di Convegno]
Palazzi, Andrea; Borghi, Guido; Abati, Davide; Calderara, Simone; Cucchiara, Rita
Awareness of the road scene is an essential component for both autonomous vehicles and Advances Driver Assistance Systems and is gaining importance both for the academia and car companies. This paper presents a way to learn a semantic-aware transformation which maps detections from a dashboard camera view onto a broader bird's eye occupancy map of the scene. To this end, a huge synthetic dataset featuring 1M couples of frames, taken from both car dashboard and bird's eye view, has been collected and automatically annotated. A deep-network is then trained to warp detections from the first to the second view. We demonstrate the effectiveness of our model against several baselines and observe that is able to generalize on real-world data despite having been trained solely on synthetic ones.

2017 - Modeling Multimodal Cues in a Deep Learning-based Framework for Emotion Recognition in the Wild [Relazione in Atti di Convegno]
Pini, Stefano; Ben Ahmed, Olfa; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita; Huet, Benoit
In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81% and 40.47%): we achieve an accuracy of 50.39% and 49.92% respectively on the validation and the testing data.

2017 - NeuralStory: an Interactive Multimedia System for Video Indexing and Re-use [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
In the last years video has been swamping the Internet: websites, social networks, and business multimedia systems are adopting video as the most important form of communication and information. Video are normally accessed as a whole and are not indexed in the visual content. Thus, they are often uploaded as short, manually cut clips with user-provided annotations, keywords and tags for retrieval. In this paper, we propose a prototype multimedia system which addresses these two limitations: it overcomes the need of human intervention in the video setting, thanks to fully deep learning-based solutions, and decomposes the storytelling structure of the video into coherent parts. These parts can be shots, key-frames, scenes and semantically related stories, and are exploited to provide an automatic annotation of the visual content, so that parts of video can be easily retrieved. This also allows a principled re-use of the video itself: users of the platform can indeed produce new storytelling by means of multi-modal presentations, add text and other media, and propose a different visual organization of the content. We present the overall solution, and some experiments on the re-use capability of our platform in edutainment by conducting an extensive user valuation %with students from primary schools.

2017 - POSEidon: Face-from-Depth for Driver Pose Estimation [Relazione in Atti di Convegno]
Borghi, Guido; Venturelli, Marco; Vezzani, Roberto; Cucchiara, Rita
Fast and accurate upper-body and head pose estimation is a key task for automatic monitoring of driver attention, a challenging context characterized by severe illumination changes, occlusions and extreme poses. In this work, we present a new deep learning framework for head localization and pose estimation on depth images. The core of the proposal is a regression neural network, called POSEidon, which is composed of three independent convolutional nets followed by a fusion layer, specially conceived for understanding the pose by depth. In addition, to recover the intrinsic value of face appearance for understanding head position and orientation, we propose a new Face-from-Depth approach for learning image faces from depth. Results in face reconstruction are qualitatively impressive. We test the proposed framework on two public datasets, namely Biwi Kinect Head Pose and ICT-3DHP, and on Pandora, a new challenging dataset mainly inspired by the automotive setup. Results show that our method overcomes all recent state-of-art works, running in real time at more than 30 frames per second.

2017 - Segmentation models diversity for object proposals [Articolo su rivista]
Manfredi, Marco; Grana, Costantino; Cucchiara, Rita; Smeulders, Arnold W. M.
In this paper we present a segmentation proposal method which employs a box-hypotheses generation step followed by a lightweight segmentation strategy. Inspired by interactive segmentation, for each automatically placed bounding-box we compute a precise segmentation mask. We introduce diversity in segmentation strategies enhancing a generic model performance exploiting class-independent regional appearance features. Foreground probability scores are learned from groups of objects with peculiar characteristics to specialize segmentation models. We demonstrate results comparable to the state-of-the-art on PASCAL VOC 2012 and a further improvement by merging our proposals with those of a recent solution. The ability to generalize to unseen object categories is demonstrated on Microsoft COCO 2014.

2017 - Towards Video Captioning with Naming: a Novel Dataset and a Multi-Modal Approach [Relazione in Atti di Convegno]
Pini, Stefano; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
Current approaches for movie description lack the ability to name characters with their proper names, and can only indicate people with a generic "someone" tag. In this paper we present two contributions towards the development of video description architectures with naming capabilities: firstly, we collect and release an extension of the popular Montreal Video Annotation Dataset in which the visual appearance of each character is linked both through time and to textual mentions in captions. We annotate, in a semi-automatic manner, a total of 53k face tracks and 29k textual mentions on 92 movies. Moreover, to underline and quantify the challenges of the task of generating captions with names, we present different multi-modal approaches to solve the problem on already generated captions.

2017 - Tracking social groups within and across cameras [Articolo su rivista]
Solera, Francesco; Calderara, Simone; Ristani, Ergys; Tomasi, Carlo; Cucchiara, Rita
We propose a method for tracking groups from single and multiple cameras with disjoint fields of view. Our formulation follows the tracking-by-detection paradigm where groups are the atomic entities and are linked over time to form long and consistent trajectories. To this end, we formulate the problem as a supervised clustering problem where a Structural SVM classifier learns a similarity measure appropriate for group entities. Multi-camera group tracking is handled inside the framework by adopting an orthogonal feature encoding that allows the classifier to learn inter- and intra-camera feature weights differently. Experiments were carried out on a novel annotated group tracking data set, the DukeMTMC-Groups data set. Since this is the first data set on the problem it comes with the proposal of a suitable evaluation measure. Results of adopting learning for the task are encouraging, scoring a +15% improvement in F1 measure over a non-learning based clustering baseline. To our knowledge this is the first proposal of this kind dealing with multi-camera group tracking.

2017 - Visual Saliency for Image Captioning in New Multimedia Services [Relazione in Atti di Convegno]
Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita
Image and video captioning are important tasks in visual data analytics, as they concern the capability of describing visual content in natural language. They are the pillars of query answering systems, improve indexing and search and allow a natural form of human-machine interaction. Even though promising deep learning strategies are becoming popular, the heterogeneity of large image archives makes this task still far from being solved. In this paper we explore how visual saliency prediction can support image captioning. Recently, some forms of unsupervised machine attention mechanisms have been spreading, but the role of human attention prediction has never been examined extensively for captioning. We propose a machine attention model driven by saliency prediction to provide captions in images, which can be exploited for many services on cloud and on multimedia data. Experimental evaluations are conducted on the SALICON dataset, which provides groundtruths for both saliency and captioning, and on the large Microsoft COCO dataset, the most widely used for image captioning.

2016 - A Browsing and Retrieval System for Broadcast Videos using Scene Detection and Automatic Annotation [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Messina, Alberto; Cucchiara, Rita
This paper presents a novel video access and retrieval system for edited videos. The key element of the proposal is that videos are automatically decomposed into semantically coherent parts (called scenes) to provide a more manageable unit for browsing, tagging and searching. The system features an automatic annotation pipeline, with which videos are tagged by exploiting both the transcript and the video itself. Scenes can also be retrieved with textual queries; the best thumbnail for a query is selected according to both semantics and aesthetics criteria.

2016 - A Deep Multi-Level Network for Saliency Prediction [Relazione in Atti di Convegno]
Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita
This paper presents a novel deep architecture for saliency prediction. Current state of the art models for saliency prediction employ Fully Convolutional networks that perform a non-linear combination of features extracted from the last convolutional layer to predict saliency maps. We propose an architecture which, instead, combines features extracted at different levels of a Convolutional Neural Network (CNN). Our model is composed of three main blocks: a feature extraction CNN, a feature encoding network, that weights low and high level feature maps, and a prior learning network. We compare our solution with state of the art saliency models on two public benchmarks datasets. Results show that our model outperforms under all evaluation metrics on the SALICON dataset, which is currently the largest public dataset for saliency prediction, and achieves competitive results on the MIT300 benchmark.

2016 - A location-aware architecture for an IoT-based smart museum [Articolo su rivista]
Fiore, Giuseppe Del; Mainetti, Luca; Mighali, Vincenzo; Patrono, Luigi; Alletto, Stefano; Cucchiara, Rita; Serra, Giuseppe
The Internet of Things, whose main goal is to automatically predict users' desires, can find very interesting opportunities in the art and culture field, as the tourism is one of the main driving engines of the modern society. Currently, the innovation process in this field is growing at a slower pace, so the cultural heritage is a prerogative of a restricted category of users. To address this issue, a significant technological improvement is necessary in the culture-dedicated locations, which do not usually allow the installation of hardware infrastructures. In this paper, we design and validate a no-invasive indoor location-aware architecture able to enhance the user experience in a museum. The system relies on the user's smartphone and a wearable device (with image recognition and localization capabilities) to automatically deliver personalized cultural contents related to the observed artworks. The proposal was validated in the MUST museum in Lecce (Italy).

2016 - An Indoor Location-aware System for an IoT-based Smart Museum [Articolo su rivista]
Alletto, Stefano; Cucchiara, Rita; Del Fiore, Giuseppe; Mainetti, Luca; Mighali, Vincenzo; Patrono, Luigi; Serra, Giuseppe
The new technologies characterizing the Internet of Things (IoT) allow realizing real smart environments able to provide advanced services to the users. Recently, these smart environments are also being exploited to renovate the users' interest on the cultural heritage, by guaranteeing real interactive cultural experiences. In this paper, we design and validate an indoor location-aware architecture able to enhance the user experience in a museum. In particular, the proposed system relies on a wearable device that combines image recognition and localization capabilities to automatically provide the users with cultural contents related to the observed artworks. The localization information is obtained by a Bluetooth low energy (BLE) infrastructure installed in the museum. Moreover, the system interacts with the Cloud to store multimedia contents produced by the user and to share environment-generated events on his/her social networks. Finally, several location-aware services, running in the system, control the environment status also according to users' movements. These services interact with physical devices through a multiprotocol middleware. The system has been designed to be easily extensible to other IoT technologies and its effectiveness has been evaluated in the MUST museum, Lecce, Italy.

2016 - Analysis and Re-use of Videos in Educational Digital Libraries with Automatic Scene Detection [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
The advent of modern approaches to education, like Massive Open Online Courses (MOOC), made video the basic media for educating and transmitting knowledge. However, IT tools are still not adequate to allow video content re-use, tagging, annotation and personalization. In this paper we analyze the problem of identifying coherent sequences, called scenes, in order to provide the users with a more manageable editing unit. A simple spectral clustering technique is proposed and compared with state-of-the-art results. We also discuss correct ways to evaluate the performance of automatic scene detection algorithms.

2016 - Body Part Based Re-identification from an Egocentric Perspective [Relazione in Atti di Convegno]
Fergnani, Federica; Alletto, Stefano; Serra, Giuseppe; De Mira, Joaquim; Cucchiara, Rita
With the spread of wearable cameras, many consumer applications ranging from social tagging to video summarization would greatly benefit from people re-identification methods capable of dealing with the egocentric perspective. In this regard, first-person camera views present such a unique setting that traditional re-identification methods results in poor performance when applied to this scenario. In this paper, we present a simple but effective solution that overcomes the limitations of traditional approaches by dividing people images into meaningful body parts. Furthermore, by taking into account human gaze information concerning where people look at when trying to recognize a person, we devise a meaningful way to weight the contributions of different bodyparts. Experimental results validate the proposal on a novel egocentric re-identification dataset, the first of its kind, showing that the performance increases when compared to current state of the art on egocentric sequences is significant.

2016 - Context Change Detection for an Ultra-Low Power Low-Resolution Ego-Vision Imager [Relazione in Atti di Convegno]
Paci, Francesco; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita; Benini, Luca
With the increasing popularity of wearable cameras, such as GoPro or Narrative Clip, research on continuous activity monitoring from egocentric cameras has received a lot of attention. Research in hardware and software is devoted to find new efficient, stable and long-time running solutions; however, devices are too power-hungry for truly always-on operation, and are aggressively duty-cycled to achieve acceptable lifetimes. In this paper we present a wearable system for context change detection based on an egocentric camera with ultra-low power consumption that can collect data 24/7. Although the resolution of the captured images is low, experimental results in real scenarios demonstrate how our approach, based on Siamese Neural Networks, can achieve visual context awareness. In particular, we compare our solution with hand-crafted features and with state of art technique and propose a novel and challenging dataset composed of roughly 30000 low-resolution images.

2016 - DR(eye)VE: a Dataset for Attention-Based Tasks with Applications to Autonomous and Assisted Driving [Relazione in Atti di Convegno]
Alletto, Stefano; Palazzi, Andrea; Solera, Francesco; Calderara, Simone; Cucchiara, Rita
Autonomous and assisted driving are undoubtedly hot topics in computer vision. However, the driving task is extremely complex and a deep understanding of drivers' behavior is still lacking. Several researchers are now investigating the attention mechanism in order to define computational models for detecting salient and interesting objects in the scene. Nevertheless, most of these models only refer to bottom up visual saliency and are focused on still images. Instead, during the driving experience the temporal nature and peculiarity of the task influence the attention mechanisms, leading to the conclusion that real life driving data is mandatory. In this paper we propose a novel and publicly available dataset acquired during actual driving. Our dataset, composed by more than 500,000 frames, contains drivers' gaze fixations and their temporal integration providing task-specific saliency maps. Geo-referenced locations, driving speed and course complete the set of released data. To the best of our knowledge, this is the first publicly available dataset of this kind and can foster new discussions on better understanding, exploiting and reproducing the driver's attention process in the autonomous and assisted cars of future generations.

2016 - Deep Head Pose Estimation from Depth Data for In-car Automotive Applications [Relazione in Atti di Convegno]
Venturelli, Marco; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
Recently, deep learning approaches have achieved promising results in various fields of computer vision. In this paper, we tackle the problem of head pose estimation through a Convolutional Neural Network (CNN). Differently from other proposals in the literature, the described system is able to work directly and based only on raw depth data. Moreover, the head pose estimation is solved as a regression problem and does not rely on visual facial features like facial landmarks. We tested our system on a well known public dataset, \textit{Biwi Kinect Head Pose}, showing that our approach achieves state-of-art results and is able to meet real time performance requirements.

2016 - Exploring Architectural Details Through aWearable Egocentric Vision Device [Articolo su rivista]
Alletto, Stefano; Abati, Davide; Serra, Giuseppe; Cucchiara, Rita
Augmented user experiences in the cultural heritage domain are in increasing demand by the new digital native tourists of 21st century. In this paper, we propose a novel solution that aims at assisting the visitor during an outdoor tour of a cultural site using the unique first person perspective of wearable cameras. In particular, the approach exploits computer vision techniques to retrieve the details by proposing a robust descriptor based on the covariance of local features. Using a lightweight wearable board the solution can localize the user with respect to the 3D point cloud of the historical landmark and provide him with information about the details he is currently looking at. Experimental results validate the method both in terms of accuracy and computational effort. Furthermore, user evaluation based on real-world experiments shows that the proposal is deemed effective in enriching a cultural experience.

2016 - Eyewear Computing – Augmenting the Human with Head-Mounted Wearable Assistants [Recensione in Volume]
Cucchiara, Rita; Bulling, Andreas; Kunze, Kai; Rehg, James
The seminar was composed of workshops and tutorials on head-mounted eye tracking, egocentric vision, optics, and head-mounted displays. The seminar welcomed 30 academic and industry researchers from Europe, the US, and Asia with a diverse background, including wearable and ubiquitous computing, computer vision, developmental psychology, optics, and human-computer interaction. In contrast to several previous Dagstuhl seminars, we used an ignite talk format to reduce the time of talks to one half-day and to leave the rest of the week for hands-on sessions, group work, general discussions, and socialising. The key results of this seminar are 1) the identification of key research challenges and summaries of breakout groups on multimodal eyewear computing, egocentric vision, security and privacy issues, skill augmentation and task guidance, eyewear computing for gaming, as well as prototyping of VR applications, 2) a list of datasets and research tools for eyewear computing, 3) three small-scale datasets recorded during the seminar, 4) an article in ACM Interactions entitled “Eyewear Computers for Human-Computer Interaction”, as well as 5) two follow-up workshops on “Egocentric Perception, Interaction, and Computing” at the European Conference on Computer Vision (ECCV) as well as “Eyewear Computing” at the ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp).

2016 - Fast gesture recognition with Multiple StreamDiscrete HMMs on 3D Skeletons [Relazione in Atti di Convegno]
Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita;
HMMs are widely used in action and gesture recognition due to their implementation simplicity, low computational requirement, scalability and high parallelism. They have worth performance even with a limited training set. All these characteristics are hard to find together in other even more accurate methods. In this paper, we propose a novel doublestage classification approach, based on Multiple Stream Discrete Hidden Markov Models (MSD-HMM) and 3D skeleton joint data, able to reach high performances maintaining all advantages listed above. The approach allows both to quickly classify presegmented gestures (offline classification), and to perform temporal segmentation on streams of gestures (online classification) faster than real time. We test our system on three public datasets, MSRAction3D, UTKinect-Action and MSRDailyAction, and on a new dataset, Kinteract Dataset, explicitly created for Human Computer Interaction (HCI). We obtain state of the art performances on all of them.

2016 - From Depth Data to Head Pose Estimation: a Siamese approach [Relazione in Atti di Convegno]
Venturelli, Marco; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
The correct estimation of the head pose is a problem of the great importance for many applications. For instance, it is an enabling technology in automotive for driver attention monitoring. In this paper, we tackle the pose estimation problem through a deep learning network working in regression manner. Traditional methods usually rely on visual facial features, such as facial landmarks or nose tip position. In contrast, we exploit a Convolutional Neural Network (CNN) to perform head pose estimation directly from depth data. We exploit a Siamese architecture and we propose a novel loss function to improve the learning of the regression network layer. The system has been tested on two public datasets, Biwi Kinect Head Pose and ICT-3DHP database. The reported results demonstrate the improvement in accuracy with respect to current state-of-the-art approaches and the real time capabilities of the overall framework.

2016 - Historical Document Digitization through Layout Analysis and Deep Content Classification [Relazione in Atti di Convegno]
Corbelli, Andrea; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
Document layout segmentation and recognition is an important task in the creation of digitized documents collections, especially when dealing with historical documents. This paper presents an hybrid approach to layout segmentation as well as a strategy to classify document regions, which is applied to the process of digitization of an historical encyclopedia. Our layout analysis method merges a classic top-down approach and a bottom-up classification process based on local geometrical features, while regions are classified by means of features extracted from a Convolutional Neural Network merged in a Random Forest classifier. Experiments are conducted on the first volume of the Enciclopedia Treccani'', a large dataset containing 999 manually annotated pages from the historical Italian encyclopedia.

2016 - Motion Segmentation using Visual and Bio-mechanical Features [Relazione in Atti di Convegno]
Alletto, Stefano; Serra, Giuseppe; Cucchiara, Rita
Nowadays, egocentric wearable devices are continuously increasing their widespread among both the academic community and the general public. For this reason, methods capable of automatically segment the video based on the recorder motion patterns are gaining attention. These devices present the unique opportunity of both high quality video recordings and multimodal sensors readings. Significant efforts have been made in either analyzing the video stream recorded by these devices or the bio-mechanical sensor information. So far, the integration between these two realities has not been fully addressed, and the real capabilities of these devices are not yet exploited. In this paper, we present a solution to segment a video sequence into motion activities by introducing a novel data fusion technique based on the covariance of visual and bio-mechanical features. The experimental results are promising and show that the proposed integration strategy outperforms the results achieved focusing solely on a single source.

2016 - Multi-Level Net: a Visual Saliency Prediction Model [Relazione in Atti di Convegno]
Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita
State of the art approaches for saliency prediction are based on Full Convolutional Networks, in which saliency maps are built using the last layer. In contrast, we here present a novel model that predicts saliency maps exploiting a non-linear combination of features coming from different layers of the network. We also present a new loss function to deal with the imbalance issue on saliency masks. Extensive results on three public datasets demonstrate the robustness of our solution. Our model outperforms the state of the art on SALICON, which is the largest and unconstrained dataset available, and obtains competitive results on MIT300 and CAT2000 benchmarks.

2016 - Optimizing image registration for interactive applications [Relazione in Atti di Convegno]
Gasparini, Riccardo; Alletto, Stefano; Serra, Giuseppe; Cucchiara, Rita
With the spread of wearable and mobile devices, the request for interactive augmented reality applications is in constant growth. Among the different possibilities, we focus on the cultural heritage domain where a key step in the development applications for augmented cultural experiences is to obtain a precise localization of the user, i.e. the 6 degree-of-freedom of the camera acquiring the images used by the application. Current state of the art perform this task by extracting local descriptors from a query and exhaustively matching them to a sparse 3D model of the environment. While this procedure obtains good localization performance, due to the vast search space involved in the retrieval of 2D-3D correspondences this is often not feasible in real-time and interactive environments. In this paper we hence propose to perform descriptor quantization to reduce the search space and employ multiple KD-Trees combined with a principal component analysis dimensionality reduction to enable an efficient search. We experimentally show that our solution can halve the computational requirements of the correspondence search with regard to the state of the art while maintaining similar accuracy levels.

2016 - Quick, accurate, smart: 3D computer vision technology helps assessing confined animals' behaviour [Articolo su rivista]
Barnard, Shanis; Calderara, Simone; Pistocchi, Simone; Cucchiara, Rita; Podaliri-Vulpiani, Michele; Messori, Stefano; Ferri, Nicola
Mankind directly controls the environment and lifestyles of several domestic species for purposes ranging from production and research to conservation and companionship. These environments and lifestyles may not offer these animals the best quality of life. Behaviour is a direct reflection of how the animal is coping with its environment. Behavioural indicators are thus among the preferred parameters to assess welfare. However, behavioural recording (usually from video) can be very time consuming and the accuracy and reliability of the output rely on the experience and background of the observers. The outburst of new video technology and computer image processing gives the basis for promising solutions. In this pilot study, we present a new prototype software able to automatically infer the behaviour of dogs housed in kennels from 3D visual data and through structured machine learning frameworks. Depth information acquired through 3D features, body part detection and training are the key elements that allow the machine to recognise postures, trajectories inside the kennel and patterns of movement that can be later labelled at convenience. The main innovation of the software is its ability to automatically cluster frequently observed temporal patterns of movement without any pre-set ethogram. Conversely, when common patterns are defined through training, a deviation from normal behaviour in time or between individuals could be assessed. The software accuracy in correctly detecting the dogs' behaviour was checked through a validation process. An automatic behaviour recognition system, independent from human subjectivity, could add scientific knowledge on animals' quality of life in confinement as well as saving time and resources. This 3D framework was designed to be invariant to the dog's shape and size and could be extended to farm, laboratory and zoo quadrupeds in artificial housing. The computer vision technique applied to this software is innovative in non-human animal behaviour science. Further improvements and validation are needed, and future applications and limitations are discussed.

2016 - Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks [Articolo su rivista]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
In this paper, we propose a novel scene detection algorithm which employs semantic, visual, textual and audio cues. We also show how the hierarchical decomposition of the storytelling video structure can improve retrieval results presentation with semantically and aesthetically effective thumbnails. Our method is built upon two advancements of the state of the art: 1) semantic feature extraction which builds video specific concept detectors; 2) multimodal feature embedding learning, that maps the feature vector of a shot to a space in which the Euclidean distance has task specific semantic properties. The proposed method is able to decompose the video in annotated temporal segments which allow for a query specific thumbnail extraction. Extensive experiments are performed on different data sets to demonstrate the effectiveness of our algorithm. An in-depth discussion on how to deal with the subjectivity of the task is conducted and a strategy to overcome the problem is suggested.

2016 - Scene-driven Retrieval in Edited Videos using Aesthetic and Semantic Deep Features [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
This paper presents a novel retrieval pipeline for video collections, which aims to retrieve the most significant parts of an edited video for a given query, and represent them with thumbnails which are at the same time semantically meaningful and aesthetically remarkable. Videos are first segmented into coherent and story-telling scenes, then a retrieval algorithm based on deep learning is proposed to retrieve the most significant scenes for a textual query. A ranking strategy based on deep features is finally used to tackle the problem of visualizing the best thumbnail. Qualitative and quantitative experiments are conducted on a collection of edited videos to demonstrate the effectiveness of our approach.

2016 - Shot, scene and keyframe ordering for interactive video re-use [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
This paper presents a complete system for shot and scene detection in broadcast videos, as well as a method to select the best representative key-frames, which could be used in new interactive interfaces for accessing large collections of edited videos. The final goal is to enable an improved access to video footage and the re-use of video content with the direct management of user-selected video-clips.

2016 - Socially Constrained Structural Learning for Groups Detection in Crowd [Articolo su rivista]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita;
Modern crowd theories agree that collective behavior is the result of the underlying interactions among small groups of individuals. In this work, we propose a novel algorithm for detecting social groups in crowds by means of a Correlation Clustering procedure on people trajectories. The affinity between crowd members is learned through an online formulation of the Structural SVM framework and a set of specifically designed features characterizing both their physical and social identity, inspired by Proxemic theory, Granger causality, DTW and Heat-maps. To adhere to sociological observations, we introduce a loss function (G-MITRE) able to deal with the complexity of evaluating group detection performances. We show our algorithm achieves state-of-the-art results when relying on both ground truth trajectories and tracklets previously extracted by available detector/tracker systems.

2016 - Spotting prejudice with nonverbal behaviours [Relazione in Atti di Convegno]
Palazzi, Andrea; Calderara, Simone; Bicocchi, Nicola; Vezzali, Loris; Di Bernardo, Gian Antonio; Zambonelli, Franco; Cucchiara, Rita
Despite prejudice cannot be directly observed, nonverbal behaviours provide profound hints on people inclinations. In this paper, we use recent sensing technologies and machine learning techniques to automatically infer the results of psychological questionnaires frequently used to assess implicit prejudice. In particular, we recorded 32 students discussing with both white and black collaborators. Then, we identiﬁed a set of features allowing automatic extraction and measured their degree of correlation with psychological scores. Results conﬁrmed that automated analysis of nonverbal behaviour is actually possible thus paving the way for innovative clinical tools and eventually more secure societies.

2016 - Transductive People Tracking in Unconstrained Surveillance [Articolo su rivista]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
Long term tracking of people in unconstrained scenarios is still an open problem due to the absence of constant elements in the problem setting. The camera, when active, may move and both the background and the target appearance may change abruptly leading to the inadequacy of most standard tracking techniques. We propose to exploit a learning approach that considers the tracking task as a semi supervised learning (SSL) problem. Given few target samples the aim is to search the target occurrences in the video stream re-interpreting the problem as label propagation on a similarity graph. We propose a solution based on graph transduction that works iteratively frame by frame. Additionally, in order to avoid drifting, we introduce an update strategy based on an evolutionary clustering technique that chooses the visual templates that better describe target appearance evolving the model during the processing of the video. Since we model people appearance by means of covariance matrices on color and gradient information our framework is directly related to structure learning on Riemannian manifolds. Tests on publicly available datasets and comparisons with stateof- the-art techniques allow to conclude that our solution exhibit interesting performances in terms of tracking precision and recall in most of the considered scenarios.

2016 - Video registration in egocentric vision under day and night illumination changes [Articolo su rivista]
Alletto, Stefano; Serra, Giuseppe; Cucchiara, Rita
With the spread of wearable devices and head mounted cameras, a wide range of application requiring precise user localization is now possible. In this paper we propose to treat the problem of obtaining the user position with respect to a known environment as a video registration problem. Video registration, i.e. the task of aligning an input video sequence to a pre-built 3D model, relies on a matching process of local keypoints extracted on the query sequence to a 3D point cloud. The overall registration performance is strictly tied to the actual quality of this 2D-3D matching, and can degrade if environmental conditions such as steep changes in lighting like the ones between day and night occur. To effectively register an egocentric video sequence under these conditions, we propose to tackle the source of the problem: the matching process. To overcome the shortcomings of standard matching techniques, we introduce a novel embedding space that allows us to obtain robust matches by jointly taking into account local descriptors, their spatial arrangement and their temporal robustness. The proposal is evaluated using unconstrained egocentric video sequences both in terms of matching quality and resulting registration performance using different 3D models of historical landmarks. The results show that the proposed method can outperform state of the art registration algorithms, in particular when dealing with the challenges of night and day sequences.

2015 - A Deep Siamese Network for Scene Detection in Broadcast Videos [Relazione in Atti di Convegno]
Baraldi Lorenzo; Grana Costantino; Cucchiara Rita
We present a model that automatically divides broadcast videos into coherent scenes by learning a distance measure between shots. Experiments are performed to demonstrate the effectiveness of our approach by comparing our algorithm against recent proposals for automatic scene segmentation. We also propose an improved performance measure that aims to reduce the gap between numerical evaluation and expected results, and propose and release a new benchmark dataset.

2015 - A General-Purpose Sensing Floor Architecture for Human-Environment Interaction [Articolo su rivista]
Vezzani, Roberto; Lombardi, Martino; Pieracci, Augusto; Santinelli, Paolo; Cucchiara, Rita
Smart environments are now designed as natural interfaces to capture and understand human behavior without a need for explicit human-computer interaction. In this paper, we present a general-purpose architecture that acquires and understands human behaviors through a sensing floor. The pressure field generated by moving people is captured and analyzed. Specific actions and events are then detected by a low-level processing engine and sent to high-level interfaces providing different functions. The proposed architecture and sensors are modular, general-purpose, cheap, and suitable for both small- and large-area coverage. Some sample entertainment and virtual reality applications that we developed to test the platform are presented.

2015 - Active query process for digital video surveillance forensic applications [Articolo su rivista]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
Multimedia forensics is a new emerging discipline regarding the analysis and exploitation of digital data as support for investigation to extract probative elements. Among them, visual data about people and people activities, extracted from videos in an efficient way, are becoming day by day more appealing for forensics, due to the availability of large video-surveillance footage. Thus, many research studies and prototypes investigate the analysis of soft biometrics data, such as people appearance and people trajectories. In this work, we propose new solutions for querying and retrieving visual data in an interactive and active fashion for soft biometrics in forensics. The innovative proposal joins the capability of transductive learning for semi-supervised search by similarity and a typical multimedia methodology based on user-guided relevance feedback to allow an active interaction with the visual data of people, appearance and trajectory in large surveillance areas. Approaches proposed are very general and can be exploited independently by the surveillance setting and the type of video analytic tools.

2015 - Automatic configuration and calibration of modular sensing floors [Relazione in Atti di Convegno]
Vezzani, Roberto; Lombardi, Martino; Cucchiara, Rita
Sensing floors are becoming an emerging solution for many privacy-compliant and large area surveillance systems. Many research and even commercial Technologies have been proposed in the last years. Similarly to distributed camera networks, the problem of calibration is crucial, specially when installed in wide areas. This paper addresses the general problem of automatic calibration and configuration of modular and scalable sensing floors. Working on training data only, the system automatically finds the spatial placement of each sensor module and estimates threshold parameters needed for people detection. Tests on several training sequences captured with a commercial sensing floor are provided to validate the method

2015 - Classification of Affective Data to Evaluate the Level Design in a Role-Playing Videogame [Relazione in Atti di Convegno]
Balducci, Fabrizio; Grana, Costantino; Cucchiara, Rita
This paper presents a novel approach to evaluate game level design strategies, applied to role playing games. Following a set of well defined guidelines, two game levels were designed for Neverwinter Nights 2 to manipulate particular emotions like boredom or flow, and tested by 13 subjects wearing a brain computer interface helmet. A set of features was extracted from the affective data logs and used to classify different parts of the gaming sessions, to verify the correspondence of the original level aims and the effective results on people emotions. The very interesting correlations observed, suggest that the technique is extensible to other similar evaluation tasks.

2015 - Detection of Human Movements with Pressure Floor Sensors [Relazione in Atti di Convegno]
Lombardi, Martino; Vezzani, Roberto; Cucchiara, Rita
Following the recent Internet of Everything (IoE) trend, several general-purpose devices have been proposed to acquire as much information as possible from the environment and from people interacting with it. Among the others, sensing floors are recently attracting the interest of the research community. In this paper, we propose a new model to store and process floor data. The model does not assume a regular grid distribution of the sensing elements and is based on the ground reaction force (GRF) concept, widely used in biomechanics. It allows the correct detection and tracking of people, outperforming the common background subtraction schema adopted in the past. Several tests on a real sensing floor prototype are reported and discussed

2015 - Egocentric Object Tracking: An Odometry-Based Solution [Relazione in Atti di Convegno]
Alletto, Stefano; Serra, Giuseppe; Cucchiara, Rita
Tracking objects moving around a person is one of the key steps in human visual augmentation: we could estimate their locations when they are out of our field of view, know their position, distance or velocity just to name a few possibilities. This is no easy task: in this paper, we show how current state-of-the-art visual tracking algorithms fail if challenged with a first-person sequence recorded from a wearable camera attached to a moving user. We propose an evaluation that highlights these algorithms' limitations and, accordingly, develop a novel approach based on visual odometry and 3D localization that overcomes many issues typical of egocentric vision. We implement our algorithm on a wearable board and evaluate its robustness, showing in our preliminary experiments an increase in tracking performance of nearly 20\% if compared to currently state-of-the-art techniques.

2015 - Egocentric Video Summarization of Cultural Tour based on User Preferences [Relazione in Atti di Convegno]
Varini, Patrizia; Serra, Giuseppe; Cucchiara, Rita
In this paper, we propose a new method to obtain customized video summarization according to specific user preferences. Our approach is tailored on Cultural Heritage scenario and is designed on identifying candidate shots, selecting from the original streams only the scenes with behavior patterns related to the presence of relevant experiences, and further filtering them in order to obtain a summary matching the requested user's preferences. Our preliminary results show that the proposed approach is able to leverage user's preferences in order to obtain a customized summary, so that different users may extract from the same stream different summaries.

2015 - Egocentric video personalization in cultural experiences scenarios [Relazione in Atti di Convegno]
Varini, Patrizia; Serra, Giuseppe; Cucchiara, Rita
In this paper we propose a novel approach for egocentric video personalization in a cultural experience scenario, based on shots automatic labelling according to different semantic dimensions, such as web leveraged knowledge of the surrounded cultural Points Of Interest, information about stops and moves, both relying on geolocalization, and camera’s wearer behaviour. Moreover we present a video personalization web system based on shots multi-dimensional semantic classification, that is designed to aid the visitor to browse and to retrieve relevant information to obtain a customized video. Experimental results show that the proposed techniques for video analysis achieve good performances in unconstrained scenario and user evaluation tests confirm that our solution is useful and effective.

2015 - GOLD: Gaussians of Local Descriptors for Image Representation [Articolo su rivista]
Serra, Giuseppe; Grana, Costantino; Manfredi, Marco; Cucchiara, Rita
The Bag of Words paradigm has been the baseline from which several successful image classification solutions were developed in the last decade. These represent images by quantizing local descriptors and summarizing their distribution. The quantization step introduces a dependency on the dataset, that even if in some contexts significantly boosts the performance, severely limits its generalization capabilities. Differently, in this paper, we propose to model the local features distribution with a multivariate Gaussian, without any quantization. The full rank covariance matrix, which lies on a Riemannian manifold, is projected on the tangent Euclidean space and concatenated to the mean vector. The resulting representation, a Gaussian of local descriptors (GOLD), allows to use the dot product to closely approximate a distance between distributions without the need for expensive kernel computations. We describe an image by an improved spatial pyramid, which avoids boundary effects with soft assignment: local descriptors contribute to neighboring Gaussians, forming a weighted spatial pyramid of GOLD descriptors. In addition, we extend the model leveraging dataset characteristics in a mixture of Gaussian formulation further improving the classification accuracy. To deal with large scale datasets and high dimensional feature spaces the Stochastic Gradient Descent solver is adopted. Experimental results on several publicly available datasets show that the proposed method obtains state-of-the-art performance.

2015 - Gesture Recognition using Wearable Vision Sensors to Enhance Visitors' Museum Experiences [Articolo su rivista]
Baraldi, Lorenzo; Paci, Francesco; Serra, Giuseppe; Cucchiara, Rita
We introduce a novel approach to cultural heritage experience: by means of ego-vision embedded devices we develop a system, which offers a more natural and entertaining way of accessing museum knowledge. Our method is based on distributed self-gesture and artwork recognition, and does not need fixed cameras nor radio-frequency identifications sensors. We propose the use of dense trajectories sampled around the hand region to perform self-gesture recognition, understanding the way a user naturally interacts with an artwork, and demonstrate that our approach can benefit from distributed training. We test our algorithms on publicly available data sets and we extend our experiments to both virtual and real museum scenarios, where our method shows robustness when challenged with real-world data. Furthermore, we run an extensive performance analysis on our ARM-based wearable device.

2015 - Innovative IoT-aware Services for a Smart Museum [Relazione in Atti di Convegno]
Mighali, Vincenzo; Del Fiore, Giuseppe; Patrono, Luigi; Mainetti, Luca; Alletto, Stefano; Serra, Giuseppe; Cucchiara, Rita
Smart cities are a trading topic in both the academic literature and industrial world. The capability to provide the users with added-value services through low-power and low-cost smart objects is very attractive in many fields. Among these, art and culture represent very interesting examples, as the tourism is one of the main driving engines of modern society. In this paper, we propose an IoT-aware architecture to improve the cultural experience of the user, by involving the most important recent innovations in the ICT field. The main components of the proposed architecture are: (i) an indoor localization service based on the Bluetooth Low Energy technology, (ii) a wearable device able to capture and process images related to the user's point of view, (iii) the user's mobile device useful to display customized cultural contents and to share multimedia data in the Cloud, and (iv) a processing center that manage the core of the whole business logic. In particular, it interacts with both wearable and mobile devices, and communicates with the outside world to retrieve contents from the Cloud and to provide services also to external users. The proposal is currently under development and it will be validated in the MUST museum in Lecce.

2015 - Learning to Divide and Conquer for Online Multi-Target Tracking [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
Online Multiple Target Tracking (MTT) is often addressed within the tracking-by-detection paradigm. Detections are previously extracted independently in each frame and then objects trajectories are built by maximizing specifically designed coherence functions. Nevertheless, ambiguities arise in presence of occlusions or detection errors. In this paper we claim that the ambiguities in tracking could be solved by a selective use of the features, by working with more reliable features if possible and exploiting a deeper representation of the target only if necessary. To this end, we propose an online divide and conquer tracker for static camera scenes, which partitions the assignment problem in local subproblems and solves them by selectively choosing and combining the best features. The complete framework is cast as a structural learning task that unifies these phases and learns tracker parameters from examples. Experiments on two different datasets highlights a significant improvement of tracking performances (MOTA +10%) over the state of the art.

2015 - Learning to identify leaders in crowd [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
Leader identification is a crucial task in social analysis, crowd management and emergency planning. In this paper, we investigate a computational model for the individuation of leaders in crowded scenes. We deal with the lack of a formal definition of leadership by learning, in a supervised fashion, a metric space based exclusively on people spatiotemporal information. Based on Tarde's work on crowd psychology, individuals are modeled as nodes of a directed graph and leaders inherits their relevance thanks to other members references. We note this is analogous to the way websites are ranked by the PageRank algorithm. During experiments, we observed different feature weights depending on the specific type of crowd, highlighting the impossibility to provide a unique interpretation of leadership. To our knowledge, this is the first attempt to study leader identification as a metric learning problem

2015 - Measuring scene detection performance [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
In this paper we evaluate the performance of scene detection techniques, starting from the classic precision/recall approach, moving to the better designed coverage/overflow measures, and finally proposing an improved metric, in order to solve frequently observed cases in which the numeric interpretation is different from the expected results. Numerical evaluation is performed on two recent proposals for automatic scene detection, and comparing them with a simple but effective novel approach. Experimental results are conducted to show how different measures may lead to different interpretations.

2015 - Personalized Egocentric Video Summarization for Cultural Experience [Relazione in Atti di Convegno]
Varini, Patrizia; Serra, Giuseppe; Cucchiara, Rita
Recent egocentric video summarization approaches have dealt with motion analysis and social interaction without considering that user can be interested in preserving only part of the video related to his interests. In this paper we propose a new method for personalized video summarization of cultural experiences with the goal of extracting from the streams only the scenes corresponding to a user's specific topics request, chosen among the shots in which it's possible to deduce that the visitor was focusing on a point of interest. Preliminary experiments show that our approach is promising and allows visitor to better customize the summary of his experience.

2015 - Scene segmentation using temporal clustering for accessing and re-using broadcast video [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
Scene detection is a fundamental tool for allowing effective video browsing and re-using. In this paper we present a model that automatically divides videos into coherent scenes, which is based on a novel combination of local image descriptors and temporal clustering techniques. Experiments are performed to demonstrate the effectiveness of our approach, by comparing our algorithm against two recent proposals for automatic scene segmentation. We also propose improved performance measures that aim to reduce the gap between numerical evaluation and expected results.

2015 - Shot and Scene Detection via Hierarchical Clustering for Re-using Broadcast Video [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
Video decomposition techniques are fundamental tools for allowing effective video browsing and re-using. In this work, we consider the problem of segmenting broadcast videos into coherent scenes, and propose a scene detection algorithm based on hierarchical clustering, along with a very fast state-of-the-art shot segmentation approach. Experiments are performed to demonstrate the effectiveness of our algorithms, by comparing against recent proposals for automatic shot and scene segmentation.

2015 - Towards the evaluation of reproducible robustness in tracking-by-detection [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
Conventional experiments on MTT are built upon the belief that fixing the detections to different trackers is sufficient to obtain a fair comparison. In this work we argue how the true behavior of a tracker is exposed when evaluated by varying the input detections rather than by fixing them. We propose a systematic and reproducible protocol and a MATLAB toolbox for generating synthetic data starting from ground truth detections, a proper set of metrics to understand and compare trackers peculiarities and respective visualization solutions.

2015 - Understanding social relationships in egocentric vision [Articolo su rivista]
Alletto, Stefano; Serra, Giuseppe; Calderara, Simone; Cucchiara, Rita
The understanding of mutual people interaction is a key component for recognizing people social behavior, but it strongly relies on a personal point of view resulting difficult to be a-priori modeled. We propose the adoption of the unique head mounted cameras first person perspective (ego-vision) to promptly detect people interaction in different social contexts. The proposal relies on a complete and reliable system that extracts people׳s head pose combining landmarks and shape descriptors in a temporal smoothed HMM framework. Finally, interactions are detected through supervised clustering on mutual head orientation and people distances exploiting a structural learning framework that specifically adjusts the clustering measure according to a peculiar scenario. Our solution provides the flexibility to capture the interactions disregarding the number of individuals involved and their level of acquaintance in context with a variable degree of social involvement. The proposed system shows competitive performances on both publicly available ego-vision datasets and ad hoc benchmarks built with real life situations.

2015 - Wearable Vision for Retrieving Architectural Details in Augmented Tourist Experiences [Relazione in Atti di Convegno]
Alletto, Stefano; Serra, Giuseppe; Rita, Cucchiara
The interest in cultural cities is in constant growth, and so is the demand for new multimedia tools and applications that enrich their fruition. In this paper we propose an egocentric vision system to enhance tourists' cultural heritage experience. Exploiting a wearable board and a glass-mounted camera, the visitor can retrieve architectural details of the historical building he is observing and receive related multimedia contents. To obtain an effective retrieval procedure we propose a visual descriptor based on the covariance of local features. Differently than the common Bag of Words approaches our feature vector does not rely on a generated visual vocabulary, removing the dependence from a specific dataset and obtaining a reduction of the computational cost. 3D modeling is used to achieve a precise visitor's localization that allows browsing visible relevant details that the user may otherwise miss. Experimental results conducted on a publicly available cultural heritage dataset show that the proposed feature descriptor outperforms Bag of Words techniques.

2014 - 3D Hough transform for sphere recognition on point clouds [Articolo su rivista]
Camurri, Marco; Vezzani, Roberto; Cucchiara, Rita
Three-dimensional object recognition on range data and 3D point clouds is becoming more important nowadays. Since many real objects have a shape that could be approximated by simple primitives, robust pattern recognition can be used to search for primitive models. For example, the Hough transform is a well-known technique which is largely adopted in 2D image space. In this paper, we systematically analyze different probabilistic/randomized Hough transform algorithms for spherical object detection in dense point clouds. In particular, we study and compare four variants which are characterized by the number of points drawn together for surface computation into the parametric space and we formally discuss their models. We also propose a new method that combines the advantages of both single-point and multi-point approaches for a faster and more accurate detection. The methods are tested on synthetic and real datasets.

2014 - A complete system for garment segmentation and color classification [Articolo su rivista]
M. Manfredi; C. Grana; S. Calderara; R. Cucchiara
In this paper, we propose a general approach for automatic segmentation, color-based retrieval and classification of garments in fashion store databases, exploiting shape and color information. The garment segmentation is automatically initialized by learning geometric constraints and shape cues, then it is performed by modeling both skin and accessory colors with Gaussian Mixture Models. For color similarity retrieval and classification, to adapt the color description to the users’ perception and the company marketing directives, a color histogram with an optimized binning strategy, learned on the given color classes, is introduced and combined with HOG features for garment classification. Experiments validating the proposed strategy, and a free-to-use dataset publicly available for scientific purposes, are finally detailed.

2014 - Benchmarking for Person Re-identification [Capitolo/Saggio]
Vezzani, Roberto; Cucchiara, Rita
The evaluation of computer vision and pattern recognition systems is usually a burdensome and time-consuming activity. In this chapter all the benchmarks publicly available for re-identification will be reviewed and compared, starting from the ancestors VIPeR and Caviar to the most recent datasets for 3D modeling such as SARC3d (with calibrated cameras) and RGBD-ID (with range sensors). Specific requirements and constraints are highlighted and reported for each of the described collections. In addition, details on the metrics that are mostly used to test and evaluate the re-identification systems are provided.

2014 - Covariance of Covariance Features for Image Classification [Relazione in Atti di Convegno]
Serra, Giuseppe; Grana, Costantino; Manfredi, Marco; Cucchiara, Rita
In this paper we propose a novel image descriptor built by computing the covariance of pixel level features on densely sampled patches and encoding them using their covariance. Appropriate projections to the Euclidean space and feature normalizations are employed in order to provide a strong descriptor usable with linear classifiers. In order to remove border effects, we further enhance the Spatial Pyramid representation with bilinear interpolation. Experimental results conducted on two common datasets for object and texture classification show that the performance of our method is comparable with state of the art techniques, but removing any dataset specific dependency in the feature encoding step.

2014 - Detection of static groups and crowds gathered in open spaces by texture classification [Articolo su rivista]
Manfredi, Marco; Vezzani, Roberto; Calderara, Simone; Cucchiara, Rita
A surveillance system specifically developed to manage crowded scenes is described in this paper. In particular we focused on static crowds, composed by groups of people gathered and stayed in the same place for a while. The detection and spatial localization of static crowd situations is performed by means of a One Class Support Vector Machine, working on texture features extracted at patch level. Spatial regions containing crowds are identified and filtered using motion information to prevent noise and false alarms due to moving flows of people. By means of one class classification and inner texture descriptors, we are able to obtain, from a single training set, a sufficiently general crowd model that can be used for all the scenarios that shares a similar viewpoint. Tests on public datasets and real setups validate the proposed system.

2014 - From Ego to Nos-Vision: Detecting Social Relationships in First-Person Views [Relazione in Atti di Convegno]
Alletto, Stefano; Serra, Giuseppe; Calderara, Simone; Solera, Francesco; Cucchiara, Rita
In this paper we present a novel approach to detect groups in ego-vision scenarios. People in the scene are tracked through the video sequence and their head pose and 3D location are estimated. Based on the concept of f-formation, we define with the orientation and distance an inherently social pairwise feature that describes the affinity of a pair of people in the scene. We apply a correlation clustering algorithm that merges pairs of people into socially related groups. Due to the very shifting nature of social interactions and the different meanings that orientations and distances can assume in different contexts, we learn the weight vector of the correlation clustering using Structural SVMs. We extensively test our approach on two publicly available datasets showing encouraging results when detecting groups from first-person camera views.

2014 - Gesture Recognition in Ego-Centric Videos using Dense Trajectories and Hand Segmentation [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Paci, Francesco; Serra, Giuseppe; Benini, Luca; Cucchiara, Rita
We present a novel method for monocular hand gesture recognition in ego-vision scenarios that deals with static and dynamic gestures and can achieve high accuracy results using a few positive samples. Specifically, we use and extend the dense trajectories approach that has been successfully introduced for action recognition. Dense features are extracted around regions selected by a new hand segmentation technique that integrates superpixel classification, temporal and spatial coherence. We extensively testour gesture recognition and segmentation algorithms on public datasets and propose a new dataset shot with a wearable camera. In addition, we demonstrate that our solution can work in near real-time on a wearable device.

2014 - Head Pose Estimation in First-Person Camera Views [Relazione in Atti di Convegno]
Alletto, Stefano; Serra, Giuseppe; Calderara, Simone; Cucchiara, Rita
In this paper we present a new method for head pose real-time estimation in ego-vision scenarios that is a key step in the understanding of social interactions. In order to robustly detect head under changing aspect ratio, scale and orientation we use and extend the Hough-Based Tracker which allows to follow simultaneously each subject in the scene. In an ego-vision scenario where a group interacts in a discussion, each subject's head orientation will be more likely to remain focused for a while on the person who has the floor. In order to encode this behavior we include a stateful Hidden Markov Model technique that enforces the predicted pose with the temporal coherence from a video sequence. We extensively test our approach on several indoor and outdoor ego-vision videos with high illumination variations showing its validity and outperforming other recent related state of the art approaches.

2014 - Illustrations Segmentation in Digitized Documents Using Local Correlation Features [Relazione in Atti di Convegno]
D. Coppi; C. Grana; R. Cucchiara
In this paper we propose an approach for Document Layout Analysis based on local correlation features. We identify and extract illustrations in digitized documents by learning the discriminative patterns of textual and pictorial regions. The proposal has been demonstrated to be effective on historical datasets and to outperform the state-of-the-art in presence of challenging documents with a large variety of pictorial elements.

2014 - Kernelized Structural Classification for 3D Dogs Body Parts Detection [Relazione in Atti di Convegno]
Pistocchi, Simone; Calderara, Simone; Barnard, S.; Ferri, N.; Cucchiara, Rita
Despite pattern recognition methods for human behavioral analysis has flourished in the last decade, animal behavioral analysis has been almost neglected. Those few approaches are mostly focused on preserving livestock economic value while attention on the welfare of companion animals, like dogs, is now emerging as a social need. In this work, following the analogy with human behavior recognition, we propose a system for recognizing body parts of dogs kept in pens. We decide to adopt both 2D and 3D features in order to obtain a rich description of the dog model. Images are acquired using the Microsoft Kinect to capture the depth map images of the dog. Upon depth maps a Structural Support Vector Machine (SSVM) is employed to identify the body parts using both 3D features and 2D images. The proposal relies on a kernelized discriminative structural classificator specifically tailored for dogs independently from the size and breed. The classification is performed in an online fashion using the LaRank optimization technique to obtaining real time performances. Promising results have emerged during the experimental evaluation carried out at a dog shelter, managed by IZSAM, in Teramo, Italy.

2014 - Layout analysis and content enrichment of digitized books [Articolo su rivista]
Grana, Costantino; Serra, Giuseppe; Manfredi, Marco; Coppi, Dalia; Cucchiara, Rita
In this paper we describe a system for automatically analyzing old documents and creating hyper linking between different epochs, thus opening ancient documents to young people and to make them available on the web with old and current content. We propose a supervised learning approach to segment text and illustration of digitized old documents using a texture feature based on local correlation aimed at detecting the repeating patterns of text regions and differentiate them from pictorial elements. Moreover we present a solution to help the user in finding contemporary content connected to what is automatically extracted from the ancient documents.

2014 - Learning Graph Cut Energy Functions for Image Segmentation [Relazione in Atti di Convegno]
M. Manfredi; C. Grana; R. Cucchiara
In this paper we address the task of learning how to segment a particular class of objects, by means of a training set of images and their segmentations. In particular we propose a method to overcome the extremely high training time of a previously proposed solution to this problem, Kernelized Structural Support Vector Machines. We employ a one-class SVM working with joint kernels to robustly learn significant support vectors (representative image-mask pairs) and accordingly weight them to build a suitable energy function for the graph cut framework. We report results obtained on two public datasets and a comparison of training times on different training set sizes.

2014 - Learning Superpixel Relations for Supervised Image Segmentation [Relazione in Atti di Convegno]
M. Manfredi; C. Grana; R. Cucchiara
In this paper we propose to extend the well known graph cut segmentation framework by learning superpixel relations and use them to weight superpixel-to-superpixel edges in a superpixel graph. Adjacent superpixel-pairs are analyzed to build an object boundary model, able to discriminate between superpixel-pairs belonging to the same object or placed on the edge between the foreground object and the background. Several superpixel-pair features are investigated and exploited to build a non-linear SVM to learn object boundary appearance. The adoption of this modified graph cut enhances the performance of a previously proposed segmentation method on two publicly available datasets, reaching state-of-the-art results.

2014 - Mapping Appearance Descriptors on 3D Body Models for People Re-identification [Articolo su rivista]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
People Re-identification aims at associating multiple instances of a person’s appearance acquired from different points of view, different cameras, or after a spatial or a limited temporal gap to the same identifier. The basic hypothesis is that the person’s appearance is mostly constant. Many appearance descriptors have been adopted in the past, but they are often subject to severe perspective and view-point issues. In this paper, we propose a complete re-identification framework which exploits non-articulated 3D body models to spatially map appearance descriptors (color and gradient histograms) into the vertices of a regularly sampled 3D body surface. The matching and the shot integration steps are directly handled in the 3D body model, reducing the effects of occlusions, partial views or pose changes, which normally afflict 2D descriptors. A fast and effective model to image alignment is also proposed. It allows operation on common surveillance cameras or image collections. A comprehensive experimental evaluation is presented using the benchmark suite 3DPeS

2014 - Miniature illustrations retrieval and innovative interaction for digital illuminated manuscripts [Articolo su rivista]
D. Borghesani; C. Grana; R. Cucchiara
In this paper we propose a multimedia solution for the interactive exploration of illuminated manuscripts. We leveraged on the joint exploitation of content-based image retrieval and relevance feedback to provide an effective mechanism to navigate through the manuscript and add custom knowledge in the form of tags. The similarity retrieval between miniature illustrations is based on covariance descriptors, integrating color, spatial and gradient information. The proposed relevance feedback technique, namely Query Remapping Feature Space Warping, accounts for the user’s opinions by accordingly warping the data points. This is obtained by means of a remapping strategy (from the Riemannian space where covariance matrices lie, referring back to Euclidean space) useful to boost the retrieval performance. Experiments are reported to show the quality of the proposal. Moreover, the complete prototype with user interaction, as already showcased at museums and exhibitions, is presented.

2014 - Substrate for a sensitive floor and method for displaying loads on the substrate [Brevetto]
Lucchese, Claudio; Cucchiara, Rita; Lombardi, Martino; Pieracci, Augusto; Santinelli, Paolo; Vezzani, Roberto
The substrate (1; 50) for making a sensitive floor comprises: a first frame made of high-conductivity sensing means (2a-2d) having a first orientation; a second frame made of high-conductivity sensing means (3a-3d) which is adapted to be laid on said first frame and has a second orientation, other than said first orientation, said second frame (3a-3d) forming a support layer for floor finishing products; an element (4) made of a conductive material, which comprises: an elastically compressible thickness (S1), two opposite faces (104, 204) contacting said two first and second frames (2a-2d), (3a-3d), an electric resistor whose resistance is proportional to said thickness (S1).

2014 - Visual Tracking: An Experimental Survey [Articolo su rivista]
A. W. M. Smeulder;D. M. Chu;R. Cucchiara;S. Calderara;A. Dehghan;M. Shah
There is a large variety of trackers, which have been proposed in the literature during the last two decades with some mixed success. Object tracking in realistic scenarios is difficult problem, therefore it remains a most active area of research in Computer Vision. A good tracker should perform well in a large number of videos involving illumination changes, occlusion, clutter, camera motion, low contrast, specularities and at least six more aspects. However, the performance of proposed trackers have been evaluated typically on less than ten videos, or on the special purpose datasets. In this paper, we aim to evaluate trackers systematically and experimentally on 315 video fragments covering above aspects. We selected a set of nineteen trackers to include a wide variety of algorithms often cited in literature, supplemented with trackers appearing in 2010 and 2011 for which the code was publicly available. We demonstrate that trackers can be evaluated objectively by survival curves, Kaplan Meier statistics, and Grubs testing. We find that in the evaluation practice the F-score is as effective as the object tracking accuracy (OTA) score. The analysis under a large variety of circumstances provides objective insight into the strengths and weaknesses of trackers.

2013 - A Fast Approach for Integrating ORB Descriptors in the Bag of Words Model [Relazione in Atti di Convegno]
C. Grana; D. Borghesani; M. Manfredi; R. Cucchiara
In this paper we propose to integrate the recently introduces ORB descriptors in the currently favored approach for image classification, that is the Bag of Words model. In particular the problem to be solved is to provide a clustering method able to deal with the binary string nature of the ORB descriptors. We suggest to use a k-means like approach, called k-majority, substituting Euclidean distance with Hamming distance and majority selected vector as the new cluster center. Results combining this new approach with other features are provided over the ImageCLEF 2011 dataset.

2013 - AN AUTOMATED PICKING WORKSTATION FOR HEALTHCARE APPLICATIONS [Articolo su rivista]
PICCININI P.; GAMBERINI R.; PRATI A.; RIMINI B.; CUCCHIARA R.
The costs associated with the management of healthcare systems have been subject to continuous scrutiny for some time now, with a view to reducing them without affecting the quality as perceived by final users. A number of different solutions have arisen based on centralisation of healthcare services and investments in Information Technology (IT). One such example is centralised management of pharmaceuticals among a group of hospitals which is then incorporated into the different steps of the automation supply chain. This paper focuses on a new picking workstation available for insertion in automated pharmaceutical distribution centres and which is capable of replacing manual workstations and bringing about improvements in working time. The workstation described uses a sophisticated computer vision algorithm to allow picking of very diverse and complex objects randomly available on a belt or in bins. The algorithm exploits state-of-the-art feature descriptors for an approach that is robust against occlusions and distracting objects, and invariant to scale, rotation or illumination changes. Finally, the performance of the designed picking workstation is tested in a large experimentation focused on the management of pharmaceutical items.

2013 - Automatic Single-Image People Segmentation and Removal for Cultural Heritage Imaging [Relazione in Atti di Convegno]
M. Manfredi; C. Grana; R. Cucchiara
In this paper, the problem of automatic people removal from digital photographs is addressed. Removing unintended people from a scene can be very useful to focus further steps of image analysis only on the object of interest, A supervised segmentation algorithm is presented and tested in several scenarios.

2013 - Beyond Bag of Words for Concept Detection and Search of Cultural Heritage Archives [Relazione in Atti di Convegno]
C. Grana; G. Serra; M. Manfredi; R. Cucchiara
Several local features have become quite popular for concept detection and search, due to their ability to capture distinctive details. Typically a Bag of Words approach is followed, where a codebook is built by quantizing the local features. In this paper, we propose to represent SIFT local features extracted from an image as a multivariate Gaussian distribution, obtaining a mean vector and a covariance matrix. Differently from common techniques based on the Bag of Words model, our solution does not rely on the construction of a visual vocabulary, thus removing the dependence of the image descriptors on the specific dataset and allowing to immediately retargeting the features to different classification and search problems. Experimental results are conducted on two very different Cultural Heritage image archives, composed of illuminated manuscript miniatures, and architectural elements pictures collected from the web, on which the proposed approach outperforms the Bag of Words technique both in classification and retrieval.

2013 - Hand Segmentation for Gesture Recognition in EGO-Vision [Relazione in Atti di Convegno]
Serra, Giuseppe; Camurri, Marco; Baraldi, Lorenzo; Michela, Benedetti; Cucchiara, Rita
Portable devices for first-person camera views will play a central role in future interactive systems. One necessary step for feasible human-computer guided activities is gesture recognition, preceded by a reliable hand segmentation from egocentric vision. In this work we provide a novel hand segmentation algorithm based on Random Forest superpixel classification that integrates light, time and space consistency. We also propose a gesture recognition method based Exemplar SVMs since it requires a only small set of positive samples, hence it is well suitable for the egocentric video applications. Furthermore, this method is enhanced by using segmented images instead of full frames during test phase. Experimental results show that our hand segmentation algorithm outperforms the state-of-the-art approaches and improves the gesture recognition accuracy on both the publicly available EDSH dataset and our dataset designed for cultural heritage applications.

2013 - Human Behavior Understanding with Wide Area Sensing Floors [Relazione in Atti di Convegno]
Lombardi, Martino; Pieracci, Augusto; Santinelli, Paolo; Vezzani, Roberto; Cucchiara, Rita
The research on innovative and natural interfaces aims at developing devices able to capture and understand the human behavior without the need of a direct interaction. In this paper we propose and describe a framework based on a sensing floor device. The pressure field generated by people or objects standing on the floor is captured and analyzed. Local and global features are computed by a low level processing unit and sent to high level interfaces. The framework can be used in different applications, such as entertainment, education or surveillance. A detailed description of the sensing element and the processing architectures is provided, together with some sample applications developed to test the device capabilities.

2013 - Image Classification with Multivariate Gaussian Descriptors [Relazione in Atti di Convegno]
C. Grana; G. Serra; M. Manfredi; R. Cucchiara
Techniques based on Bag Of Words approach represent images by quantizing local descriptors and summarizing their distribution in a histogram. Dierently, in this paper we describe an image as multivariate Gaussian distribution, estimated over the extracted local descriptors. The estimated distribution is mapped to a high-dimensional descriptor, by concatenating the mean vector and the projection of the covariance matrix on the Euclidean space tangent to the Riemannian manifold. To deal with large scale datasets and high dimensional feature spaces the Stochastic Gradient Descent solver is adopted. The experimental results on Caltech-101 and ImageCLEF2011 show that the method obtains competitive performance with state-of-the art approaches.

2013 - Learning articulated body models for people re-identification [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
People re-identification is a challenging problem in surveillance and forensics and it aims at associating multiple instances of the same person which have been acquired from different points of view and after a temporal gap. Image-based appearance features are usually adopted but, in addition to their intrinsically low discriminability, they are subject to perspective and view-point issues. We propose to completely change the approach by mapping local descriptors extracted from RGB-D sensors on a 3D body model for creating a view-independent signature. An original bone-wise color descriptor is generated and reduced with PCA to compute the person signature. The virtual bone set used to map appearance features is learned using a recursive splitting approach. Finally, people matching for re-identification is performed using the Relaxed Pairwise Metric Learning, which simultaneously provides feature reduction and weighting. Experiments on a specific dataset created with the Microsoft Kinect sensor and the OpenNi libraries prove the advantages of the proposed technique with respect to state of the art methods based on 2D or non-articulated 3D body models.

2013 - Lightweight Sign Recognition for Mobile Devices [Relazione in Atti di Convegno]
M. Fornaciari; A. Prati; C. Grana; R. Cucchiara
The diffusion of powerful mobile devices has posed the basis for new applications implementing on the devices (which are embedded devices) sophisticated computer vision and pattern recognition algorithms. This paper describes the implementation of a complete system for automatic recognition of places localized on a map through the recognition of significant signs by means of the camera of a mobile device (smartphone, tablet, etc.). The paper proposes a novel classification algorithm based on the innovative use of bag-of-words on ORB features. The recognition is achieved using a simple yet effective search scheme which exploits GPS localization to limit the possible matches. This simple solution brings several advantages, such as the speed also on limited-resource devices, the usability also with limited training samples and the easiness of adapting to new training samples and classes. The overall architecture of the system is based on a REST-JSON client-server architecture. The experimental results have been conducted in a real scenario and evaluating the different parameters which influence the performance.

2013 - Modeling Local Descriptors with Multivariate Gaussians for Object and Scene Recognition [Relazione in Atti di Convegno]
G. Serra; C. Grana; M. Manfredi; R. Cucchiara
Common techniques represent images by quantizing local descriptors and summarizing their distribution in a histogram. In this paper we propose to employ a parametric description and compare its capabilities to histogram based approaches. We use the multivariate Gaussian distribution, applied over the SIFT descriptors, extracted with dense sampling on a spatial pyramid. Every distribution is converted to a high-dimensional descriptor, by concatenating the mean vector and the projection of the covariance matrix on the Euclidean space tangent to the Riemannian manifold. Experiments on Caltech-101 and ImageCLEF2011 are performed using the Stochastic Gradient Descent solver, which allows to deal with large scale datasets and high dimensional feature spaces.

2013 - People reidentification in surveillance and forensics: a Survey [Articolo su rivista]
Vezzani, Roberto; Baltieri, Davide; Cucchiara, Rita
The field of surveillance and forensics research is currently shifting focus and is now showing an ever increasing interest in the task of people reidentification. This is the task of assigning the same identifier to all instances of a particular individual captured in a series of images or videos, even after the occurrence of significant gaps over time or space. People reidentification can be a useful tool for people analysis in security as a data association method for long-term tracking in surveillance. However, current identification techniques being utilized present many difficulties and shortcomings. For instance, they rely solely on the exploitation of visual cues such as color, texture, and the object's shape. Despite the many advances in this field, reidentification is still an open problem. This survey aims to tackle all the issues and challenging aspects of people reidentification while simultaneously describing the previously proposed solutions for the encountered problems. This begins with the first attempts of holistic descriptors and progresses to the more recently adopted 2D and 3D model-based approaches. The survey also includes an exhaustive treatise of all the aspects of people reidentification, including available datasets, evaluation metrics, and benchmarking.

2013 - Sensing floors for privacy-compliant surveillance of wide areas [Relazione in Atti di Convegno]
Lombardi, Martino; Pieracci, Augusto; Santinelli, Paolo; Vezzani, Roberto; Cucchiara, Rita
Surveillance systems can really benefit from the integration of multiple and heterogeneous sensors. In this paper we describe an innovative sensing floor. Thanks to its low cost and ease of installation, the floor is suitable for both private and public environments, from narrow zones to wide areas. The floor is made adding a sensing layer below commercial floating tiles. The sensor is scalable, reliable, and completely invisible to the users. The temporal and spatial resolutions of the data are high enough to identify the presence of people, to recognize their behavior and to detect events in a privacy compliant way. Experimental results on a real prototype implementation confirm the potentiality of the framework.

2013 - Structured learning for detection of social groups in crowd [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
Group detection in crowds will play a key role in future behavior analysis surveillance systems. In this work we build a new Structural SVM-based learning framework able to solve the group detection task by exploiting annotated video data to deduce a sociologically motivated distance measure founded on Hall's proxemics and Granger's causality. We improve over state-of-the-art results even in the most crowded test scenarios, while keeping the classification time affordable for quasi-real time applications. A new scoring scheme specifically designed for the group detection task is also proposed.

2013 - UNIMORE at ImageCLEF 2013: Scalable Concept Image Annotation [Relazione in Atti di Convegno]
C. Grana; G. Serra; M. Manfredi; R. Cucchiara; R. Martoglia; F. Mandreoli
In this paper we propose a large-scale Image annotation system for the Scalable Concept Image Annotation task. For each concept to be detected a separated classifier is built using the provided textual annotation. Images are represented as a Multivariate Gaussian distribution of a set of local features extracted over a dense regular grid. Textual analysis, on the web pages containing training images, is performed to retrieve a relevant set of samples for learning each concept classifier. An online SVMs solver based on Stochastic Gradient Descent is used to manage the large amount of training data. Experimental results show that the combination of different kind of local features encoded with our strategy achieves very competitive performance both in terms of mAP and mean F-measure.

2013 - Video surveillance online repository (ViSOR) [Relazione in Atti di Convegno]
Vezzani, Roberto; Cucchiara, Rita
This paper describe the ViSOR (Video Surveillance Online Repository) repository, designed with the aim of establishing an open platform for collecting, annotating, retrieving, and sharing surveillance videos, as well as evaluating the performance of automatic surveillance systems. The repository is free and researchers can collaborate sharing their own videos or datasets. Most of the included videos are annotated. Annotations are based on a reference ontology which has been defined integrating hundreds of concepts, some of them coming from the LSCOM and MediaMill ontologies. A new annotation classification schema is also provided, which is aimed at identifying the spatial, temporal and domain detail level used. The web interface allows video browsing, querying by annotated concepts or by keywords, compressed video previewing, media downloading and uploading. Finally, ViSOR includes a performance evaluation desk which can be used to compare different annotations.

2012 - 2D Images Map Warping for Improved User Interaction [Relazione in Atti di Convegno]
D. Borghesani; C. Grana; R. Cucchiara
In this paper, we suggest an interaction model designed to fit users' expectations in front of an image retrieval system. A lightweight relevance feedback strategy, working directly on the 2D projection of image features, allows the user to spatially navigate the media collection maintaining the real-time constraint. A preliminary evaluation of this relevance feedback strategy shows good performance compared with other known approaches.

2012 - Class-based color bag of words for fashion retrieval [Relazione in Atti di Convegno]
C. Grana; D. Borghesani; R. Cucchiara
Color signatures, histograms and bag of colors are basic and effective strategies for describing the color content of images, for retrieving images by their color appearance or providing color annotation. In some domains, colors assume a specific meaning for users and the color-based classification and retrieval should mirror the initial suggestions given by users in the training set. For instance in fashion world, the names given to the dominant color of a garment or a dress reflect the fashion dictact and not an uniform division of the color space.In this paper we propose a general approach to implement color signature as a trained bag of words, defined on the basis of user defined color classes. The novel Class-based Color Bag of Words is a easy computable bag of words of color, constructed following an approach similar to the Median Cut algorithm, but biased by color distribution in the trained classes. Moreover, to dramatically reduce the computational effort we propose 3D integral histograms, a 3D extension of integral images, easily extensible for many histogram-based signature in 3D color space. Several comparisons in large fashion datasets confirm the discriminant power of this signature.

2012 - Integrate tool for online analysis and offline mining of people trajectories [Articolo su rivista]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
In the past literature, online alarm-based video-surveillance and offline forensic-based data mining systems are often treated separately, even from different scientific communities. However, the founding techniques are almost the same and, despite some examples in commercial systems, the cases on which an integrated approach is followed are limited. For this reason, this study describes an integrated tool capable of putting together these two subsystems in an effective way. Despite its generality, the proposal is here reported in the case of people trajectory analysis, both in real time and offline. Trajectories are modelled based on either their spatial location or their shape, and proper similarity measures are proposed. Special solutions to meet real-time requirements in both cases are also presented and the trade-off between efficiency and efficacy is analysed by comparing when using a statistical model and when not. Examples of results in large datasets acquired in the University campus are reported as preliminary evaluation of the system.

2012 - Intelligent Video Surveillance [Capitolo/Saggio]
R. Cucchiara; A. Prati; R. Vezzani
Safety and security reasons are pushing the growth of surveillance systems, for both prevention and forensic tasks. Unfortunately, most of the installed systems have recording capability only, with quality so poor that makes them completely unhelpful. This chapter will introduce the concepts of modern systems for Intelligent Video Surveillance (IVS), with the claim of providing neither a complete treatment nor a technical description of this topic but of representing a simple and concise panorama of the motivations, components, and trends of these systems. Different from CCTV systems, IVS should be able, for instance, to monitor people in public areas and smart homes, to control urban traffi c, and to identity assessment for security and safety of critical infrastructure.

2012 - Learning Non-Target Items for Interesting Clothes Segmentation in Fashion Images [Relazione in Atti di Convegno]
C. Grana; S. Calderara; D. Borghesani; R. Cucchiara
In this paper we propose a color-based approach for skin detection and interest garment selection aimed at an automatic segmentation of pieces of clothing. For both purposes, the color description is extracted by an iterative energy minimization approach and an automatic initialization strategy is proposed by learning geometric constraints and shape cues. Experiments confirms the good performance of this technique both in the context of skin removal and in the context of classification of garments.

2012 - Multimedia for Cultural Heritage: Key Issues [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; D. Borghesani; M. Agosti; A.D. Bagdanov
Multimedia technologies have recently created the conditions for a true revolution in the Cultural Heritage domain, particularly in reference to the study, exploitation, and fruition of artistic works. New opportunities are arising for researchers in the field of multimedia to share their research results with people coming from the field of art and culture, and viceversa. This paper gathers together opinions and ideas shared during the final discussion session at the 1st International Workshop on Multimedia for Cultural Heritage, as a summary of the problems and possible directions to solve to them.

2012 - Multistage Particle Windows for Fast and Accurate Object Detection [Articolo su rivista]
G. Gualdi; A. Prati; R. Cucchiara
The common paradigm employed for object detection is the sliding window (SW) search. This approach generates grid-distributed patches, at all possible positions and sizes, which are evaluated by a binary classifier: the trade-off between computational burden and detection accuracy is the real critical point of sliding windows; several methods have been proposed to speed up the search such as adding complementary features. We propose a paradigm that differs from any previous approach, since it casts object detection into a statistical-based search using a Monte Carlo sampling for estimating the likelihood density function with Gaussian kernels. The estimation relies on a multi-stage strategy where the proposal distribution is progressively refined by taking into account the feedback of the classifiers. The method can be easily plugged in a Bayesian-recursive framework to exploit the temporal coherency of the target objects in videos. Several tests on pedestrian and face detection, both on images and videos, with different types of classifiers (cascade of boosted classifiers, soft cascades and SVM) and features (covariance matrices, Haar-like features, integral channel features and histogram of oriented gradients) demonstrate that the proposed method provides higher detection rates and accuracy as well as a lower computational burden w.r.t. sliding window detection.

2012 - People Orientation Recognition by Mixtures of Wrapped Distributions on Random Trees [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
The recognition of people orientation in single images is still an open issue in several real cases, when the image resolution is poor, body parts cannot be distinguished and localized or motion cannot be exploited. However, the estimation of a person orientation, even an approximated one, could be very useful to improve people tracking and re-identification systems, or to provide a coarse alignment of body models on the input images. In these situations, holistic features seem to be more effective and faster than model based 3D reconstructions. In this paper we propose to describe the people appearance with multi-level HoG feature sets and to classify their orientation using an array of Extremely Randomized Trees classifiers trained on quantized directions. The outputs of the classifiers are then integrated into a global continuous probability density function using a Mixture of Approximated Wrapped Gaussian distributions. Experiments on the TUD Multiview Pedestrians, the Sarc3D, and the 3DPeS datasets confirm the efficacy of the method and the improvement with respect to state of the art approaches.

2012 - Relevance Feedback as an Interactive Navigation Tool [Relazione in Atti di Convegno]
D. Borghesani; C. Grana; R. Cucchiara
Image collections are searched in common retrieval systems in many different ways, but the typical presentation is by means of a grid styled view. In this paper we try to suggest a novel use of relevance feedback as a tool to warp the view and allow the user to spatially navigate the image collection, and at the same time focus on his retrieval aim. This is obtained by the use of a distance based space warping on the 2D projection of the distance matrix.

2012 - Special Issue: Recent Achievements in Multimedia for Cultural Heritage - Guest Editorial [Articolo su rivista]
R. Cucchiara; C. Grana
For quite some time, libraries, document and historical centers from opposite corners of the world have been the caretakers of our rich and assorted social legacy. They have protected and furnished access to the testimonies of knowledge, beauty and inspiration, such as sculptures, paintings, music and literature. The new information technologies have created unbelievable opportunities to make this common heritage more accessible for all. Culture is following the digital path and “memory institutions” are adapting the way in which they communicate with their public. Multimedia technologies have recently created the conditions for a true revolution in the cultural heritage area, with reference to the study, valorization, and fruition of artistic works. New multimedia technologies shall be able to be utilized to plan unique approaches to the perception and fulfillment of the masterful legacy, for instance, through smart cultural objects and new interfaces with the backing of items such as story-telling, gaming and learning.All the plurality of masterpieces (paintings, books, manuscripts, even photos of sculptures and architecture) can be effectively embedded into a unique paradigm'' through digitization. This allows a significant reduction in costs, an enormous expansion of public accessibility (and therefore income), and at the same time a tremendous freedom for data elaboration. In brief, digitization enhances pleasure for the public and usefulness to experts on cultural heritage assets.

2012 - Towards Artistic Collections Navigation Tools based on Relevance Feedback [Relazione in Atti di Convegno]
D. Borghesani; C. Grana; R. Cucchiara
Artistic image collections are usually managed via textual metadata into standard content management systems. More sophisticated searches can be performed using image retrieval technologies based on visual content. Nevertheless, the problem of the information presentation remains. In this paper we try to move beyond the classic grid-styled presentation model, suggesting a novel use of relevance feedback as a navigation tool. Relevance feedback is therefore used to warp the view and allow the user to spatially navigate the image collection, and at the same time focus on his retrieval aim. This is obtained exploiting a distance based space warping on the 2D projection of the distance matrix. Multitouch gestures are employed to provide feedbacks by natural interaction with the system.

2012 - Understanding dyadic interactions applying proxemic theory on videosurveillance trajectories [Relazione in Atti di Convegno]
Simone Calderara;Rita Cucchiara
Understanding social and collective people behaviour in open spaces is one of the frontier of modern video surveillance. Many sociological theories, and proxemics in particular, have been proved their validity as a support for classifying and interpreting human behaviour. Proxemics suggest some simple but effective behavioural rules, useful to understand what people are doing and their social involvement with other individuals. In this paper we propose to extend the proxemics analysis along the time and provide a solution for analysing sequences of proxemic states computed between trajectories of people pairs (dyads). Trajectories, computed from videosurveillance videos, are first analysed and converted to a sequence of symbols according to proxemic theory. Then an elastic measure for comparing those sequences is introduced. Finally, interactions are classified both in an off-line unsupervised way and in an on-line fashion. Results on videosurveillance data, demonstrate that sequences of proxemic states can be effective in characterizing mutual interactions and experiments in capturing the most frequent dyads interactions and on-line classifying them when a labelled training set is available are proposed.

2012 - Veiling Luminance estimation on FPGA-based embedded smart camera [Relazione in Atti di Convegno]
C. Grana; D. Borghesani; P. Santinelli; R. Cucchiara
This paper describes the design and development of a Veiling Luminance estimation system based on the use of a CMOS image sensor, fully implemented on FPGA. The system is composed of the CMOS Image sensor, FPGA, DDR SDRAM, USB controller and SPI (Serial Peripheral Interface) Flash. The FPGA is used to build a system-on-chip integrating a soft processor (Xilinx MicroBlaze) and all the hardware blocks needed to handle the external peripherals and memory. The soft processor is used to handle image acquisition and all computational tasks need to compute the Veiling Luminance value. The advantages of this single chip FPGA implementation include the reduction of the hardware requirements, power consumption, and system complexity. The problem of the high dynamic range images have been addressed with multiple acquisitions at different exposure times. Vignetting, radial distortion and angular weighting, as required by veiling luminance definition, are handled by a single integer look-up table (LUT) access. Results are compared with a state of the art certified instrument.

2011 - 3DPes: 3D People Dataset for Surveillance and Forensics [Relazione in Atti di Convegno]
D. Baltieri; R. Vezzani; R. Cucchiara
The interest of the research community in creating reference datasets for performance analysis is always very high. Although new datasets, collecting large amounts of video footage are spreading in surveillance and forensics, few bench-marks with annotation data are available for testing specific tasks and especially for 3D/multi-view analysis. In this paper we present 3DPeS, a new dataset for 3D/multi- view surveillance and forensic applications. This has been designed for discussing and evaluating research results in people re-identification and other related activities (people detection, people segmentation and people tracking). The new assessed version of the dataset contains hundreds of video sequences of 200 people taken from a multi-camera distributed surveillance system over several days, with different light conditions; each person is detected multiple times and from different points of view. In surveillance scenarios, the dataset can be exploited to evaluate people reacquisition, 3D body models and people activity reconstruction algorithms. In forensics it can be adopted too, by relaxing some constraints (e.g. real time) and neglecting some information (e.g. calibration). Some results on this new dataset are presented using state of the art methods for people re-identification as a benchmark for future comparisons.

2011 - A Real-Time Embedded Solution for Skew Correction in Banknote Analysis [Relazione in Atti di Convegno]
Rashid, Adnan; Prati, Andrea; Cucchiara, Rita
Several industrial applications do require embedded solutionsboth for compacting the hardware occupation and reducing energy consumption, and for achieving high speed performance. This paper presents a computer vision system developed for correcting image skew in applications for banknote analysis and classification. The system must be very efficient and run on a fixed-point DSP with limited computational resources. Consequently, we propose three innovative improvements to basic and general-purpose image processing techniques that can be helpful in other computer vision applications on embedded devices. In particular, we address: a) an efficient labeling with an unionfind approach for hole filling, b) a fast Hough transform implementation, and c) a very high-speed estimation of affinetransformation for skew correction. The reported results demonstrate both the accuracy and the efficiency of the system,also in presence of severe skew. In terms of efficiency, the computational time is reduced of about two orders of magnitude.

2011 - A Reasoning Engine for Intruders' Localization in Wide Open Areas using a Network of Cameras and RFIDs [Relazione in Atti di Convegno]
Cucchiara, Rita; Fornaciari, Michele; Haider, Razia; Mandreoli, Federica; Martoglia, Riccardo; Prati, Andrea; Sassatelli, Simona
Wide open areas represent challenging scenarios forsurveillance systems, since sensory data can be affected bynoise, uncertainty, and distractors. Therefore, the tasks oflocalizing and identifying targets (e.g., people) in such environmentssuggest to go beyond the use of camera-only deployments.In this paper, we propose an innovative systemrelying on the joint use of cameras and RFIDs, allowing usto “map” RFID tags to people detected by cameras and,thus, highlighting potential intruders. To this end, sophisticatedfiltering techniques preserve the uncertainty of dataand overcome the heterogeneity of sensors, while an evidentialfusion architecture, based on Transferable Belief Model,combines the two sources of information and manages conflictbetween them. The conducted experimental evaluationshows very promising results.

2011 - A low-cost system and calibration method for veiling luminance measurement [Relazione in Atti di Convegno]
S. Cattini; C. Grana; R. Cucchiara; L. Rovati
A CCD-based measuring instrument aimed at the veiling luminance estimation and the relative low-cost calibration method are described. The system may allow the estimation of the optimum luminance levels in road-tunnels lighting, thus both increasing the drivers safety and avoiding energy wasting hence unjustified higher lighting-costs.

2011 - An Evidential Fusion Architecture for People Surveillance in Wide Open Areas [Relazione in Atti di Convegno]
Fornaciari, Michele; Davide, Sottara; Prati, Andrea; Paola, Mello; Cucchiara, Rita
A new evidential fusion architecture is proposed to build anhybrid articial intelligent system for people surveillance in wide open areas. Authorized people and intruders are identied and localized thanks to the joint employment of cameras and RFID tags. Complex Event Processing and Transferable Belief Model are exploited for handling noisy data and uncertainty propagation. Experimental results on complex synthetic scenarios demonstrate the accuracy of the proposed solution.

2011 - Appearance tracking by transduction in surveillance scenarios [Relazione in Atti di Convegno]
D. Coppi; S. Calderara; R. Cucchiara
We propose a formulation of people tracking problem as a Transductive Learning (TL) problem. TL is an effective semi-supervised learning technique by which many classification problems have been recently reinterpreted as learning labels from incomplete datasets. In our proposal the joint exploitation of spectral graph theory and Riemannian manifold learning tools leads to the formulation of a robust approach for appearance based tracking in Video Surveillance scenarios. The key advantage of the presented method is a continuously updated model of the tracked target, used in the TL process, that allows to on-line learn the target visual appearance and consequently to improve the tracker accuracy. Experiments on public datasets show an encouraging advancement over alternative state-of the-art techniques.

2011 - Automatic segmentation of digitalized historical manuscripts [Articolo su rivista]
C. Grana; D. Borghesani; R. Cucchiara
The artistic content of historical manuscripts provides a lot of challenges in terms of automatic text extraction, picture segmentation and retrieval by similarity. In particular this work addresses the problem of automatic extraction of meaningful pictures, distinguishing them from handwritten text and floral and abstract decorations. The proposed solution firstly employs a circular statistics description of a directional histogram in order to extract text. Then visual descriptors are computed over the pictorial regions of the page: the semantic content is distinguished from the decorative parts using color histograms and a novel texture feature called Gradient Spatial Dependency Matrix. The feature vectors are finally processed using an embedding procedure which allows increased performance in later SVM classification. Results for both feature extraction and embedding based classification are reported, supporting the effectiveness of the proposal on high resolution replicas of artistic manuscripts.

2011 - Contextual Information and Covariance Descriptors for People Surveillance: An Application for Safety of Construction Workers [Articolo su rivista]
G. Gualdi; A. Prati; R. Cucchiara
In computer science, contextual information can be used both to reduce computations and to increase accuracy. This paper discusses how it can be exploited for people surveillance in very cluttered environments in terms of perspective (i.e., weak scenecalibration) and appearance of the objects of interest (i.e., relevance feedback on the training of a classifier). These techniques are applied to a pedestrian detector that uses a LogitBoost classifier, appropriately modified to work with covariance descriptors which lie on Riemannian manifolds. On each detected pedestrian, a similar classifier is employed to obtain a precise localization of the head. Two novelties on the algorithms are proposed in this case: polar image transformations to better exploit the circular feature of the head appearance and multispectral image derivatives that catch not only luminance but also chrominance variations. The complete approach has been tested on the surveillance of a construction site to detect workers that do not wear the hard hat: in such scenarios, the complexity and dynamics are very high, making pedestrian detection a real challenge.

2011 - Detecting Anomalies in People’s Trajectories using Spectral Graph Analysis [Articolo su rivista]
Simone Calderara; Uri Heinemann; Andrea Prati; Rita Cucchiara; Naftali Tishby
Video surveillance is becoming the technology of choice for monitoring crowded areas for security threats. While video provides ample information for human inspectors, there is a great need for robust automated techniques that can efficiently detect anomalous behavior in streaming video from single ormultiple cameras. In this work we synergistically combine two state-of-the-art methodologies. The rst is the ability to track and label single person trajectories in a crowded area using multiple video cameras, and the second is a new class of novelty detection algorithms based on spectral analysis of graphs. By representing the trajectories as sequences of transitions betweennodes in a graph, shared individual trajectories capture only a small subspace of the possible trajectories on the graph. This subspace is characterized by large connected components of the graph, which are spanned by the eigenvectors with the low eigenvalues of the graph Laplacian matrix. Using this technique, we develop robust invariant distance measures for detectinganomalous trajectories, and demonstrate their application on realvideo data.

2011 - Energy-efficient Feedback Tracking on Embedded Smart Cameras by Hardware-level Optimization [Relazione in Atti di Convegno]
Casares, M.; Santinelli, Paolo; Velipasalar, S.; Prati, Andrea; Cucchiara, Rita
Embedded systems have limited processing power, memory and energy. When camera sensors are added to an embedded system, the problem of limited resources becomes even more pronounced. In this paper, we introduce two methodologies to increase the energy-efficiency and battery-life of an embeddedsmart camera by hardware-level operations when performingobject detection and tracking. The CITRIC platform is employedas our embedded smart camera. First, down-sampling is performed at hardware level on the micro-controller of the imagesensor rather than performing software-level down-sampling atthe main microprocessor of the camera board. In addition, instead of performing object detection and tracking on wholeimage, we first estimate the location of the target in the nextframe, form a search region around it, then crop the next frameby using the HREF and VSYNC signals at the micro-controllerof the image sensor, and perform detection and tracking onlyin the cropped search region. Thus, the amount of data thatis moved from the image sensor to the main memory at eachframe is optimized. Also, we can adaptively change the size ofthe cropped window during tracking depending on the objectsize. Reducing the amount of transferred data, better use ofthe memory resources, and delegating image down-samplingand cropping tasks to the micro-controller on the image sensor,result in significant decrease in energy consumption and increasein battery-life. Experimental results show that hardware-leveldown-sampling and cropping, and performing detection andtracking in cropped regions provide 41.24% decrease in energyconsumption, and 107.2% increase in battery-life. Compared toperforming software-level down-sampling and processing wholeframes, proposed methodology provides an additional 8 hours ofcontinuous processing on 4 AA batteries, increasing the lifetimeof the camera to 15.5 hours.

2011 - Energy-efficient Object Detection and Tracking on Embedded Smart Cameras by Hardware-level Operations at the Image Sensor [Relazione in Atti di Convegno]
M. Casares; P. Santinelli; S. Velipasalar; A. Prati; R. Cucchiara
Embedded smart cameras have limited processing power, memory and energy. In this paper, we introduce two methodologies to increase the energy-efficiency and the battery-life of an embedded smart camera by hardware-level operations when performing object detection and tracking. We use the CITRIC platform as our embedded smart camera. We first perform down-sampling at hardware-level on the microcontroller of the image sensor rather than performing software-level down-sampling at the main microprocessor of the camera board. In addition, instead of performing object detection on whole image, we first estimate the location of the target in the next frame, form a search region around it, then crop the next frame by using the HREF and VSYNC signals at the microcontrollerof the image sensor, and perform detection and tracking only in the cropped search region. Thus, the amount of data that is moved from the image sensor to the main memory at each frame, is greatly reduced. Thanks to reduced data transfer, better use of the memory resources and not occupying the main microprocessor with image down-sampling and cropping tasks, we obtain significant savings in energy consumption and battery-life. Experimental results show that hardware-level down-sampling and cropping, and performing detection in cropped regions provide 54:14% decrease in energy consumption, and 121:25% increase in battery-life compared to performing software-level downsampling and processing whole frame.

2011 - Feature Space Warping Relevance Feedback with Transductive Learning [Relazione in Atti di Convegno]
D. Borghesani; D. Coppi; C. Grana; S. Calderara; R. Cucchiara
Relevance feedback is a widely adopted approach to improve content-based information retrieval systems by keeping the user in the retrieval loop. Among the fundamental relevance feedback approaches, feature space warping has been proposed as an effective approach for bridging the gap between high-level semantics and the low-level features. Recently, combination of feature space warping and query point movement techniques has been proposed in contrast to learning based approaches, showing good performance under dierent data distributions. In this paper we propose to merge feature space warping and transductive learning, in order to benet from both the ability of adapting data to the user hints and the information coming from unlabeled samples. Experimental results on an image retrieval task reveal signicant performance improvements from the proposed method.

2011 - Identification of Intruders in Groups of People using Cameras and RFIDs [Relazione in Atti di Convegno]
Cucchiara, Rita; Fornaciari, Michele; Haider, Razia; Mandreoli, Federica; Prati, Andrea
The identification of intruders in groups of people moving in wide open areas represents a challenging scenario where coordination between cameras can be certainly used but this solution is not enough. In this paper, we propose to go beyond pure vision-based approaches by integrating the use of distributed cameras with the RFID technology. To this end, we introduce a system that “maps” RFID tags to people detected by cameras by using sophisticated techniques to filter the singular modalities and an evidential fusion architecture, based on Transferable Belief Model, to combine the two sources of information and manage conflict between them. The conducted experimental evaluation shows very promising results, especially in treating groups of people.

2011 - Iterative active querying for surveillance data retrieval in crime detection and forensics [Relazione in Atti di Convegno]
D. Coppi; S. Calderara; R. Cucchiara
Large sets of visual data are now available both, in real time andoff line, at time of investigation in multimedia forensics, however passive querying systems often encounter difﬁculties in retrieving signiﬁcant results. In this paper we propose an iterativeactive querying system for video surveillance and forensic applications based on the continuous interaction between the userand the system. The positive and negative user feedbacks areexploited as the input of a graph based transductive procedurefor iteratively reﬁning the initial query results. Experimentsare shown using people trajectories and people appearance asdistance metrics.

2011 - Markerless Body Part Tracking for Action Recognition [Articolo su rivista]
S. Calderara; A. Prati; R. Cucchiara
This paper presents a method for recognising human actions bytracking body parts without using artificial markers. A sophisticated appearance-based tracking able to cope with occlusions is exploited to extract a probability map for each moving object. A segmentation technique based on mixture of Gaussians (MoG) is then employed to extract and track significantpoints on this map, corresponding to significant regions on the human silhouette. The evolution of the mixture in time is analysed by transforming it in a sequence of symbols (corresponding to a MoG). The similarity between actions is computed by applying global alignment and dynamic programming techniques to the corresponding sequences and using a variational approximation of the Kullback-Leibler divergence to measure the dissimilarity between two MoGs. Experiments on publicly available datasets and comparison with existing methods are provided.

2011 - Mixtures of von Mises Distributions for People Trajectory Shape Analysis [Articolo su rivista]
S. Calderara; A. Prati; R. Cucchiara
People trajectory analysis is a recurrent task inmany pattern recognition applications, such as surveillance,behavior analysis, video annotation, and many others. In thispaper we propose a new framework for analyzing trajectoryshape, invariant to spatial shifts of the people motion in thescene. In order to cope with the noise and the uncertainty ofthe trajectory samples, we propose to describe the trajectoriesas a sequence of angles modelled by distributions of circularstatistics, i.e. a mixture of von Mises (MovM) distributions.To deal with MovM, we define a new specific EM algorithmfor estimating the parameters and derive a closed form of theBhattacharyya distance between single vM pdfs. Trajectories arethen modelled with a sequence of symbols, corresponding tothe most suitable distribution in the mixture, and comparedeach other after a global alignment procedure to cope withtrajectories of different lengths. The trajectories in the trainingset are clustered according with their shape similarity in an offlinephase, and testing trajectories are then classified with aspecific on-line EM, based on sufficient statistics. The approachis particularly suitable for classifying people trajectories in videosurveillance, searching for abnormal (i.e. infrequent) paths. Testson synthetic and real data are provided with also a completecomparison with other circular statistical and alignment methods.

2011 - Multi-view people surveillance using 3D information [Relazione in Atti di Convegno]
D. Baltieri; R. Vezzani; R. Cucchiara; A. Utasi; C. Benedek; T. Sziranyi
abstract

In this paper we introduce a novel surveillance system, which uses 3D information extracted from multiple cameras to detect, track and re-identify people. The detection method is based on a 3D Marked Point Process model using two pixel-level features extracted from multi-plane projections of binary foreground masks, and uses a stochastic optimization framework to estimate the position and the height of each person. We apply a rule based Kalman-filter tracking on the detection results to find the object-to-object correspondence between consecutive time steps. Finally, a 3D body model based long-term tracking module connects broken tracks and is also used to re-identify people

2011 - Optimal Decision Trees Generation from OR-Decision Tables [Relazione in Atti di Convegno]
C. Grana; M. Montangero; D. Borghesani; R. Cucchiara
abstract

In this paper we present a novel dynamic programming algorithm to synthesize an optimal decision tree from OR-decision tables,an extension of standard decision tables,which allow to choose between several alternative actions in the same rule. Experiments are reported,showing the computational time improvements over state of the art implementations of connected components labeling,using this modelling technique.

2011 - People appearance tracing in video by spectral graph transduction [Relazione in Atti di Convegno]
D. Coppi; S. Calderara; R. Cucchiara
abstract

Following people in different video sources is a challenging task: variations in the type of camera, in the lighting conditions, in the scene settings (e.g. crowd or occlusions) and in the point of view must be accounted. In this paper we propose a system based only on appearance information that, disregarding temporal and spatial information, can be flexibly applied on both moving and static cameras. We exploit the joint use of transductive learning and spectral properties of graph Laplacians proposing a formulation of the people tracing problem as a semi-supervised classification. The knowledge encoded in two labeled input sets of positive and negative samples of the target person and the continuous spectral update of these models allow us to obtain a robust approach for people tracing in surveillance video sequences. Experiments on publicly available datasets show satisfactory results and exhibit a good robustness in dealing with short and long term occlusions.

2011 - Probabilistic people tracking with appearance models and occlusion classification: The AD-HOC system [Articolo su rivista]
Vezzani, Roberto; Grana, Costantino; Cucchiara, Rita
abstract

AD-HOC (Appearance Driven Human tracking with Occlusion Classification) is a complete framework for multiple people tracking in video surveillance applications in presence of large occlusions. The appearance-based approach allows the estimation of the pixel-wise shape of each tracked person even during the occlusion. This peculiarity can be very useful for higher level processes, such as action recognition or event detection. A first step predicts the position of all the objects in the new frame while a MAP framework provides a solution for best placement. A second step associates each candidate foreground pixel to an object according to mutual object position and color similarity. A novel definition of non-visible regions accounts for the parts of the objects that are not detected in the current frame, classifying them as dynamic, scene or apparent occlusions. Results on surveillance videos are reported, using in-house produced videos and the PETS2006 test set.

2011 - Relevance feedback strategies for artistic image collections tagging [Relazione in Atti di Convegno]
C. Grana; D. Borghesani; R. Cucchiara
abstract

This paper provides an analysis on relevance feedback techniques in a multimedia system designed for the interactive exploration and annotation of artistic collections, in particular illuminated manuscripts. The relevance feedback is presented not only as a very effective technique to improve the performance of the system, but also as a clever way to increase the user experience, mixing the interactive surfing through the artistic content with the possibility to gather valuable information from the user, and consequently improving his retrieval satisfaction. We compare a modification of the Mean-Shift Feature Space Warping algorithm, as representative of the standard RF procedures, and a learning-based technique based on transduction, considered in order to overcome some limitation of the previous technique. Experiments are reported regarding the adopted visual features based on covariance matrices.

2011 - SARC3D: a new 3D body model for People Tracking and Re-identification [Relazione in Atti di Convegno]
Davide Baltieri; Roberto Vezzani; Rita Cucchiara
abstract

We propose a new simplified 3D body model (called Sarc3D) for surveillance application, that can be created, updated and compared in rea-time.People are detected and tracked in each calibrated camera, and their silhouette, appearance, position and orientation are extracted and used to place, scale and orientate a 3D body model. Foreach vertex of the model a signature (color features, reliability and saliency) is computed from the 2D appearance images and exploited for mathing. This approach achieves robustness against partial occlusions, pose and viewpoint changes. The complete proposal and a full experimental evaluation is presented, using a new benchmark suite and the PETS2009 dataset.

2011 - Using Monolithic Classifiers On Multi-stage Pedestrian Detection [Relazione in Atti di Convegno]
G. Gualdi; A. Prati; R. Cucchiara
abstract

Despite the many efforts in finding effective feature sets or accurate classifiers for people detection, few works have addressed ways for reducing the computational burden introducedby the sliding window paradigm. This paper proposes a multi-stage procedure for refining the search for pedestrians using the HOG features and the monolithic SVM classifier. The multi-stage procedure is based on particle-based estimation of pdfs and exploits the margin provided by the classifier to draw more particles on the areas where the classifier’s response is higher. This iterative algorithm achieves the same accuracy than sliding window using less particles (and thus being more efficient) and, conversely, is more accurate when configured to work at thesame computational load. Experimental results on publicly available datasets demonstrate that this method, previouslyproposed for boosted classifiers only, can be successfully applied to monolithic classifiers.

2011 - Vision based smoke detection system using image energy and color information [Articolo su rivista]
S. Calderara; P. Piccinini; R. Cucchiara
abstract

Smoke detection is a crucial task in many video surveillance applications and could have a great impact to raise the level of safety of urban areas. Many commercial smoke detection sensors exist but most of them cannot be applied in open space or outdoor scenarios. With this aim, the paper presents a smoke detection system that uses a common CCD camera sensor to detect smoke in images and trigger alarms. First, a proper background model is proposed to reliably extract smoke regions and avoid over-segmentation and false positives in outdoor scenarios where many distractors are present, such as moving trees or light reflexes. A novel Bayesian approach is adopted to detect smoke regions in the scene analyzing image energy by means of the Wavelet Transform coefficients and Color Information. A statistical model of image energy is built, using a temporal Gaussian Mixture, to analyze the energy decay that typically occurs when smoke covers the scene then the detection is strengthen evaluating the color blending between a reference smoke color and the input frame. The proposed system is capable of detecting rapidly smoke events both in night and in day conditions with a reduced number of false alarms hence is particularly suitable for monitoring large outdoor scenarios where common sensors would fail. An extensive experimental campaign both on recorded videos and live cameras evaluates the efficacy and efficiency of the system in many real world scenarios, such as outdoor storages and forests.

2010 - 3D Body Model Construction and Matching for Real Time People Re-Identification [Relazione in Atti di Convegno]
Davide Baltieri; Roberto Vezzani; Rita Cucchiara
abstract

Wide area video surveillance always requires to extract and integrate information coming from different cameras and views. Re-identification of people captured from different cameras or different views is one of most challenging problems. In this paper, we present a novel approach for people matching with vertices-based 3D human models.People are detected and tracked in each calibrated camera, and their silhouette, appearance, position and orientation are extracted and used to place, scale and orientate a 3D body model. Colour features are computed from the 2D appearance images and mapped to the 3D model vertices, generating the 3D model for each tracked person. A distance function between 3D models is defined in order to find matches among models belonging to the same person. This approach achieves robustness against partial occlusions, pose and viewpoint changes. A first experimental evaluation is conducted using images extracted from a real camera set-up.

2010 - Alignment-based Similarity of People Trajectories using Semi-directional Statistics [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper presents a method for comparing people trajectories for video surveillance applications, based on semi-directional statistics. In fact, the modelling of a trajectory as a sequence of angles, speeds and time lags, requires the use of a statistical tool capable to jointly consider periodic and linear variables. Our statistical method is compared with two state-of-the-art methods.

2010 - Bag-Of-Words Classification of Miniature Illustrations [Relazione in Atti di Convegno]
C. Grana; D. Borghesani; G. Gualdi; R. Cucchiara
abstract

In this paper a system for illuminated manuscripts images analysis is presented. In particular the bag-of-keypoints strategy, commonly adopted for object recognition, image classification and scene recognition, is applied to the classification of automatically extracted miniatures. Pictures are characterized by SURF descriptors, and a classification procedure is performed, comparing the results of Naive Bayes and histogram intersection distance measures.

2010 - Decision Trees for Fast Thinning Algorithms [Relazione in Atti di Convegno]
C. Grana; D. Borghesani; R. Cucchiara
abstract

We propose a new efficient approach for neighborhood exploration, optimized with decision tables and decision trees, suitable for local algorithms in image processing. In this work, it is employed to speed up two widely used thinning techniques. The performance gain is shown over a large freely available dataset of scanned document images.

2010 - Event Driven Software Architecture for Multi-camera and Distributed Surveillance Research Systems [Relazione in Atti di Convegno]
R. Vezzani; R. Cucchiara
abstract

Surveillance of wide areas with several connected cameras integrated in the same automatic system is no more a chimera, but modular, scalable and flexible architectures are mandatory to manage them. This paper points out the main issues on the development of distributed surveillance systems and proposes an integrated framework particularly suitable for research purposes. As first, exploiting a computer architecture analogy, a three layer tracking system is proposed, which copes with the integration of both overlapping and non overlapping cameras. Then, a static service oriented architecture is adopted to collect and manage the plethora of high level modules, such as face detection and recognition, posture and action classification, and so on. Finally, the overall architecture is controlled by an event driven communication infrastructure, which assures the scalability and the flexibility of the system.

2010 - Fast Background Initialization with Recursive Hadamard Transform [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
abstract

In this paper, we present a new and fast techniquefor background estimation from cluttered image sequences.Most of the background initialization approaches developedso far collect a number of initial frames and then requirea slow estimation step which introduces a delay wheneverit is applied. Conversely, the proposed technique redistributesthe computational load among all the frames bymeans of a patch by patch preprocessing, which makesthe overall algorithm more suitable for real-time applications.For each patch location a prototype set is created andmaintained. The background is then iteratively estimatedby choosing from each set the most appropriate candidatepatch, which should verify a sort of frequency coherencewith its neighbors. To this aim, the Hadamard transformhas been adopted which requires less computation time thanthe commonly used DCT. Finally, a refinement step exploitsspatial continuity constraints along the patch borders toprevent erroneous patch selections. The approach has beencompared with the state of the art on videos from availabledatasets (ViSOR and CAVIAR), showing a speed up of about10 times and an improved accuracy

2010 - HMM Based Action Recognition with Projection Histogram Features [Relazione in Atti di Convegno]
Vezzani, Roberto; Baltieri, Davide; Cucchiara, Rita
abstract

Hidden Markov Models (HMM) have been widely used for action recognition, since they allow to easily model the temporal evolution of a single or a set of numeric features extracted from the data. The selection of the feature set and the related emission probability function are the key issues to be defined. In particular, if the training set is not sufficiently large, a manual or automatic feature selection and reduction is mandatory. In this paper we propose to model the emission probability function as a Mixture of Gaussian and the feature set is obtained from the projection histograms of the foreground mask. The projectionhistograms contain the number of moving pixel for each row and for each column of the frame and they provide sufficient information to infer the instantaneous posture of the person. Then, the HMM framework recovers the temporal evolution of the postures recognizing in such a manner the global action. The proposed method have been successfully tested on the UT-Tower and on the Weizmann Datasets.

2010 - High Performance Connected Components Labeling on FPGA [Relazione in Atti di Convegno]
C. Grana; D. Borghesani; P. Santinelli; R. Cucchiara
abstract

This paper proposes a comparison of the two most advanced algorithms for connected components labeling, highlighting how they perform on a soft core SoC architecture based on FPGA. In particular we test our block based connected components labeling algorithm, optimized with decision tables and decision trees. The embedded system is composed of the CMOS image sensor, FPGA, DDR SDRAM, USB controller and SPI Flash. Results highlight the importance of caching and instructions and data cache sizes for high performance image processing tasks.

2010 - Improving classification and retrieval of illuminated manuscripts with semantic information [Relazione in Atti di Convegno]
C. Grana; D. Borghesani; R. Cucchiara
abstract

In this paper we detail a proposal of exploitation of expert-made commentaries in a unified system for illuminated manuscripts images analysis. In particular we will explore the possibility to improve the automatic segmentation of meaningful pictures, as well as the retrieval by similarity search engine, using clusters of keywords extracted from commentaries as semantic information.

2010 - Moving pixels in static cameras: detecting dangerous situations due to environment or people [Capitolo/Saggio]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

Dangerous situations arise in everyday life and many efforts have been lavished to exploit technology to increase the level of safety in urban areas. Video analysis is absolutely one of the most important and emerging technology for security purposes. Automatic video surveillance systems commonly analyze the scene searching for moving objects. Well known techniques exist to cope with this problem that is commonly referred as \change detection". Every time a dierence against a reference model is sensed, it should be analyzed to allow the system to discriminateamong a usual situation or a possible threat. When the sensor is a camera, motion is the key element to detect changes and moving objects must be correctly classied according to their nature. In this context we can distinguish among two dierent kinds of threat that can lead to dangerous situations in a video-surveilled environment. The first one is due to environmental changes such as rain, fog or smoke present in the scene. This kind of phenomena are sensed by the camera as moving pixelsand, subsequently as moving objects in the scene. This kind of threats shares some common characteristics such as texture, shape and color information and can be detected observing the features' evolution in time. The second situation arises whenpeople are directly responsible of the dangerous situation. In this case a subject is acting in an unusual way leading to an abnormal situation. From the sensor's point of view, moving pixels are still observed, but specic features and time-dependent statistical models should be adopted to learn and then correctly detect unusual and dangerous behaviors. With these premises, this chapter will present two different case studies. The rst one describes the detection of environmental changes in theobserved scene and details the problem of reliably detecting smoke in outdoor environments using both motion information and global image features, such as color information and texture energy computed by the means of the Wavelet transform.The second refers to the problem of detecting suspicious or abnormal people behaviors by means of people trajectory analysis in a multiple cameras video-surveillance scenario. Specically, a technique to infer and learn the concept of normality is proposed jointly with a suitable statistical tool to model and robustly compare people trajectories.

2010 - Multi-stage Sampling with Boosting Cascades for Pedestrian Detection in Images and Videos [Relazione in Atti di Convegno]
Gualdi, Giovanni; Prati, Andrea; Cucchiara, Rita
abstract

Many works address the problem of object detection by means of machine learning with boosted classifiers. They exploit sliding window search, spanning the whole image: the patches, at all possible positions and sizes, are sent to the classifier. Several methods have been proposed to speed up the search (adding complementary features or using specialized hardware). In this paper we propose a statisticalbased search approach for object detection which uses a Monte Carlo sampling approach for estimating the likelihood density function with Gaussian kernels. The estimation relies on a multi-stage strategy where the proposal distribution is progressively refined by taking into account the feedback of the classifier (i.e. its response). For videos, this approach is plugged in a Bayesian-recursive framework which exploits the temporal coherency of the pedestrians. Several tests on both still images and videos on common datasets are provided in order to demonstrate therelevant speedup and the increased localization accuracy with respect to sliding window strategy using a pedestrian classifier based on covariance descriptors and a cascade of Logitboost classifiers.

2010 - Mutual Calibration of Camera Motes and RFIDs for People Localization and Identification [Relazione in Atti di Convegno]
Cucchiara, Rita; Fornaciari, Michele; Prati, Andrea; Santinelli, Paolo
abstract

Achieving both localization and identication of people ina wide open area using only cameras can be a challengingtask, which requires cross-cutting requirements : high reso-lution for identication, whereas low resolution for having awide coverage of the localization. Consequently, this paperproposes the joint use of cameras (only devoted to local-ization) and RFID sensors (devoted to identication) withthe nal objective of detecting and localizing intruders. Toground the observations on a common coordinate system,a calibration procedure is dened. This procedure only de-mands a training phase with a single person moving in thescene holding a RFID tag. Although preliminary, the resultsdemonstrate that this calibration is sufficiently accurate tobe applied whenever dierent scenarios, where area of over-lap between the eld of view (FoV) of a camera and theField of sense" (FoS) of a (blind) sensor must be efficientlydetermined.

2010 - Optimized Block-based Connected Components Labeling with Decision Trees [Articolo su rivista]
C. Grana; D. Borghesani; R. Cucchiara
abstract

In this paper we define a new paradigm for 8-connection labeling, which employes a general approach to improve neighborhood exploration and minimizes the number of memory accesses. Firstly we exploit and extend the decision table formalism introducing OR-decision tables, in which multiple alternative actions are managed. An automatic procedure to synthesize the optimal decision tree from the decision table is used, providing the most effective conditions evaluation order. Secondly we propose a new scanning technique that moves on a 2x2 pixel grid over the image, which is optimized by the automatically generated decision tree.An extensive comparison with the state of art approaches is proposed, both on synthetic and real datasets. The synthetic dataset is composed of different sizes and densities random images, while the real datasets are an artistic image analysis dataset, a document analysis dataset for text detection and recognition, and finally a standard resolution dataset for picture segmentation tasks. The algorithm provides an impressive speedup over the state of the art algorithms.

2010 - People trajectory mining with statistical pattern recognition [Relazione in Atti di Convegno]
S. Calderara; R. Cucchiara
abstract

People social interaction analysis is a complex and interesting problem that can be faced from several points of view depending on the application context. In videosurveillance contexts many indicators of people habits and relations exist and, among these, people trajectories analysis can reveal many aspects of the way people behave in social environments. We propose a statistical framework for trajectories mining that analyzes, in an integrated solution, several aspects of the trajectories such as location, shape and speed properties. Three different models are proposed to deal with non-idealities of the selected features in conjunction with a robust inexact- matching similarity measure for comparing sequences with different lengths. Experimental results in a real scenario demonstrates the efficacy of the framework in clustering people trajectories with the purpose of analyze frequent behaviors in complex environments.

2010 - Perspective and Appearance Context for People Surveillance in Open Areas [Relazione in Atti di Convegno]
Gualdi, Giovanni; Prati, Andrea; Cucchiara, Rita
abstract

Contextual information can be used both to reduce computationsand to increase accuracy and this paper presentshow it can be exploited for people surveillance in terms ofperspective (i.e. weak scene calibration) and appearance ofthe objects of interest (i.e. relevance feedback on the trainingof a classifier). These techniques are applied to a pedestriandetector that exploits covariance descriptors througha LogitBoost classifier on Riemannian manifolds. The approachhas been tested on a construction working site wherecomplexity and dynamics are very high, making human detectiona real challenge. The experimental results demonstratethe improvements achieved by the proposed approach.

2010 - Polar Representation of Covariance Descriptors for Circular Features [Articolo su rivista]
Gualdi, Giovanni; Prati, Andrea; Cucchiara, Rita
abstract

The use of polar representation of covariance descriptors, suitable for the classification of circular feature sets, is proposed. It overcomes the implicit limits of state-of-the-art methods based on axis-oriented rectangular patches. The suitability of the proposed solution is verified on two case studies, namely head detection and polymer classification in photomicrograph contexts.

2010 - Rerum Novarum: Interactive Exploration of Illuminated Manuscripts [Relazione in Atti di Convegno]
D. Borghesani; C. Grana; R. Cucchiara
abstract

This paper describes an interactive application for the exploration and annotation of illuminated manuscripts, which typically contain thousands of pictures, used to comment or embellish the manuscript Gothic text. The system is composed by a modern user interface for browsing, surfing and querying, an automatic segmentation module, to ease the initial picture extraction task, and a similarity based retrieval engine, used to provide visually assisted tagging capabilities. A relevance feedback procedure is included to further refine the results.

2010 - Surfing on Artistic Documents with Visually Assisted Tagging [Relazione in Atti di Convegno]
C. Grana; D. Borghesani; R. Cucchiara
abstract

This paper describes a complete architecture for the interactive exploration and annotation of artistic collections. In particular the focus is on Renaissance illuminated manuscripts, which typically contain thousands of pictures, used to comment or embellish the manuscript Gothic text. The final aim is to create a human centered multimedia application allowing the non practitioners to enjoy these masterpieces and expert users to share their knowledge. The system is composed by a modern user interface for browsing, surfing and querying, an automatic segmentation module, to ease the initial picture extraction task, and a similarity based retrieval engine, used to provide visually assisted tagging capabilities. A relevance feedback procedure is included to further refine the results. Experiments are reported regarding the adopted visual features based on covariance matrices and the Mean Shift Feature Space Warping relevance feedback. Finally some hints on the user interface for museum installations are discussed.

2010 - Unsupervised Learning in Body-area Networks [Relazione in Atti di Convegno]
Bicocchi, Nicola; Lasagni, Matteo; Mamei, Marco; Prati, Andrea; Cucchiara, Rita; Zambonelli, Franco
abstract

Pattern recognition is becoming a key application in bodyarea networks. This paper presents a framework promoting unsupervised training for multi-modal, multi-sensor classification systems. Specifically, it enables sensors provided with patter-recognition capabilities to autonomously supervise the learning process of other sensors. The approach is discussed using a case study combining a smart camera and a body-worn accelerometer. The body-worn accelerometer sensor is trained to recognize four user activities pairing accelerometer data with labels coming from the camera. Experimental results illustrate the applicability of the approach in different conditions.

2010 - Video Surveillance Online Repository (ViSOR): an integrated framework [Articolo su rivista]
Vezzani, Roberto; Cucchiara, Rita
abstract

The availability of new techniques and tools for Video Surveillance and the capability of storing huge amounts of visual data acquired by hundreds of cameras every day call for a convergence between pattern recognition, computer vision and multimedia paradigms. A clear need for this convergence is shown by new research projects which attempt to exploit both ontology-based retrieval and video analysis techniques also in the field of surveillance.This paper presents the ViSOR (Video Surveillance Online Repository) framework, designed with the aim of establishing an open platform for collecting, annotating, retrieving, and sharing surveillance videos, as well as evaluating the performance of automatic surveillance systems. Annotations are based on a reference ontology which has been defined integrating hundreds of concepts, some of them coming from the LSCOM and MediaMill ontologies. A new annotation classification schema is also provided, which is aimed at identifying the spatial, temporal and domain detail level used.The ViSOR web interface allows video browsing, querying by annotated concepts or by keywords, compressed video previewing, media downloading and uploading.Finally, ViSOR includes a performance evaluation desk which can be used to compare different annotations.

2010 - Video sorveglianza per l'individuazione di persone e l'analisi comportamentale [Articolo su rivista]
R. Cucchiara
abstract

In questo articolo si parla delle nuove frontiere di visione artificiale nella videosorveglianza di persone in ambienti pubblici e privati ed in particolare di analisi comportamentale. Sono poi presentate alcuni progetti in corso presso l’ImageLab di Modena

2009 - A Fast Multi-model Approach for Object Duplicate Extraction [Relazione in Atti di Convegno]
P. Piccinini; A. Prati; R. Cucchiara
abstract

This paper presents an innovative approach for localizingand segmenting duplicate objects for industrial applications.The working conditions are challenging, withcomplex heavily-occluded objects, arranged at random inthe scene. To account for high flexibility and processingspeed, this approach exploits SIFT keypoint extraction andmean shift clustering to efficiently partition the correspondencesbetween the object model and the duplicates ontothe different object instances. The re-projection (by meansof an Euclidean transform) of some delimiting points ontothe current image is used to segment the object shapes. Thisprocedure is compared in terms of accuracy with existinghomography-based solutions which make use of RANSACto eliminate outliers in the homography estimation. Moreover,in order to improve the extraction in the case of reflectiveor transparent objects, multiple object models are usedand fused together. Experimental results on different andchallenging kinds of objects are reported.

2009 - A Real-Time System for Abnormal Path Detection [Relazione in Atti di Convegno]
S. Calderara; C. Alaimo; A. Prati; R. Cucchiara
abstract

This paper proposes a real-time system capable to extract andmodel object trajectories from a multi-camera setup with theaim of identifying abnormal paths. The trajectories are modeledas a sequence of positional distributions (2D Gaussians)and clustered in the training phase by exploiting an innovativedistance measure based on a global alignment techniqueand Bhattacharyya distance between Gaussians. An on-lineclassification procedure is proposed in order to on-the-fly classifynew trajectories into either “normal” or “abnormal” (in thesense of rarely seen before, thus unusual and potentially interesting).Experiments on a real scenario will be presented.

-

2009 - An efficient Bayesian framework for on-line action recognition [Relazione in Atti di Convegno]
Vezzani, Roberto; Piccardi, Massimo; Cucchiara, Rita
abstract

On-line action recognition from a continuous stream of actionsis still an open problem with fewer solutions proposedcompared to time-segmented action recognition. The mostchallenging task is to classify the current action while findingits time boundaries at the same time. In this paper wepropose an approach capable of performing on-line actionsegmentation and recognition by means of batteries of HMMtaking into account all the possible time boundaries and actionclasses. A suitable Bayesian normalization is appliedto make observation sequences of different length comparableand computational optimizations are introduce to achievereal-time performances. Results on a well known actiondataset prove the efficacy of the proposed method

2009 - Automatic Analysis of Historical Manuscripts [Relazione in Atti di Convegno]
C. Grana; D. Borghesani; R. Cucchiara
abstract

In this paper a document analysis tool for historical manuscripts is proposed. The goal is to automatically segment layout components of the page, that is text, pictures and decorations. We specifically focused on the pictures, proposing a set of visual features able to identify significant pictures and separating them from all the floral and abstract decorations. The analysis is performed by blocks using a limited set of color and texture features, including a new texture descriptor particularly effective for this task, namely Gradient Spatial Dependency Matrix. The feature vectors are processed by an embedding procedure which allows increased performance in later SVM classification.

2009 - Color features performance comparison for image retrieval [Relazione in Atti di Convegno]
D. Borghesani; C. Grana; R. Cucchiara
abstract

This paper proposes a comparison of color features for image retrieval. In particular the UCID image database has been employed to compare the retrieval capabilities of different color descriptors. The set of descriptors comprises global and spatially related features, and the tests show that HSV based global features provide the best performance at varying brightness and contrast settings.

2009 - Connected component labeling techniques on modern architectures [Relazione in Atti di Convegno]
C. Grana; D. Borghesani; R. Cucchiara
abstract

In this paper we present an overview of the historical evolution of connected component labeling algorithms, and in particular the ones applied on images stored in raster scan order. This brief survey aims at providing a comprehensive comparison of their performance on modern architectures, since the high availability of memory and the presence of caches make some solutions more suitable and fast. Moreover we propose a new strategy for label propagation based on a 2x2 blocks, which allows to improve the performance of many existing algorithms. The tests are conducted on high resolution images obtained from digitized historical manuscripts and a set of transformations is applied in order to show the algorithms behavior at different image resolutions and with a varying number of labels.

2009 - Covariance Descriptors on Moving Regions for Human Detection in Very Complex Outdoor Scenes [Relazione in Atti di Convegno]
G. Gualdi; A. Prati; R. Cucchiara
abstract

The detection of humans in very complex scenes can be very challenging, due to the performance degradation of classical motion detection and tracking approaches. An alternative approach is the detection of human-like patterns over the whole image. The present paper follows this line by extending Tuzel et al.’s technique [1] based on covariance descriptors and LogitBoost algorithm applied over Riemannian manifolds. Our proposal represents a significant extension of it by: (a) exploiting motion information to focus the attention over areas in which motion is present or was present in the recent past; (b) enriching the human classifier by additional, dedicated cascades trained on positive and negative samples taken from the specific scene; (c) using a rough estimation of the scene perspective, to reduce false detections and improve system performance. This approach is suitable in multi-camera scenarios, since the monolithic block for human-detection remains the same for the whole system, whereas the parameter tuning and set-up of the three proposed extensions (the only camera-dependent parts of the system), are automatically computed for each camera. The approach has been tested on a construction working site in which complexity and dynamics are very high, making human detection a real challenge. The experimental results demonstrate the improvements achieved by the proposed approach.

2009 - Dynamic Pictorially Enriched Ontologies for Digital Video Libraries [Articolo su rivista]
M., Bertini; A., Del Bimbo; Serra, Giuseppe; C., Torniai; Cucchiara, Rita; Grana, Costantino; Vezzani, Roberto
abstract

This article presents a framework for automatic semantic annotation of video streams with an ontology that includes concepts expressed using linguistic terms and visual data.

2009 - Emergent perspetives in artificial intelligence [Monografia/Trattato scientifico]
R. Serra; R. Cucchiara
abstract

Proceedings of the XIth International Conference on Artificial Intelligence

2009 - Fast Block Based Connected Components Labeling [Relazione in Atti di Convegno]
C. Grana; D. Borghesani; R. Cucchiara
abstract

In this paper we present a new optimization technique for the neighborhood computation in connected component labeling focused on images stored in raster scan order. This new technique is based on a 2x2 square block analysis of the image, and it exploits the fact that, when using 8-connection, the pixels of a 2x2 square are all connected to each other. This implies that they will share the same label at the end of the computation. To prove the effectiveness of our proposal, we show a comprehensive comparison of the most used and advanced connected components labeling techniques presented so far. The tests are conducted on high resolution images obtained from digitized historical manuscripts and a set of transformations is applied in order to show the algorithms behavior at different image resolutions and with a varying number of labels.

2009 - Learning People Trajectories using Semi-directional Statistics [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper proposes a system for people trajectory shape analysis by exploiting a statistical approach which accounts for sequences of both directional (the directions of the trajectory) and linear (the speeds) data. A semi-directional distribution (AWLG - Approximated Wrapped and Linear Gaussian) is used with a mixture to find main directions and speeds. A variational version of the mutual information criterion is proposed to prove the statistical dependency of the data. Then, in order to compare data sequences, we define an inexact method with a Kullback-Leibler-based distance measure and employ a global alignment technique is to handle sequences of different lengths and with local shifts or deformations. A comprehensive analysis of variable dependency and parameter estimation techniques are reported and evaluated on both synthetic and real data sets.

2009 - Multiple Object Segmentation for Pick-and-Place Applications [Relazione in Atti di Convegno]
P. Piccinini; A. Prati; R. Cucchiara
abstract

This paper presents a novel approach for detecting multipleinstances of the same object for pick-and-place automation.The working conditions are very challenging, with complex objects, arranged at random in the scene, and heavily occluded. This approach exploits SIFT to obtain a set of correspondences between the object model and the current image. In order to segment the multiple instances of the object, the correspondences are clustered among the objects using a voting scheme which determines the best estimate of the object’s center through mean shift. This procedure is compared in terms of accuracy with existing homography-based solutions which make use of RANSAC to eliminate outliers in the homography estimation.

2009 - Pathnodes integration of standalone Particle Filters for people tracking on distributed surveillance systems [Relazione in Atti di Convegno]
R. Vezzani; D. Baltieri; R. Cucchiara
abstract

In this paper, we present a new approach to object tracking based on batteries of particle filter working in multicamera systems with non overlapped fields of view. In each view the moving objects are tracked with independent particle filters; each filter exploits a likelihood function based on both color and motion information. The consistent labeling of people exiting from a camera field of view and entering in a neighbor one is obtained sharing particles information for the initialization of new filtering trackers. The information exchange algorithm is based on path-nodes, which are a graph-based scene representation usually adopted in computer graphics. The approach has been tested even in case of simultaneous transitions, occlusions, and groups of people. Promising results have been obtained and here presented using a real setup of non overlapped cameras.

2009 - Picture Extraction from Digitized Historical Manuscripts [Relazione in Atti di Convegno]
C. Grana; D. Borghesani; R. Cucchiara
abstract

In this work we propose a system for automatic document segmentation to extract graphical elements from historical manuscripts and then to identify significant pictures from them, removing floral and abstract decorations. The system performs a block based analysis by means of color and texture features. The Gradient Spatial Dependency Matrix, a new texture operator particularly effective for this task, is proposed. The feature vectors are processed by an embedding procedure which allows increased performance in later SVM classification. Results for both feature extraction and embedding based classification are reported, supporting the effectiveness of the proposal.

2009 - Proceedings of International Workshop on Multimedia in forensics [Curatela]
M. Worring; R. Cucchiara
abstract

It is our great pleasure to welcome you to the 1st ACM Workshop on Multimedia in Forensics -- MiFor'09.With the proliferation of multimedia data on the web, surveillance cameras in cities, and mobile phones in everyday life we see an enormous growth in multimedia data that needs to be analyzed by forensic investigators. The sheer volume of such datasets makes manual inspection of all data impossible. Tools are needed to support the investigator in their quest for relevant clues and evidence and in their strive towards preventing crime.The multimedia community has developed new solutions for management of large collections of video footage, images, audio and other multimedia content, knowledge extraction and categorization, pattern recognition, indexing and retrieval, searching, browsing and visualization, and modeling and simulation in various domains. Due to the inherent uncertainty and complexity of forensic data, applying those techniques to forensic data is not straightforward. The time is ripe to tailor these results for forensics. Multimedia in forensics is the workshop which target is to join the research topics and the applications.The workshop aims at addressing the multimedia toolbox supporting the forensic process from the prevention of crime, capturing and annotation of the crime scene, the investigation of the data in the lab, up to the presentation of the results in court. It is a first attempt in bringing multimedia tools in to this exciting application field. The target audience consists of researchers working on innovative technology, representatives from companies developing tools, and forensic investigators in various disciplines.Despite the ambitious objective for the workshop and it being the first edition, it attracted a good number of quality submissions fairly distributed among different countries and among the different topics of the workshop. The MiFor09 Technical Program Committee includes the most experienced researchers in the related research fields, and thanks to their indispensable effort we were able to select 11 papers for oral presentation.The workshop schedules four oral sessions, named "Detection and Mining", "Multimedia forensics prototypes", "Forgery and Splicing Detection" and "Tracking". In addition, the program includes a keynote address by Professor Mohan Kankanhalli, a distinguished lecturer in the field.

2009 - Statistical Pattern Recognition for Multi-Camera Detection, Tracking and Trajectory Analysis [Capitolo/Saggio]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

This chapter will address most of the aspects of modern video surveillance with the reference to the research activity conducted at University of Modena and Reggio Emilia, Italy, within the scopes of the national FREE SURF (FREE SUrveillance in a pRivacy-respectFul way) and NATO-funded BE SAFE (Behavioral lEarning in Surveilled Areas with Feature Extraction) projects. Moving object detection and tracking from a single camera, multi-camera consistent labeling and trajectory shape analysis for path classification will be the main topics of this chapter.

2009 - Video Analysis for Ambient intelligence in Urban Environments [Capitolo/Saggio]
A. PRATI; R. CUCCHIARA
abstract

Ambient Intelligence (AmI) is an emerging field of research that comprises new paradigms, techniques and systems for intelligent processing of distributed sensing. A challenging arena for AmI framework is represented by urban environments that are characterized by high complexity, numerous sources of data,and spreading of interesting and non-trivial applications. In this context, the project LAICA (Laboratory of Ambient Intelligence for a friendly city) represents a real experiment of the usefulness of AmI for advanced services to citizens. This chapter will address solutions of video analysis that can be directly applied in urban AmI. It describes in details the uniqueness of LAICA approach, focusing in particular on the use of computer vision techniques for monitoring public parks. People surveillanceand web-based video broadcasting will be taken into account.

2009 - Video surveillance and multimedia forensics: an application to trajectory analysis [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper reports an application of trajectory analysis in which forensics and video surveillance techniques are jointly employed for providing a new tool of multimedia forensics. Advanced video surveillance techniques are used to extract from a multi-camera system the trajectories of the moving people which are then modelled by either their positions (projected on the ground plane) or their directions of movement. Both these two representations can be very suitable for querying large video repositories, by searching for similar trajectories in terms of either sequences of positions or trajectory shape (encoded as sequence of angles, where positions do not care). Preliminary examples of the possible use of this approach are shown.

2008 - "Inside the Bible": Segmentation, Annotation and Retrieval for a New Browsing Experience [Relazione in Atti di Convegno]
C. Grana; D. Borghesani; S. Calderara; R. Cucchiara
abstract

In this paper we present a system for automatic segmentation, annotation and image retrieval based on content, focused on illuminated manuscripts and in particular the Borso D'Este Holy Bible. To enhance the interaction possibilities with this work, full of decorations and illustrations, we exploit some well known document analysis techniques in addition to some new approaches, in order to achieve good segmentation of pages into meaningful visual objects with the relative annotation. We wanted to extend the standard keyword-based retrieval approach in a commentary with a modern visual-based retrieval by appearance similarity: an entire software user interface for exploration and visual search of illuminated manuscripts.

2008 - A Markerless Approach for Consistent Action Recognition in a Multi-camera System [Relazione in Atti di Convegno]
S. Calderara; A. Prati; R. Cucchiara
abstract

This paper presents a method for recognizing human actions in a multi-camera setup. The proposed method automatically extracts significant points on the human body, without the need of artificial markers. A sophisticated appearance-based tracking able to cope with occlusions is exploited to extract a probability map for each moving object. A segmentation technique based on mixture of Gaussians is then employed to extract and track significant points on this map, corresponding to significant regions on the human silhouette. The point tracking produces a set of 3D trajectories that are compared with other trajectories by means of global alignment and dynamic programming techniques. Preliminary experiments showed the potentiality of the proposed approach.

2008 - AD-HOC: Appearance Driven Human tracking with Occlusion Handling [Relazione in Atti di Convegno]
R. Vezzani; R. Cucchiara
abstract

AD-HOC copes with the problem of multiple people tracking in video surveillance in presence of large occlusions. The main novelty is the adoption of an appearance-based approach in a formal Bayesian framework: the status of each object is defined at pixel level, where each pixel is characterized by the appearance, i.e. the color (integrated along the time) and the likelihood to belong to the object. With these data at pixel-level and a probability of non-occlusion at object-level, the problem of occlusions is addressed. The method does not aim at detecting the presence of an occlusion only, but classifies the type of occlusion at a sub-region level and evolve the status of theobject in a selective way. The AD-HOC tracking has been tested in many application for indoor and outdoor surveillance. Results on PETS2006 test set are reported where many people and abandoned objects are detected and tracked.

2008 - Action Signature: a Novel Holistic Representation for Action Recognition [Relazione in Atti di Convegno]
Calderara, S.; Cucchiara, R.; Prati, A.
abstract

Recognizing different actions with a unique approach can be a difficult task. This paper proposes a novel holistic representation of actions that we called "action signature". This 1D trajectory is obtained by parsing the 2D image containing the orientations of the gradient calculated on the motion feature map called motion-history image. In this way, the trajectory is a sketch representation of how the object motion varies in time. A robust statistical framework based on mixtures of von Mises distributions and dynamic programming for sequence alignment are used to compare and classify actions/trajectories. The experimental results show a rather high accuracy in distinguishing quite complicated actions, such as drinking, jumping, or abandoning an object.

2008 - Annotation Collection and Online Performance Evaluation for Video Surveillance: the ViSOR Project [Relazione in Atti di Convegno]
Roberto Vezzani; Rita Cucchiara
abstract

This paper presents the Visor (VIdeo Surveillance Online Repository) project designed with the aim of establishing anopen platform for collecting, annotating, retrieving, sharingsurveillance videos, and of evaluating the performanceof automatic surveillance systems. The main idea is to exploitthe collaborative paradigm spreading in the web communityto join together the ontology based annotation andretrieval concepts and the requirements of the computer visionand video surveillance communities. The ViSOR openrepository is based on a reference ontology which integratesmany concepts, also coming from LSCOM and MediaMillontologies. The web interface allows video browse, queryby annotated concepts or by keywords, compressed videopreview, media download and upload. The repository containsmetadata annotations, which can be either manuallycreated as ground truth or automatically generated by videosurveillance systems. Their automatic annotations can becompared each other or with the reference ground-truth exploitingan integrated on-line performance evaluator.

2008 - Bayesian-competitive Consistent Labeling for People Surveillance [Articolo su rivista]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

This paper presents a novel and robust approach to consistent labeling for people surveillance in multi-camera systems. A general framework scalable to any number of cameras with overlapped views is devised. An off-line training process automatically computes ground-plane homography and recovers epipolar geometry. When a new object is detected in any one camera, hypotheses for potential matching objects in the other cameras are established. Each of the hypotheses is evaluated using a prior and likelihood value. The prior accounts for the positions of the potential matching objects, while the likelihood is computed by warping the vertical axis of the new object on the field of view of the other cameras and measuring the amount of match. In the likelihood, two contributions (forward and backward) are considered so as to correctly handle the case of groups of people merged into single objects. Eventually, a maximum-a-posteriori approach estimates the best label assignment for the new object. Comparisons with other methods based on homography and extensive outdoor experiments demonstrate that the proposed approach is accurate and robust in coping with segmentation errors and in disambiguating groups.

2008 - Describing Texture Directions with Von Mises Distributions [Relazione in Atti di Convegno]
C. Grana; D. Borghesani; R. Cucchiara
abstract

In this work we describe a new approach for texture characterization. Starting from the autocorrelation matrix an elegant description through a mixture of Von Mises distributions is proposed. A compact 6 valued descriptor is produced for each block and served as input to an SVM classifier. Tests are carried out on high resolution illuminated manuscripts images.

2008 - Enabling Technologies on Hybrid Camera Networks for Behavioral Analysis of Unattended Indoor Environments and Their Surroundings [Relazione in Atti di Convegno]
G. Gualdi; A. Prati; R. Cucchiara; E. Ardizzone; M. La Cascia; L. Lo Presti; M. Morana
abstract

This paper presents a layered network architecture and the enabling technologies for accomplishing vision-based behavioral analysis of unattended environments. Specifically the vision network covers both the attended environment and its surroundings by means of hybrid cameras. The layer overlooking at the surroundings is laid outdoor and tracks people, monitoring entrance/exit points. It recovers the geometry of the site under surveillance and communicates people positions to a higher level layer. The layer monitoring the unattended environment undertakes similar goals, with the addition of maintaining a global mosaic of the observed scene for further understanding. Moreover, it merges information coming from sensors beyond the vision to deepen the understanding or increase the reliability of the system. The behavioral analysis is demanded to a third layer that merges the information received from the two other layers and infers knowledge about what happened, happens and will be likely happening in the environment. The paper also describes a case study that was implemented in the Engineering Campus of the University of Modena and Reggio Emilia, where our surveillance system has been deployed in a computer laboratory which was often unaccessible due to lack of attendance.

2008 - HECOL: Homography and Epipolar-based Consistent Labeling for Outdoor Park Surveillance [Articolo su rivista]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

Outdoor surveillance is one of the most attractive application of video processing and analysis. Robust algorithms must be defined and tuned to cope with the non-idealities of outdoor scenes. For instance, in a public park, an automatic video surveillance system must discriminate between shadows, reflections, waving trees, people standing still or moving, and other objects. Visual knowledge coming from multiple cameras can disambiguate cluttered and occluded targets by providing a continuous consistent labeling of tracked objects among the different views. This work proposes a new approach for coping with this problem in multi-camera systems with overlapped Fields of View (FoVs). The presence of overlapped zones allows the definition of a geometry-based approach to reconstruct correspondences between FoVs, using only homography and epipolar lines (hereinafter HECOL: Homography and Epipolar-based COnsistent Labeling) computed automatically with a training phase. We also propose a complete system that provides segmentation and tracking of people in each camera module. Segmentation is performed by means of the SAKBOT (Statistical and Knowledge Based Object Tracker) approach, suitably modified to cope with multi-modal backgrounds, reflections and other artefacts, typical of outdoor scenes. The extracted objects are tracked using a statistical appearance model robust against occlusions and segmentation errors. The main novelty of this paper is the approach to consistent labeling. A specific Camera Transition Graph is adopted to efficiently select the possible correspondence hypotheses between labels. A Bayesian MAP optimization assigns consistent labels to objects detected by several points of views: the object axis is computed from the shape tracked in each camera module and homography and epipolar lines allow a correct axis warping in other image planes. Both forward and backward probability contributions from the two different warping directions make the approach robust against segmentation errors, and capable of disambiguating groups of people. The system has been tested in a real setup of a urban public park, within the Italian LAICA (Laboratory of Ambient Intelligence for a friendly city) project. The experiments show how the system can correctly track and label objects in a distributed system with real-time performance. Comparisons with simpler consistent labeling methods and extensive outdoor experiments with ground truth demonstrate the accuracy and robustness of the proposed approach.

2008 - Pervasive Self-Learning with multi-modal distributed sensors [Relazione in Atti di Convegno]
Bicocchi, Nicola; Mamei, Marco; Prati, Andrea; Cucchiara, Rita; Zambonelli, Franco
abstract

Truly ubiquitous computing poses new and significantchallenges. One of the key aspects that will condition theimpact of these new tecnologies is how to obtain a manageablerepresentation of the surrounding environment startingfrom simple sensing capabilities. This will make devicesable to adapt their computing activities on an everchangingenvironment. This paper presents a frameworkto promote unsupervised training processes among differentsensors. This framework allows different sensors to exchangethe needed knowledge to create a model to classifyevents. In particular we developed, as a case study,a multi-modal multi-sensor classification system combiningdata from a camera and a body-worn accelerometer to identifythe user motion state. The body-worn accelerometerlearns a model of the user behavior exploiting the informationcoming from the camera and uses it later on to classifythe user motion in an autonomous way. Experimentsdemonstrate the accuracy of the proposed approach in differentsituations.

2008 - Reliable smoke detection system in the domains of image energy and color [Relazione in Atti di Convegno]
Piccinini, Paolo; Calderara, Simone; Cucchiara, Rita
abstract

Smoke detection calls for a reliable and fast distinction between background, moving objects and variable shapes that are recognizable as smoke. In our system we propose a stable background suppression module joined with a smoke detection module working on segmented objects. It exploits two features: the energy variation in wavelet model and a color model of the smoke. The decrease of energy ratio in wavelet domain between background and current image is a clue to detect smoke representing the variations of texture level. A mixture of Gaussians models this texture ratio for temporal evolution. The color model is used as reference to measure the deviation of the current pixel color from the model. The two features have been combined using a Bayesian classifier to detect smoke in the scene. Experiments on real data and a comparison between our background model and Gaussian Mixture(MOG) model for smoke detection are presented. © 2008 IEEE.

2008 - Smoke detection in video surveillance: A MoG model in the wavelet domain [Capitolo/Saggio]
Calderara, Simone; Piccinini, Paolo; Cucchiara, Rita
abstract

The paper presents a new fast and robust technique of smoke detection in video surveillance images. The approach aims at detecting the spring or the presence of smoke by analyzing color and texture features of moving objects, segmented with background subtraction. The proposal embodies some novelties: first the temporal behavior of the smoke is modeled by a Mixture of Gaussians (MoG ) of the energy variation in the wavelet domain. The MoG takes into account the image energy variation due to either external luminance changes or the smoke propagation. It allows a distinction to energy variation due to the presence of real moving objects such as people and vehicles. Second, this textural analysis is enriched by a color analysis based on the blending function. Third, a Bayesian model is defined where the texture and color features, detected at block level, contributes to model the likelihood while a global evaluation of the entire image models the prior probability contribution. The resulting approach is very flexible and can be adopted in conjunction to a whichever video surveillance system based on dynamic background model. Several tests on tens of different contexts, both outdoor and indoor prove its robustness and precision. © 2008 Springer-Verlag Berlin Heidelberg.

2008 - Smoke detection in videosurveillance: the use of VISOR (Video Surveillance On-line Repository) [Relazione in Atti di Convegno]
Roberto Vezzani; Simone Calderara; Paolo Piccinini; Rita Cucchiara
abstract

Visor (VIdeo Surveillance Online Repository) is a large videorepository, designed for containing annotated video surveillancefootages, comparing annotations, evaluating systemperformance, and performing retrieval tasks. The web interfaceallows video browse, query by annotated conceptsor by keywords, compressed video preview, media downloadand upload. The repository contains metadata annotations,both manually created ground-truth data and automaticallyobtained outputs of particular systems. An exampleof application is the collection of videos and annotationsfor smoke detection, an important video surveillance task. Inthis paper we present the architecture of ViSOR, the build-insurveillance ontology which integrates many concepts, alsocoming from LSCOM, and MediaMill, the annotation toolsand the visualization of results for performance evaluation.The annotation is obtained with an automatic smoke detectionsystem, capable to detect people, moving objects, andsmoke in real-time.

2008 - Using Dominant Sets for Object Tracking with Freely Moving Camera [Relazione in Atti di Convegno]
G. Gualdi; A. Albarelli; A. Prati; A. Torsello; M. Pelillo; R. Cucchiara
abstract

Object tracking with freely moving cameras is an openissue, since background information cannot be exploited forforeground segmentation, and plain feature tracking is notrobust enough for target tracking, due to occlusions, distractors and object deformations. In order to deal withsuch challenging conditions a traditional approach, basedon Camshift-like color-based features, is augmented by introducing a structural model of the object to be tracked incorporating previous knowledge about the spatial relationsbetween the parts. Hence, an attributed graph is built ontop of the features extracted from each frame and a graphmatching technique is used to extract the optimal matchwith the model. Pixel-wise and object-wise comparisonwith other tracking techniques with respect to manually obtained ground truth are presented.

2008 - Using circular statistics for trajectory shape analysis [Relazione in Atti di Convegno]
Prati, Andrea; Calderara, Simone; Cucchiara, Rita
The analysis of patterns of movement is a crucial task for several surveillance applications, for instance to classify normal or abnormal people trajectories on the basis of their occurrence. This paper proposes to model the shape of a single trajectory as a sequence of angles described using a Mixture of Von Mises (MoVM) distribution. A complete EM (Expectation Maximization) algorithm is derived for MoVM parameters estimation and an on-line version proposed to meet real time requirement. Maximum-A-Posteriori is used to encode the trajectory as a sequence of symbols corresponding to the MoVM components. Iterative k-medoids clustering groups trajectories in a variable number of similarity classes. The similarity is computed aligning (with dynamic programming) two sequences and considering as symbol-to-symbol distance the Bhattacharyya distance between von Mises distributions. Extensive experiments have been performed on both synthetic and real data. ©2008 IEEE.

2008 - ViSOR: Video Surveillance On-line Repository for Annotation Retrieval [Relazione in Atti di Convegno]
Roberto Vezzani; Rita Cucchiara
The Imagelab Laboratory of the University of Modena andReggio Emilia has designed a large video repository, aimingat containing annotated video surveillance footages. The webinterface, named ViSOR (VIdeo Surveillance Online Repository),allows video browse, query by annotated concepts or bykeywords, compressed preview, video download and upload.The repository contains metadata annotation, both manuallyannotated ground-truth data and automatically obtained outputsof a particular system. In such a manner, the users of therepository are able to perform validation tasks of their ownalgorithms as well as comparative activities.

2008 - Video Streaming for Mobile Video Surveillance [Articolo su rivista]
G. GUALDI; A. PRATI; R. CUCCHIARA
Mobile video surveillance represents a new paradigm that encompasses, on the one side, ubiquitous video acquisition and, on the other side, ubiquitous video processing and viewing, addressing both computer-based and human-based surveillance. To this aim, systems must provide efficient video streaming with low latency and low frame skipping, even over limited bandwidth networks. This work presents MoSES (MObile Streaming for vidEo Surveillance), an effective system for mobile video surveillance for both PC and PDA clients; it relies over H.264/AVC video coding and GPRS/EDGE-GPRS network. Adaptive control algorithms are employed to achieve the best tradeoff between low latency and good video fluidity. MoSES provides a good-quality video streaming that is used as input to computer-based video surveillance applications for people segmentation and tracking. In this paper new and general-purpose methodologies for streaming performance evaluation are also proposed and used to compare MoSES with existing solutions in terms of different parameters (latency, image quality, video fluidity, and frame losses), as well as in terms of performance in people segmentation and tracking.

2007 - A Distributed Outdoor Video Surveillance System for Detection of Abnormal People Trajectories [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
Distributed surveillance systems are nowadays widely adopted to monitor large areas for security purposes. In this paper, we present a complete multicamera system designed for people tracking from multiple partially overlapped views and capable of inferring and detecting abnormal people trajectories. Detection and tracking are performed by means of background suppression and an appearance-based probabilistic approach. Objects' label ambiguities are geometrically solved and the concept of "normality" is learned from data using a robust statistical model based on Von Mises distributions. Abnormal trajectories are detected using a first-order Bayesian network and, for each abnormal event, the appearance of the subject from each view is logged. Experiments demonstrate that our system can process with real-time performance up to three cameras simultaneously in an unsupervised setup and under varying environmental conditions.

2007 - A Dynamic Programming Technique for Classifying Trajectories [Relazione in Atti di Convegno]
Calderara, S.; Cucchiara, R.; Prati, A.
This paper proposes the exploitation of a dynamic programming technique for efficiently comparing people trajectories adopting an encoding scheme that jointly takes into account both the direction and the velocity of movement. With this approach, each pair of trajectories in the training set is compared and the corresponding distance computed. Clustering is achieved by using the k-medoids algorithm and each cluster is modeled with a 1-D Gaussian over the distance from the medoid. A MAP framework is adopted for the testing phase. The reported results are encouraging.

2007 - A Multi-Camera Vision System for Fall Detection and Alarm Generation [Articolo su rivista]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
In-house video surveillance can represent an excellent support for people with some difficulties (e.g. elderly or disabled people) living alone and with a limited autonomy. New hardware technologies and in particular digital cameras are now affordable and they have recently gained credit as tools for (semi-)automatically assuring people's safety. In this paper a multi-camera vision system for detecting and tracking people and recognizing dangerous behaviours and events such as a fall is presented. In such a situation a suitable alarm can be sent, e.g. by means of an SMS. A novel technique of warping people's silhouette is proposed to exchange visual information between partially overlapped cameras whenever a camera handover occurs. Finally, a multi-client and multi-threaded transcoding video server delivers live video streams to operators/remote users in order to check the validity of a received alarm. Semantic and event-based transcoding algorithms are used to optimize the bandwidth usage. A two-room setup has been created in our laboratory to test the performance of the overall system and some of the results obtained are reported.

2007 - An Open Source Architecture for Low-Latency Video Streaming on PDAs [Relazione in Atti di Convegno]
G. Gualdi; A. Prati; R. Cucchiara
This paper presents a open-source system for low-latency video streaming on PDAs, specifically addressing mobile video surveillance requirements. The system is based on H.264 and suitably modified to obtain the best trade-off between image quality and video fluidity, working also at very limited bandwidths. Moreover, the used con- trols allow to keep the number of lost frames very low. A large set of experiments and comparisons have been carried out and the achieved results demonstrate the efficacy and efficiency of our system.

2007 - Behavioral lEarning in Surveilled Areas with Feature Extraction [Partecipazione a progetti di ricerca]
R. Cucchiara; A. Prati; S. Calderara; R. Vezzani
The project aims at exploring how visual features can be automatically extracted from video using computer vision techniques and exploited by a classifier (generated by machine learning) to detect and identify suspicious people behavior in public places in real time. In this sense, CV and ML are jointly developed and studied to provide a better mix of innovative techniques.

2007 - Compressed Domain Features Extraction for Shot Characterization [Relazione in Atti di Convegno]
C. Grana; R. Vezzani; D. Borghesani; R. Cucchiara
In this work, we propose a system for shot comparison directly working on the MPEG-1 stream in the compressed domain, extracting both color, texture and motion features considering all frames with a reasonable computational cost, and results comparable to those obtained on uncompressed keyframes. In particular a summary descriptor for each Group Of Pictures (GOP) is computed and employed for shot characterization and comparison. The Mallows distance allows to match different length clips in a unified framework.

2007 - Detection of Abnormal Behaviors using a Mixture of Von Mises Distributions [Relazione in Atti di Convegno]
CAlderara, S.; Cucchiara, R.; Prati, A.
This paper proposes the use of a mixture of Von Mises distributions to detect abnormal behaviors of moving people. The mixture is created from an unsupervised training set by exploiting k-medoids clustering algorithm based on Bhattacharyya distance between distributions. The extracted medoids are used as modes in the multi-modal mixture whose weights are the priors of the specific medoid. Given the mixture model a new trajectory is verified on the model by considering each direction composing it as independent. Experiments over a real scenario composed of multiple, partially-overlapped cameras are reported.

2007 - Dynamic Pictorial Ontologies for Video Digital libraries Annotation [Relazione in Atti di Convegno]
M. Bertini; A. Del Bimbo; C. Torniai; C. Grana; R. Cucchiara
In this paper, we present the dynamic pictorial ontology paradigm for video annotation. Ontologies are often used to describe a given domain for different goals, including description of multimedia data. In the case of video annotation, the visual knowledge cannot be described using only abstract concepts but is more effectively represented in a visual form. To this aim, we introduce visual concepts, elicited from the data set as the most representative prototypes that specialize abstract concepts. The ontology created is intrinsically dynamic since it must embrace the perceptual and visual experience during annotation. Thus visual concepts can change, adapting to the multimedia content analyzed. Motivation for this new ontology paradigm are discussed together with a proposal of a framework for ontology creation, maintenance, and automatic annotation of video. The creation and usage of dynamic pictorial ontologies have been tested for soccer domain exploiting low level perceptual features and higher level domain features.

2007 - Enhancing HSV Histograms with Achromatic Points Detection for Video Retrieval [Relazione in Atti di Convegno]
C. Grana; R. Vezzani; R. Cucchiara
Color is one of the most meaningful features used in content based retrieval of visual data. In video content based retrieval, color features computed on selected frames are integrated with other low-level features concerning texture, shape and motion in order to find clip similarities. For example, the Scalable Color feature defined in the MPEG-7 standard exploits HSV histograms to create color feature vectors. HSV is a widely adopted space in image and video retrieval, but its quantization for histogram generation can create misleading errors in classification of achromatic and low saturated colors. In this paper we propose an Enhanced HSV Histogram with achromatic point detection based on a single Hue and Saturation parameter that can correct this limitation. The enhanced histograms have proven to be effective in color analysis and they have been used in a system for automatic clip annotation called PEANO, where pictorial concepts are extracted by a clip clustering and used for similarity based automatic annotation.

2007 - Guest Editorial: Expert environments: machine intelligence methods for ambient intelligence [Articolo su rivista]
P. REMAGNINO; A. PRATI; G.L. FORESTI; R. CUCCHIARA
2007 - Linear Transition Detection as a Unified Shot Detection Approach [Articolo su rivista]
C. Grana; R. Cucchiara
In this paper, we propose an automatic system forvideo shot segmentation, called Linear Transition Detector (LTD),unique for both cuts and linear transitions detection. Comparisonwith publicly available shot detection systems is reported ondifferent sports (Formula 1, basket, soccer and cycling) andTRECVID 2005 results are also reported.

2007 - Mobile Video Surveillance with Low-Bandwidth Low-Latency Video Streaming [Relazione in Atti di Convegno]
G. GUALDI; A. PRATI; R. CUCCHIARA
This paper presents a system for remote live video surveillance. Videos are acquired from a fixed camera at 10 fps and QVGA resolution, compressed at 5 or 20 kbit/s with H.264, and streamed to a remote site, where they get processed by an automatic video surveillance system. The target surveillance application performs moving object segmentation and tracking. Both ends (video acquisition and processing) could be connected through a wireless network, specifically GPRS.The whole system is studied and optimized to maintain low latency. The reported experiments demonstrate that the proposed system is able to send up to four video streams over GPRS or E-GPRS network, without significantly affecting the performance of the automatic video surveillance system. Comparative tests have been performed with other existing streaming solutions.

2007 - Network patterns recognition for automatic dermatoscopic images classification [Relazione in Atti di Convegno]
C. Grana; D. Vanini; S. Seidenari; G. Pellacani; R. Cucchiara
In this paper we focus on the problem of automatic classification of melanocytic lesions, aiming at identifying the presence of reticular patterns. The recognition of reticular lesions is an important step in the description of the pigmented network, in order to obtain meaningful diagnostic information. Parameters like color, size or symmetry could benefit from the knowledge of having a reticular or non-reticular lesion. The detection of network patterns is performed with a three-steps procedure. The first step is the localization of line points, by means of the line points detection algorithm, firstly described by Steger. The second step is the linking of such points into a line considering the direction of the line at its endpoints and the number of line points connected to these. Finally a third step discards the meshes which couldn’t be closed at the end of the linking procedure and the ones characterized by anomalous values of area or circularity. The number of the valid meshes left and their area with respect to the whole area of the lesion are the inputs of a discriminant function which classifies the lesions into reticular and non-reticular. This approach was tested on two balanced (both sets are formed by 50 reticular and 50 non-reticular images) training and testing sets. We obtained above 86% correct classification of the reticular and non-reticular lesions on real skin images, with a specificity value never lower than 92%.

2007 - Proceedings of 14th International Conference on Image Analysis and Processing (ICIAP 2007) [Curatela]
R. Cucchiara
2007 - Prototypes Selection with Context Based Intra-class Clustering for Video Annotation with Mpeg7 Features [Relazione in Atti di Convegno]
C. Grana; R. Vezzani; R. Cucchiara
In this work, we analyze the effectiveness of perceptual features to automatically annotate video clips in domain-specific video digital libraries. Typically, automatic annotation is provided by computing clip similarity with respect to given examples, which constitute the knowledgebase, in accordance with a given ontology or a classification scheme. Since the amount of training clips is normally very large, we propose to automatically extract some prototypes, or visual concepts, for each class instead of using the whole knowledge base. The prototypes are generated after a Complete Link clustering based on perceptual features with an automatic selection of the number of clusters. Context based information are used in an intra-class clustering framework to provide selection of more discriminative clips. Reducing the number of samples makes the matching process faster and lessens the storage requirements. Clips are annotated following the MPEG-7 directives to provide easier portability. Results are provided on videos taken from sports and news digital libraries.

2007 - Semi-automatic Video Digital Library Annotation Tools [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; R. Vezzani
In this work, we present a general purpose systemfor hierarchical structural segmentation and automaticannotation of video clips, by means of standardizedlow level features. We propose to automatically extractsome prototypes for each class with a context basedintra-class clustering. Clips are annotated followingthe MPEG-7 standard directives to provide easierportability. Results of automatic annotation and semiautomaticmetadata creation are provided.

2007 - Similarity-Based Retrieval with MPEG-7 3D Descriptors: Performance Evaluation on the Princeton Shape Benchmark [Relazione in Atti di Convegno]
C. Grana; M. Davolio; R. Cucchiara
In this work, we describe in detail the new MPEG-7 Perceptual 3D Shape Descriptor and provide a set of tests with different 3D objects databases, mainly with the Princeton Shape Benchmark. With this purpose we created a function library called Retrieval-3D and fixed some bugs of the MPEG-7 eXperimentation Model (XM). We explain how to match the Attributed Relational Graph (ARG) of every 3D model with the modified nested Earth Mover’s Distance (mnEMD). Finally we compare our results with the best found in literature, including the first MPEG-7 3D descriptor, i.e. the Shape Spectrum Descriptor.

2007 - Sports Video Annotation Using Enhanced HSV Histograms in Multimedia Ontologies [Relazione in Atti di Convegno]
M. Bertini; A. Del Bimbo; C. Torniai; C. Grana; R. Vezzani; R. Cucchiara
This paper presents multimedia ontologies, where multimedia data and traditional textual ontologies are merged. A solution for their implementation for the soccer video domain and a method to perform automatic soccer video annotation using these extended ontologies is shown. HSV is a widely adopted space in image and video retrieval, but its quantization for histogram generation can create misleading errors in classification of achromatic and low saturated colors. In this paper we propose an Enhanced HSV Histogram with achromatic point detection based on a single Hue and Saturation parameter that can correct this limitation.The more general concepts of the sport domain (e.g. play/break, crowd, etc.) are put in correspondence with the more general visual features of the video like color and texture, while the more specific concepts of the soccer domain (e.g. highlights such as attack actions) are put in correspondence with domain specific visual feature like the soccer playfield and the players. Experimental results for annotation of soccer videos using generic concepts are presented.

2007 - Using a Wireless Sensor Network to Enhance Video Surveillance [Articolo su rivista]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto; L., Benini; E., Farella; P., Zappi
To enhance video surveillance systems, multi-modal sensor integration can be a successful strategy. In this work, a computer vision system able to detect and track people from multiple cameras is integrated with a wireless sensor network mounting passive Pyroelectric InfraRed sensors. Thetwo subsystems are briefly described and possible cases in which computer vision algorithms are likely to fail are discussed. Then, simple but reliable outputs from the sensor nodes are exploited to improve the accuracy of the vision system. In particular, two case studies are reported: the first uses the presence detection of sensors to disambiguate between an open door and a moving person, while the second handles motion direction changes during occlusions. Preliminary results are reported and demonstrate the usefulness of the integration of the two subsystems.

2007 - Video Shots Comparison using the Mallows Distance [Relazione in Atti di Convegno]
C. Grana; D. Borghesani; R. Cucchiara
In this work, we focus on two aspects of the comparison of video shots. We present a new approach to extract a variable number of key frames from a shot, by the use of a hierarchical clustering with automatic level selection, in order to provide optimal allocation of features on different parts of the shot. We then employ the Mallows distance as an effective technique to compare the discrete distributions of features, independently from the features selected for the specific application. Results and comparisons on a soccer documentary video are provided.

2007 - VidiVideo [Partecipazione a progetti di ricerca]
Rita Cucchiara; Roberto Vezzani
2007 - Visor: Video Surveillance Online Repository [Relazione in Atti di Convegno]
Roberto Vezzani; Rita Cucchiara
abstract

2006 - 3-D Virtual Environments on Mobile Devices for Remote Surveillance [Relazione in Atti di Convegno]
R. VEZZANI; R. CUCCHIARA; A. MALIZIA; L. CINQUE
abstract

2006 - A Distributed Domotic Surveillance System [Capitolo/Saggio]
Cucchiara, Rita; Grana, Costantino; Prati, Andrea; Vezzani, Roberto
abstract

2006 - A semi-automatic system for segmentation of cardiac M-mode images [Articolo su rivista]
L. BERTELLI; R. CUCCHIARA; G. PATERNOSTRO; A. PRATI
abstract

2006 - A semi-automatic video annotation tool with MPEG-7 content collections [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; D. Bulgarelli; R. Vezzani
abstract

2006 - A system for automatic face obscuration for privacy purposes [Articolo su rivista]
R. Cucchiara; A. Prati; R. Vezzani
abstract

2006 - Advanced video surveillance with pan tilt zoom cameras [Relazione in Atti di Convegno]
R. CUCCHIARA; A. PRATI; R. VEZZANI
abstract

2006 - Comparison of color clustering algorithms for segmentation of dermatological images [Relazione in Atti di Convegno]
R. Melli; C. Grana; R. Cucchiara
abstract

2006 - Distance transform for automatic dermatologic images composition [Relazione in Atti di Convegno]
C. Grana; G. Pellacani; S. Seidenari; R. Cucchiara
abstract

2006 - Estimating Geospatial Trajectory of a Moving Camera [Relazione in Atti di Convegno]
A. HAKEEM; R. VEZZANI; S. SHAH; R. CUCCHIARA
abstract

2006 - FREE Surveillance in a pRivacy respectFul way [Partecipazione a progetti di ricerca]
R. Cucchiara; A. Prati; C. Grana; R. Vezzani
abstract

The FREE SURF project aims at proposing new technologies for the next generations of video surveillance systems oriented to the automatic real-time control of the presence and actions undertaken by people in the environment, without the direct control of a human operator. The FREE SURF project is born with a twofold aim: first, innovative scientific research in the field of Computer Vision and Pattern Recognition, second, innovative applied research for the development of new generations of video surveillance systems, both effective and socially acceptable with respect to privacy concerns.The first objective is to conduct a thoughtful research activity in the field of Computer Engineering for video surveillance of people in "structural constraint FREE" systems, that is in systems free from structural and environmental constraints. The automatic visual control of human presence and actions in a given environment is, indeed, one of the most studied problems in the last decade. Nowadays, a very large literature exists, which presents algorithms and robust implementations for the recognition of single persons, in structured environments: closed environments with controlled illumination, open environments with large field of view (in order to consider people as small rigid moving objects), with few people, with only partially occluded fields of view, controlled by fixed cameras (to segment objects as different from the background), and installed with a precise manual calibration (for an exact 3D reconstruction).The final objective of the project is to study innovative methodologies and techniques for going further on: the final targets are environments free from structural constraints, in scenes with more people that live together and interact each other, as in parks or tourist areas. The foreseen activities are devoted to the study of new ways to extract visual data, from distributed camera systems, from hybrid systems with active cameras, capable to automatically move toward a target, from moving cameras, and coordinated with networks of sensors. New algorithms will be studied and working prototypes developed for people segmentation and tracking in videos acquired by multiple auto-calibrated cameras, by exploiting geometrical information and appearance (color and texture). Approaches for active camera control and mosaicing of the scene from moving cameras will be studied. Moreover, mobile agents systems will be studied to coordinate cameras and sensor networks in large scenes like archaeological sites. These techniques will all implemented in separated modules by each RU, but they will be coordinated in a single architecture to provide a common interface for the reasoning modules.All the previous modules have the common objective to extract visual data on the people in the scene. In particular, trajectory computation with invariants independent of the point of view, people posture analysis and soft biometries are the main data that will be extracted. Differently from projects dealing with biometric analysis, the FREE SURF project is oriented to the automatic visual analysis of the presence and behavior of people independently of their identities, which are not easy to assess in noisy, low-resolution videos with large filed of view, like those typical of distributed video surveillance systems. As a further support, hybrid system with PTZ and mobile cameras can provide, if needed, information with more details, which can be used in "posterity logging" by the experts. The visual data are provided to modules for dual activities: to monitor dangerous situations in real time, and to annotate interesting situations for future off-line queries. The first is a strategic tool to help the human operator in the prevention and fast responsiveness to facts regarding security, the second provides a valid support to investigations and a-posteriori analysis. These solutions may enable the many existing surveillance systems to provide effective su

2006 - FaceMouse: a Human-Computer Interface for Tetraplegic People [Relazione in Atti di Convegno]
E. Perini; S. Soria; A. Prati; R. Cucchiara
abstract

2006 - Fast Dynamic Mosaicing and Person Following [Relazione in Atti di Convegno]
A. PRATI; F. SEGHEDONI; R. CUCCHIARA
abstract

2006 - Group Detection at Camera Handoff for Collecting People Appearance in Multi-camera Systems [Relazione in Atti di Convegno]
S. CALDERARA; R. CUCCHIARA; A. PRATI
abstract

2006 - Line Detection and Texture Characterization of Network Patterns [Relazione in Atti di Convegno]
C. Grana; R. Cucchiara; G. Pellacani; S. Seidenari
abstract

2006 - Low-latency Live Video Streaming over Low-Capacity Networks [Relazione in Atti di Convegno]
G. GUALDI; R. CUCCHIARA; A. PRATI
abstract

2006 - MOM: multimedia ontology manager. A framework for automatic annotation and semantic retrieval of video sequences [Relazione in Atti di Convegno]
M. Bertini; A. Del Bimbo; C. Torniai; C. Grana; R. Cucchiara
abstract

2006 - MPEG-7 Pictorially Enriched Ontologies for Video Annotation [Relazione in Atti di Convegno]
C. Grana; R. Vezzani; D. Bulgarelli; R. Cucchiara
abstract

2006 - Multimedia Surveillance: Content-based Retrieval with Multicamera People Tracking [Relazione in Atti di Convegno]
Calderara, S.; Cucchiara, R.; Prati, A.
abstract

2006 - PEANO: Pictorial Enriched Annotation of Video [Relazione in Atti di Convegno]
C. Grana; R. Vezzani; D. Bulgarelli; G. Gualdi; R. Cucchiara; M. Bertini; C. Torniai; A. Del Bimbo
abstract

2006 - Performance of the MPEG-7 Shape Spectrum Descriptor for 3D objects retrieval [Relazione in Atti di Convegno]
C. Grana; R. Cucchiara
abstract

2006 - Reliable background suppression for complex scenes [Relazione in Atti di Convegno]
Calderara, S.; Melli, R.; Prati, A.; Cucchiara, R.
abstract

2006 - Semantic Annotation and Adaptation of Live Sports Videos [Relazione in Atti di Convegno]
M. Bertini; R. Cucchiara; A. Del Bimbo; A. Prati
abstract

M. BERTINI; R. CUCCHIARA; A. DEL BIMBO; A. PRATI
abstract

2006 - Special Issue on Multimedia Surveillance Systems: Guest Editorial [Articolo su rivista]
Aggarwal JK; Cucchiara R
It is with considerable pride that we present this special issue of ACM multimedia based on the presentations at the third Video Surveillance and Sensor Network workshop, in conjunction with the ACM conference in Singapore 2005. The papers were thoroughly reviewed independently of the review process for the workshop. This special issue consists of eight papers drawn from a number of areas. It appears that we are breaking new ground as explained in this issue.Whenever we say multimedia, we think of systems and services that manage heterogeneous data for human-oriented applications; human users are normally the subjects who access and use multimedia data, multimediastreams, multimedia content, and multimedia interfaces in many different applications contexts. Following this abstraction, multimedia surveillance systems would be only a surveillance system able to produce output of the task in a multimedia format, providing distilled video, images and sounds of the monitored environment, which would possibly be annotated in an efficient and standard way or possibly transcoded in another media such as text or animation, to improve further querying to surveillance stored data.

2006 - Sub-Shot Summarization for MPEG-7 based Fast Browsing [Relazione in Atti di Convegno]
C. Grana; R. Cucchiara
In this paper, we propose a system for automatic video summarization at sub-shot level. Our work covers two main aspects: the first is the sub-shot detection, which is performed without a priori constraints on the number or length of the shots. The algorithm is based on color histograms and motion features, and employs fuzzy c-means with variable number of clusters. The second aspect is an in depth discussion on the annotation of summaries with the MPEG-7 standard. Results on mixed genres TV material, from TRECVID videos, are reported.

2006 - The LAICA project: Experiments on Multicamera People Tracking and Logging [Relazione in Atti di Convegno]
S. CALDERARA; R. CUCCHIARA; A. PRATI
Logging information on moving objects is crucial in video surveillance systems. Distributed multi-camera systems can provide the appearance of objects/people from differentviewpoints and at different resolutions, allowing a more complete and precise logging of the information. This is achieved through consistent labeling to correlate collected information of the same person. This paper proposes a novel approach to consistent labeling also capable tofully characterize groups of people and to manage miss segmentations. The ground-plane homography and the epipolar geometry are automatically learned and exploited to warp objects’ principal axes between overlapped cameras. A MAP estimator that exploits two contributions (forward and backward) is used to choose the most probable label con£guration to be assigned at the handoff of a new object. Extensive experiments demonstrate the accuracy of the proposed method in detecting single and simultaneous handoffs, miss segmentations, and groups.

2006 - University of Modena and Reggio Emilia at TRECVID 2006 [Relazione in Atti di Convegno]
C. Grana; R. Vezzani; R. Cucchiara
What approach or combination of approaches did you test in each of your submitted runs?TRECVID2005_UNIMORE_??.xml: the same linear transition detector (LTD) was tested forevery run, with ten uniformly spaced thresholds for the detection.What if any significant differences (in terms of what measures) did you find among theruns?The system behaved as expected: the higher the threshold the better the recall. Of course theprecision lowered correspondently. Interesting enough, it seems that we cannot overcome theoverall limit around 80% for recall and 88% for precision, independently of the other parameter.Based on the results, can you estimate the relative contribution of each component of yoursystem/approach to its effectiveness?One of the main objective of our system was to test the performance of a single algorithm forboth cuts and gradual transitions. So all the merit and the demerits are related to our LTD.Overall, what did you learn about runs/approaches and the research question(s) thatmotivated them?The use of a single algorithm allows the system to be run without training. Just a singleparameter may be employed to tune the sensibility of the system, thus allowing its use in generalpurpose/user friendly systems.

2006 - Video Clip Clustering for Assisted Creation of MPEG-7 Pictorially Enriched Ontologies [Relazione in Atti di Convegno]
C. Grana; D. Bulgarelli; R. Cucchiara
In this paper, we present a system for the assisted creation of Pictorially Enriched Ontologies, that is ontologies for context-based digital libraries enriched by pictorial concepts for video annotation, summarization and similarity based retrieval. Here we detail the approach for video clips clustering and pictorial concepts extraction together with the approach for storing the ontology within the MPEG-7 framework. The clustering is performed by Complete Link hierarchical clustering on color histograms and motion features. Results on Formula 1 TV material are reported.

2005 - A computer vision system for in-house video surveillance [Articolo su rivista]
R. Cucchiara; C. Grana; A. Prati; R. Vezzani
In-house video surveillance to control the safety of people living in domestic environments is considered. In this context, common problems and general purpose computer vision techniques are discussed and implemented in an integrated solution comprising a robust moving object detection module which is able to disregard shadows, a tracking module designed to handle large occlusions, and a posture detector. These factors, shadows, large occlusions and people's posture, are the key problems that are encountered with in-house surveillance systems, A distributed system with cameras installed in each room of a house can be used to provide full coverage of people's movements. Tracking is based on a probabilistic approach in which the appearance and probability of occlusions are computed for the current camera and warped in the next camera's view by positioning the cameras to disambiguate the occlusions. The application context is the emerging area of domotics (from the Latin word domus, meaning 'home', and informatics). In particular, indoor video surveillance, which makes it possible for elderly and disabled people to live with a sufficient degree of autonomy, via interaction with this new technology, which can be distributed in a house at affordable costs and with high reliability.

2005 - Adaptation and Annotation of Formula 1 Sport Videos [Relazione in Atti di Convegno]
C. Grana; G. Tardini; R. Cucchiara
In this paper, we approach the problem of detecting editing features suitable for video annotation, by paying attention to artifacts and effects introduced in video editing. In particular, a linear transition detection algorithm is presented, which can characterize the transition center and length with high precision. The technique works with sub-frame granularity and is able to include both abrupt cuts and longer dissolves in a single approach. Theoretical justification for the algorithm is provided with an optimization technique for real cases. We present results obtained exploiting the editing features on a Formula 1 video digital library, detecting replays and providing pre classification hints for automatic shot annotation.

2005 - Ambient Intelligence for Security in Public Parks: the LAICA Project [Relazione in Atti di Convegno]
R. CUCCHIARA; A. PRATI; R. VEZZANI
In this paper, we address the exploitation of computervision techniques to develop multimedia services andautomatic monitoring systems related to the securityand the privacy in public areas. The research is part ofa two-year ltalian project called LAICA, intended toprovide advanced services for citizens and publicofficers. Citizens want fast and friendly web access topublic places, to see the environment in real-timewithout violating the privacy laws. Public officers andpolicy centres want a fast and reactive monitoringsystem, capable to automatically detect dangeroussituations, given the huge amount of cameras that cannot be monitored simultaneously by human operators.In this work, we describe the project and the definedmethodologies in multi-camera video mosaicing,people tracking and consistent labelling, and access toprocessed data with face obscuration.

2005 - Ambient Intelligence in Urban Environments [Relazione in Atti di Convegno]
R. CUCCHIARA; A. PRATI; C. OSTI; S. PAVANI
This paper reports advances achieved within a project called LAICA (Laboratorio di Ambient Intelligence per una Città Amica) on Ambient Intelligence in urban environments. The overall LAICA architecture is described and the unified operative centre developed by Regulus SpA (partner of the project) to collect and correlate data from different sensors and prototypes is depicted. Moreover, the paper describes the results obtained in developing a system for video surveillance in public parks, devoted to create a mosaic image of the scene and to extract and track moving people. Moreover, the system takes the privacy issues into account, proposing a method for face detection and tracking able to obscure faces in order to protect people’s identity.

2005 - An integrated framework for semantic annotation and adaptation [Articolo su rivista]
M. Bertini; R. Cucchiara; A. Del Bimbo; A. Prati
Tools for the interpretation of significant events from video and video clip adaptation can effectively support automatic extraction and distribution of relevant content from video streams. In fact, adaptation can adjust meaningful content, previously detected and extracted, to the user/client capabilities and requirements. The integration of these two functions is increasingly important, due to the growing demand of multimedia data from remote clients with limited resources (PDAs, HCCs, Smart phones). In this paper we propose an unified framework for event-based and object-based semantic extraction from video and semantic on-line adaptation. Two cases of application, highlight detection and recognition from soccer videos and people behavior detection in domotic* applications, are analyzed and discussed.

2005 - Assessing Temporal Coherence for Posture Classification with Large Occlusions [Relazione in Atti di Convegno]
R. CUCCHIARA; R. VEZZANI
In this paper we present a people posture classificationapproach especially devoted to cope with occlusions. Inparticular, the approach aims at assessing temporal coherenceof visual data over probabilistic models. A mixed predictiveand probabilistic tracking is proposed: a probabilistictracking maintains along time the actual appearance ofdetected people and evaluates the occlusion probability; anadditional tracking with Kalman prediction improves the estimationof the people position inside the room. ProbabilisticProjection Maps (PPMs) created with a learning phaseare matched against the appearance mask of the track. Finally,an Hidden Markov Model formulation of the posturecorrects the frame-by-frame classification uncertainties andmakes the system reliable even in presence of occlusions.Results obtained over real indoor sequences are discussed.

2005 - Auto-iris Compensation for Traffic Surveillance Systems [Relazione in Atti di Convegno]
R. CUCCHIARA; R. MELLI; A. PRATI
This paper addresses auto-iris compensation. Auto-iris can be really troublesome for motion detection and tracking techniques based on background or frame differencing,since it can change quickly the average intensity of thecurrent frame. To cope with this, we introduced a two-step autoiris compensation approach in our traffic monitoring system. First, the auto-iris detection is based on the computation of the average of the luminance difference obtained by background suppression. Then, if an auto-iris is detected, the compensation phase is started. In this phase, the auto-iris’ behaviour is empirically modelled and, thus, compensated. Experimental results demonstrate the accuracy of the proposed approach, with both quantitative measures and visual analysis.

2005 - Consistent labeling for multi-camera object tracking [Relazione in Atti di Convegno]
S. Calderara; A. Prati; R. Vezzani; R. Cucchiara
In this paper, we present a new approach to multi-camera object tracking based on the consistent labeling. An automatic and reliable procedure allows to obtain the homographic transformation between two overlapped views, without any manual calibration of the cameras. Object's positions are matched by using the homography when the object is firstly detected in one of the two views. The approach has been tested also in the case of simultaneous transitions and in the case in which people are detected as a group during the transition. Promising results are reported over a real setup of overlapped cameras.

2005 - Entry Edge of Field of View for multi-camera tracking in distributed video surveillance [Relazione in Atti di Convegno]
S. Calderara; R. Vezzani; A. Prati; R. Cucchiara
Efficient solution to people tracking in distributed videosurveillance is requested to monitor crowded and large environments.This paper proposes a novel use of the EntryEdges of Field of View (E2oFoV) to solve the consistentlabeling problem between partially overlapped views. Anautomatic and reliable procedure allows to obtain the homographictransformation between two overlapped views,without any manual calibration of the cameras. Throughthe homography, the consistent labeling is established eachtime a new track is detected in one of the cameras. A CameraTransition Graph (CTG) is defined to speed up the establishmentprocess by reducing the search space. Experimentalresults prove the effectiveness of the proposed solutionalso in challenging conditions.

2005 - MPEG-7 Compliant Shot Detection in Sport Videos [Relazione in Atti di Convegno]
C. Grana; G. Tardini; R. Cucchiara
In this paper we propose a system for automatic detection of shots in sport videos. Our work covers two main aspects: the first is robust shot detection in presence of fast object motion and camera operations. To this aim we propose a new algorithm, unique for both cuts and linear transitions detection, which only needs the tuning of two parameters. An extended comparison with four transition detection algorithms, representing the state of the art in literature, is reported. Examples with formula 1, basket, soccer and cycling videos are analyzed. The second aspect is an in depth discussion on the annotation of shots and transitions with the MPEG-7 standard.

2005 - Making the home safer and more secure through visual surveillance [Relazione in Atti di Convegno]
R. Cucchiara; A. Prati; R. Vezzani
Video surveillance has a direct application in intelligent home automation or domotics (from the Latin word domus, that means “home”, and informatics). In particular, in-house video surveillance can provide good support for people with some difficulties (e.g. elderly or disabled people) living alone and with limited autonomy. A key aspect in video surveillance systems for domotics is that of analyzing behaviours of the monitored people. To accomplish this task, people must be detected and tracked, and their posture must be analyzed in order to model behaviours recognizing abrupt changes in it. Problems related to reliable software solutions are not completely solved, in particular luminance changes, shadows and frequent posture changes must be taken into account. Long-lasting occlusions are common due to the proximity of the cameras and the presence of furniture and doors that can often hide parts of a person’s body. For these reasons, a probabilistic and appearance-based tracking, particularly conceivable for people tracking and posture classification, has been developed. However, despite its effectiveness for long-lasting and large occlusions, this approach tends to fail whenever the person is monitored with multiple cameras and he appears in one of them already occluded. Different views provided by multiple cameras can be exploited to solve occlusions by warping known object appearance into the occluded view. To this aim, this paper describes an approach to posture classification based on projection histograms, reinforced by HMM for assuring temporal coherence of the posture.

2005 - Multimedia Surveillance Systems [Relazione in Atti di Convegno]
R. CUCCHIARA
The integration of video technology and sensor networks constitutes the fundamental infrastructure for new generations of multimedia surveillance systems, where many different media streams (audio, video, images, textual data, sensor signals) will concur to provide an automatic analysis of the controlled environment and a real-time interpretation of the scene. New solutions can be devised to enlarge the view of traditional surveillance systems by means of distributed architectures with fixed and active cameras, to enhance their view with other sensed data, to explore multi-resolution views with zooming and omnidirectional cameras. Applications regard surveillance of wide indoor and outdoor area and particularly people surveillance: in this case, multimedia surveillance systems can be enriched with biometric technology; the best views of detected persons and their extracted visual features (e.g. faces, voices, trajectories)can be exploited for people identification. VSSN05 is the third edition of the workshop, co-located at ACM Multimedia Conference, that embraces research reports on video surveillance and, since the edition of 2004, sensor networks. Thispaper gives a short overview of the hot topics in multimedia surveillance systems and introduces some research activities currently engaged in the world and presented at VSSN05.

2005 - On the usefulness of object shape coding with MPEG-4 [Relazione in Atti di Convegno]
A. PRATI; R. CUCCHIARA
This paper reports the results of an in-depth analysis ofthe degree of usefulness of object shape coding in videocompression. In particular, MPEG-4 is used as referencestandard. The influence of different coding parameters onthe performance is deeply examined and discussions on theresults are provided. Object shape coding is compared withclassical (MPEG-2) frame-based coding both at an objectivelevel (by comparing PSNR/quality and bitrate/filesize)and at a subjective level (asking to a set of users to expresstheir opinion on overall quality, cognitive effectiveness, andwillingness to pay). In conclusion, this paper aims at answering to the question whether it is convenient to use object shape coding instead of frame-based coding or not.

2005 - Posture Classification in a Multi-camera Indoor Environment [Relazione in Atti di Convegno]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
Posture classification is a key process for analyzing thepeople’s behaviour. Computer vision techniques can behelpful in automating this process, but clutteredenvironments and consequent occlusions make this taskoften difficult. Different views provided by multiplecameras can be exploited to solve occlusions by warpingknown object appearance into the occluded view. To thisaim, this paper describes an approach to postureclassification based on projection histograms, reinforcedby HMM for assuring temporal coherence of the posture.The single camera posture classification is then exploitedin the multi-camera system to solve the cases in which theocclusions make the classification impossible.Experimental results of the classification from both thesingle camera and the multi-camera system are provided.

2005 - Predictive and Probabilistic Tracking to Detect Stopped Vehicles [Relazione in Atti di Convegno]
R. MELLI; R. CUCCHIARA; A. PRATI; L. DE COCK
Many techniques and models have been proposed for vehicles surveillance in highways. In the past, tracking algorithms based on Kalman filter have been largely usedfor their efficiency in the prediction and low computationalcost. However, predictive filters can not solve long-lastingocclusions. In this paper, we propose a new mixed predictiveand probabilistic tracking that exploits the advantagesof predictive filters for moving vehicles and adopts probabilistic and appearance-based tracking for stopped vehicles. The proposed tracking is part of a complete videosurveillance system, oriented to control tunnels and highwaysfrom cluttered views, that is implemented in an embeddedDSP platform and provides background suppression,a novel shadow detection algorithm, tracking, and scenerecognition module. The experimental results are obtainedover several hours of videos acquired in pre-existing platforms of CCTV surveillance systems.

2005 - Probabilistic posture classification for human-behavior analysis [Articolo su rivista]
Cucchiara, Rita; Grana, Costantino; Prati, Andrea; Vezzani, Roberto
Computer vision and ubiquitous multimedia access nowadays make feasible the development of a mostly automated system for human-behavior analysis. In this context, our proposal is to analyze human behaviors by classifying the posture of the monitored person and, consequently, detecting corresponding events and alarm situations, like a fall. To this aim, our approach can be divided in two phases: for each frame, the projection histograms (Haritaoglu et al., 1998) of each person are computed and compared with the probabilistic projection maps stored for each posture during the training phase; then, the obtained posture is further validated exploiting the information extracted by a tracking module in order to take into account the reliability of the classification of the first phase. Moreover, the tracking algorithm is used to handle occlusions, making the system particularly robust even in indoors environments. Extensive experimental results demonstrate a promising average accuracy of more than 95% in correctly classifying human postures, even in the case of challenging conditions.

2005 - Real Time Semantic Adaptation of Sports Video with User-centred Performance Analysis [Relazione in Atti di Convegno]
M. BERTINI; R. CUCCHIARA; A. DEL BIMBO; A. PRATI
Semantic video adaptation improves traditional adaptation by taking into account the degree of relevance of the different portions of the content. It employs solutions to detect the significant parts of the video and applies different compression ratios to elements that have different importance. Performance of semantic adaptation heavily depends on the quality and precision of the automatic annotation, whether it operates in strict or nonstrict real time, and the codec which is used to perform adaptation at the event or object level. It should consider the effects of the errors in the automatic extraction of objects and events over the operation of the adaptation subsystem, and relate these effects to the preferences for the objects and events of the video program, that have been decided by the user. In this paper, we present strict real time annotation and adaptation of sports video and introduce two new performance measures: Viewing Quality Loss and Bit-rate Cost Increase, that are obtained from classical PSNR and Bit Ratio, but relate the results of semantic adaptation with the user’s preferences and expectations.

2005 - Shot Detection for Formula 1 Video Digital Libraries [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; G. Tardini
Metadata extraction is one of the first tasks to be performed for automatic Digital Library annotation, and in particular shot detection has been widely explored in literature. While a lot of methods have been proposed for the detection of abrupt cuts, only a small number of them has explicitly addressed the problem of gradual transitions. In this paper we propose an algorithm that exploits a precise model of linear transition. Experimental results on Formula 1 car races videos show the robustness of this method. These test videos are characterized by extreme situations such as fast camera and objects motion and very different kinds of shots. The algorithm is able to estimate the exact length of the transition and an error score is also given as a fitness measure to the linear model, to discriminate true transitions from false detections. The final shot segmentation is delivered as an MPEG7 compliant output.

2005 - Shot detection and motion analysis for automatic MPEG-7 annotation of sports videos [Relazione in Atti di Convegno]
G. Tardini; C. Grana; R. Marchi; R. Cucchiara
In this paper we describe general algorithms that are devised for MPEG-7 automatic annotation of Formula 1 videos, and in particular for camera-car shots detection. We employed a shot detection algorithm suitable for cuts and linear transitions detection, which is able to precisely detect both the transition's center and length. Statistical features based on MPEG motion compensation vectors arc then employed to provide motion characterization, using a subset of the motion types defined in MPEG-7, and shot type classification. Results on shot detection and classification are provided.

2005 - T_PARK: Ambient Intelligence for Security in Public Parks [Relazione in Atti di Convegno]
R. CUCCHIARA; A. PRATI; L. BENINI; E. FARELLA
In this paper, we present joint research activities in computer vision and sensor networks for a distributedsurveillance of urban parks. Distributed visual surveillance of urban environments is one of the most interesting scenarios in Ambient Intelligence; in addition, the automated monitoring of public parks, often crowded by children and aduits, is still a very difficult task due to the number of objects of interests. In this context, integrating the power of low cost sensors with the information provided by cameras can lead to a more reliable solution to people tracking in wide areas. Specifically, the deficiencies of one approach can be (at least partially) covered by the advantages of the other. The goal is to perform people tracking in parks (toachieve trackable parks - T-Parks), both in zones covered by overlapped cameras and afso, thanks to sensors, in areas not covered by any camera. In this paper, we propose a new technique for multi-camera people tracking based on a learning phase to automatically calibrate pairs of cameras and to build Areas of Field Views (AoFoVs) in order to establish consistent labelling of people. In addition, sensornetworks distributed at the borders of the AoFoV give an estimation of the probability of people overlapping, triggering specific algorithms of face detection or headcounting to identify the single person. The research ofT-Parks is part of a two-year Italian project called LAICA, intended to provide advanced services for citizens and public officers based on ambient intelligence technologies.

2005 - Video Annotation with Pictorially Enriched Ontologies [Relazione in Atti di Convegno]
C. TORNIAI; A. DEL BIMBO; R. CUCCHIARA; M. BERTINI
Video annotation is typically performed by classifying video elements according to some pre-defined ontology of the video content domain. Ontologies are defined by establishing relationships between linguistic terms, that specify domain concepts at different abstraction levels. However, although linguistic terms are appropriate to distinguish event and object categories, they are inadequate when they must describe specific patterns of events or video entities. Instead, in these cases, pattern specifications are better expressed through visual prototypes that capture the essence of the event or entity. Pictorially enriched ontologies, that include visual concepts together with linguistic keywords, are therefore needed tosupport video annotation up to the level of detail of pattern specification. This paper presents pictorially enriched ontologies and provide a solution for their implementation in the soccer video domain. The pictorially enriched ontology is used both to directly assign multimedia objects to concepts, providing a more meaningful definition than the linguistics terms, and to extend the initial knowledge of the domain, adding subclasses of highlights or new highlight classes that were not defined in the linguistic ontology. Automatic annotation of soccer clips up to the pattern specification level using a pictorially enriched ontology is discussed.

2005 - Video understanding and content-based retrieval [Relazione in Atti di Convegno]
Y. Zhai; J. Liu; X. Cao; A. Basharat; A. Hakeem; S. Ali; M. Shah; C. Grana; R. Cucchiara
This year, the joint team of UCF and the University of Modenahas participated in the following tasks: (1) shot boundarydetection, (2) low-level feature extraction, (3) high-levelfeature extraction, (4) topic search and (5) BBC rushes management.The shot boundary detection was contributed bythe Image Lab at the University of Modena. The other taskswere performed by the Computer Vision Team at UCF.

2004 - An Intelligent Surveillance System for Dangerous Situation Detection in Home Environments [Articolo su rivista]
R. Cucchiara; A. Prati; R. Vezzani
In this paper we address the problem of human posture classification, in particular focusing to an indoor surveillance application. The approach was initially inspired to a previous works of Haritaoglou et al. [5] that uses histogram projections to classify people’s posture. Projection histograms are here exploited as the main feature for the posture classification, but, differently from [5], we propose a supervised statistical learning phase to create probability maps adopted as posture templates. Moreover, camera calibration and homography are included to solve perspective problems and to improve the precision of the classification. Furthermore, we make use of a finite state machine to detect dangerous situations as falls and to activate a suitable alarm generator. The system works on-line on standard workstations with network cameras.

2004 - An image analysis approach for automatically re-orienteering CT images for dental implants [Articolo su rivista]
Cucchiara R; Lamma E; Sansoni T
In the last decade, computerized tomography (CT) has become the most frequently used imaging modality to obtain a correct pre-operative implant planning. In this work, we present an image analysis and computer vision approach able to identify, from the reconstructed 3D data set, the optimal cutting plane specific to each implant to be planned, in order to obtain the best view of the implant site and to have correct measures. If the patient requires more implants, different cutting planes are automatically identified, and the axial and cross-sectional images can be re-oriented accordingly to each of them. In the paper, we describe the defined algorithms in order to recognize 3D markers (each one aligned with a missed tooth for which an implant has to be planned) in the 3D reconstructed space, and the results in processing red] exams, in terms of effectiveness and precision and reproducibility of the measure.

2004 - Automated extraction and description of dark areas in surface microscopy melanocytic lesion images [Articolo su rivista]
G. Pellacani; C. Grana; R. Cucchiara; S. Seidenari
Background: Identification of dark areas inside a melanocytic lesion (ML) is of great importance for melanoma diagnosis, both during clinical examination and employing programs for automated image analysis. Objective: The aim of our study was to compare two different methods for the automated identification and description of dark areas in epiluminescence microscopy images of MLs and to evaluate their diagnostic capability. Methods: Two methods for the automated extraction of ´absolute´ (ADAs) and ´relative´ dark areas (RDAs) and a set of parameters for their description were developed and tested on 339 images of MLs acquired by means of a polarized-light videomicroscope. Results: Significant differences in dark area distribution between melanomas and nevi were observed employing both methods, permitting a good discrimination of MLs (diagnostic accuracy = 74.6 and 71.2% for ADAs and RDAs, respectively). Conclusions: Both methods for the automated identification of dark areas are useful for melanoma diagnosis and can be implemented in programs for image analysis. Copyright

2004 - Color Calibration for a Dermatological Video Camera System [Relazione in Atti di Convegno]
C. Grana; G. Pellacani; S. Seidenari; R. Cucchiara
In this work, we describe a technique to calibrate images for skin analysis in dermatology. Using a common reference we correct non-uniform illumination effects, give an estimation of the gamma correction and produce a XYZ conversion matrix. The final result is then reverted to a non standard RGB color space, built from the instrument images. In this way different instruments behave uniformly allowing colorimetric characterization, while improving the results of common algorithms. The proposed techniques should be the initial support for a distributed framework where dermatological images can be consistently compared.

2004 - Content-based Video Adaptation with User's Preference [Relazione in Atti di Convegno]
M. BERTINI; R. CUCCHIARA; A. DEL BIMBO; A. PRATI
In this papes we present an integrated system that hasbeen designed to support automatic semantic extraction ofhighlights in sports video and automatic video adaptationaccording to user’s preferences. To analyze the user’s satisfaction, we propose a new performance measure that explicitly takes into account the user’s preferences and considers the number and type of errors produced by the annotation engine and the way in which these errors affectthe compressed video quality and bandwidth allocation. Weprovide experimental results with application to soccer andswimming.

2004 - DELOS: a Network of Excellence on Digital Libraries [Partecipazione a progetti di ricerca]
R. Cucchiara; A. Prati; C. Grana; R. Vezzani
2004 - Introduction to the special section on in vehicle computer vision systems [Articolo su rivista]
R. Cucchiara; D. Lovell; A. Prati; M.M. Trivedi
2004 - Neighbor cache prefetching for multimedia image and video processing [Articolo su rivista]
R. CUCCHIARA; M. PICCARDI; A. PRATI
Cache performance is strongly influenced by the type of locality embodied in programs. In particular, multimedia programs handling images and videos are characterized by a bidimensional spatial locality, which is not adequately exploited by standard caches. In this paper we propose novel cache prefetching techniques for image data, called neighbor prefetching, able to improve exploitation of bidimensional spatial locality. A performance comparison is provided against other assessed prefetching techniques on a multimedia workload (with MPEG-2 and MPEG-4 decoding, image processing, and visual object segmentation), including a detailed evaluation of both the miss rate and the memory access time. Results prove that neighbor prefetching achieves a significant reduction in the time due to delayed memory cycles (more than 97% on MPEG-4 with respect to 75% of the second performing technique). This reduction leads to a substantial speedup on the overall memory access time (up to 140% for MPEG-4). Performance has been measured with the PRIMA trace-driven simulator, specifically devised to support cache prefetching.

2004 - Object-based and Event-based Semantic Video Adaptation [Relazione in Atti di Convegno]
M. BERTINI; R. CUCCHIARA; A. DEL BIMBO; A. PRATI
Semantic video adaptation allows to transmit video contentwith different viewing quality, depending on the relevanceof the content from the user’s viewpoint. To this end, an automatic annotation subsystem must be employed thatautomatically detect relevant objects and events in the videostream. In this paper we present a composite framework thatis made of an automatic annotation engine and a semantics-based adaptation module. Three new different compression solutions are proposed that work at the object or event level. Their performance is compared according to a new measure that takes into account the user’s satisfaction and the effects on it of the errors in the annotation module.

2004 - Objects and Events Recognition for Sport Videos Transcoding [Relazione in Atti di Convegno]
M. BERTINI; A. DEL BIMBO; A. PRATI; R. CUCCHIARA
2004 - Probabilistic People Tracking for Occlusion Handling [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; G. Tardini; R. Vezzani
This work presents a novel people tracking approach, able to cope with frequent shape changes and large occlusions. In particular, the tracks are described by means of probabilistic masks and appearance models. Occlusions due to other tracks or due to background objects and false occlusions are discriminated. The tracking system is general enough to be applied with any motion segmentation module, it can track people interacting each other and it maintains the pixel assignment to track even with large occlusions. At the same time, the update model is very reactive, so as to cope with sudden body motion and silhouette's shape changes. Due to its robustness, it has been used in many experiments of people behavior control in indoor situations.

2004 - Real-time motion segmentation from moving cameras [Articolo su rivista]
R. Cucchiara; A. Prati; R. Vezzani
This paper describes our approach to real-time detection of camera motion and moving object segmentation in videos acquired from moving cameras. As far as we know, none of the proposals reported in the literature are able to meet real-time requirements. In this work, we present an approach based on a color segmentation followed by a region-merging on motion through Markov Random Fields (MRFs). The technique we propose is inspired to a work of Gelgon and Bouthemy (Pattern Recognition 33 (2000) 725-40), that has been modified to reduce computational cost in order to achieve a fast segmentation (about 10 frame per second). To this aim a modified region matching algorithm (namely Partitioned Region Matching) and an innovative arc-based MRF optimization algorithm with a suitable definition of the motion reliability are proposed. Results on both synthetic and real sequences are reported to confirm validity of our solution.

2004 - Semantic Annotation and Transcoding for Sport Videos [Relazione in Atti di Convegno]
M. BERTINI; A. DEL BIMBO; A. PRATI; R. CUCCHIARA
Telecommunication companies are demonstrating interestin providing mobile video services. The availability of largerbandwidth, and the improvements in terms of resolution ofthe displays of third generation mobile phones, let telecomand content provider companies to provide new services totheir customers. Among these services users can watch acertain number of sport videos, usually a selection of thebest actions occurred during a play. In order to provide atimely and satisfying service to customers there is need oftools and systems that help to detect and recognize the interesting events, and optimize the use of bandwidth, coding these events and the most interesting objects within them at the best visual quality/bandwidth ratio.

2004 - Semantic Annotation and Transcoding of Soccer Videos [Relazione in Atti di Convegno]
M. BERTINI; A. DEL BIMBO; R. CUCCHIARA; A. PRATI
2004 - Semantic Transcoding of Videos by using Adaptive Quantization [Articolo su rivista]
Cucchiara, Rita; Grana, Costantino; Prati, Andrea
This paper proposes the use of an approach of video transcoding driven by the video content and providedwith the adaptive quantization of MPEG standards.Computer vision techniques can extract semanticsfrom videos according with user's interests: the videosemantics is exploited to adapt the video in order tomeet the device's capabilities and the user'srequirements and preserve the best quality possible. Well assessed video analysis techniques are used to segment the video into objects grouped in classes ofrelevance to which the user can assign a weight proportional to their relevance. This weight is used todecide the quantization values to be applied in theMPEG-2 encoding to each macroblock. A modified version of the PSNR (Peak Signal-to-Noise Ratio) is used as performance metric and comparativeevaluation is reported with respect to other codingstandards such as JPEG, JPEG 2000, (basic) MPEG-2, and MPEG-4. Experimental results are provided on different situations, one indoor and oneoutdoor. Keywords:Videotranscoding, adaptive quantization, motion detection

2004 - Semantic Video Adaptation based on Automatic Annotation of Sport Videos [Relazione in Atti di Convegno]
M. BERTINI; R. CUCCHIARA; A. DEL BIMBO; A. PRATI
Semantic video adaptation improves traditional adaptation by taking into account the degree of relevance of the different portions of the content. It employs solutions to detect the significant parts of the video and applies different compression ratios to elements that have different importance. Performance of semantic adaptation heavily depends on the precision of the automatic annotation andthe way of operation of the codec which is used to perform adaptation at the event or object level. In this paper, we discuss critical factors that affect performance of automatic annotation and define new performance measures of semantic adaptation, Viewing Quality Loss and Bitrate Cost Increase, that are obtained from classical PSNR and Bit Rate, but relate the results of semantic adaptation with the user’s preferences and expectations. The new measuresare discussed in detail for a system of sport annotation and adaptation with reference to different user profiles.

2004 - Track-based and object-based occlusion for people tracking refinement in indoor surveillance [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; G. Tardini
People tracking deals with problems of shape changes, self-occlusions and track occlusions due to other interfering tracks and fixed objects that hide parts of the people shape. These problems are more critical in indoor surveillance and in particular in home automation settings, in which the need to merge information obtained form different cameras distributed around the house calls for the integration of reliable data obtained during time. Therefore, tracking algorithms should be carefully tuned to cope with occlusions and shape changes, working not only at pixel level but also at region level. In this work we provide a novel technique for object tracking, based on probabilistic masks and appearance models. Occlusions due to other tracks or due to background objects and false occlusions are discriminated. The classification of occluded regions of the track is exploited in a selective model update. The tracking system is general enough to be applied with any motion segmentation module, it can track people interacting each other and it maintains the pixel to track assignment even with large occlusions. At the same time, the model update is very reactive, so as to cope with sudden body motion and silhouette's shape changes. Due to its robustness, it has been used in different experiments of people behavior control in indoor situations.

2004 - Using computer vision techniques for dangerous situation detection in domotic applications [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; A. Prati; G. Tardini; R. Vezzani
We describe an integrated solution devised for inhouse video surveillance, to control the safety of people living in a domestic environment. The system is composed of robust moving object detection module, able to disregard shadows, a tracking module designed for large occlusion solution and of a posture detector. Shadows, large occlusions and deformable model of people are key features of inhouse surveillance. Moreover, the requirements of high speed reaction to dangerous situations and the need to implement a reliable and low cost televiewing system, led to the introduction of a new multimedia model of semantic transcoding, capable of supporting different user's requests and constraints of their devices (PDA, smart phones, ...). Our application context is the emerging area of domotics (from the Latin word domus that means "home" and informatics) and, in particular, indoor video surveillance of the house where people with some difficulties (elders and disabled people) can now live in a sufficient degree of autonomy, thanks to the strong interaction with the new technologies that can be distributed in the house with affordable costs and high reliability.

2003 - A Hough transform-based method for radial lens distortion correction [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; A. Prati; R. Vezzani
The paper presents an approach for a robust (semi-)automatic correction of radial lens distortion in images and videos. This method, based on the Hough transform, has the characteristics to be applicable also on videos from unknown cameras that, consequently, can not be a priori calibrated. We approximated the lens distortion by considering only the lower-order term of the radial distortion. Thus, the method relies on the assumption that pure radial distortion transforms straight lines into curves. The computation of the best value of the distortion parameter is performed in a multi-resolution way. The method precision depends on the scale of the multi-resolution and on the Hough space's resolution. Experiments are provided for both outdoor, uncalibrated camera and an indoor, calibrated one. The stability of the value found in different frames of the same video demonstrates the reliability of the proposed method.

2003 - A machine learning approach for human posture detection in domotics applications [Relazione in Atti di Convegno]
L. Panini; R. Cucchiara
This paper describes an approach for human postureclassification that has been devised for indoor surveillance in domotic applications. The approach was initially inspired to a previous works of Haritaoglou et al. [2] that uses histogram projections to classify people’s posture. We modify and improve the generality of the approach by adding a machine learning phase in order to generate probability maps. A statistic classifier has then defined that compares the probability maps and the histogram profiles extracted from each moving people. The approach results to be very robust if the initial constraints are satisfied and exhibits a very lowcomputational time so that it can be used to process livevideos with standard platforms.

2003 - A new algorithm for border description of polarized light surface microscopic images of pigmented skin lesions [Articolo su rivista]
C. Grana; G. Pellacani; R. Cucchiara; S. Seidenari
The aim of this study was to provide mathematical descriptors for the border of pigmented skin lesion images and to assess their efficacy for distinction among different lesion groups. New descriptors such as lesion slope and lesion slope regularity are introduced and mathematically defined. A new algorithm based on the Catmull-Rom spline method and the computation of the gray-level gradient of points extracted by interpolation of normal direction on spline points was employed. The efficacy of these new descriptors was tested on a data set of 510 pigmented skin lesions, composed by 85 melanomas and 425 nevi, by employing statistical methods for discrimination between the two populations.

2003 - Camera-car Video Analysis for Steering Wheel's Tracking [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; A. Prati; F. Vigetti; M. Piccardi
Monitoring and controlling the driver’s guidance by analyzing the rotation impressed to the steering-wheel can be a very important task in order to improve safety. This paper proposes a general-purpose method to track the steering wheel’s absolute angle by using a single camera vision system mounted inside the car. The absolute angle is computed by means of the accumulation of inter-frame relative rotations and the error propagation is prevented with an alignment process. The approach is based on the modeling of the motion of the steering wheel, as it appears perspectivelydistorted by the point of view of the un-calibrated camera. We modified the Lucas-Kanade method for an approximatively rotational motion model in order to provide the detection and tracking of significant features on the wheel. The experimental results are compared with ground-truthed data obtained with different types of sensors.

2003 - Computer Vision Techniques for PDA Accessibility of In-House Video Surveillance [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; A. Prati; R. Vezzani
In this paper we propose an approach to indoor environment surveillance and, in particular, to people behaviour control in home automation context. The reference application is a silent and automatic control of the behaviour of people living alone in the house and specially conceived for people with limited autonomy (e.g., elders or disabled people). The aim is to detect dangerous events (such as a person falling down) and to react to these events by establishing a remote connection with low-performance clients, such as PDA (Personal Digital Assistant). To this aim, we propose an integrated server architecture, typically connected in intranet with network cameras, able to segment and track objects of interest; in the case of objects classified as people, the system must also evaluate the people posture and infer possible dangerous situations. Finally, the system is equipped with a specifically designed transcoding server to adapt the video content to PDA requirements (display area and bandwidth) and to the user's requests. The main issues of the proposal are a reliable real-time object detector and tracking module, a simple but effective posture classifier improved by a supervised learning phase, and an high performance transcoding inspired on MPEG-4 object-level standard, tailored to PDA. Results on different video sequences and performance analysis are discussed.

2003 - Detecting moving objects, ghosts, and shadows in video streams [Articolo su rivista]
Cucchiara, Rita; Grana, Costantino; Piccardi, Massimo; Prati, Andrea
Background subtraction methods are widely exploited for moving object detection in videos in many applications, such as traffic monitoring, human motion capture, and video surveillance. How to correctly and efficiently model and update the background model and how to deal with shadows are two of the most distinguishing and challenging aspects of such approaches. This work proposes a general-purpose method that combines statistical assumptions with the object-level knowledge of moving objects, apparent objects (ghosts), and shadows acquired in the processing of the previous frames. Pixels belonging to moving objects, ghosts, and shadows are processed differently in order to supply an object-based selective update. The proposed approach exploits color information for both background subtraction and shadow detection to improve object segmentation and background update. The approach proves fast, flexible, and precise in terms of both pixel accuracy and reactivity to background changes.

2003 - Detecting moving shadows: Algorithms and evaluation [Articolo su rivista]
A. PRATI; I. MIKIC; MM TRIVEDI; R. CUCCHIARA
Moving shadows need careful consideration in the development of robust dynamic scene analysis systems. Moving shadow detection is critical for accurate object detection in video streams since shadow points are often misclassified as object points, causing errors in segmentation and tracking. Many algorithms have been proposed in the literature that deal with shadows. However, a comparative evaluation of the existing approaches is still lacking. In this paper, we present a comprehensive survey of moving shadow detection approaches. We organize contributions reported in the literature in four classes two of them are statistical and two are deterministic. We also present a comparative empirical evaluation of representative algorithms selected from these four classes. Novel quantitative (detection and discrimination rate) and qualitative metrics (scene and object independence, flexibility to shadow situations, and robustness to noise) are proposed to evaluate these classes of algorithms on a benchmark suite of indoor and outdoor video sequences. These video sequences and associated ground-truth data are made available at http://cvrr.ucsd.edu/aton/shadow to allow for others in the community to experiment with new algorithms and metrics.

2003 - Domotics for disability: smart surveillance and smart video server [Relazione in Atti di Convegno]
R. Cucchiara; A. Prati; R. Vezzani
In this paper we address the problem of human posture classification, in particular focusing to an indoor surveillance application. The approach was initially inspired to a previous works of Haritaoglou et al. [6] that uses histogram projections to classify people’s posture. Projection histograms are here exploited as the main feature for the posture classification, but, differently from [6], we propose a supervised statistical learning phase to create probability maps adopted as posture templates. Moreover, camera calibration and homography is included to resolve prospective problems and improve the precision of classification. Furthermore, we make use of a finite state machineto detect dangerous situations as falls and to activate a suitable alarm generator. The system works on line on standard workstation with network cameras.

2003 - Image Representation and Retrieval with Topological Trees [Relazione in Atti di Convegno]
C. Grana; G. Pellacani; S. Seidenari; R. Cucchiara
Typical processes of image representation comprehend initial region segmentation followed by a description of single regions’ feature and their relationships. Then a graph model can be exploited in order to integrate the knowledge of the specific regions (that are the attributed relational graph’s (ARG) nodes) and the regions’ relations (that are the ARG’s edges). In this work we use color features to guide region segmentation, geometric features to characterize regions one by one and topological features (and in particular inclusion) to describe regions’ relationships. Guided by the inclusion property we define the Topological Tree (TT) as an image representation model that exploiting the transitive property of inclusion, uses the adjacency and inclusion topological features. We propose an approach based on a recursive version of fuzzy c-means to construct the topological tree directly from the initial image, performing both segmentation and TT construction. The TT can be exploited in many applications of image analysis and image retrieval by similarity in those contexts where inclusion is a key feature: we propose an applicative case of analysis of dermatological images to support the melanoma diagnosis.In this paper describe details of the TT algorithm, including the management of not ideality and an approximate measure of tree similarity in order to retrieve skin lesion with a similar TT-based description.

2003 - Improving data prefetching efficacy in multimedia applications [Articolo su rivista]
R. Cucchiara; A. Prati; M. Piccardi
The workload of multimedia applications has a strong impact on cache memory performance, since the locality of memory references embedded in multimedia programs differs from that of traditional programs. In many cases, standard cache memory organization achieves poorer performance when used for multimedia. A widely-explored approach to improve cache performance is hardware prefetching, which allows the pre-loading of data in the cache before they are referenced. However, existing hardware prefetching approaches are unable to exploit the potential improvement in performance, since they are not tailored to multimedia locality. In this paper we propose novel effective approaches to hardware prefetching to be used in image processing programs for multimedia. Experimental results are reported for a suite of multimedia image processing programs including MPEG-2 decoding and encoding, convolution, thresholding, and edge chain coding.

2003 - Object Segmentation in Videos from Moving Camera with MRFs on Color and Motion Features [Relazione in Atti di Convegno]
R. Cucchiara; A. Prati; R. Vezzani
In this paper we address the problem of fast segmenting moving objects in video acquired by moving camera or more generally with a moving background. We present an approach based on a color segmentation followed by a region-merging on motion through Markov Random Fields (MRFs). The technique we propose is inspired to a work of Gelgon and Bouthemy [6], that has been modified to reduce computational cost in order to achieve a fast segmentation (about ten frame per second). To this aim a modified region matching algorithm (namely Partitioned Region Matching) and an innovative arc-based MRF optimization algorithmwith a suitable definition of the motion reliability are proposed. Results on both synthetic and real sequences are reported to confirm validity of our solution.

2003 - Object and Event Detection for Semantic Annotation and Transcoding [Relazione in Atti di Convegno]
M. BERTINI; R. CUCCHIARA; A. DEL BIMBO; A. PRATI
Video annotation provides a suitable way to describe, organize, and index stored videos. On the other hand,transcoding aims at adapting content to the usedclientcapabilities and requirements. Both cues are now mandatory, given the tremendous demand of multimediaaccess from remote clients, in particular nowadays thatnew terminals with limited resources (PDAs, HCCs, Smartphones) have access to the network. In this paper wepropose an unified framework to define event-based andobject-based semantic extraction from video to provideboth semantic video annotation for video stored andsemantic on-line transcoding from live cameras. Two casestudies (highlights’ extraction from soccer videos for theannotation and people behavior detection in domoticapplication for transcoding) and corresponding experimental results are reported.

2003 - Proceedings of 1st Workshop on “In-Vehicle (Cognitive) Computer Vision Systems” [Curatela]
R. Cucchiara; M. Trivedi; A. Prati
2003 - Semantic video transcoding using classes of relevance [Articolo su rivista]
Cucchiara, Rita; Grana, Costantino; Prati, Andrea
In this work we present a framework for on-the-fly video transcoding that exploits computer vision-based techniques to adapt the Web access to the user requirements. Theproposed transcoding approach aims at coping with both user bandwidth and resources capabilities, and with user interests in the video's content. We propose an object-basedsemantic transcoding that, according to the user-dened classes of relevance, applies different transcoding techniques to the objects segmented in a scene. Object extraction is provided by on-the-fly video processing, without manual annotation. Multiple transcoding policies are reviewed and a performance evaluation metric based on the Weighted Mean Square Error (and corresponding PSNR), that takes into account the perceptual user requirements by means of classes of relevance, is dened. Results are analyzed by varying transcoding techniques, bandwidth requirements and video types (with indoor and outdoor scenes), showing that the use of semantics can dramatically improve the bandwidth to distortion ratio.

2003 - Steering wheel's angle tracking from camera-car [Relazione in Atti di Convegno]
R. CUCCHIARA; A. PRATI; F. VIGETTI
This paper proposes a general-purpose method to trackthe steering wheel’s absolute angle by using a single camera vision system mounted inside the car. The approachis based on the modeling of the motion of thesteering wheel, as it appears perspectively distorted bythe point of view of the un-calibrated camera. We modifiedthe Lucas-Kanade method for an approzimativelyrotational motion model in order to provide the detectionand tracking of significant features on the wheel.The experimental results are compared with ground-trutheddata obtained with different types of sensors.

2003 - Tuning range image segmentation by genetic algorithm [Articolo su rivista]
G. Pignalberi; R. Cucchiara; L. Cinque; S. Levialdi
Several range image segmentation algorithms have been proposed, each one to be tuned by a number of parameters in order to provide accurate results on a given class of images. Segmentation parameters are generally affected by the type of surfaces (e.g., planar versus curved) and the nature of the acquisition system (e.g., laser range finders or structured light scanners). It is impossible to answer the question, which is the best set of parameters given a range image within a class and a range segmentation algorithm? Systems proposing such a parameter optimization are often based either on careful selection or on solution space-partitioning methods. Their main drawback is that they have to limit their search to a subset of the solution space to provide an answer in acceptable time. In order to provide a different automated method to search a larger solution space, and possibly to answer more effectively the above question, we propose a tuning system based on genetic algorithms. A complete set of tests was performed over a range of different images and with different segmentation algorithms. Our system provided a particularly high degree of effectiveness in terms of segmentation quality and search time.

2002 - A Decision Support System for Range Image Segmentation [Relazione in Atti di Convegno]
L. CINQUE; R. CUCCHIARA; S. LEVIALDI; G. PIGNALBERI
2002 - A Framework for Semantic Video Transcoding [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; A. Prati
In this work we present a transcoding framework and an object-based technique to adapt live and stored videos to the user bandwidth and resources capabilities.Multiple transcoding policies are reviewed and a performance evaluation metric based on the Weighted Mean Square Error that allows different classes of relevance is presented.We present results for different transcoding policies and for different bandwidth requirements, showing that the use of semantic can improve the bandwidth to distortion ratio.

2002 - Building the Topological Tree by Recursive FCM Color Clustering [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; A. Prati; S. Seidenari; G. Pellacani
In this paper we define a Topological Tree (TT) as a knowledge representation method that aims to describe important visual and spatial features of image regions, namely the color similarity, the inclusion and the spatial adjacency. The topological tree exhibits some interesting properties that can be exploited to extract knowledge from images for information retrieval, image understanding and diagnosis purposes. Examples of applications in dermatology are described. The TT can be constructed after segmentation, by computing the spatial relationships of regions or can be generated directly during the segmentation: to this aim we present a novel recursive fuzzy c-means (FCM) clustering algorithm based on the Principal Component Analysis of the color space. The recursive FCM proves to be effective for underlining the adjacency and inclusion property of regions.

2002 - Data-type Dependent Cache Prefetching for MPEG Applications [Relazione in Atti di Convegno]
R. CUCCHIARA; M. PICCARDI; A. PRATI
Data cache prefetching is an effective technique to improve performance of cache memories, whenever the prefetching algorithm is able to correctly predict useful data to be prefetched. To this aim, adequate information on the program’s data locality must be used by the prefetching algorithm. In particular, multimedia applications are characterized by a substantial amount of image and video processing, which exhibits spatial locality in both the dimensions of the 2D data structures used for images and frames. However, in multimedia programs many memory references are made also to non-image data, characterized by standard spatial locality. In this work, we explore the adoption of different prefetching techniques in dependence of the data type (i.e., image and non-image), thus making itpossible to tune the prefetching algorithms to the differentforms of locality, and achieving overall performance optimization. In order to prevent interference between the two different data types, a split cache with two separated caches for image and non-image data is also evaluated as an alternative to a standard unified cache. Results on a multimedia workload (MPEG-2 and MPEG-4 decoders) show that standard prefetching techniques such as One-block-lookahead and the Stride Prediction Table are effective for standard data, while novel 2D prefetching techniques perform best on image data. In addition, at a parity of size, unified caches offer in general better performance that split caches, thank to the more flexible allocation of a unified cache space.

2002 - Detecting Moving Objects and their Shadows: An Evaluation with the PETS2002 Dataset [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; A. Prati
This work presents a general-purpose method for moving visual object segmentation in videos and discusses results attained on sequences of PETS2002 datasets. The proposed approach, called Sakbot, exploits color and motion information to detect objects, shadows and ghosts, i.e. foreground objects with apparent motion. The method is based on background suppression in the color space. The main peculiarity of the approach is the exploitation of motion and shadow information to selectively update the background, improving the statistical background model with the knowledge of detected objects. The approach is able to detect Moving Visual Objects (MVOs), and stopped objects too, since the motion status is maintained at the level of tracking module. HSV color space is exploited for shadow detection in order to enhance both segmentation and background update. Time measures and precision performance analysis in tracking and counting people is provided for surveillance and monitoring purposes.

2002 - Development of a new program for image analysis of digital videomicroscopic images of pigmented skin lesions [Abstract in Rivista]
S. Seidenari; G. Pellacani; C. Grana; R. Cucchiara
Although an improvement of the diagnostic accuracy of pigmented skin lesions (PSL) has been achieved by the epiluminescence technique (ELM), the interpretation of ELM criteria is often confusing, especially for inexperienced observers. To enhance the reproducibility and accuracy of clinical judgement and the training of inexperienced operators, programs for PSL image analysis and algorithms for automatic diagnosis have been developed. The aim of our study was to develop a new program for PSL image analysis, able to describe different aspects of PSLs and to test its descriptive capability on PSL acquired by means of a digital videomicroscope (VMS 110A, Scalar Mitsubishi, Japan) using 20-fold magnification. After automatic border identification and baricentre determination, some geometric parameters, describing shape characteristics of the lesion, were calculated. A mathematical description of the border cut-off was obtained. The texture of the lesion was calculated applying the co-occurrence matrix at different image resolutions. Dark areas and colour areas, referring to selected colour groups, were obtained and their aspect and distribution were mathematically defined and calculated. 281 common nevi and 117 melanomas were numerically described by our program and the capability of the mathematical parameters to distinguish between benign and malignant lesion was tested by means of discriminant analysis. Significant differences were observed for most parameters between different PSL populations. The automatic classification enabled the distinction between melanomas and nevi with a 100% sensitivity and a 82.9% specificity.

2002 - Exploiting color and topological features for region segmentation with recursive fuzzy c-means [Articolo su rivista]
Cucchiara, Rita; Grana, Costantino; Seidenari, Stefania; Pellacani, Giovanni
In this paper we define a novel approach for image segmentation into regions which focuses on both visual and topological cues, namely color similarity, inclusion and spatial adjacency. Many color clustering algorithms have been proposed in the past for skin lesion images but none exploits explicitly the inclusion properties between regions. Our algorithm is based on a recursive version of fuzzy c-means (FCM) clustering algorithm in the 2D color histogram constructed by Principal Component Analysis (PCA) of the color space. The distinctive feature of the proposal is that recursion is guided by the evaluation of adjacency and mutual inclusion properties of extracted regions; then, the recursive analysis addresses only included regions or regions with a not-negligible size. This approach allows a coarse-to-fine segmentation which focuses the attention on the inner parts of the images, in order to highlight the internal structure of the object depicted in the image. This could be particularly useful in many applications, especially in the biomedical image analysis. In this work we apply the technique to the segmentation of skin lesions in dermatoscopic images. It could be a suitable support for the diagnosis of skin melanoma, since dermatologists are interested in the analysis of the spatial relations, the symmetrical positions and the inclusion of regions.

2002 - Iterative fuzzy clustering for detecting regions of interest in skin lesions [Articolo su rivista]
Cucchiara, Rita; Grana, Costantino; Piccardi, Massimo
Image analysis tools are spreading in dermatology since the introduction of dermoscopy (epiluminescence microscopy), in the effort of algorithmically reproducing clinical evaluations. Color-based region segmentation of skin lesions is one of the key steps for correctly collecting statistics that can help clinicians in their diagnosis. Nevertheless, an efficient and accurate region segmentation algorithm has not been proposed in the literatureyet. This work proposes an iterative fuzzy c-means clustering algorithm based on PCA with the Karhunen-Loève transform of the color space. A topological tree is provided to store the mutual inclusions of the regions and then used to summarize the structural properties of the skin lesion. Preliminary experimental results are presented and discussed.

2002 - Performance analysis of MPEG-4 decoder and encoder [Relazione in Atti di Convegno]
F. CAVALLI; R. CUCCHIARA; M. PICCARDI; A. PRATI
In this paper, a performance analysis of MPEG-4 encoder and decoder programs on standard personal computer is presented. The paper first describes the MPEG-4 computational load and discusses related works, then outlines the performance analysis. Experimental results show that while the decoder program can be easily executed in real time, the encoder requires execution times in the order of seconds per frame which call for substantial optimisation to satisfy the real-time constraints.

2002 - Semantic Transcoding for Live Video Server [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; A. Prati
In this paper we present transcoding techniques for a video server architecture that enables the user to access live video streams by using different devices with different capabilities. For live videos, annotation methods cannot be exploited. Instead we propose methods of on-the-fly transcoding that adapt the video content with respect to the user resources and the video semantic. Thus we propose an object-based transcoding with "classes of relevance" (for instance People, Face and Background). To compare the different strategies we propose a metric based on the Weighted Mean Square Error that allows the analysis of different application scenarios by means of a class-wise distortion measure. The obtained results show that the use of semantic can improve the bandwidth to distortion ratio significantly.

2002 - Using the Topological Tree for skin lesion structure description [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana
In this work we describe the Topological Tree (TT) as a knowledge representation method that relates some important visual and spatial features of image regions, namely the color similarity, the inclusion and the spatial adjacency. Starting from color-based region segmentation of an image into disjoint regions, their spatial relationships can be devised and described with graph-based methods. We are interested in the region’s propriety “to be included into” (in the sense of “surrounded by”) another region. This property could be very useful in biomedical imaging and in particular in the diagnosis of skin melanoma. The TT can be constructed after segmentation, by computing the spatial relationships of regions or can be generated directly during the segmentation: to this aim we present a novel recursive fuzzy c-means (FCM) clustering algorithm based on the PCA of the color space. In the paper, in addition to the TT definition and the construction algorithm description, some results are presented and discussed.

2001 - A Metodology to Award a Score to Range Image Segmentation [Relazione in Atti di Convegno]
L. CINQUE; R. CUCCHIARA; S. LEVIALDI; G. PIGNALBERI
2001 - An Application of Machine Learning and Statistics to Defect Detection [Articolo su rivista]
R. CUCCHIARA; P. MELLO; M. PICCARDI; F. RIGUZZI
We present an application of machine learning and statistics to the problem of distinguishing between defective and non-defective industrial workpieces, where the defect takes the form of a long and thin crack on the surface of the piece. From the images of pieces a number of features are extracted by using the Hough transform and the Correlated Hough transform. Two datasets are considered, one containing only features related to the Hough transform and the other containing also features related to the Correlated Hough transform. On these datasets we have compared six different learning algorithms: an attribute-value learner, C4.5, a backpropagation neural network, NeuralWorks Predict, a k-nearest neighbour algorithm, and three statistical techniques, linear, logistic and quadratic discriminant. The experiments show that C4.5 performs best for both feature sets and gives an average accuracy of 93.3% for the first dataset and 95.9% for the second dataset

2001 - Analysis and detection of shadows in video streams: a comparative evaluation [Relazione in Atti di Convegno]
A. PRATI; R. CUCCHIARA; I. MIKIC; MM TRIVEDI
Robustness to changes in illumination conditions as well as viewing perspectives is an important requirement for many computer vision applications. One of the key factors in enhancing the robustness of dynamic scene analysis is that of accurate and reliable means for shadow detection. Shadow detection is critical for correct object detection in image sequences. Many algorithms have been proposed in the literature that deal with shadows. However, a comparative evaluation of the existing approaches isstill lacking. In this paper, the full range of problems underlyingthe shadow detection are identified and discussed. We classify the proposed solutions to this problem using a taxonomy of four main classes, called deterministic model and non-model based and statistical parametric and nonparametric. Novel quantitative (detection and discrimination accuracy) and qualitative metrics (scene and object independence, flexibility to shadow situations and robustness to noise) are proposed to evaluate these classes of algorithms on a benchmark suite of indoor and outdoor videosequences.

2001 - Comparative Evaluation of Moving Shadow Detection Algorithms [Relazione in Atti di Convegno]
A. PRATI; I. MIKIC; R. CUCCHIARA; M.M. TRIVEDI
Moving shadows need careful consideration in the development of robust dynamic scene analysis systems. Moving shadow detection is critical for accurate object detection in video streams, since shadow points are often misclassified as object points causing errors in segmentation and tracking. Many algorithms have been proposed in the literature that deal with shadows. However, acomparative evaluation of the existing approaches is still lacking. In this paper, the full range of problems underlying the shadowdetection are identified and discussed. We present a comprehensive survey of moving shadow detection approaches. We organize contributions reported in the literature in four classes. We also present a comparative empirical evaluation of representative algorithms selected from these four classes. Quantitative (detection and discrimination accuracy) and qualitative metrics (scene and object independence, flexibility to shadow situations and robustness to noise) are proposed to evaluate these classes of algorithms on a benchmark suite of indoor and outdoor video sequences. These video sequences and associated “ground-truth” data are made available at http://cvrr.ucsd.edu:88/aton/shadow to allow for others in the community to experiment with new algorithms and metrics.

2001 - Detecting objects, shadows and ghosts in video streams by exploiting color and motion information [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; M. Piccardi; A. Prati
Many approaches to moving object detection for traffic monitoring and video surveillance proposed in the literature are based on background suppression methods. How to correctly and efficiently update the background model and how to deal with shadows are two of the more distinguishing and challenging features of such approaches. This work presents a general-purpose method for segmentation of moving visual objects (MVOs) based on an object-level classification in MVOs, ghosts and shadows. Background suppression needs a background model to be estimated and updated: we use motion and shadow information to selectively exclude from the background model MVOs and their shadows, while retaining ghosts. The color information (in the HSV color space) is exploited to shadow suppression and, consequently, to enhance both MVOs segmentation and background update.

2001 - Enhancing Implant Surgery Planning via Computerized Image Processing [Articolo su rivista]
R. Cucchiara; F. Franchini; A. Lamma; E. Lamma; T. Sansoni; E. Sarti
Computerized tomography (CT) and magnetic resolution imaging (MRI) are the medical imaging modalities to deliver cross-sectional images of the human body. In the last decade, CT has become the most frequently used imaging modality for the evaluation of the jaw for dental implants (see [Rot98]). Furthermore, image reformatting software has been developed in order to obtain a correct pre-operative diagnosis and treatment planning regarding osseointegrated implants (see, for instance, [CSIa] and [CSIb]). Previous work (see [Cla90]) has shown that CT images are affected by a distortion ratio from 0 to 6 percent. This might be due to the alignment of the patient during the scanning, to his/her movements and eventually to the saturation of pixels composing the image. To solve the first cause, intraoral stents can be used for centering the patient’s head perpendicularly to the axis of the implant to be installed. However, when more than one implant have to be installed, eventually with very different axes, it is better to not expose the patient to multiple CT scanning, which would be necessary to have different CT acquisitions each one perpendicular to the axis of one of the planned teeth.In this work, we present a software approach for enhancing implant surgical planning in order to get exact morphological measurements of the bone and planned teeth by a single CT acquisition. This is achieved by applying image-processing techniques to the original CT images, in order to produce new CT images lying on different planes, and eventually perpendicular to a different tooth. The resulting software system (named DentalVox) has been implemented in C++ and runs on Intel-based personal computers under the Windows operating system. DentalVox ensures better mechanical results in the design and planning of a dental implant with respect to other similar software tools, being able to reconstruct axial (and panorex and cross-sectional) images once any direction is chosen. This allows to get a better mechanical and aestethic prothesis implantata in the underlying jaw bone morphology.

2001 - From eager to lazy constrained data acquisition: a general framework [Articolo su rivista]
P. Mello; M. Milano; G. Gavanelli; E. Lamma; M. Piccardi; R. Cucchiara
Constraint Satisfaction Problems (CSPs)(17)) are an effective framework for modeling a variety of real life applications and many techniques have been proposed for solving them efficiently. CSPs are based on the assumption that all constrained data (values in variable domains) are available at the beginning of the computation. However, many non-toy problems derive their parameters from an external environment. Data retrieval can be a hard task, because data can come from a third-party system that has to convert information encoded with signals (derived from sensors) into symbolic information (exploitable by a CSP solver). Also, data can be provided by the user or have to be queried to a database. For this purpose, we introduce an extension of the widely used CSP model, called Interactive Constraint Satisfaction Problem (ICSP) model. The variable domain values can be acquired when needed during the resolution process by means of Interactive Constraints, which retrieve (possibly consistent) information. A general framework for constraint propagation algorithms is proposed which is parametric in the number of acquisitions performed at each step. Experimental results show the effectiveness of the proposed approach. Some applications which can benefit from the proposed solution are also discussed.

2001 - Improving shadow suppression in moving object detection with HSV color information [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; M. Piccardi; A. Prati; S. Sirotti
Video-surveillance and traffic analysis systems can be heavily improved using vision-based techniques able to extract, manage and track objects in the scene. However, problems arise due to shadows. In particular, moving shadows can affect the correct localization, measurements and detection of moving objects. This work aims to present a technique for shadow detection and suppression used in a system for moving visual object detection and tracking. The major novelty of the shadow detection technique is the analysis carried out in the HSV color space to improve the accuracy in detecting shadows. Signal processing and optic motivations of the approach proposed are described. The integration and exploitation of the shadow detection module into the system are outlined and experimental results are shown and evaluated

2001 - Iterative fuzzy clustering for detecting regions of interest in skin lesions [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; M. Piccardi
Image analysis tools are spreading in dermatology since the introduction of dermoscopy (epiluminescence microscopy), in the effort of algorithmically reproducing clinical evaluations. Color-based region segmentation of skin lesions is one of the key steps for correctly collecting statistics that can help clinicians in their diagnosis. Nevertheless, an efficient and accurate region segmentation algorithm has not been proposed in the literature yet. This work proposes an iterative fuzzy c-means clustering algorithm based on PCA with the Karhunen-Loève transform of the color space. A topological tree is provided to store the mutual inclusions of the regions and then used to summarize the structural properties of the skin lesion. Preliminary experimental results are presented and discussed.

2001 - Temporal analysis of cache prefetching strategies for multimedia applications [Relazione in Atti di Convegno]
R. CUCCHIARA; M. PICCARDI; A. PRATI
Prefetching is a widely adopted technique for improving performance of cache memories. Performances are typically affected by the design parameters, such as cache size and associativity, but also by the type of locality embodied in the programs. In particular multimedia tools and programs handling images and video are characterized & a bi-dimensional spatiallocality that could be greatly exploited by the inclusion of prefetching in the cache architecture. In this paper we compare some prefetching techniques for multimedia programs (such as MPEG compression, image processing, visual object egmentation) by performing a detailed evaluation of the memory access time. The goal is to prove that a signifcant speedup can be achieved by using either standard prefecthing techniques (such as OBL or adaptive prefetchind or some innovative andimage-oriented prefetching methods, like the neighbor prefetching described in the paper. Performance are measured with the PRIMA trace-driven simulator.

2001 - The Sakbot system for moving object detection and tracking [Capitolo/Saggio]
Cucchiara, Rita; Grana, Costantino; Neri, Gianni; Piccardi, Massimo; Prati, Andrea
This paper presents Sakbot, a system for moving object detection in traffic monitoring and video surveillance applications. The system is endowed with robust and efficient detection techniques, which main features are the statistical and knowledge-based background update and the use of HSV color information for shadow suppression. Tracking is provided by a symbolic reasoning module allowing flexible object tracking over a variety of different applications. This system proves effective on many different situations, both from the point of view of the scene appearance and the purpose of the application.

2001 - The Sakbot system for moving object detection and tracking [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; G. Neri; M. Piccardi; A. Prati
This paper presents Sakbot, a system for moving object detection and tracking in traffic monitoring and video surveillance applications. The system is endowed with robust and efficient detection techniques, which main features are the statistical and knowledge-based background update and the use of HSV color information for shadow suppression. Tracking is performed by means of a flexible tracking module based on symbolic reasoning, which can be tuned to several different applications.

2000 - An Application of Machine Learning and Statistics to Defect Detection [Relazione in Atti di Convegno]
R. CUCCHIARA; P. MELLO; M. PICCARDI; F. RIGUZZI
2000 - Computational Models for Machine Vision on Shared Memory Multiprocessors [Articolo su rivista]
A. CALLIPO; R. CUCCHIARA; M. PICCARDI
Different tasks in image processing exhibit different computational requirements that should be considered with respect to the architecture. This is particularly critical in parallel machines where many parallelization techniques, as data partitioning and mapping on processors, use of shared memory space, exploitation of pipelining with pre-fetching affect dramatically the performance with a strong relation with algorithm and architectural parameters.The paper defines computational models for tightly-coupled multiprocessors with crossbar architecture, both for data-parallel local algorithms and for global algorithms such as spatial transformations. To solve the intrinsic memory limitations of low-cost, highly integrated systems, the paper proposes to extend the classical block processing model by analytically modeling also the case of multiple processing stages.The models have been compared in detail and have been efficiently adopted for optimizing performance in block processing on crossbar multiprocessors for low-level computer vision applications.

2000 - Focus based Feature Extraction for Pallets Recognition [Relazione in Atti di Convegno]
R. CUCCHIARA; M. PICCARDI; A. PRATI
Visual recognition for object grasping is a well-known challenge for robot automation in industrial applications. A typical example is pallet recognition in industrial environment for pick-and-place automated process. The aim of vision and reasoning algorithms is to help robots in choosing the best pallets holes location. This work proposes an application-based approach, which fulfil all requirements, dealing with every kind of occlusions and light situations possible. Even some ”meaning noise” (or ”meaning misunderstanding”) is considered. A pallet model, with limited degrees of freedom, is described and, starting from it, a complete approach to pallet recognition is outlined. In the model we define both virtual and real corners, that are geometricalobject proprieties computed by different image analysis operators. Real corners are perceived by processing brightness information directly from the image, while virtual corners are inferred at a higher level of abstraction. A final reasoning stage selects the best solution fitting the model. Experimental results and performance are reported in order to demonstrate the suitability of the proposed approach.

2000 - Hardware prefetching techniques for cache memories in multimedia applications [Relazione in Atti di Convegno]
R. CUCCHIARA; M. PICCARDI; A. PRATI
The workload of niultimedia applications has a strong impact on cache memory performance, since the locality of memory references embedded in multimedia programs differs from that of traditional programs. In many cases, standard cache memory organization achieves poorer performance when used for multimedia. A widely explored approach to improve cache performance is hardware prefetching that allows the pre-loading of data in the cache before they are referenced. However, existing hardware prefetching approaches partially miss thepotential performance improvement, since they are not tailored to multimedia locality. In this paper we propose novel effective approaches to hardware prefetching to be used in image processing programs for multimedia. Experimental results are reported for a suite of multimedia image processing programs including convolutions with kernels, MPEG-2 decoding, and edgechain coding.

2000 - Image Analysis and Rule-Based Reasoning for a Traffic Monitoring [Articolo su rivista]
R. CUCCHIARA; M. PICCARDI; P. MELLO
The paper presents an approach for detecting vehicles in urban traffic scenes by means of rule-based reasoning on visual data. The strength of the approach is its formal separation between the low-level image processing modules (used for extracting visual data under various illumination conditions) and the high-level module, which provides a general-purpose knowledge-based framework for tracking vehicles in the scene. The image-processing modules extract visual data from the scene by spatio-temporal analysis during daytime, and by morphological analysis of headlights at night, The high-level module is designed as a forward chaining production rule system, working on symbolic data, i.e., vehicles and their attributes (area, pattern, direction, and others) and exploiting a set of heuristic rules tuned to urban traffic conditions, The synergy between the artificial intelligence techniques of the high-level and the low-level image analysis techniques provides the system with flexibility and robustness.

2000 - Improving data prefetching efficacy in multimedia applications [Relazione in Atti di Convegno]
R. CUCCHIARA; M. PICCARDI; A. PRATI
The workload of multimedia applications has a strong impact on cache memory performance, since the locality of memory references embedded in multimedia programs diers from that of traditional programs. In many cases, standard cache memory organization achieves poorer performance when used for multimedia. A widely-explored approach to improve cache performance is hardware prefetching, which allows the pre-loading of data in the cache before they are referenced. However, existing hardware prefetching approaches unable to exploit the potential improvement in performance, since they are not tailored to multimedia locality. In this paper we propose novel effective approaches to hardware prefetching to be used in image processing programs for multimedia. Experimental results (both on efficiency and on efficacy of the proposed approach) are reported for a suite of multimedia image processing programs including MPEG-2 decoding and encoding, convolution, thresholding, and edge chain coding.

2000 - Optimal Range Segmentation Parameters Through Genetic Algorithms [Relazione in Atti di Convegno]
L. CINQUE; R. CUCCHIARA; S. LEVIALDI; S. MARTINZ; G. PIGNALBERI
A wide number of algorithms for surface segmentation in range images have been recently proposed characterized by different approaches (edge filling, region growing, ...), different surface types (either for planar or curved surfaces) and different parameters involved. Optimization of the parameter set is a particularly critical task since the range of parameter variability is often quite large: parameter selection depends on surface type, sensors and the required speed which strongly affect performance. A framework for parameters optimization is proposed based on genetic algorithms. Such algorithms allow a general approach that has been successfully applied on different state-of-the-art segmenters and different range image databases.

2000 - Scuola "La Visione delle Macchine" [Curatela]
R. Cucchiara; M. Piccardi; A. Prati
2000 - Statistic and knowledge-based moving object detection in traffic scenes [Relazione in Atti di Convegno]
R. Cucchiara; C. Grana; M. Piccardi; A. Prati
The most common approach used for vision-based traffic surveillance consists of a fast segmentation of moving visual objects (MVOs) in the scene together with an intelligent reasoning module capable of identifying, tracking and classifying the MVOs in dependency of the system goal. In this paper we describe our approach for MVOs segmentation in an unstructured traffic environment. We consider complex situations with moving people, vehicles and infrastructures that have different aspect model and motion model. In this case we define a specific approach based on background subtraction with statistic and knowledge-based background update. We show many results of real-time tracking of traffic MVOs in outdoor traffic scene such as roads, parking area intersections, and entrance with barriers

1999 - 3D Object Recognition by VC-graphs and Interactive Constraint Satisfaction [Relazione in Atti di Convegno]
R. CUCCHIARA; E. LAMMA; P. MELLO; M. MILANO; M. PICCARDI
We propose a novel approach for recognizing 3D CADmade objects in complex range images containing several overlapped and different objects. Objects are modeled by a graph whose nodes are surfaces and arcs are surface relations. We propose an object-centered graph model, called Visual Constraint graph (VC-graph), with special visual constraints modeling occlusions between object surfaces. The VC-graph is used for recognizing objects from each possible point of view, instead of evaluating many different single-view graphs. The reasoning engine is based on an original extension of the Constraint Satisfaction Problem (CSP) paradigm, called Interactive CSP (ICSP). CSP requires the acquisition of all surfaces before starting constraint propagation; instead, ICSP guides the acquisition of new surfaces only on-demand, without computing useless information and focussing attention only on significant image parts.

1999 - Constraint Propagation and Value Acquisition: why we should do it Interactively [Relazione in Atti di Convegno]
E. LAMMA; P. MELLO; M. MILANO; R. CUCCHIARA; G. GAVANELLI; M. PICCARDI
In Constraint Satisfaction Problems (CSPs) values belonging to variable domains should be completely known before the constraint propagationprocess starts. In many applications, however, the acquisition of domain values is a computational expensive process or some domainvalues could not be available at the beginningof the computation. For this purpose, we introduce an Interactive Constraint SatisfactionProblem (ICSP) model as extension of the widely used CSP model. The variable domainvalues can be acquired when needed duringthe resolution process by means of InteractiveConstraints, which retrieve (possibly consistent)information. Experimental results on randomly generated CSPs and for 3D object recognition show the effectiveness of the proposedapproach.

1999 - Eliciting Visual Primitives for Detecting Elongated Shapes [Articolo su rivista]
R. CUCCHIARA; M. PICCARDI
1999 - Exploiting Cache in Multimedia [Relazione in Atti di Convegno]
R. CUCCHIARA; M. PICCARDI; A. PRATI
The paper explores cache strategies for multimedia. Although many architectural improvements have been designed for multimedia, the cache structure and the standard caching policies of general-purpose processors exhibit poor performance in exploiting the 2D spatial locality typical of programs handling and processing images. In this paper we propose a novel caching approach suitably tailored to the requirement of multimedia programs. Our proposal exploits hardware pre-fetching for allocating in cache blocks of data that satisfy the 2D spatial locality requirements. Results refer to a benchmark suite of multimedia program including MPEG decoding and image processing programs with different data dependency and access scheme to image data.

1999 - Extending CLP(FD) with Interactive Data Acquisition for 3D Visual Object Recognition [Relazione in Atti di Convegno]
R. Cucchiara; M. Gavanelli; E. Lamma; P. Mello; M. Milano; M. Piccardi
1999 - Image Analysis and Rule-Based Reasoning for a Traffic Monitoring [Relazione in Atti di Convegno]
R. CUCCHIARA; M. PICCARDI; P. MELLO
The paper describes a system for detecting vehicles in urban traffic scenes in daytime and at night by means of image analysis and rule-based reasoning. The strength of the proposed approach is its formal separation between the low-level image processing modules (detecting moving vehicles under day and night light) and the high-level module, which provides a single framework for tracking vehicles in the scene. The image processing modules perform spatio-temporal analysis on moving templates in daytime images, and morphological analysis of headlight pairs in night images. The high-level module is designed as a forward chained production rule system, working on symbolic data, i.e. vehicles and their attributes, and exploiting a set of heuristic roles tuned to urban traffic conditions. The synergy between the artificial intelligence techniques of the high level and low-level image analysis techniques provides the system with flexibility and robustness.

1999 - Real-time Detection of Moving Vehicles [Relazione in Atti di Convegno]
R. CUCCHIARA; M. PICCARDI; A. PRATI; N. SCARABOTTOLO
Computer vision-based traffic flow monitoring is of major importance for enforcing traffic management policies. Information such as the number of vehicles passing on a road per time unit, or vehicles' turning rates at intersections are exploited by traffic management policies to supervise traffic-light timings. Computer vision-based traffic flow monitoring requiresextraction of moving vehicles from traffic scenes in real time. To accomplish this task, efficient algorithms must be used and effective, low-cost hardware implementation must be pursued. This paper first describes the algorithms used in VTTS (Vehicular Traffic Tracking System) to achieve segmentation of moving vehicles. Then, hardware implementation on a re-programmable FPGA-based board is described in detail.

1999 - Rule-based reasoning on visual data for urban traffic monitoring [Relazione in Atti di Convegno]
R. CUCCHIARA; M. GAVANELLI; A. PRATI; M. PICCARDI
The paper describes a system for detecting vehicles in urban traffic scenes by means of rule-based reasoning on visual data. The strength of the proposed approach is its formal separation between low-level image processing modules (able for extracting visual data under various illumination conditions) and the high-level module, which provides a single framework for tracking vehicles in the scene. The image processing modules extract visual data from the scene, by spatio-temporal analysis during day-time, and by morphological analysis of headlights at night. The high-level module is designed as a forward chaining production rule system, working on symbolic data, i.e. vehicles and their attributes (area, pattern, direction...) and exploiting a set of heuristic rules tuned to urban traffic conditions. The synergy between the artificial intelligence techniques of the high level and the low-level image analysis techniques provides the system with flexibility and robustness.

1999 - Segmentation of Moving Objects at Frame Rate: A Dedicated Hardware Solution [Relazione in Atti di Convegno]
R. CUCCHIARA; P. ONFIANI; A. PRATI; N. SCARABOTTOLO
Many works in image processing concern segmentation of moving objects in sequence of images. This problem is particularly critical, since it represents the first step of many complex processes of computer vision, for applications like object tracking, video-surveillance, monitoring, and autonomous navigation. In such applications, both real-time and low-cost requirements should be satisfied.To this aim we propose a dedicated hardware solution, based on reconfigurable logic, that provides motion detection and moving objects segmentation at framerate.

1999 - Vehicle Detection under Day and Night Illumination [Relazione in Atti di Convegno]
R. CUCCHIARA; M. PICCARDI
Effective detection of vehicles in urban traffic scenes can be achieved by exploiting image analysis techniques. Nevertheless, vehicle detection in daytime and at night can’t be approached with the same image analysis algorithms, due to the strongly different illumination conditions. This paper describes the two different sets of image analysis algorithms that have been used in the VTTS system (Vehicular Traffic Tracking System) for extracting vehicles from image sequences acquired in daytime and at night. In the system, a supervising level selects the set of algorithms to apply and performs vehicle tracking under control of a rule-based decision module. The paper describes the tracking module, and reports experimental results for both vehicle detection andtracking.

1998 - A real-time hardware implementation of the hough transform [Articolo su rivista]
R. CUCCHIARA; G. NERI; M. PICCARDI
The paper presents a hardware implementation of algorithms based on the Hough transform (HT) for real-time straight line detection. In particular, the basic HT on the edge points (EHT) and the Gradient-Weighted Hough transform (GWHT) for gray-level images are analyzed in detail and implemented on a pipelined architecture using Field Programmable Gate Arrays (FPGA). Algorithms execution times are compared with other hardware and software based systems in order to assess the efficiency of the presented approach. The paper shows how the achievable performance can meet the real-time requirements of an industrial inspection application.

1998 - Genetic algorithms for clustering in machine vision [Articolo su rivista]
R. CUCCHIARA
The paper presents a genetic algorithm for clustering objects in images based on their visual features. In particular, a novel solution code (named Boolean Matching Code) and a correspondent reproduction operator (the Single Gene Crossover) are defined specifically for clustering and are compared with other standard genetic approaches. The paper describes the clustering algorithm in detail, in order to show the suitability of the genetic paradigm and underline the importance of effective tuning of algorithm parameters to the application. The algorithm is evaluated on some test sets and an example of its application in automated visual inspection is presented.

1998 - The Vector-Gradient Hough Transform [Articolo su rivista]
R. CUCCHIARA; F. FILICORI
The paper presents a new transform, called vector-gradient Hough transform, for identifying elongated shapes in gray-scale images. This goal is achieved not only by collecting information on the edges of the objects, but also by reconstructing their transversal profile of luminosity. The main features of the new approach are related to its vector space formulation and the associated capability of exploiting all the vector information of the luminosity gradient