Roberto VEZZANI - personale UniMoRe

Nuova ricerca

Roberto VEZZANI

Professore Associato
Dipartimento di Ingegneria "Enzo Ferrari"

Pubblicazioni

2023 - Computer Vision in Human Analysis: From Face and Body to Clothes [Articolo su rivista]
Daoudi, Mohamed; Vezzani, Roberto; Borghi, Guido; Ferrari, Claudio; Cornia, Marcella; Becattini, Federico; Pilzer, Andrea
abstract

: For decades, researchers of different areas, ranging from artificial intelligence to computer vision, have intensively investigated human-centered data, i [...].

2023 - Depth-based 3D human pose refinement: Evaluating the refinet framework [Articolo su rivista]
D'Eusanio, A.; Simoni, A.; Pini, S.; Borghi, G.; Vezzani, R.; Cucchiara, R.
abstract

In recent years, Human Pose Estimation has achieved impressive results on RGB images. The advent of deep learning architectures and large annotated datasets have contributed to these achievements. However, little has been done towards estimating the human pose using depth maps, and especially towards obtaining a precise 3D body joint localization. To fill this gap, this paper presents RefiNet, a depth-based 3D human pose refinement framework. Given a depth map and an initial coarse 2D human pose, RefiNet regresses a fine 3D pose. The framework is composed of three modules, based on different data representations, i.e. 2D depth patches, 3D human skeletons, and point clouds. An extensive experimental evaluation is carried out to investigate the impact of the model hyper-parameters and to compare RefiNet with off-the-shelf 2D methods and literature approaches. Results confirm the effectiveness of the proposed framework and its limited computational requirements.

2022 - Semi-Perspective Decoupled Heatmaps for 3D Robot Pose Estimation from Depth Maps [Articolo su rivista]
Simoni, Alessandro; Pini, Stefano; Borghi, Guido; Vezzani, Roberto
abstract

Knowing the exact 3D location of workers and robots in a collaborative environment enables several real applications, such as the detection of unsafe situations or the study of mutual interactions for statistical and social purposes. In this paper, we propose a non-invasive and light-invariant framework based on depth devices and deep neural networks to estimate the 3D pose of robots from an external camera. The method can be applied to any robot without requiring hardware access to the internal states. We introduce a novel representation of the predicted pose, namely Semi-Perspective Decoupled Heatmaps (SPDH), to accurately compute 3D joint locations in world coordinates adapting efficient deep networks designed for the 2D Human Pose Estimation. The proposed approach, which takes as input a depth representation based on XYZ coordinates, can be trained on synthetic depth data and applied to real-world settings without the need for domain adaptation techniques. To this end, we present the SimBa dataset, based on both synthetic and real depth images, and use it for the experimental evaluation. Results show that the proposed approach, made of a specific depth map representation and the SPDH, overcomes the current state of the art.

2022 - Unsupervised Detection of Dynamic Hand Gestures from Leap Motion Data [Relazione in Atti di Convegno]
D'Eusanio, A.; Pini, S.; Borghi, G.; Simoni, A.; Vezzani, R.
abstract

The effective and reliable detection and classification of dynamic hand gestures is a key element for building Natural User Interfaces, systems that allow the users to interact using free movements of their body instead of traditional mechanical tools. However, methods that temporally segment and classify dynamic gestures usually rely on a great amount of labeled data, including annotations regarding the class and the temporal segmentation of each gesture. In this paper, we propose an unsupervised approach to train a Transformer-based architecture that learns to detect dynamic hand gestures in a continuous temporal sequence. The input data is represented by the 3D position of the hand joints, along with their speed and acceleration, collected through a Leap Motion device. Experimental results show a promising accuracy on both the detection and the classification task and that only limited computational power is required, confirming that the proposed method can be applied in real-world applications.

2021 - A Systematic Comparison of Depth Map Representations for Face Recognition [Articolo su rivista]
Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Maltoni, Davide; Cucchiara, Rita
abstract

2021 - Extracting accurate long-term behavior changes from a large pig dataset [Relazione in Atti di Convegno]
Bergamini, L.; Pini, S.; Simoni, A.; Vezzani, R.; Calderara, S.; Eath, R. B. D.; Fisher, R. B.
abstract

Visual observation of uncontrolled real-world behavior leads to noisy observations, complicated by occlusions, ambiguity, variable motion rates, detection and tracking errors, slow transitions between behaviors, etc. We show in this paper that reliable estimates of long-term trends can be extracted given enough data, even though estimates from individual frames may be noisy. We validate this concept using a new public dataset of approximately 20+ million daytime pig observations over 6 weeks of their main growth stage, and we provide annotations for various tasks including 5 individual behaviors. Our pipeline chains detection, tracking and behavior classification combining deep and shallow computer vision techniques. While individual detections may be noisy, we show that long-term behavior changes can still be extracted reliably, and we validate these results qualitatively on the full dataset. Eventually, starting from raw RGB video data we are able to both tell what pigs main daily activities are, and how these change through time.

2021 - Improving Car Model Classification through Vehicle Keypoint Localization [Relazione in Atti di Convegno]
Simoni, Alessandro; D'Eusanio, Andrea; Pini, Stefano; Borghi, Guido; Vezzani, Roberto
abstract

In this paper, we present a novel multi-task framework which aims to improve the performance of car model classification leveraging visual features and pose information extracted from single RGB images. In particular, we merge the visual features obtained through an image classification network and the features computed by a model able to predict the pose in terms of 2D car keypoints. We show how this approach considerably improves the performance on the model classification task testing our framework on a subset of the Pascal3D dataset containing the car classes. Finally, we conduct an ablation study to demonstrate the performance improvement obtained with respect to a single visual classifier network.

2021 - Multi-Category Mesh Reconstruction From Image Collections [Relazione in Atti di Convegno]
Simoni, Alessandro; Pini, Stefano; Vezzani, Roberto; Cucchiara, Rita
abstract

Recently, learning frameworks have shown the capability of inferring the accurate shape, pose, and texture of an object from a single RGB image. However, current methods are trained on image collections of a single category in order to exploit specific priors, and they often make use of category-specific 3D templates. In this paper, we present an alternative approach that infers the textured mesh of objects combining a series of deformable 3D models and a set of instance-specific deformation, pose, and texture. Differently from previous works, our method is trained with images of multiple object categories using only foreground masks and rough camera poses as supervision. Without specific 3D templates, the framework learns category-level models which are deformed to recover the 3D shape of the depicted object. The instance-specific deformations are predicted independently for each vertex of the learned 3D mesh, enabling the dynamic subdivision of the mesh during the training process. Experiments show that the proposed framework can distinguish between different object categories and learn category-specific shape priors in an unsupervised manner. Predicted shapes are smooth and can leverage from multiple steps of subdivision during the training process, obtaining comparable or state-of-the-art results on two public datasets. Models and code are publicly released.

2021 - RefiNet: 3D Human Pose Refinement with Depth Maps [Relazione in Atti di Convegno]
D'Eusanio, Andrea; Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

Human Pose Estimation is a fundamental task for many applications in the Computer Vision community and it has been widely investigated in the 2D domain, i.e. intensity images. Therefore, most of the available methods for this task are mainly based on 2D Convolutional Neural Networks and huge manually-annotated RGB datasets, achieving stunning results. In this paper, we propose RefiNet, a multi-stage framework that regresses an extremely-precise 3D human pose estimation from a given 2D pose and a depth map. The framework consists of three different modules, each one specialized in a particular refinement and data representation, i.e. depth patches, 3D skeleton and point clouds. Moreover, we present a new dataset, called Baracca, acquired with RGB, depth and thermal cameras and specifically created for the automotive context. Experimental results confirm the quality of the refinement procedure that largely improves the human pose estimations of off-the-shelf 2D methods.

2021 - SHREC 2021: Skeleton-based hand gesture recognition in the wild [Articolo su rivista]
Caputo, Ariel; Giacchetti, Andrea; Soso, Simone; Pintani, Deborah; D'Eusanio, Andrea; Pini, Stefano; Borghi, Guido; Simoni, Alessandro; Vezzani, Roberto; Cucchiara, Rita; Ranieri, Andrea; Giannini, Franca; Lupinetti, Katia; Monti, Marina; Maghoumi, Mehran; LaViola Jr, Joseph; Le, Minh-Quan; Nguyen, Hai-Dang; Tran, Minh-Triet
abstract

This paper presents the results of the Eurographics 2019 SHape Retrieval Contest track on online gesture recognition. The goal of this contest was to test state-of-the-art methods that can be used to online detect command gestures from hands' movements tracking on a basic benchmark where simple gestures are performed interleaving them with other actions. Unlike previous contests and benchmarks on trajectory-based gesture recognition, we proposed an online gesture recognition task, not providing pre-segmented gestures, but asking the participants to find gestures within recorded trajectories. The results submitted by the participants show that an online detection and recognition of sets of very simple gestures from 3D trajectories captured with a cheap sensor can be effectively performed. The best methods proposed could be, therefore, directly exploited to design effective gesture-based interfaces to be used in different contexts, from Virtual and Mixed reality applications to the remote control of home devices.

2021 - Video Frame Synthesis combining Conventional and Event Cameras [Articolo su rivista]
Pini, Stefano; Borghi, Guido; Vezzani, Roberto
abstract

Event cameras are biologically-inspired sensors that gather the temporal evolution of the scene. They capture pixel-wise brightness variations and output a corresponding stream of asynchronous events. Despite having multiple advantages with respect to conventional cameras, their use is limited due to the scarce compatibility of asynchronous event streams with traditional data processing and vision algorithms. In this regard, we present a framework that synthesizes RGB frames from the output stream of an event camera and an initial or a periodic set of color key-frames. The deep learning-based frame synthesis framework consists of an adversarial image-to-image architecture and a recurrent module. Two public event-based datasets, DDD17 and MVSEC, are used to obtain qualitative and quantitative per-pixel and perceptual results. In addition, we converted into event frames two additional wellknown datasets, namely Kitti and Cityscapes, in order to present semantic results, in terms of object detection and semantic segmentation accuracy. Extensive experimental evaluation confirm the quality and the capability of the proposed approach of synthesizing frame sequences from color key-frames and sequences of intermediate events.

2020 - A Transformer-Based Network for Dynamic Hand Gesture Recognition [Relazione in Atti di Convegno]
D'Eusanio, Andrea; Simoni, Alessandro; Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

Transformer-based neural networks represent a successful self-attention mechanism that achieves state-of-the-art results in language understanding and sequence modeling. However, their application to visual data and, in particular, to the dynamic hand gesture recognition task has not yet been deeply investigated. In this paper, we propose a transformer-based architecture for the dynamic hand gesture recognition task. We show that the employment of a single active depth sensor, specifically the usage of depth maps and the surface normals estimated from them, achieves state-of-the-art results, overcoming all the methods available in the literature on two automotive datasets, namely NVidia Dynamic Hand Gesture and Briareo. Moreover, we test the method with other data types available with common RGB-D devices, such as infrared and color data. We also assess the performance in terms of inference time and number of parameters, showing that the proposed framework is suitable for an online in-car infotainment system.

2020 - Baracca: a Multimodal Dataset for Anthropometric Measurements in Automotive [Relazione in Atti di Convegno]
Pini, Stefano; D'Eusanio, Andrea; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

2020 - Face-from-Depth for Head Pose Estimation on Depth Images [Articolo su rivista]
Borghi, Guido; Fabbri, Matteo; Vezzani, Roberto; Calderara, Simone; Cucchiara, Rita
abstract

Depth cameras allow to set up reliable solutions for people monitoring and behavior understanding, especially when unstable or poor illumination conditions make unusable common RGB sensors. Therefore, we propose a complete framework for the estimation of the head and shoulder pose based on depth images only. A head detection and localization module is also included, in order to develop a complete end-to-end system. The core element of the framework is a Convolutional Neural Network, called POSEidon+, that receives as input three types of images and provides the 3D angles of the pose as output. Moreover, a Face-from-Depth component based on a Deterministic Conditional GAN model is able to hallucinate a face from the corresponding depth image. We empirically demonstrate that this positively impacts the system performances. We test the proposed framework on two public datasets, namely Biwi Kinect Head Pose and ICT-3DHP, and on Pandora, a new challenging dataset mainly inspired by the automotive setup. Experimental results show that our method overcomes several recent state-of-art works based on both intensity and depth input data, running in real-time at more than 30 frames per second.

2020 - Learn to See by Events: Color Frame Synthesis from Event and RGB Cameras [Relazione in Atti di Convegno]
Pini, Stefano; Borghi, Guido; Vezzani, Roberto
abstract

Event cameras are biologically-inspired sensors that gather the temporal evolution of the scene. They capture pixel-wise brightness variations and output a corresponding stream of asynchronous events. Despite having multiple advantages with respect to traditional cameras, their use is partially prevented by the limited applicability of traditional data processing and vision algorithms. To this aim, we present a framework which exploits the output stream of event cameras to synthesize RGB frames, relying on an initial or a periodic set of color key-frames and the sequence of intermediate events. Differently from existing work, we propose a deep learning-based frame synthesis method, consisting of an adversarial architecture combined with a recurrent module. Qualitative results and quantitative per-pixel, perceptual, and semantic evaluation on four public datasets confirm the quality of the synthesized images.

2020 - Mercury: a vision-based framework for Driver Monitoring [Relazione in Atti di Convegno]
Borghi, Guido; Pini, Stefano; Vezzani, Roberto; Cucchiara, Rita
abstract

In this paper, we propose a complete framework, namely Mercury, that combines Computer Vision and Deep Learning algorithms to continuously monitor the driver during the driving activity. The proposed solution complies to the require-ments imposed by the challenging automotive context: the light invariance, in or-der to have a system able to work regardless of the time of day and the weather conditions. Therefore, infrared-based images, i.e. depth maps (in which each pixel corresponds to the distance between the sensor and that point in the scene), have been exploited in conjunction with traditional intensity images. Second, the non-invasivity of the system is required, since driver’s movements must not be impeded during the driving activity: in this context, the use of camer-as and vision-based algorithms is one of the best solutions. Finally, real-time per-formance is needed since a monitoring system must immediately react as soon as a situation of potential danger is detected.

2020 - Multimodal Hand Gesture Classification for the Human-Car Interaction [Articolo su rivista]
D'Eusanio, Andrea; Simoni, Alessandro; Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

2019 - Driver Face Verification with Depth Maps [Articolo su rivista]
Borghi, Guido; Pini, Stefano; Vezzani, Roberto; Cucchiara, Rita
abstract

Face verification is the task of checking if two provided images contain the face of the same person or not. In this work, we propose a fully-convolutional Siamese architecture to tackle this task, achieving state-of-the-art results on three publicly-released datasets, namely Pandora, High-Resolution Range-based Face Database (HRRFaceD), and CurtinFaces. The proposed method takes depth maps as the input, since depth cameras have been proven to be more reliable in different illumination conditions. Thus, the system is able to work even in the case of the total or partial absence of external light sources, which is a key feature for automotive applications. From the algorithmic point of view, we propose a fully-convolutional architecture with a limited number of parameters, capable of dealing with the small amount of depth data available for training and able to run in real time even on a CPU and embedded boards. The experimental results show acceptable accuracy to allow exploitation in real-world applications with in-board cameras. Finally, exploiting the presence of faces occluded by various head garments and extreme head poses available in the Pandora dataset, we successfully test the proposed system also during strong visual occlusions. The excellent results obtained confirm the efficacy of the proposed method.

2019 - Face Verification from Depth using Privileged Information [Relazione in Atti di Convegno]
Borghi, Guido; Pini, Stefano; Grazioli, Filippo; Vezzani, Roberto; Cucchiara, Rita
abstract

In this paper, a deep Siamese architecture for depth-based face verification is presented. The proposed approach efficiently verifies if two face images belong to the same person while handling a great variety of head poses and occlusions. The architecture, namely JanusNet, consists in a combination of a depth, a RGB and a hybrid Siamese network. During the training phase, the hybrid network learns to extract complementary mid-level convolutional features which mimic the features of the RGB network, simultaneously leveraging on the light invariance of depth images. At testing time, the model, relying only on depth data, achieves state-of-art results and real time performance, despite the lack of deep-oriented depth-based datasets.

2019 - Hand Gestures for the Human-Car Interaction: the Briareo dataset [Relazione in Atti di Convegno]
Manganaro, Fabio; Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

Natural User Interfaces can be an effective way to reduce driver's inattention during the driving activity. To this end, in this paper we propose a new dataset, called Briareo, specifically collected for the hand gesture recognition task in the automotive context. The dataset is acquired from an innovative point of view, exploiting different kinds of cameras, i.e. RGB, infrared stereo, and depth, that provide various types of images and 3D hand joints. Moreover, the dataset contains a significant amount of hand gesture samples, performed by several subjects, allowing the use of deep learning-based approaches. Finally, a framework for hand gesture segmentation and classification is presented, exploiting a method introduced to assess the quality of the proposed dataset.

2019 - Manual Annotations on Depth Maps for Human Pose Estimation [Relazione in Atti di Convegno]
D'Eusanio, Andrea; Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

Few works tackle the Human Pose Estimation on depth maps. Moreover, these methods usually rely on automatically annotated datasets, and these annotations are often imprecise and unreliable, limiting the achievable accuracy using this data as ground truth. For this reason, in this paper we propose an annotation refinement tool of human poses, by means of body joints, and a novel set of fine joint annotations for the Watch-n-Patch dataset, which has been collected with the proposed tool. Furthermore, we present a fully-convolutional architecture that performs the body pose estimation directly on depth maps. The extensive evaluation shows that the proposed architecture outperforms the competitors in different training scenarios and is able to run in real-time.

2019 - SHREC 2019 Track: Online Gesture Recognition [Relazione in Atti di Convegno]
Caputo, F. M.; Burato, S.; Pavan, G.; Voillemin, T.; Wannous, H.; Vandeborre, J. P.; Maghoumi, M.; Taranta, E. M.; Razmjoo, A.; J. J. LaViola Jr., ; Manganaro, Fabio; Pini, S.; Borghi, G.; Vezzani, R.; Cucchiara, R.; Nguyen, H.; Tran, M. T.; Giachetti, A.
abstract

2019 - Video synthesis from Intensity and Event Frames [Relazione in Atti di Convegno]
Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

Event cameras, neuromorphic devices that naturally respond to brightness changes, have multiple advantages with respect to traditional cameras. However, the difficulty of applying traditional computer vision algorithms on event data limits their usability. Therefore, in this paper we investigate the use of a deep learning-based architecture that combines an initial grayscale frame and a series of event data to estimate the following intensity frames. In particular, a fully-convolutional encoder-decoder network is employed and evaluated for the frame synthesis task on an automotive event-based dataset. Performance obtained with pixel-wise metrics confirms the quality of the images synthesized by the proposed architecture.

2018 - Deep Head Pose Estimation from Depth Data for In-car Automotive Applications [Relazione in Atti di Convegno]
Venturelli, Marco; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

Recently, deep learning approaches have achieved promising results in various fields of computer vision. In this paper, we tackle the problem of head pose estimation through a Convolutional Neural Network (CNN). Differently from other proposals in the literature, the described system is able to work directly and based only on raw depth data. Moreover, the head pose estimation is solved as a regression problem and does not rely on visual facial features like facial landmarks. We tested our system on a well known public dataset, extit{Biwi Kinect Head Pose}, showing that our approach achieves state-of-art results and is able to meet real time performance requirements.

2018 - Deep Learning-Based Method for Vision-Guided Robotic Grasping of Unknown Objects [Relazione in Atti di Convegno]
Bergamini, Luca; Sposato, Mario; Peruzzini, Margherita; Vezzani, Roberto; Pellicciari, Marcello
abstract

Collaborative robots must operate safely and efficiently in ever-changing unstructured environments, grasping and manipulating many different objects. Artificial vision has proved to be collaborative robots' ideal sensing technology and it is widely used for identifying the objects to manipulate and for detecting their optimal grasping. One of the main drawbacks of state of the art robotic vision systems is the long training needed for teaching the identification and optimal grasps of each object, which leads to a strong reduction of the robot productivity and overall operating flexibility. To overcome such limit, we propose an engineering method, based on deep learning techniques, for the detection of the robotic grasps of unknown objects in an unstructured environment, which should enable collaborative robots to autonomously generate grasping strategies without the need of training and programming. A novel loss function for the training of the grasp prediction network has been developed and proved to work well also with low resolution 2-D images, then allowing the use of a single, smaller and low cost camera, that can be better integrated in robotic end-effectors. Despite the availability of less information (resolution and depth) a 75% of accuracy has been achieved on the Cornell data set and it is shown that our implementation of the loss function does not suffer of the common problems reported in literature. The system has been implemented using the ROS framework and tested on a Baxter collaborative robot.

2018 - Domain Translation with Conditional GANs: from Depth to RGB Face-to-Face [Relazione in Atti di Convegno]
Fabbri, Matteo; Borghi, Guido; Lanzi, Fabio; Vezzani, Roberto; Calderara, Simone; Cucchiara, Rita
abstract

Can faces acquired by low-cost depth sensors be useful to see some characteristic details of the faces? Typically the answer is not. However, new deep architectures can generate RGB images from data acquired in a different modality, such as depth data. In this paper we propose a new Deterministic Conditional GAN, trained on annotated RGB-D face datasets, effective for a face-to-face translation from depth to RGB. Although the network cannot reconstruct the exact somatic features for unknown individual faces, it is capable to reconstruct plausible faces; their appearance is accurate enough to be used in many pattern recognition tasks. In fact, we test the network capability to hallucinate with some Perceptual Probes, as for instance face aspect classification or landmark detection. Depth face can be used in spite of the correspondent RGB images, that often are not available for darkness of difficult luminance conditions. Experimental results are very promising and are as far as better than previous proposed approaches: this domain translation can constitute a new way to exploit depth data in new future applications.

2018 - Fully Convolutional Network for Head Detection with Depth Images [Relazione in Atti di Convegno]
Ballotta, Diego; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

Head detection and localization are one of most investigated and demanding tasks of the Computer Vision community. These are also a key element for many disciplines, like Human Computer Interaction, Human Behavior Understanding, Face Analysis and Video Surveillance. In last decades, many efforts have been conducted to develop accurate and reliable head or face detectors on standard RGB images, but only few solutions concern other types of images, such as depth maps. In this paper, we propose a novel method for head detection on depth images, based on a deep learning approach. In particular, the presented system overcomes the classic sliding-window approach, that is often the main computational bottleneck of many object detectors, through a Fully Convolutional Network. Two public datasets, namely Pandora and Watch-n-Patch, are exploited to train and test the proposed network. Experimental results confirm the effectiveness of the method, that is able to exceed all the state-of-art works based on depth images and to run with real time performance.

2018 - Hands on the wheel: a Dataset for Driver Hand Detection and Tracking [Relazione in Atti di Convegno]
Borghi, Guido; Frigieri, Elia; Vezzani, Roberto; Cucchiara, Rita
abstract

The ability to detect, localize and track the hands is crucial in many applications requiring the understanding of the person behavior, attitude and interactions. In particular, this is true for the automotive context, in which hand analysis allows to predict preparatory movements for maneuvers or to investigate the driver’s attention level. Moreover, due to the recent diffusion of cameras inside new car cockpits, it is feasible to use hand gestures to develop new Human-Car Interaction systems, more user-friendly and safe. In this paper, we propose a new dataset, called Turms, that consists of infrared images of driver’s hands, collected from the back of the steering wheel, an innovative point of view. The Leap Motion device has been selected for the recordings, thanks to its stereo capabilities and the wide view-angle. Besides, we introduce a method to detect the presence and the location of driver’s hands on the steering wheel, during driving activity tasks.

2018 - Head Detection with Depth Images in the Wild [Relazione in Atti di Convegno]
Ballotta, Diego; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

Head detection and localization is a demanding task and a key element for many computer vision applications, like video surveillance, Human Computer Interaction and face analysis. The stunning amount of work done for detecting faces on RGB images, together with the availability of huge face datasets, allowed to setup very effective systems on that domain. However, due to illumination issues, infrared or depth cameras may be required in real applications. In this paper, we introduce a novel method for head detection on depth images that exploits the classification ability of deep learning approaches. In addition to reduce the dependency on the external illumination, depth images implicitly embed useful information to deal with the scale of the target objects. Two public datasets have been exploited: the first one, called Pandora, is used to train a deep binary classifier with face and non-face images. The second one, collected by Cornell University, is used to perform a cross-dataset test during daily activities in unconstrained environments. Experimental results show that the proposed method overcomes the performance of state-of-art methods working on depth images.

2018 - Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World [Relazione in Atti di Convegno]
Fabbri, Matteo; Lanzi, Fabio; Calderara, Simone; Palazzi, Andrea; Vezzani, Roberto; Cucchiara, Rita
abstract

Multi-People Tracking in an open-world setting requires a special effort in precise detection. Moreover, temporal continuity in the detection phase gains more importance when scene cluttering introduces the challenging problems of occluded targets. For the purpose, we propose a deep network architecture that jointly extracts people body parts and associates them across short temporal spans. Our model explicitly deals with occluded body parts, by hallucinating plausible solutions of not visible joints. We propose a new end-to-end architecture composed by four branches (visible heatmaps, occluded heatmaps, part affinity fields and temporal affinity fields) fed by a time linker feature extractor. To overcome the lack of surveillance data with tracking, body part and occlusion annotations we created the vastest Computer Graphics dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, almost 10 million body poses) of human body parts for people tracking in urban scenarios. Our architecture trained on virtual data exhibits good generalization capabilities also on public real tracking benchmarks, when image resolution and sharpness are high enough, producing reliable tracklets useful for further batch data association or re-id modules.

2018 - Learning to Generate Facial Depth Maps [Relazione in Atti di Convegno]
Pini, Stefano; Grazioli, Filippo; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

In this paper, an adversarial architecture for facial depth map estimation from monocular intensity images is presented. By following an image-to-image approach, we combine the advantages of supervised learning and adversarial training, proposing a conditional Generative Adversarial Network that effectively learns to translate intensity face images into the corresponding depth maps. Two public datasets, namely Biwi database and Pandora dataset, are exploited to demonstrate that the proposed model generates high-quality synthetic depth images, both in terms of visual appearance and informative content. Furthermore, we show that the model is capable of predicting distinctive facial details by testing the generated depth maps through a deep model trained on authentic depth maps for the face verification task.

2017 - Embedded Recurrent Network for Head Pose Estimation in Car [Relazione in Atti di Convegno]
Borghi, Guido; Gasparini, Riccardo; Vezzani, Roberto; Cucchiara, Rita
abstract

An accurate and fast driver's head pose estimation is a rich source of information, in particular in the automotive context. Head pose is a key element for driver's behavior investigation, pose analysis, attention monitoring and also a useful component to improve the efficacy of Human-Car Interaction systems. In this paper, a Recurrent Neural Network is exploited to tackle the problem of driver head pose estimation, directly and only working on depth images to be more reliable in presence of varying or insufficient illumination. Experimental results, obtained from two public dataset, namely Biwi Kinect Head Pose and ICT-3DHP Database, prove the efficacy of the proposed method that overcomes state-of-art works. Besides, the entire system is implemented and tested on two embedded boards with real time performance.

2017 - Fast and Accurate Facial Landmark Localization in Depth Images for In-car Applications [Relazione in Atti di Convegno]
Frigieri, Elia; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

A correct and reliable localization of facial landmark enables several applications in many fields, ranging from Human Computer Interaction to video surveillance. For instance, it can provide a valuable input to monitor the driver physical state and attention level in automotive context. In this paper, we tackle the problem of facial landmark localization through a deep approach. The developed system runs in real time and, in particular, is more reliable than state-of-the-art competitors specially in presence of light changes and poor illumination, thanks to the use of depth images as input. We also collected and shared a new realistic dataset inside a car, called MotorMark, to train and test the system. In addition, we exploited the public Eurecom Kinect Face Dataset for the evaluation phase, achieving promising results both in terms of accuracy and computational speed.

2017 - From Depth Data to Head Pose Estimation: a Siamese approach [Relazione in Atti di Convegno]
Venturelli, Marco; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

The correct estimation of the head pose is a problem of the great importance for many applications. For instance, it is an enabling technology in automotive for driver attention monitoring. In this paper, we tackle the pose estimation problem through a deep learning network working in regression manner. Traditional methods usually rely on visual facial features, such as facial landmarks or nose tip position. In contrast, we exploit a Convolutional Neural Network (CNN) to perform head pose estimation directly from depth data. We exploit a Siamese architecture and we propose a novel loss function to improve the learning of the regression network layer. The system has been tested on two public datasets, Biwi Kinect Head Pose and ICT-3DHP database. The reported results demonstrate the improvement in accuracy with respect to current state-of-the-art approaches and the real time capabilities of the overall framework.

2017 - POSEidon: Face-from-Depth for Driver Pose Estimation [Relazione in Atti di Convegno]
Borghi, Guido; Venturelli, Marco; Vezzani, Roberto; Cucchiara, Rita
abstract

Fast and accurate upper-body and head pose estimation is a key task for automatic monitoring of driver attention, a challenging context characterized by severe illumination changes, occlusions and extreme poses. In this work, we present a new deep learning framework for head localization and pose estimation on depth images. The core of the proposal is a regression neural network, called POSEidon, which is composed of three independent convolutional nets followed by a fusion layer, specially conceived for understanding the pose by depth. In addition, to recover the intrinsic value of face appearance for understanding head position and orientation, we propose a new Face-from-Depth approach for learning image faces from depth. Results in face reconstruction are qualitatively impressive. We test the proposed framework on two public datasets, namely Biwi Kinect Head Pose and ICT-3DHP, and on Pandora, a new challenging dataset mainly inspired by the automotive setup. Results show that our method overcomes all recent state-of-art works, running in real time at more than 30 frames per second.

2016 - Fast gesture recognition with Multiple StreamDiscrete HMMs on 3D Skeletons [Relazione in Atti di Convegno]
Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

HMMs are widely used in action and gesture recognition due to their implementation simplicity, low computational requirement, scalability and high parallelism. They have worth performance even with a limited training set. All these characteristics are hard to find together in other even more accurate methods. In this paper, we propose a novel doublestage classification approach, based on Multiple Stream Discrete Hidden Markov Models (MSD-HMM) and 3D skeleton joint data, able to reach high performances maintaining all advantages listed above. The approach allows both to quickly classify presegmented gestures (offline classification), and to perform temporal segmentation on streams of gestures (online classification) faster than real time. We test our system on three public datasets, MSRAction3D, UTKinect-Action and MSRDailyAction, and on a new dataset, Kinteract Dataset, explicitly created for Human Computer Interaction (HCI). We obtain state of the art performances on all of them.

2016 - Shot, scene and keyframe ordering for interactive video re-use [Relazione in Atti di Convegno]
Baraldi, L.; Grana, C.; Borghi, G.; Vezzani, R.; Cucchiara, R.
abstract

This paper presents a complete system for shot and scene detection in broadcast videos, as well as a method to select the best representative key-frames, which could be used in new interactive interfaces for accessing large collections of edited videos. The final goal is to enable an improved access to video footage and the re-use of video content with the direct management of user-selected video-clips.

2016 - YACCLAB - Yet Another Connected Components Labeling Benchmark [Relazione in Atti di Convegno]
Grana, Costantino; Bolelli, Federico; Baraldi, Lorenzo; Vezzani, Roberto
abstract

The problem of labeling the connected components (CCL) of a binary image is well-defined and several proposals have been presented in the past. Since an exact solution to the problem exists and should be mandatory provided as output, algorithms mainly differ on their execution speed. In this paper, we propose and describe YACCLAB, Yet Another Connected Components Labeling Benchmark. Together with a rich and varied dataset, YACCLAB contains an open source platform to test new proposals and to compare them with publicly available competitors. Textual and graphical outputs are automatically generated for three kinds of test, which analyze the methods from different perspectives. The fairness of the comparisons is guaranteed by running on the same system and over the same datasets. Examples of usage and the corresponding comparisons among state-of-the-art techniques are reported to confirm the potentiality of the benchmark.

2015 - A General-Purpose Sensing Floor Architecture for Human-Environment Interaction [Articolo su rivista]
Vezzani, Roberto; Lombardi, Martino; Pieracci, Augusto; Santinelli, Paolo; Cucchiara, Rita
abstract

Smart environments are now designed as natural interfaces to capture and understand human behavior without a need for explicit human-computer interaction. In this paper, we present a general-purpose architecture that acquires and understands human behaviors through a sensing floor. The pressure field generated by moving people is captured and analyzed. Specific actions and events are then detected by a low-level processing engine and sent to high-level interfaces providing different functions. The proposed architecture and sensors are modular, general-purpose, cheap, and suitable for both small- and large-area coverage. Some sample entertainment and virtual reality applications that we developed to test the platform are presented.

2015 - Automatic configuration and calibration of modular sensing floors [Relazione in Atti di Convegno]
Vezzani, Roberto; Lombardi, Martino; Cucchiara, Rita
abstract

Sensing floors are becoming an emerging solution for many privacy-compliant and large area surveillance systems. Many research and even commercial Technologies have been proposed in the last years. Similarly to distributed camera networks, the problem of calibration is crucial, specially when installed in wide areas. This paper addresses the general problem of automatic calibration and configuration of modular and scalable sensing floors. Working on training data only, the system automatically finds the spatial placement of each sensor module and estimates threshold parameters needed for people detection. Tests on several training sequences captured with a commercial sensing floor are provided to validate the method

2015 - Detection of Human Movements with Pressure Floor Sensors [Relazione in Atti di Convegno]
Lombardi, Martino; Vezzani, Roberto; Cucchiara, Rita
abstract

Following the recent Internet of Everything (IoE) trend, several general-purpose devices have been proposed to acquire as much information as possible from the environment and from people interacting with it. Among the others, sensing floors are recently attracting the interest of the research community. In this paper, we propose a new model to store and process floor data. The model does not assume a regular grid distribution of the sensing elements and is based on the ground reaction force (GRF) concept, widely used in biomechanics. It allows the correct detection and tracking of people, outperforming the common background subtraction schema adopted in the past. Several tests on a real sensing floor prototype are reported and discussed

2015 - Mapping Appearance Descriptors on 3D Body Models for People Re-identification [Articolo su rivista]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
abstract

People Re-identification aims at associating multiple instances of a person’s appearance acquired from different points of view, different cameras, or after a spatial or a limited temporal gap to the same identifier. The basic hypothesis is that the person’s appearance is mostly constant. Many appearance descriptors have been adopted in the past, but they are often subject to severe perspective and view-point issues. In this paper, we propose a complete re-identification framework which exploits non-articulated 3D body models to spatially map appearance descriptors (color and gradient histograms) into the vertices of a regularly sampled 3D body surface. The matching and the shot integration steps are directly handled in the 3D body model, reducing the effects of occlusions, partial views or pose changes, which normally afflict 2D descriptors. A fast and effective model to image alignment is also proposed. It allows operation on common surveillance cameras or image collections. A comprehensive experimental evaluation is presented using the benchmark suite 3DPeS

2014 - 3D Hough transform for sphere recognition on point clouds [Articolo su rivista]
Camurri, Marco; Vezzani, Roberto; Cucchiara, Rita
abstract

Three-dimensional object recognition on range data and 3D point clouds is becoming more important nowadays. Since many real objects have a shape that could be approximated by simple primitives, robust pattern recognition can be used to search for primitive models. For example, the Hough transform is a well-known technique which is largely adopted in 2D image space. In this paper, we systematically analyze different probabilistic/randomized Hough transform algorithms for spherical object detection in dense point clouds. In particular, we study and compare four variants which are characterized by the number of points drawn together for surface computation into the parametric space and we formally discuss their models. We also propose a new method that combines the advantages of both single-point and multi-point approaches for a faster and more accurate detection. The methods are tested on synthetic and real datasets.

2014 - Benchmarking for Person Re-identification [Capitolo/Saggio]
Vezzani, Roberto; Cucchiara, Rita
abstract

The evaluation of computer vision and pattern recognition systems is usually a burdensome and time-consuming activity. In this chapter all the benchmarks publicly available for re-identification will be reviewed and compared, starting from the ancestors VIPeR and Caviar to the most recent datasets for 3D modeling such as SARC3d (with calibrated cameras) and RGBD-ID (with range sensors). Specific requirements and constraints are highlighted and reported for each of the described collections. In addition, details on the metrics that are mostly used to test and evaluate the re-identification systems are provided.

2014 - Detection of static groups and crowds gathered in open spaces by texture classification [Articolo su rivista]
Manfredi, Marco; Vezzani, Roberto; Calderara, Simone; Cucchiara, Rita
abstract

A surveillance system specifically developed to manage crowded scenes is described in this paper. In particular we focused on static crowds, composed by groups of people gathered and stayed in the same place for a while. The detection and spatial localization of static crowd situations is performed by means of a One Class Support Vector Machine, working on texture features extracted at patch level. Spatial regions containing crowds are identified and filtered using motion information to prevent noise and false alarms due to moving flows of people. By means of one class classification and inner texture descriptors, we are able to obtain, from a single training set, a sufficiently general crowd model that can be used for all the scenarios that shares a similar viewpoint. Tests on public datasets and real setups validate the proposed system.

2014 - Substrate for a sensitive floor and method for displaying loads on the substrate [Brevetto]
Lucchese, Claudio; Cucchiara, Rita; Lombardi, Martino; Pieracci, Augusto; Santinelli, Paolo; Vezzani, Roberto
abstract

The substrate (1; 50) for making a sensitive floor comprises: a first frame made of high-conductivity sensing means (2a-2d) having a first orientation; a second frame made of high-conductivity sensing means (3a-3d) which is adapted to be laid on said first frame and has a second orientation, other than said first orientation, said second frame (3a-3d) forming a support layer for floor finishing products; an element (4) made of a conductive material, which comprises: an elastically compressible thickness (S1), two opposite faces (104, 204) contacting said two first and second frames (2a-2d), (3a-3d), an electric resistor whose resistance is proportional to said thickness (S1).

2014 - Welcome message from the technical program committee chairs [Relazione in Atti di Convegno]
Micheloni, C.; Velipasalar, S.; Vezzani, R.
abstract

2013 - Editorial to the 'pattern recognition and artificial intelligence for human behaviour analysis' special section [Articolo su rivista]
Iocchi, L.; Prati, A.; Vezzani, R.
abstract

2013 - Human Behavior Understanding with Wide Area Sensing Floors [Relazione in Atti di Convegno]
Lombardi, Martino; Pieracci, Augusto; Santinelli, Paolo; Vezzani, Roberto; Cucchiara, Rita
abstract

The research on innovative and natural interfaces aims at developing devices able to capture and understand the human behavior without the need of a direct interaction. In this paper we propose and describe a framework based on a sensing floor device. The pressure field generated by people or objects standing on the floor is captured and analyzed. Local and global features are computed by a low level processing unit and sent to high level interfaces. The framework can be used in different applications, such as entertainment, education or surveillance. A detailed description of the sensing element and the processing architectures is provided, together with some sample applications developed to test the device capabilities.

2013 - Intelligent video surveillance as a service [Capitolo/Saggio]
Prati, A.; Vezzani, R.; Fornaciari, M.; Cucchiara, R.
abstract

Nowadays, intelligent video surveillance has become an essential tool of the greatest importance for several security-related applications. With the growth of installed cameras and the increasing complexity of required algorithms, in-house self-contained video surveillance systems become a chimera for most institutions and (small) companies. The paradigm of Video Surveillance as a Service (VSaaS) helps distributing not only storage space in the cloud (necessary for handling large amounts of video data), but also infrastructures and computational power. This chapter will briefly introduce the motivations and the main characteristics of a VSaaS system, providing a case study where research-lab computer vision algorithms are integrated in a VSaaS platform. The lessons learnt and some future directions on this topic will be also highlighted.

2013 - Learning articulated body models for people re-identification [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
abstract

People re-identification is a challenging problem in surveillance and forensics and it aims at associating multiple instances of the same person which have been acquired from different points of view and after a temporal gap. Image-based appearance features are usually adopted but, in addition to their intrinsically low discriminability, they are subject to perspective and view-point issues. We propose to completely change the approach by mapping local descriptors extracted from RGB-D sensors on a 3D body model for creating a view-independent signature. An original bone-wise color descriptor is generated and reduced with PCA to compute the person signature. The virtual bone set used to map appearance features is learned using a recursive splitting approach. Finally, people matching for re-identification is performed using the Relaxed Pairwise Metric Learning, which simultaneously provides feature reduction and weighting. Experiments on a specific dataset created with the Microsoft Kinect sensor and the OpenNi libraries prove the advantages of the proposed technique with respect to state of the art methods based on 2D or non-articulated 3D body models.

2013 - People reidentification in surveillance and forensics: a Survey [Articolo su rivista]
Vezzani, Roberto; Baltieri, Davide; Cucchiara, Rita
abstract

The field of surveillance and forensics research is currently shifting focus and is now showing an ever increasing interest in the task of people reidentification. This is the task of assigning the same identifier to all instances of a particular individual captured in a series of images or videos, even after the occurrence of significant gaps over time or space. People reidentification can be a useful tool for people analysis in security as a data association method for long-term tracking in surveillance. However, current identification techniques being utilized present many difficulties and shortcomings. For instance, they rely solely on the exploitation of visual cues such as color, texture, and the object's shape. Despite the many advances in this field, reidentification is still an open problem. This survey aims to tackle all the issues and challenging aspects of people reidentification while simultaneously describing the previously proposed solutions for the encountered problems. This begins with the first attempts of holistic descriptors and progresses to the more recently adopted 2D and 3D model-based approaches. The survey also includes an exhaustive treatise of all the aspects of people reidentification, including available datasets, evaluation metrics, and benchmarking.

2013 - Sensing floors for privacy-compliant surveillance of wide areas [Relazione in Atti di Convegno]
Lombardi, Martino; Pieracci, Augusto; Santinelli, Paolo; Vezzani, Roberto; Cucchiara, Rita
abstract

Surveillance systems can really benefit from the integration of multiple and heterogeneous sensors. In this paper we describe an innovative sensing floor. Thanks to its low cost and ease of installation, the floor is suitable for both private and public environments, from narrow zones to wide areas. The floor is made adding a sensing layer below commercial floating tiles. The sensor is scalable, reliable, and completely invisible to the users. The temporal and spatial resolutions of the data are high enough to identify the presence of people, to recognize their behavior and to detect events in a privacy compliant way. Experimental results on a real prototype implementation confirm the potentiality of the framework.

2013 - Video surveillance online repository (ViSOR) [Relazione in Atti di Convegno]
Vezzani, Roberto; Cucchiara, Rita
abstract

This paper describe the ViSOR (Video Surveillance Online Repository) repository, designed with the aim of establishing an open platform for collecting, annotating, retrieving, and sharing surveillance videos, as well as evaluating the performance of automatic surveillance systems. The repository is free and researchers can collaborate sharing their own videos or datasets. Most of the included videos are annotated. Annotations are based on a reference ontology which has been defined integrating hundreds of concepts, some of them coming from the LSCOM and MediaMill ontologies. A new annotation classification schema is also provided, which is aimed at identifying the spatial, temporal and domain detail level used. The web interface allows video browsing, querying by annotated concepts or by keywords, compressed video previewing, media downloading and uploading. Finally, ViSOR includes a performance evaluation desk which can be used to compare different annotations.

2012 - Intelligent Video Surveillance [Capitolo/Saggio]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

Safety and security reasons are pushing the growth of surveillance systems, for both prevention and forensic tasks. Unfortunately, most of the installed systems have recording capability only, with quality so poor that makes them completely unhelpful. This chapter will introduce the concepts of modern systems for Intelligent Video Surveillance (IVS), with the claim of providing neither a complete treatment nor a technical description of this topic but of representing a simple and concise panorama of the motivations, components, and trends of these systems. Different from CCTV systems, IVS should be able, for instance, to monitor people in public areas and smart homes, to control urban traffi c, and to identity assessment for security and safety of critical infrastructure.

2012 - People Orientation Recognition by Mixtures of Wrapped Distributions on Random Trees [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
abstract

The recognition of people orientation in single images is still an open issue in several real cases, when the image resolution is poor, body parts cannot be distinguished and localized or motion cannot be exploited. However, the estimation of a person orientation, even an approximated one, could be very useful to improve people tracking and re-identification systems, or to provide a coarse alignment of body models on the input images. In these situations, holistic features seem to be more effective and faster than model based 3D reconstructions. In this paper we propose to describe the people appearance with multi-level HoG feature sets and to classify their orientation using an array of Extremely Randomized Trees classifiers trained on quantized directions. The outputs of the classifiers are then integrated into a global continuous probability density function using a Mixture of Approximated Wrapped Gaussian distributions. Experiments on the TUD Multiview Pedestrians, the Sarc3D, and the 3DPeS datasets confirm the efficacy of the method and the improvement with respect to state of the art approaches.

2011 - 3DPes: 3D People Dataset for Surveillance and Forensics [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
abstract

The interest of the research community in creating reference datasets for performance analysis is always very high. Although new datasets, collecting large amounts of video footage are spreading in surveillance and forensics, few bench-marks with annotation data are available for testing specific tasks and especially for 3D/multi-view analysis. In this paper we present 3DPeS, a new dataset for 3D/multi- view surveillance and forensic applications. This has been designed for discussing and evaluating research results in people re-identification and other related activities (people detection, people segmentation and people tracking). The new assessed version of the dataset contains hundreds of video sequences of 200 people taken from a multi-camera distributed surveillance system over several days, with different light conditions; each person is detected multiple times and from different points of view. In surveillance scenarios, the dataset can be exploited to evaluate people reacquisition, 3D body models and people activity reconstruction algorithms. In forensics it can be adopted too, by relaxing some constraints (e.g. real time) and neglecting some information (e.g. calibration). Some results on this new dataset are presented using state of the art methods for people re-identification as a benchmark for future comparisons.

2011 - Multi-view people surveillance using 3D information [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita; A., Utasi; C., Benedek; T., Sziranyi
abstract

In this paper we introduce a novel surveillance system, which uses 3D information extracted from multiple cameras to detect, track and re-identify people. The detection method is based on a 3D Marked Point Process model using two pixel-level features extracted from multi-plane projections of binary foreground masks, and uses a stochastic optimization framework to estimate the position and the height of each person. We apply a rule based Kalman-filter tracking on the detection results to find the object-to-object correspondence between consecutive time steps. Finally, a 3D body model based long-term tracking module connects broken tracks and is also used to re-identify people

2011 - Probabilistic people tracking with appearance models and occlusion classification: The AD-HOC system [Articolo su rivista]
Vezzani, Roberto; Grana, Costantino; Cucchiara, Rita
abstract

AD-HOC (Appearance Driven Human tracking with Occlusion Classification) is a complete framework for multiple people tracking in video surveillance applications in presence of large occlusions. The appearance-based approach allows the estimation of the pixel-wise shape of each tracked person even during the occlusion. This peculiarity can be very useful for higher level processes, such as action recognition or event detection. A first step predicts the position of all the objects in the new frame while a MAP framework provides a solution for best placement. A second step associates each candidate foreground pixel to an object according to mutual object position and color similarity. A novel definition of non-visible regions accounts for the parts of the objects that are not detected in the current frame, classifying them as dynamic, scene or apparent occlusions. Results on surveillance videos are reported, using in-house produced videos and the PETS2006 test set.

2011 - SARC3D: a new 3D body model for People Tracking and Re-identification [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
abstract

We propose a new simplified 3D body model (called Sarc3D) for surveillance application, that can be created, updated and compared in rea-time.People are detected and tracked in each calibrated camera, and their silhouette, appearance, position and orientation are extracted and used to place, scale and orientate a 3D body model. Foreach vertex of the model a signature (color features, reliability and saliency) is computed from the 2D appearance images and exploited for mathing. This approach achieves robustness against partial occlusions, pose and viewpoint changes. The complete proposal and a full experimental evaluation is presented, using a new benchmark suite and the PETS2009 dataset.

2010 - 3D Body Model Construction and Matching for Real Time People Re-Identification [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
abstract

Wide area video surveillance always requires to extract and integrate information coming from different cameras and views. Re-identification of people captured from different cameras or different views is one of most challenging problems. In this paper, we present a novel approach for people matching with vertices-based 3D human models.People are detected and tracked in each calibrated camera, and their silhouette, appearance, position and orientation are extracted and used to place, scale and orientate a 3D body model. Colour features are computed from the 2D appearance images and mapped to the 3D model vertices, generating the 3D model for each tracked person. A distance function between 3D models is defined in order to find matches among models belonging to the same person. This approach achieves robustness against partial occlusions, pose and viewpoint changes. A first experimental evaluation is conducted using images extracted from a real camera set-up.

2010 - Event Driven Software Architecture for Multi-camera and Distributed Surveillance Research Systems [Relazione in Atti di Convegno]
Vezzani, Roberto; Cucchiara, Rita
abstract

Surveillance of wide areas with several connected cameras integrated in the same automatic system is no more a chimera, but modular, scalable and flexible architectures are mandatory to manage them. This paper points out the main issues on the development of distributed surveillance systems and proposes an integrated framework particularly suitable for research purposes. As first, exploiting a computer architecture analogy, a three layer tracking system is proposed, which copes with the integration of both overlapping and non overlapping cameras. Then, a static service oriented architecture is adopted to collect and manage the plethora of high level modules, such as face detection and recognition, posture and action classification, and so on. Finally, the overall architecture is controlled by an event driven communication infrastructure, which assures the scalability and the flexibility of the system.

2010 - Fast Background Initialization with Recursive Hadamard Transform [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
abstract

In this paper, we present a new and fast techniquefor background estimation from cluttered image sequences.Most of the background initialization approaches developedso far collect a number of initial frames and then requirea slow estimation step which introduces a delay wheneverit is applied. Conversely, the proposed technique redistributesthe computational load among all the frames bymeans of a patch by patch preprocessing, which makesthe overall algorithm more suitable for real-time applications.For each patch location a prototype set is created andmaintained. The background is then iteratively estimatedby choosing from each set the most appropriate candidatepatch, which should verify a sort of frequency coherencewith its neighbors. To this aim, the Hadamard transformhas been adopted which requires less computation time thanthe commonly used DCT. Finally, a refinement step exploitsspatial continuity constraints along the patch borders toprevent erroneous patch selections. The approach has beencompared with the state of the art on videos from availabledatasets (ViSOR and CAVIAR), showing a speed up of about10 times and an improved accuracy

2010 - HMM Based Action Recognition with Projection Histogram Features [Relazione in Atti di Convegno]
Vezzani, Roberto; Baltieri, Davide; Cucchiara, Rita
abstract

Hidden Markov Models (HMM) have been widely used for action recognition, since they allow to easily model the temporal evolution of a single or a set of numeric features extracted from the data. The selection of the feature set and the related emission probability function are the key issues to be defined. In particular, if the training set is not sufficiently large, a manual or automatic feature selection and reduction is mandatory. In this paper we propose to model the emission probability function as a Mixture of Gaussian and the feature set is obtained from the projection histograms of the foreground mask. The projectionhistograms contain the number of moving pixel for each row and for each column of the frame and they provide sufficient information to infer the instantaneous posture of the person. Then, the HMM framework recovers the temporal evolution of the postures recognizing in such a manner the global action. The proposed method have been successfully tested on the UT-Tower and on the Weizmann Datasets.

2010 - Video Surveillance Online Repository (ViSOR): an integrated framework [Articolo su rivista]
Vezzani, Roberto; Cucchiara, Rita
abstract

The availability of new techniques and tools for Video Surveillance and the capability of storing huge amounts of visual data acquired by hundreds of cameras every day call for a convergence between pattern recognition, computer vision and multimedia paradigms. A clear need for this convergence is shown by new research projects which attempt to exploit both ontology-based retrieval and video analysis techniques also in the field of surveillance.This paper presents the ViSOR (Video Surveillance Online Repository) framework, designed with the aim of establishing an open platform for collecting, annotating, retrieving, and sharing surveillance videos, as well as evaluating the performance of automatic surveillance systems. Annotations are based on a reference ontology which has been defined integrating hundreds of concepts, some of them coming from the LSCOM and MediaMill ontologies. A new annotation classification schema is also provided, which is aimed at identifying the spatial, temporal and domain detail level used.The ViSOR web interface allows video browsing, querying by annotated concepts or by keywords, compressed video previewing, media downloading and uploading.Finally, ViSOR includes a performance evaluation desk which can be used to compare different annotations.

2009 - An efficient Bayesian framework for on-line action recognition [Relazione in Atti di Convegno]
Vezzani, Roberto; Piccardi, Massimo; Cucchiara, Rita
abstract

On-line action recognition from a continuous stream of actionsis still an open problem with fewer solutions proposedcompared to time-segmented action recognition. The mostchallenging task is to classify the current action while findingits time boundaries at the same time. In this paper wepropose an approach capable of performing on-line actionsegmentation and recognition by means of batteries of HMMtaking into account all the possible time boundaries and actionclasses. A suitable Bayesian normalization is appliedto make observation sequences of different length comparableand computational optimizations are introduce to achievereal-time performances. Results on a well known actiondataset prove the efficacy of the proposed method

2009 - Dynamic Pictorially Enriched Ontologies for Digital Video Libraries [Articolo su rivista]
M., Bertini; A., Del Bimbo; Serra, Giuseppe; C., Torniai; Cucchiara, Rita; Grana, Costantino; Vezzani, Roberto
abstract

This article presents a framework for automatic semantic annotation of video streams with an ontology that includes concepts expressed using linguistic terms and visual data.

2009 - Pathnodes integration of standalone Particle Filters for people tracking on distributed surveillance systems [Relazione in Atti di Convegno]
Vezzani, Roberto; Baltieri, Davide; Cucchiara, Rita
abstract

In this paper, we present a new approach to object tracking based on batteries of particle filter working in multicamera systems with non overlapped fields of view. In each view the moving objects are tracked with independent particle filters; each filter exploits a likelihood function based on both color and motion information. The consistent labeling of people exiting from a camera field of view and entering in a neighbor one is obtained sharing particles information for the initialization of new filtering trackers. The information exchange algorithm is based on path-nodes, which are a graph-based scene representation usually adopted in computer graphics. The approach has been tested even in case of simultaneous transitions, occlusions, and groups of people. Promising results have been obtained and here presented using a real setup of non overlapped cameras.

2009 - Statistical Pattern Recognition for Multi-Camera Detection, Tracking and Trajectory Analysis [Capitolo/Saggio]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

This chapter will address most of the aspects of modern video surveillance with the reference to the research activity conducted at University of Modena and Reggio Emilia, Italy, within the scopes of the national FREE SURF (FREE SUrveillance in a pRivacy-respectFul way) and NATO-funded BE SAFE (Behavioral lEarning in Surveilled Areas with Feature Extraction) projects. Moving object detection and tracking from a single camera, multi-camera consistent labeling and trajectory shape analysis for path classification will be the main topics of this chapter.

2009 - Statistical pattern recognition for multi-camera detection, tracking, and trajectory analysis [Capitolo/Saggio]
Calderara, S.; Cucchiara, R.; Vezzani, R.; Prati, A.
abstract

2008 - AD-HOC: Appearance Driven Human tracking with Occlusion Handling [Relazione in Atti di Convegno]
Vezzani, Roberto; Cucchiara, Rita
abstract

AD-HOC copes with the problem of multiple people tracking in video surveillance in presence of large occlusions. The main novelty is the adoption of an appearance-based approach in a formal Bayesian framework: the status of each object is defined at pixel level, where each pixel is characterized by the appearance, i.e. the color (integrated along the time) and the likelihood to belong to the object. With these data at pixel-level and a probability of non-occlusion at object-level, the problem of occlusions is addressed. The method does not aim at detecting the presence of an occlusion only, but classifies the type of occlusion at a sub-region level and evolve the status of theobject in a selective way. The AD-HOC tracking has been tested in many application for indoor and outdoor surveillance. Results on PETS2006 test set are reported where many people and abandoned objects are detected and tracked.

2008 - Annotation Collection and Online Performance Evaluation for Video Surveillance: the ViSOR Project [Relazione in Atti di Convegno]
Vezzani, Roberto; Cucchiara, Rita
abstract

This paper presents the Visor (VIdeo Surveillance Online Repository) project designed with the aim of establishing anopen platform for collecting, annotating, retrieving, sharingsurveillance videos, and of evaluating the performanceof automatic surveillance systems. The main idea is to exploitthe collaborative paradigm spreading in the web communityto join together the ontology based annotation andretrieval concepts and the requirements of the computer visionand video surveillance communities. The ViSOR openrepository is based on a reference ontology which integratesmany concepts, also coming from LSCOM and MediaMillontologies. The web interface allows video browse, queryby annotated concepts or by keywords, compressed videopreview, media download and upload. The repository containsmetadata annotations, which can be either manuallycreated as ground truth or automatically generated by videosurveillance systems. Their automatic annotations can becompared each other or with the reference ground-truth exploitingan integrated on-line performance evaluator.

2008 - Smoke detection in videosurveillance: the use of VISOR (Video Surveillance On-line Repository) [Relazione in Atti di Convegno]
Vezzani, Roberto; Calderara, Simone; Piccinini, Paolo; Cucchiara, Rita
abstract

Visor (VIdeo Surveillance Online Repository) is a large videorepository, designed for containing annotated video surveillancefootages, comparing annotations, evaluating systemperformance, and performing retrieval tasks. The web interfaceallows video browse, query by annotated conceptsor by keywords, compressed video preview, media downloadand upload. The repository contains metadata annotations,both manually created ground-truth data and automaticallyobtained outputs of particular systems. An exampleof application is the collection of videos and annotationsfor smoke detection, an important video surveillance task. Inthis paper we present the architecture of ViSOR, the build-insurveillance ontology which integrates many concepts, alsocoming from LSCOM, and MediaMill, the annotation toolsand the visualization of results for performance evaluation.The annotation is obtained with an automatic smoke detectionsystem, capable to detect people, moving objects, andsmoke in real-time.

2008 - ViSOR: Video Surveillance On-line Repository for Annotation Retrieval [Relazione in Atti di Convegno]
Vezzani, Roberto; Cucchiara, Rita
abstract

The Imagelab Laboratory of the University of Modena andReggio Emilia has designed a large video repository, aimingat containing annotated video surveillance footages. The webinterface, named ViSOR (VIdeo Surveillance Online Repository),allows video browse, query by annotated concepts or bykeywords, compressed preview, video download and upload.The repository contains metadata annotation, both manuallyannotated ground-truth data and automatically obtained outputsof a particular system. In such a manner, the users of therepository are able to perform validation tasks of their ownalgorithms as well as comparative activities.

2007 - A Multi-Camera Vision System for Fall Detection and Alarm Generation [Articolo su rivista]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

In-house video surveillance can represent an excellent support for people with some difficulties (e.g. elderly or disabled people) living alone and with a limited autonomy. New hardware technologies and in particular digital cameras are now affordable and they have recently gained credit as tools for (semi-)automatically assuring people's safety. In this paper a multi-camera vision system for detecting and tracking people and recognizing dangerous behaviours and events such as a fall is presented. In such a situation a suitable alarm can be sent, e.g. by means of an SMS. A novel technique of warping people's silhouette is proposed to exchange visual information between partially overlapped cameras whenever a camera handover occurs. Finally, a multi-client and multi-threaded transcoding video server delivers live video streams to operators/remote users in order to check the validity of a received alarm. Semantic and event-based transcoding algorithms are used to optimize the bandwidth usage. A two-room setup has been created in our laboratory to test the performance of the overall system and some of the results obtained are reported.

2007 - Compressed Domain Features Extraction for Shot Characterization [Relazione in Atti di Convegno]
Grana, Costantino; Vezzani, Roberto; Borghesani, Daniele; Cucchiara, Rita
abstract

In this work, we propose a system for shot comparison directly working on the MPEG-1 stream in the compressed domain, extracting both color, texture and motion features considering all frames with a reasonable computational cost, and results comparable to those obtained on uncompressed keyframes. In particular a summary descriptor for each Group Of Pictures (GOP) is computed and employed for shot characterization and comparison. The Mallows distance allows to match different length clips in a unified framework.

2007 - Enhancing HSV Histograms with Achromatic Points Detection for Video Retrieval [Relazione in Atti di Convegno]
Grana, Costantino; Vezzani, Roberto; Cucchiara, Rita
abstract

Color is one of the most meaningful features used in content based retrieval of visual data. In video content based retrieval, color features computed on selected frames are integrated with other low-level features concerning texture, shape and motion in order to find clip similarities. For example, the Scalable Color feature defined in the MPEG-7 standard exploits HSV histograms to create color feature vectors. HSV is a widely adopted space in image and video retrieval, but its quantization for histogram generation can create misleading errors in classification of achromatic and low saturated colors. In this paper we propose an Enhanced HSV Histogram with achromatic point detection based on a single Hue and Saturation parameter that can correct this limitation. The enhanced histograms have proven to be effective in color analysis and they have been used in a system for automatic clip annotation called PEANO, where pictorial concepts are extracted by a clip clustering and used for similarity based automatic annotation.

2007 - Prototypes Selection with Context Based Intra-class Clustering for Video Annotation with Mpeg7 Features [Relazione in Atti di Convegno]
Grana, Costantino; Vezzani, Roberto; Cucchiara, Rita
abstract

In this work, we analyze the effectiveness of perceptual features to automatically annotate video clips in domain-specific video digital libraries. Typically, automatic annotation is provided by computing clip similarity with respect to given examples, which constitute the knowledgebase, in accordance with a given ontology or a classification scheme. Since the amount of training clips is normally very large, we propose to automatically extract some prototypes, or visual concepts, for each class instead of using the whole knowledge base. The prototypes are generated after a Complete Link clustering based on perceptual features with an automatic selection of the number of clusters. Context based information are used in an intra-class clustering framework to provide selection of more discriminative clips. Reducing the number of samples makes the matching process faster and lessens the storage requirements. Clips are annotated following the MPEG-7 directives to provide easier portability. Results are provided on videos taken from sports and news digital libraries.

2007 - Semi-automatic Video Digital Library Annotation Tools [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; Vezzani, Roberto
abstract

In this work, we present a general purpose systemfor hierarchical structural segmentation and automaticannotation of video clips, by means of standardizedlow level features. We propose to automatically extractsome prototypes for each class with a context basedintra-class clustering. Clips are annotated followingthe MPEG-7 standard directives to provide easierportability. Results of automatic annotation and semiautomaticmetadata creation are provided.

2007 - Sports Video Annotation Using Enhanced HSV Histograms in Multimedia Ontologies [Relazione in Atti di Convegno]
M., Bertini; A., Del Bimbo; C., Torniai; Grana, Costantino; Vezzani, Roberto; Cucchiara, Rita
abstract

This paper presents multimedia ontologies, where multimedia data and traditional textual ontologies are merged. A solution for their implementation for the soccer video domain and a method to perform automatic soccer video annotation using these extended ontologies is shown. HSV is a widely adopted space in image and video retrieval, but its quantization for histogram generation can create misleading errors in classification of achromatic and low saturated colors. In this paper we propose an Enhanced HSV Histogram with achromatic point detection based on a single Hue and Saturation parameter that can correct this limitation.The more general concepts of the sport domain (e.g. play/break, crowd, etc.) are put in correspondence with the more general visual features of the video like color and texture, while the more specific concepts of the soccer domain (e.g. highlights such as attack actions) are put in correspondence with domain specific visual feature like the soccer playfield and the players. Experimental results for annotation of soccer videos using generic concepts are presented.

2007 - Using a Wireless Sensor Network to Enhance Video Surveillance [Articolo su rivista]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto; L., Benini; E., Farella; P., Zappi
abstract

To enhance video surveillance systems, multi-modal sensor integration can be a successful strategy. In this work, a computer vision system able to detect and track people from multiple cameras is integrated with a wireless sensor network mounting passive Pyroelectric InfraRed sensors. Thetwo subsystems are briefly described and possible cases in which computer vision algorithms are likely to fail are discussed. Then, simple but reliable outputs from the sensor nodes are exploited to improve the accuracy of the vision system. In particular, two case studies are reported: the first uses the presence detection of sensors to disambiguate between an open door and a moving person, while the second handles motion direction changes during occlusions. Preliminary results are reported and demonstrate the usefulness of the integration of the two subsystems.

2007 - Visor: Video Surveillance Online Repository [Relazione in Atti di Convegno]
Vezzani, Roberto; Cucchiara, Rita
abstract

Aim of the Visor Project [1] is to gather and makefreely available a repository of surveillance andvideo footages for the research community onpattern recognition and multimedia retrieval. Thegoal is to create an open forum and a free repositoryto exchange, compare and discuss results of manyproblems in video surveillance and retrieval.Together with the videos, the repository containsmetadata annotation, both manually annotated asground-truth and automatically obtained by videosurveillance systems. Annotation refers to a largeontology of concepts on surveillance and securityrelated objects and events. The ontology has beendefined including concepts from LSCOM andMediaMill ontologies. As well as videos andannotations, Visor provides tools for enriching theontology, annotating new videos, searching bytextual queries, composing and downloading videos.

2006 - 3-D Virtual Environments on Mobile Devices for Remote Surveillance [Relazione in Atti di Convegno]
Vezzani, Roberto; Cucchiara, Rita; A., Malizia; L., Cinque
abstract

In this paper we present a distributed videosurveillanceframework. Our end is the remote monitoringof the behavior of people moving in a scene exploitinga virtual reconstruction on low capabilitiesdevices, like PDAs and cell phones. The main noveltyof this system is the effective integration of the computervision and computer graphics modules. The first,using a probabilistic frameworks, can detect the position,the trajectory and the posture of peoples movingin the scene. The second exploits the new possibility ofboth standard 3D graphics libraries on mobile (namelyJSR184 and M3G graphic format) and new PDAsprocessing capability in order to reconstruct the remotesurveillance data in real-time.

2006 - A Distributed Domotic Surveillance System [Capitolo/Saggio]
Cucchiara, Rita; Grana, Costantino; Prati, Andrea; Vezzani, Roberto
abstract

Distributed video surveillance has a direct application in intelligent home automation or domotics (from the Latin word domus, that means “home”, and informatics); in particular, in-house videosurveillance can provide good support for people with some difficulties (e.g., elderly or disabled people) living alone and with a limited autonomy. New hardware technologies for surveillance are now affordable and provide high reliability. Problems related to reliable software solutions are not completely solved, especially concerning the application of general-purpose computer vision techniques in indoor environments. Indeed, assuming the objective is to detect the presence of people, track them, and recognize dangerous behaviours by means of abrupt changes in their posture, robust techniques must cope with non-trivial difficulties. In particular, luminance changes and shadows must be taken into account, frequent posture changes must be faced, and large and long-lasting occlusions are common due to the vicinity of the cameras and the presence of furnitureand doors that can often hide parts of the person’s body. These problems are analyzed and solutions based on background suppression, appearance-based probabilistic tracking, and probabilistic reasoning for posture recognition are described.

2006 - A semi-automatic video annotation tool with MPEG-7 content collections [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; D., Bulgarelli; Vezzani, Roberto
abstract

In this work, we present a general purpose system for hierarchical structural segmentation and automatic annotation of video clips, by means of standardized low level features. We propose to automatically extract some prototypes for each class with a context based intra-class clustering. Clips are annotated following the MPEG-7 standard directives to provide easier portability. Results of automatic annotation and semiautomatic metadata creation are provided

2006 - A system for automatic face obscuration for privacy purposes [Articolo su rivista]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

This work proposes a method for automatic face obscuration capable of protecting people's identity. Since face detection heavily benefits from the possibility to exploit tracking, multi-camera people tracking has been integrated with a face detector based on colour clustering and Hough transform. Moreover, the multiple viewpoints provided by multiple cameras are exploited in order to always obtain a good-quality image of the face. The identity of people in different views is kept consistent by means of a geometrical, uncalibrated approach based on homographies. Experimental results show the accuracy of the proposed approach. (c) 2006 Elsevier B.V. All rights reserved.

2006 - Advanced video surveillance with pan tilt zoom cameras [Relazione in Atti di Convegno]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

In this paper an advanced video surveillance system is proposed.Our goal is the detection of the people’s heads toallow their obscuration for privacy issues or to performrecognition tasks. We propose a system based on active PTZ(Pan-Tilt-Zoom) cameras that produce head images havinga large enough size, and can cover an area larger than stillcameras. Since conventional approaches are not suitable toPTZ cameras, the proposed approach is based on the socalleddirection histograms to compute the ego-motion andon frame differencing for detecting moving objects. It exploitspost-processing and active contours to extract preciseshape of moving objects to be fed to a probabilistic algorithmto track moving people in the scene. Person following,instead, is based on simple heuristic rules that movethe camera as soon as the selected person is close to theborder of the field of view. Finally, a color and shape basedhead detection that takes advantage of the people trackingis presented. Experimental results on a live active camerademonstrate the feasibility of real-time person followingand of the consecutive head detection phase.

2006 - Estimating Geospatial Trajectory of a Moving Camera [Relazione in Atti di Convegno]
A., Hakeem; Vezzani, Roberto; S., Shah; Cucchiara, Rita
abstract

This paper proposes a novel method for estimating thegeospatial trajectory of a moving camera. The proposedmethod uses a set of reference images with known GPS(global positioning system) locations to recover the trajectoryof a moving camera using geometric constraints. Theproposed method has three main steps. First, scale invariantfeatures transform (SIFT) are detected and matched betweenthe reference images and the video frames to calculatea weighted adjacency matrix (WAM) based on the numberof SIFT matches. Second, using the estimated WAM, themaximum matching reference image is selected for the currentvideo frame, which is then used to estimate the relativeposition (rotation and translation) of the video frame usingthe fundamental matrix constraint. The relative position isrecovered upto a scale factor and a triangulation amongthe video frame and two reference images is performed toresolve the scale ambiguity. Third, an outlier rejection andtrajectory smoothing (using b-spline) post processing stepis employed. This is because the estimated camera locationsmay be noisy due to bad point correspondence or degenerateestimates of fundamental matrices. Results of recoveringcamera trajectory are reported for real sequences.

2006 - MPEG-7 Pictorially Enriched Ontologies for Video Annotation [Relazione in Atti di Convegno]
Grana, Costantino; Vezzani, Roberto; Bulgarelli, Daniele; Cucchiara, Rita
abstract

A system for the automatic creation of Pictorially Enriched Ontologies is presented, that is ontologies for context-based video digital libraries, enriched by pictorial concepts for video annotation, summarization and similarity-based retrieval. Extraction of pictorial concepts with video clips clustering, ontology storing with MPEG-7, and the use of the ontology for stored video annotation are described. Re-sults on sport videos and TRECVID2005 video material are reported.

2006 - PEANO: Pictorial Enriched Annotation of Video [Relazione in Atti di Convegno]
Grana, Costantino; Vezzani, Roberto; Bulgarelli, Daniele; Gualdi, Giovanni; Cucchiara, Rita; M., Bertini; C., Torniai; A., Del Bimbo
abstract

In this DEMO, we present a tool set for video digital library management that allows i) structural annotation of edited videos in MPEG-7 by automatically extracting shots and clips; ii) automatic semantic annotation based on perceptual similarity against a taxonomy enriched with pictorial concepts iii) video clip access and hierarchical summarization with stand-alone and web interface iv) access to clips from mobile platform in GPRS-UMTS videostreaming. The tools can be applied in different domain-specific Video Digital Libraries. The main novelty is the possibility to enrich the annotation with pictorial concepts that are added to a textual taxonomy in order to make the automatic annotation process more fast and often effective. The resulting multimedia ontology is described in the MPEG-7 framework. The PEANO (Perceptual Annotation of Video) tool has been tested over video art, sport (Soccer, Olimpic Games 2006, Formula 1) and news clips.

2006 - University of Modena and Reggio Emilia at TRECVID 2006 [Relazione in Atti di Convegno]
Grana, Costantino; Vezzani, Roberto; Cucchiara, Rita
abstract

What approach or combination of approaches did you test in each of your submitted runs?TRECVID2005_UNIMORE_??.xml: the same linear transition detector (LTD) was tested forevery run, with ten uniformly spaced thresholds for the detection.What if any significant differences (in terms of what measures) did you find among theruns?The system behaved as expected: the higher the threshold the better the recall. Of course theprecision lowered correspondently. Interesting enough, it seems that we cannot overcome theoverall limit around 80% for recall and 88% for precision, independently of the other parameter.Based on the results, can you estimate the relative contribution of each component of yoursystem/approach to its effectiveness?One of the main objective of our system was to test the performance of a single algorithm forboth cuts and gradual transitions. So all the merit and the demerits are related to our LTD.Overall, what did you learn about runs/approaches and the research question(s) thatmotivated them?The use of a single algorithm allows the system to be run without training. Just a singleparameter may be employed to tune the sensibility of the system, thus allowing its use in generalpurpose/user friendly systems.

2005 - Ambient Intelligence for Security in Public Parks: the LAICA Project [Relazione in Atti di Convegno]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

In this paper, we address the exploitation of computervision techniques to develop multimedia services andautomatic monitoring systems related to the securityand the privacy in public areas. The research is part ofa two-year ltalian project called LAICA, intended toprovide advanced services for citizens and publicofficers. Citizens want fast and friendly web access topublic places, to see the environment in real-timewithout violating the privacy laws. Public officers andpolicy centres want a fast and reactive monitoringsystem, capable to automatically detect dangeroussituations, given the huge amount of cameras that cannot be monitored simultaneously by human operators.In this work, we describe the project and the definedmethodologies in multi-camera video mosaicing,people tracking and consistent labelling, and access toprocessed data with face obscuration.

2005 - An Integrated Multi-Modal Sensor Network for Video Surveillance [Relazione in Atti di Convegno]
Prati, Andrea; Vezzani, Roberto; L., Benini; E., Farella; P., Zappi
abstract

To enhance video surveillance systems, multi-modal sensorintegration can be a successful strategy. In this work, a computervision system able to detect and track people frommultiple cameras is integrated with a wireless sensor networkmounting PIR (Passive InfraRed) sensors. The twosubsystems are briefly described and possible cases in whichcomputer vision algorithms are likely to fail are discussed.Then, simple but reliable outputs from the PIR sensor nodesare exploited to improve the accuracy of the vision system.In particular, two case studies are reported: the first usesthe presence detection of PIR sensors to disambiguate betweenan opened door and a moving person, while the secondhandles motion direction changes during occlusions. Preliminaryresults are reported and demonstrate the usefulness ofthe integration of the two subsystems.

2005 - Assessing Temporal Coherence for Posture Classification with Large Occlusions [Relazione in Atti di Convegno]
Cucchiara, Rita; Vezzani, Roberto
abstract

In this paper we present a people posture classificationapproach especially devoted to cope with occlusions. Inparticular, the approach aims at assessing temporal coherenceof visual data over probabilistic models. A mixed predictiveand probabilistic tracking is proposed: a probabilistictracking maintains along time the actual appearance ofdetected people and evaluates the occlusion probability; anadditional tracking with Kalman prediction improves the estimationof the people position inside the room. ProbabilisticProjection Maps (PPMs) created with a learning phaseare matched against the appearance mask of the track. Finally,an Hidden Markov Model formulation of the posturecorrects the frame-by-frame classification uncertainties andmakes the system reliable even in presence of occlusions.Results obtained over real indoor sequences are discussed.

2005 - Computer vision system for in-house video surveillance [Articolo su rivista]
Cucchiara, Rita; Grana, Costantino; Prati, Andrea; Vezzani, Roberto
abstract

In-house video surveillance to control the safety of people living in domestic environments is considered. In this context, common problems and general purpose computer vision techniques are discussed and implemented in an integrated solution comprising a robust moving object detection module which is able to disregard shadows, a tracking module designed to handle large occlusions, and a posture detector. These factors, shadows, large occlusions and people's posture, are the key problems that are encountered with in-house surveillance systems, A distributed system with cameras installed in each room of a house can be used to provide full coverage of people's movements. Tracking is based on a probabilistic approach in which the appearance and probability of occlusions are computed for the current camera and warped in the next camera's view by positioning the cameras to disambiguate the occlusions. The application context is the emerging area of domotics (from the Latin word domus, meaning 'home', and informatics). In particular, indoor video surveillance, which makes it possible for elderly and disabled people to live with a sufficient degree of autonomy, via interaction with this new technology, which can be distributed in a house at affordable costs and with high reliability.

2005 - Consistent labeling for multi-camera object tracking [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Vezzani, Roberto; Cucchiara, Rita
abstract

In this paper, we present a new approach to multi-camera object tracking based on the consistent labeling. An automatic and reliable procedure allows to obtain the homographic transformation between two overlapped views, without any manual calibration of the cameras. Object's positions are matched by using the homography when the object is firstly detected in one of the two views. The approach has been tested also in the case of simultaneous transitions and in the case in which people are detected as a group during the transition. Promising results are reported over a real setup of overlapped cameras.

2005 - Entry Edge of Field of View for multi-camera tracking in distributed video surveillance [Relazione in Atti di Convegno]
Calderara, Simone; Vezzani, Roberto; Prati, Andrea; Cucchiara, Rita
abstract

Efficient solution to people tracking in distributed videosurveillance is requested to monitor crowded and large environments.This paper proposes a novel use of the EntryEdges of Field of View (E2oFoV) to solve the consistentlabeling problem between partially overlapped views. Anautomatic and reliable procedure allows to obtain the homographictransformation between two overlapped views,without any manual calibration of the cameras. Throughthe homography, the consistent labeling is established eachtime a new track is detected in one of the cameras. A CameraTransition Graph (CTG) is defined to speed up the establishmentprocess by reducing the search space. Experimentalresults prove the effectiveness of the proposed solutionalso in challenging conditions.

2005 - Making the home safer and more secure through visual surveillance [Relazione in Atti di Convegno]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

Video surveillance has a direct application in intelligent home automation or domotics (from the Latin word domus, that means “home”, and informatics). In particular, in-house video surveillance can provide good support for people with some difficulties (e.g. elderly or disabled people) living alone and with limited autonomy. A key aspect in video surveillance systems for domotics is that of analyzing behaviours of the monitored people. To accomplish this task, people must be detected and tracked, and their posture must be analyzed in order to model behaviours recognizing abrupt changes in it. Problems related to reliable software solutions are not completely solved, in particular luminance changes, shadows and frequent posture changes must be taken into account. Long-lasting occlusions are common due to the proximity of the cameras and the presence of furniture and doors that can often hide parts of a person’s body. For these reasons, a probabilistic and appearance-based tracking, particularly conceivable for people tracking and posture classification, has been developed. However, despite its effectiveness for long-lasting and large occlusions, this approach tends to fail whenever the person is monitored with multiple cameras and he appears in one of them already occluded. Different views provided by multiple cameras can be exploited to solve occlusions by warping known object appearance into the occluded view. To this aim, this paper describes an approach to posture classification based on projection histograms, reinforced by HMM for assuring temporal coherence of the posture.

2005 - Posture Classification in a Multi-camera Indoor Environment [Relazione in Atti di Convegno]
Cucchiara, R.; Prati, A.; Vezzani, R.
abstract

Posture classification is a key process for analyzing thepeople’s behaviour. Computer vision techniques can behelpful in automating this process, but clutteredenvironments and consequent occlusions make this taskoften difficult. Different views provided by multiplecameras can be exploited to solve occlusions by warpingknown object appearance into the occluded view. To thisaim, this paper describes an approach to postureclassification based on projection histograms, reinforcedby HMM for assuring temporal coherence of the posture.The single camera posture classification is then exploitedin the multi-camera system to solve the cases in which theocclusions make the classification impossible.Experimental results of the classification from both thesingle camera and the multi-camera system are provided.

2005 - Probabilistic posture classification for human-behavior analysis [Articolo su rivista]
Cucchiara, Rita; Grana, Costantino; Prati, Andrea; Vezzani, Roberto
abstract

Computer vision and ubiquitous multimedia access nowadays make feasible the development of a mostly automated system for human-behavior analysis. In this context, our proposal is to analyze human behaviors by classifying the posture of the monitored person and, consequently, detecting corresponding events and alarm situations, like a fall. To this aim, our approach can be divided in two phases: for each frame, the projection histograms (Haritaoglu et al., 1998) of each person are computed and compared with the probabilistic projection maps stored for each posture during the training phase; then, the obtained posture is further validated exploiting the information extracted by a tracking module in order to take into account the reliability of the classification of the first phase. Moreover, the tracking algorithm is used to handle occlusions, making the system particularly robust even in indoors environments. Extensive experimental results demonstrate a promising average accuracy of more than 95% in correctly classifying human postures, even in the case of challenging conditions.

2004 - An Intelligent Surveillance System for Dangerous Situation Detection in Home Environments [Articolo su rivista]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

In this paper we address the problem of human posture classification, in particular focusing to an indoor surveillance application. The approach was initially inspired to a previous works of Haritaoglou et al. [5] that uses histogram projections to classify people’s posture. Projection histograms are here exploited as the main feature for the posture classification, but, differently from [5], we propose a supervised statistical learning phase to create probability maps adopted as posture templates. Moreover, camera calibration and homography are included to solve perspective problems and to improve the precision of the classification. Furthermore, we make use of a finite state machine to detect dangerous situations as falls and to activate a suitable alarm generator. The system works on-line on standard workstations with network cameras.

2004 - Probabilistic People Tracking for Occlusion Handling [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; Tardini, Giovanni; Vezzani, Roberto
abstract

This work presents a novel people tracking approach, able to cope with frequent shape changes and large occlusions. In particular, the tracks are described by means of probabilistic masks and appearance models. Occlusions due to other tracks or due to background objects and false occlusions are discriminated. The tracking system is general enough to be applied with any motion segmentation module, it can track people interacting each other and it maintains the pixel assignment to track even with large occlusions. At the same time, the update model is very reactive, so as to cope with sudden body motion and silhouette's shape changes. Due to its robustness, it has been used in many experiments of people behavior control in indoor situations.

2004 - Real-time motion segmentation from moving cameras [Articolo su rivista]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

This paper describes our approach to real-time detection of camera motion and moving object segmentation in videos acquired from moving cameras. As far as we know, none of the proposals reported in the literature are able to meet real-time requirements. In this work, we present an approach based on a color segmentation followed by a region-merging on motion through Markov Random Fields (MRFs). The technique we propose is inspired to a work of Gelgon and Bouthemy (Pattern Recognition 33 (2000) 725-40), that has been modified to reduce computational cost in order to achieve a fast segmentation (about 10 frame per second). To this aim a modified region matching algorithm (namely Partitioned Region Matching) and an innovative arc-based MRF optimization algorithm with a suitable definition of the motion reliability are proposed. Results on both synthetic and real sequences are reported to confirm validity of our solution.

2004 - Using computer vision techniques for dangerous situation detection in domotic applications [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; Prati, Andrea; Tardini, Giovanni; Vezzani, Roberto
abstract

We describe an integrated solution devised for inhouse video surveillance, to control the safety of people living in a domestic environment. The system is composed of robust moving object detection module, able to disregard shadows, a tracking module designed for large occlusion solution and of a posture detector. Shadows, large occlusions and deformable model of people are key features of inhouse surveillance. Moreover, the requirements of high speed reaction to dangerous situations and the need to implement a reliable and low cost televiewing system, led to the introduction of a new multimedia model of semantic transcoding, capable of supporting different user's requests and constraints of their devices (PDA, smart phones, ...). Our application context is the emerging area of domotics (from the Latin word domus that means "home" and informatics) and, in particular, indoor video surveillance of the house where people with some difficulties (elders and disabled people) can now live in a sufficient degree of autonomy, thanks to the strong interaction with the new technologies that can be distributed in the house with affordable costs and high reliability.

2003 - A Hough transform-based method for radial lens distortion correction [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; A., Prati; Vezzani, Roberto
abstract

The paper presents an approach for a robust (semi-)automatic correction of radial lens distortion in images and videos. This method, based on the Hough transform, has the characteristics to be applicable also on videos from unknown cameras that, consequently, can not be a priori calibrated. We approximated the lens distortion by considering only the lower-order term of the radial distortion. Thus, the method relies on the assumption that pure radial distortion transforms straight lines into curves. The computation of the best value of the distortion parameter is performed in a multi-resolution way. The method precision depends on the scale of the multi-resolution and on the Hough space's resolution. Experiments are provided for both outdoor, uncalibrated camera and an indoor, calibrated one. The stability of the value found in different frames of the same video demonstrates the reliability of the proposed method.

2003 - Computer Vision Techniques for PDA Accessibility of In-House Video Surveillance [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; A., Prati; Vezzani, Roberto
abstract

In this paper we propose an approach to indoor environment surveillance and, in particular, to people behaviour control in home automation context. The reference application is a silent and automatic control of the behaviour of people living alone in the house and specially conceived for people with limited autonomy (e.g., elders or disabled people). The aim is to detect dangerous events (such as a person falling down) and to react to these events by establishing a remote connection with low-performance clients, such as PDA (Personal Digital Assistant). To this aim, we propose an integrated server architecture, typically connected in intranet with network cameras, able to segment and track objects of interest; in the case of objects classified as people, the system must also evaluate the people posture and infer possible dangerous situations. Finally, the system is equipped with a specifically designed transcoding server to adapt the video content to PDA requirements (display area and bandwidth) and to the user's requests. The main issues of the proposal are a reliable real-time object detector and tracking module, a simple but effective posture classifier improved by a supervised learning phase, and an high performance transcoding inspired on MPEG-4 object-level standard, tailored to PDA. Results on different video sequences and performance analysis are discussed.

2003 - Domotics for disability: smart surveillance and smart video server [Relazione in Atti di Convegno]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

In this paper we address the problem of human posture classification, in particular focusing to an indoor surveillance application. The approach was initially inspired to a previous works of Haritaoglou et al. [6] that uses histogram projections to classify people’s posture. Projection histograms are here exploited as the main feature for the posture classification, but, differently from [6], we propose a supervised statistical learning phase to create probability maps adopted as posture templates. Moreover, camera calibration and homography is included to resolve prospective problems and improve the precision of classification. Furthermore, we make use of a finite state machineto detect dangerous situations as falls and to activate a suitable alarm generator. The system works on line on standard workstation with network cameras.

2003 - Object Segmentation in Videos from Moving Camera with MRFs on Color and Motion Features [Relazione in Atti di Convegno]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

In this paper we address the problem of fast segmenting moving objects in video acquired by moving camera or more generally with a moving background. We present an approach based on a color segmentation followed by a region-merging on motion through Markov Random Fields (MRFs). The technique we propose is inspired to a work of Gelgon and Bouthemy [6], that has been modified to reduce computational cost in order to achieve a fast segmentation (about ten frame per second). To this aim a modified region matching algorithm (namely Partitioned Region Matching) and an innovative arc-based MRF optimization algorithmwith a suitable definition of the motion reliability are proposed. Results on both synthetic and real sequences are reported to confirm validity of our solution.

Università degli studi di Modena e Reggio Emilia

Pubblicazioni