Rita CUCCHIARA - personale UniMoRe

Nuova ricerca

Rita CUCCHIARA

Professore Ordinario
Dipartimento di Ingegneria "Enzo Ferrari"

Pubblicazioni

- A feature-based separation method, for separating a plurality of loosely-arranged duplicate articles and a system for actuating the method for supplying a packaging machine [Brevetto]
G., Monti; Prati, Andrea; Piccinini, Paolo; Cucchiara, Rita
abstract

The invention relates to a segmentation method based on the characteristics for segmenting a plurality of duplicate articles (3) arranged loosely, which comprises stages of: acquiring an image (M) of a sample article (30); calculating keypoint-descriptors of the image (M); defining an identifying figure (Z) on the image (M); acquiring a first image (11) of a plurality of duplicate articles; performing a matching of the thus-defined keypoint-descriptor pairs; acquiring a position and an orientation of the identifying figure (Z) with respect to a first keypoint-descriptor pair of the image (M) having a match with a second keypoint-descriptor pair of the first image (11); defining, in the first image (11), an identifying figure of projection as a Euclidean transformation of the identifying figure (Z), with reference to the first and second pairs; applying the two preceding stages to a plurality of keypoint-descriptor pairs of the image (M) having a match with a keypoint-descriptor pair of the first image (11); collecting together identifying figures of projection having between them a predetermined degree of superposing; defining a representative figure for each group of identifying figures of projection which is formed by a minimum predetermined number of identifying figures of projection, which representative figure has a same shape and dimension as an identifying figure of projection, and is selected in order to estimate a position of a corresponding article illustrated in the first image of a plurality of duplicate articles. The invention also relates to a method for picking up articles (3) arranged loosely in a storage zone of articles (5) and for positioning the articles (3) in an outlet station (SU), and a group for actuating the method.

- Feature-based segmentation method, for segmenting a plurality of loosely-arranged duplicate articles and a group for actuating the method for supplying a packaging machine [Brevetto]
G., Monti; Prati, Andrea; Piccinini, Paolo; Cucchiara, Rita
abstract

- METODO DI SEGMENTAZIONE BASATO SULLE CARATTERISTICHE PER SEGMENTARE UNA PLURALITA’ DI ARTICOLI DUPLICATI DISPOSTI ALLA RINFUSA E GRUPPO CHE ATTUA TALE METODO PER ALIMENTARE UNA MACCHINA CONFEZIONATRICE [Brevetto]
Piccinini, Paolo; Prati, Andrea; Cucchiara, Rita
abstract

Viene divulgato un metodo di segmentazione basato sulle caratteristiche per segmentare una pluralità di articoli duplicati (3) disposti alla rinfusa, comprendente le fasi di: acquisire un’immagine (M) di un articolo campione (30); calcolare coppie keypoint-descrittore dell’immagine (M); definire una figura identificativa (Z) sull’immagine (M); acquisire una prima immagine (I1) di una pluralità di articoli; calcolare coppie keypoint-descrittore della prima immagine (I1); eseguire il matching delle coppie keypoint-descrittore così definite; acquisire la posizione e l’orientamento relativo della figura identificativa (Z) rispetto ad una prima coppia keypoint-descrittore dell’immagine (M) avente un match con una seconda coppia keypoint-descrittore della prima immagine (I1); definire nella prima immagine (I1) una figura identificativa di proiezione come trasformazione euclidea della figura identificativa (Z) con riferimento alla prima e seconda coppia citate; applicare le due fasi precedenti ad una pluralità di coppie keypoint-descrittore dell’immagine (M) aventi un match con una coppia keypoint-descrittore della prima immagine (I1); raggruppare insieme figure identificative di proiezione aventi fra loro un prestabilito grado di sovrapposizione; definire una figura rappresentativa per ciascun gruppo di figure identificative di proiezione che è formato da un numero minimo prestabilito di figure identificative di proiezione, la quale figura rappresentativa ha la medesima forma e dimensioni di una figura identificativa di proiezione ed è scelta per stimare la posizione di un corrispondente articolo illustrato nella prima immagine (I1).Viene altresì divulgato un metodo per prelevare articoli (3) disposti alla rinfusa in una zona di accumulo di articoli (5) e per posizionare tali articoli (3) in una stazione di uscita (SU), ed un gruppo che attua tale metodo.

2024 - AIGeN: An Adversarial Approach for Instruction Generation in VLN [Relazione in Atti di Convegno]
Rawal, Niyati; Bigazzi, Roberto; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2024 - Adapt to Scarcity: Few-Shot Deepfake Detection via Low-Rank Adaptation [Relazione in Atti di Convegno]
Cappelletti, Silvia; Baraldi, Lorenzo; Cocchi, Federico; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2024 - Are Learnable Prompts the Right Way of Prompting? Adapting Vision-and-Language Models with Memory Optimization [Articolo su rivista]
Moratelli, Nicholas; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2024 - BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues [Relazione in Atti di Convegno]
Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2024 - Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Cocchi, Federico; Cornia, Marcella; Baraldi, Lorenzo; Nicolosi, Alessandro; Cucchiara, Rita
abstract

2024 - FOSSIL: Free Open-Vocabulary Semantic Segmentation through Synthetic References Retrieval [Relazione in Atti di Convegno]
Barsellotti, Luca; Amoroso, Roberto; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2024 - Fluent and Accurate Image Captioning with a Self-Trained Reward Model [Relazione in Atti di Convegno]
Moratelli, Nicholas; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2024 - Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets [Articolo su rivista]
Cornia, Marcella; Baraldi, Lorenzo; Fiameni, Giuseppe; Cucchiara, Rita
abstract

2024 - Mapping High-level Semantic Regions in Indoor Environments without Object Recognition [Relazione in Atti di Convegno]
Bigazzi, Roberto; Baraldi, Lorenzo; Kousik, Shreyas; Cucchiara, Rita; Pavone, Marco
abstract

2024 - Multi-Class Unlearning for Image Classification via Weight Filtering [Articolo su rivista]
Poppi, Samuele; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Machine Unlearning is an emerging paradigm for selectively removing the impact of training datapoints from a network. Unlike existing methods that target a limited subset or a single class, our framework unlearns all classes in a single round. We achieve this by modulating the network's components using memory matrices, enabling the network to demonstrate selective unlearning behavior for any class after training. By discovering weights that are specific to each class, our approach also recovers a representation of the classes which is explainable by design. We test the proposed framework on small- and medium-scale image classification datasets, with both convolution- and Transformer-based backbones, showcasing the potential for explainable solutions through unlearning.

2024 - Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [Articolo su rivista]
Amoroso, Roberto; Morelli, Davide; Cornia, Marcella; Baraldi, Lorenzo; Del Bimbo, Alberto; Cucchiara, Rita
abstract

2024 - Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization [Relazione in Atti di Convegno]
Moratelli, Nicholas; Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2024 - Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models [Relazione in Atti di Convegno]
Poppi, Samuele; Poppi, Tobia; Cocchi, Federico; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2024 - The Revolution of Multimodal Large Language Models: A Survey [Relazione in Atti di Convegno]
Caffagni, Davide; Cocchi, Federico; Barsellotti, Luca; Moratelli, Nicholas; Sarto, Sara; Baraldi, Lorenzo; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita
abstract

2024 - Towards Retrieval-Augmented Architectures for Image Captioning [Articolo su rivista]
Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Nicolosi, Alessandro; Cucchiara, Rita
abstract

2024 - Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation [Relazione in Atti di Convegno]
Barsellotti, Luca; Amoroso, Roberto; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2024 - Trends, Applications, and Challenges in Human Attention Modelling [Relazione in Atti di Convegno]
Cartella, Giuseppe; Cornia, Marcella; Cuculo, Vittorio; D'Amelio, Alessandro; Zanca, Dario; Boccignone, Giuseppe; Cucchiara, Rita
abstract

Human attention modelling has proven, in recent years, to be particularly useful not only for understanding the cognitive processes underlying visual exploration, but also for providing support to artificial intelligence models that aim to solve problems in various domains, including image and video processing, vision-and-language applications, and language modelling. This survey offers a reasoned overview of recent efforts to integrate human attention mechanisms into contemporary deep learning models and discusses future research directions and challenges.

2024 - Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images [Articolo su rivista]
Cartella, Giuseppe; Cuculo, Vittorio; Cornia, Marcella; Cucchiara, Rita
abstract

Creating high-quality and realistic images is now possible thanks to the impressive advancements in image generation. A description in natural language of your desired output is all you need to obtain breathtaking results. However, as the use of generative models grows, so do concerns about the propagation of malicious content and misinformation. Consequently, the research community is actively working on the development of novel fake detection techniques, primarily focusing on low-level features and possible fingerprints left by generative models during the image generation process. In a different vein, in our work, we leverage human semantic knowledge to investigate the possibility of being included in frameworks of fake image detection. To achieve this, we collect a novel dataset of partially manipulated images using diffusion models and conduct an eye-tracking experiment to record the eye movements of different observers while viewing real and fake stimuli. A preliminary statistical analysis is conducted to explore the distinctive patterns in how humans perceive genuine and altered images. Statistical findings reveal that, when perceiving counterfeit samples, humans tend to focus on more confined regions of the image, in contrast to the more dispersed observational pattern observed when viewing genuine images. Our dataset is publicly available at: https://github.com/aimagelab/unveiling-the-truth.

2024 - What’s Outside the Intersection? Fine-grained Error Analysis for Semantic Segmentation Beyond IoU [Relazione in Atti di Convegno]
Bernhard, Maximilian; Amoroso, Roberto; Kindermann, Yannic; Baraldi, Lorenzo; Cucchiara, Rita; Tresp, Volker; Schubert, Matthias
abstract

2024 - Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs [Relazione in Atti di Convegno]
Caffagni, Davide; Cocchi, Federico; Moratelli, Nicholas; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2023 - Consistency-Based Self-supervised Learning for Temporal Anomaly Localization [Relazione in Atti di Convegno]
Panariello, A.; Porrello, A.; Calderara, S.; Cucchiara, R.
abstract

2023 - Deep Learning and Large Scale Models for Bank Transactions [Relazione in Atti di Convegno]
Garuti, Fabrizio; Luetto, Simone; Cucchiara, Rita; Sangineto, Enver
abstract

The success of Artificial Intelligence (AI) in different research and application areas has increased the interest in adopting Deep Learning techniques also in the financial field. Particularly interesting is the case of financial transactional data, which represent one of the most valuable sources of information for banks and other financial institutes. However, the heterogeneity of the data, composed of both numerical and categorical attributes, makes the use of standard Deep Learning methods difficult. In this paper, we present UniTTAB, a Transformer network for transactional time series, which can uniformly represent heterogeneous time-dependent data, and which is trained on a very large scale of real transactional data. As far as we know, the dataset we used for training is the largest real bank transactions dataset used for Deep Learning methods in this field, being all the other common datasets either much smaller or synthetically generated. The use of this very large real training dataset, makes our UniTTAB the first foundation model for transactional data.

2023 - Depth-based 3D human pose refinement: Evaluating the refinet framework [Articolo su rivista]
D'Eusanio, A.; Simoni, A.; Pini, S.; Borghi, G.; Vezzani, R.; Cucchiara, R.
abstract

In recent years, Human Pose Estimation has achieved impressive results on RGB images. The advent of deep learning architectures and large annotated datasets have contributed to these achievements. However, little has been done towards estimating the human pose using depth maps, and especially towards obtaining a precise 3D body joint localization. To fill this gap, this paper presents RefiNet, a depth-based 3D human pose refinement framework. Given a depth map and an initial coarse 2D human pose, RefiNet regresses a fine 3D pose. The framework is composed of three modules, based on different data representations, i.e. 2D depth patches, 3D human skeletons, and point clouds. An extensive experimental evaluation is carried out to investigate the impact of the model hyper-parameters and to compare RefiNet with off-the-shelf 2D methods and literature approaches. Results confirm the effectiveness of the proposed framework and its limited computational requirements.

2023 - Embodied Agents for Efficient Exploration and Smart Scene Description [Relazione in Atti di Convegno]
Bigazzi, Roberto; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2023 - Enhancing Open-Vocabulary Semantic Segmentation with Prototype Retrieval [Relazione in Atti di Convegno]
Barsellotti, Luca; Amoroso, Roberto; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2023 - Evaluating synthetic pre-Training for handwriting processing tasks [Articolo su rivista]
Pippi, V.; Cascianelli, S.; Baraldi, L.; Cucchiara, R.
abstract

In this work, we explore massive pre-training on synthetic word images for enhancing the performance on four benchmark downstream handwriting analysis tasks. To this end, we build a large synthetic dataset of word images rendered in several handwriting fonts, which offers a complete supervision sig-nal. We use it to train a simple convolutional neural network (ConvNet) with a fully supervised objective. The vector representations of the images obtained from the pre-trained ConvNet can then be consid-ered as encodings of the handwriting style. We exploit such representations for Writer Retrieval, Writer Identification, Writer Verification, and Writer Classification and demonstrate that our pre-training strat-egy allows extracting rich representations of the writers' style that enable the aforementioned tasks with competitive results with respect to task-specific State-of-the-Art approaches.& COPY; 2023 Elsevier B.V. All rights reserved.

2023 - Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates [Articolo su rivista]
Moratelli, Nicholas; Barraco, Manuele; Morelli, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through k-nearest neighbor (kNN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.

2023 - From Show to Tell: A Survey on Deep Learning-based Image Captioning [Articolo su rivista]
Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cascianelli, Silvia; Fiameni, Giuseppe; Cucchiara, Rita
abstract

2023 - Fully-Attentive Iterative Networks for Region-based Controllable Image and Video Captioning [Articolo su rivista]
Cornia, Marcella; Baraldi, Lorenzo; Ayellet, Tal; Cucchiara, Rita
abstract

2023 - Input Perturbation Reduces Exposure Bias in Diffusion Models [Relazione in Atti di Convegno]
Ning, M.; Sangineto, E.; Porrello, A.; Calderara, S.; Cucchiara, R.
abstract

Denoising Diffusion Probabilistic Models have shown an impressive generation quality although their long sampling chain leads to high computational costs. In this paper, we observe that a long sampling chain also leads to an error accumulation phenomenon, which is similar to the exposure bias problem in autoregressive text generation. Specifically, we note that there is a discrepancy between training and testing, since the former is conditioned on the ground truth samples, while the latter is conditioned on the previously generated results. To alleviate this problem, we propose a very simple but effective training regularization, consisting in perturbing the ground truth samples to simulate the inference time prediction errors. We empirically show that, without affecting the recall and precision, the proposed input perturbation leads to a significant improvement in the sample quality while reducing both the training and the inference times. For instance, on CelebA 64×64, we achieve a new state-of-the-art FID score of 1.27, while saving 37.5% of the training time. The code is available at https://github.com/forever208/DDPM-IP.

2023 - LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On [Relazione in Atti di Convegno]
Morelli, Davide; Baldrati, Alberto; Cartella, Giuseppe; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita
abstract

The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a given in-shop garment, has yet to capitalize on the potential of these powerful generative solutions. This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module that exploits learnable skip connections to enhance the generation process preserving the model's characteristics. To effectively maintain the texture and details of the in-shop garment, we propose a textual inversion component that can map the visual features of the garment to the CLIP token embedding space and thus generate a set of pseudo-word token embeddings capable of conditioning the generation process. Experimental results on Dress Code and VITON-HD datasets demonstrate that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task.

2023 - Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation [Relazione in Atti di Convegno]
Betti, Federico; Staiano, Jacopo; Baraldi, Lorenzo; Baraldi, Lorenzo; Cucchiara, Rita; Sebe, Nicu
abstract

2023 - Let's stay close: An examination of the effects of imagined contact on behavior toward children with disability [Articolo su rivista]
Cocco, V. M.; Bisagno, E.; Bernardo, G. A. D.; Bicocchi, N.; Calderara, S.; Palazzi, A.; Cucchiara, R.; Zambonelli, F.; Cadamuro, A.; Stathi, S.; Crisp, R.; Vezzali, L.
abstract

In line with current developments in indirect intergroup contact literature, we conducted a field study using the imagined contact paradigm among high-status (Italian children) and low-status (children with foreign origins) group members (N = 122; 53 females, mean age = 7.52 years). The experiment aimed to improve attitudes and behavior toward a different low-status group, children with disability. To assess behavior, we focused on an objective measure that captures the physical distance between participants and a child with disability over the course of a five-minute interaction (i.e., while playing together). Results from a 3-week intervention revealed that in the case of high-status children imagined contact, relative to a no-intervention control condition, improved outgroup attitudes and behavior, and strengthened helping and contact intentions. These effects however did not emerge among low-status children. The results are discussed in the context of intergroup contact literature, with emphasis on the implications of imagined contact for educational settings.

2023 - Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing [Relazione in Atti di Convegno]
Baldrati, Alberto; Morelli, Davide; Cartella, Giuseppe; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita
abstract

2023 - OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data [Relazione in Atti di Convegno]
Cartella, Giuseppe; Baldrati, Alberto; Morelli, Davide; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita
abstract

The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In this work, we propose OpenFashionCLIP, a vision-and-language contrastive learning method that only adopts open-source fashion data stemming from diverse domains, and characterized by varying degrees of specificity. Our approach is extensively validated across several tasks and benchmarks, and experimental results highlight a significant out-of-domain generalization capability and consistent improvements over state-of-the-art methods both in terms of accuracy and recall. Source code and trained models are publicly available at: https://github.com/aimagelab/open-fashion-clip.

2023 - Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation [Relazione in Atti di Convegno]
Sarto, Sara; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language models. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. We publicly release our source code and trained models.

2023 - Predicting gene and protein expression levels from DNA and protein sequences with Perceiver [Articolo su rivista]
Stefanini, Matteo; Lovino, Marta; Cucchiara, Rita; Ficarra, Elisa
abstract

Background and objective: The functions of an organism and its biological processes result from the expression of genes and proteins. Therefore quantifying and predicting mRNA and protein levels is a crucial aspect of scientific research. Concerning the prediction of mRNA levels, the available approaches use the sequence upstream and downstream of the Transcription Start Site (TSS) as input to neural networks. The State-of-the-art models (e.g., Xpresso and Basenjii) predict mRNA levels exploiting Convolutional (CNN) or Long Short Term Memory (LSTM) Networks. However, CNN prediction depends on convolutional kernel size, and LSTM suffers from capturing long-range dependencies in the sequence. Concerning the prediction of protein levels, as far as we know, there is no model for predicting protein levels by exploiting the gene or protein sequences. Methods: Here, we exploit a new model type (called Perceiver) for mRNA and protein level prediction, exploiting a Transformer-based architecture with an attention module to attend to long-range interactions in the sequences. In addition, the Perceiver model overcomes the quadratic complexity of the standard Transformer architectures. This work's contributions are 1. DNAPerceiver model to predict mRNA levels from the sequence upstream and downstream of the TSS; 2. ProteinPerceiver model to predict protein levels from the protein sequence; 3. Protein&DNAPerceiver model to predict protein levels from TSS and protein sequences. Results: The models are evaluated on cell lines, mice, glioblastoma, and lung cancer tissues. The results show the effectiveness of the Perceiver-type models in predicting mRNA and protein levels. Conclusions: This paper presents a Perceiver architecture for mRNA and protein level prediction. In the future, inserting regulatory and epigenetic information into the model could improve mRNA and protein level predictions. The source code is freely available at https://github.com/MatteoStefanini/DNAPerceiver.

2023 - Superpixel Positional Encoding to Improve ViT-based Semantic Segmentation Models [Relazione in Atti di Convegno]
Amoroso, Roberto; Tomei, Matteo; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2023 - SynthCap: Augmenting Transformers with Synthetic Data for Image Captioning [Relazione in Atti di Convegno]
Caffagni, Davide; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Image captioning is a challenging task that combines Computer Vision and Natural Language Processing to generate descriptive and accurate textual descriptions for input images. Research efforts in this field mainly focus on developing novel architectural components to extend image captioning models and using large-scale image-text datasets crawled from the web to boost final performance. In this work, we explore an alternative to web-crawled data and augment the training dataset with synthetic images generated by a latent diffusion model. In particular, we propose a simple yet effective synthetic data augmentation framework that is capable of significantly improving the quality of captions generated by a standard Transformer-based model, leading to competitive results on the COCO dataset.

2023 - Towards Explainable Navigation and Recounting [Relazione in Atti di Convegno]
Poppi, Samuele; Rawal, Niyati; Bigazzi, Roberto; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Explainability and interpretability of deep neural networks have become of crucial importance over the years in Computer Vision, concurrently with the need to understand increasingly complex models. This necessity has fostered research on approaches that facilitate human comprehension of neural methods. In this work, we propose an explainable setting for visual navigation, in which an autonomous agent needs to explore an unseen indoor environment while portraying and explaining interesting scenes with natural language descriptions. We combine recent advances in ongoing research fields, employing an explainability method on images generated through agent-environment interaction. Our approach uses explainable maps to visualize model predictions and highlight the correlation between the observed entities and the generated words, to focus on prominent objects encountered during the environment exploration. The experimental section demonstrates that our approach can identify the regions of the images that the agent concentrates on to describe its point of view, improving explainability.

2023 - TrackFlow: Multi-Object Tracking with Normalizing Flows [Relazione in Atti di Convegno]
Mancusi, Gianluca; Panariello, Aniello; Porrello, Angelo; Fabbri, Matteo; Calderara, Simone; Cucchiara, Rita
abstract

The field of multi-object tracking has recently seen a renewed interest in the good old schema of tracking-by-detection, as its simplicity and strong priors spare it from the complex design and painful babysitting of tracking-by-attention approaches. In view of this, we aim at extending tracking-by-detection to multi-modal settings, where a comprehensive cost has to be computed from heterogeneous information e.g., 2D motion cues, visual appearance, and pose estimates. More precisely, we follow a case study where a rough estimate of 3D information is also available and must be merged with other traditional metrics (e.g., the IoU). To achieve that, recent approaches resort to either simple rules or complex heuristics to balance the contribution of each cost. However, i) they require careful tuning of tailored hyperparameters on a hold-out set, and ii) they imply these costs to be independent, which does not hold in reality. We address these issues by building upon an elegant probabilistic formulation, which considers the cost of a candidate association as the negative log-likelihood yielded by a deep density estimator, trained to model the conditional joint probability distribution of correct associations. Our experiments, conducted on both simulated and real benchmarks, show that our approach consistently enhances the performance of several tracking-by-detection algorithms.

2023 - Unveiling the Impact of Image Transformations on Deepfake Detection: An Experimental Analysis [Relazione in Atti di Convegno]
Cocchi, Federico; Baraldi, Lorenzo; Poppi, Samuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

With the recent explosion of interest in visual Generative AI, the field of deepfake detection has gained a lot of attention. In fact, deepfake detection might be the only measure to counter the potential proliferation of generated media in support of fake news and its consequences. While many of the available works limit the detection to a pure and direct classification of fake versus real, this does not translate well to a real-world scenario. Indeed, malevolent users can easily apply post-processing techniques to generated content, changing the underlying distribution of fake data. In this work, we provide an in-depth analysis of the robustness of a deepfake detection pipeline, considering different image augmentations, transformations, and other pre-processing steps. These transformations are only applied in the evaluation phase, thus simulating a practical situation in which the detector is not trained on all the possible augmentations that can be used by the attacker. In particular, we analyze the performance of a k-NN and a linear probe detector on the COCOFake dataset, using image features extracted from pre-trained models, like CLIP and DINO. Our results demonstrate that while the CLIP visual backbone outperforms DINO in deepfake detection with no augmentation, its performance varies significantly in presence of any transformation, favoring the robustness of DINO.

2023 - Video Surveillance and Privacy: A Solvable Paradox? [Articolo su rivista]
Cucchiara, Rita; Baraldi, Lorenzo; Cornia, Marcella; Sarto, Sara
abstract

2023 - With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning [Relazione in Atti di Convegno]
Barraco, Manuele; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2022 - A Computational Approach for Progressive Architecture Shrinkage in Action Recognition [Articolo su rivista]
Tomei, Matteo; Baraldi, Lorenzo; Fiameni, Giuseppe; Bronzin, Simone; Cucchiara, Rita
abstract

2022 - ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval [Relazione in Atti di Convegno]
Messina, Nicola; Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Falchi, Fabrizio; Amato, Giuseppe; Cucchiara, Rita
abstract

2022 - Boosting Modern and Historical Handwritten Text Recognition with Deformable Convolutions [Articolo su rivista]
Cascianelli, Silvia; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Handwritten Text Recognition (HTR) in free-layout pages is a challenging image understanding task that can provide a relevant boost to the digitization of handwritten documents and reuse of their content. The task becomes even more challenging when dealing with historical documents due to the variability of the writing style and degradation of the page quality. State-of-the-art HTR approaches typically couple recurrent structures for sequence modeling with Convolutional Neural Networks for visual feature extraction. Since convolutional kernels are defined on fixed grids and focus on all input pixels independently while moving over the input image, this strategy disregards the fact that handwritten characters can vary in shape, scale, and orientation even within the same document and that the ink pixels are more relevant than the background ones. To cope with these specific HTR difficulties, we propose to adopt deformable convolutions, which can deform depending on the input at hand and better adapt to the geometric variations of the text. We design two deformable architectures and conduct extensive experiments on both modern and historical datasets. Experimental results confirm the suitability of deformable convolutions for the HTR task.

2022 - CaMEL: Mean Teacher Learning for Image Captioning [Relazione in Atti di Convegno]
Barraco, Manuele; Stefanini, Matteo; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2022 - Dress Code: High-Resolution Multi-Category Virtual Try-On [Relazione in Atti di Convegno]
Morelli, Davide; Fincato, Matteo; Cornia, Marcella; Landi, Federico; Cesari, Fabio; Cucchiara, Rita
abstract

Image-based virtual try-on strives to transfer the appearance of a clothing item onto the image of a target person. Existing literature focuses mainly on upper-body clothes (e.g. t-shirts, shirts, and tops) and neglects full-body or lower-body items. This shortcoming arises from a main factor: current publicly available datasets for image-based virtual try-on do not account for this variety, thus limiting progress in the field. In this research activity, we introduce Dress Code, a novel dataset which contains images of multi-category clothes. Dress Code is more than 3x larger than publicly available datasets for image-based virtual try-on and features high-resolution paired images (1024 x 768) with front-view, full-body reference models. To generate HD try-on images with high visual quality and rich in details, we propose to learn fine-grained discriminating features. Specifically, we leverage a semantic-aware discriminator that makes predictions at pixel-level instead of image- or patch-level. The Dress Code dataset is publicly available at https://github.com/aimagelab/dress-code.

2022 - Dual-Branch Collaborative Transformer for Virtual Try-On [Relazione in Atti di Convegno]
Fenocchi, Emanuele; Morelli, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cesari, Fabio; Cucchiara, Rita
abstract

Image-based virtual try-on has recently gained a lot of attention in both the scientific and fashion industry communities due to its challenging setting and practical real-world applications. While pure convolutional approaches have been explored to solve the task, Transformer-based architectures have not received significant attention yet. Following the intuition that self- and cross-attention operators can deal with long-range dependencies and hence improve the generation, in this paper we extend a Transformer-based virtual try-on model by adding a dual-branch collaborative module that can exploit cross-modal information at generation time. We perform experiments on the VITON dataset, which is the standard benchmark for the task, and on a recently collected virtual try-on dataset with multi-category clothing, Dress Code. Experimental results demonstrate the effectiveness of our solution over previous methods and show that Transformer-based architectures can be a viable alternative for virtual try-on.

2022 - Embodied Navigation at the Art Gallery [Relazione in Atti di Convegno]
Bigazzi, Roberto; Landi, Federico; Cascianelli, Silvia; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Embodied agents, trained to explore and navigate indoor photorealistic environments, have achieved impressive results on standard datasets and benchmarks. So far, experiments and evaluations have involved domestic and working scenes like offices, flats, and houses. In this paper, we build and release a new 3D space with unique characteristics: the one of a complete art museum. We name this environment ArtGallery3D (AG3D). Compared with existing 3D scenes, the collected space is ampler, richer in visual features, and provides very sparse occupancy information. This feature is challenging for occupancy-based agents which are usually trained in crowded domestic environments with plenty of occupancy information. Additionally, we annotate the coordinates of the main points of interest inside the museum, such as paintings, statues, and other items. Thanks to this manual process, we deliver a new benchmark for PointGoal navigation inside this new space. Trajectories in this dataset are far more complex and lengthy than existing ground-truth paths for navigation in Gibson and Matterport3D. We carry on extensive experimental evaluation using our new space for evaluation and prove that existing methods hardly adapt to this scenario. As such, we believe that the availability of this 3D model will foster future research and help improve existing solutions.

2022 - Explaining Transformer-based Image Captioning Models: An Empirical Analysis [Articolo su rivista]
Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2022 - Fine-Grained Human Analysis Under Occlusions and Perspective Constraints in Multimedia Surveillance [Articolo su rivista]
Cucchiara, Rita; Fabbri, Matteo
abstract

2022 - First Steps Towards 3D Pedestrian Detection and Tracking from Single Image [Relazione in Atti di Convegno]
Mancusi, G.; Fabbri, M.; Egidi, S.; Verasani, M.; Scarabelli, P.; Calderara, S.; Cucchiara, R.
abstract

Since decades, the problem of multiple people tracking has been tackled leveraging 2D data only. However, people moves and interact in a three-dimensional space. For this reason, using only 2D data might be limiting and overly challenging, especially due to occlusions and multiple overlapping people. In this paper, we take advantage of 3D synthetic data from the novel MOTSynth dataset, to train our proposed 3D people detector, whose observations are fed to a tracker that works in the corresponding 3D space. Compared to conventional 2D trackers, we show an overall improvement in performance with a reduction of identity switches on both real and synthetic data. Additionally, we propose a tracker that jointly exploits 3D and 2D data, showing an improvement over the proposed baselines. Our experiments demonstrate that 3D data can be beneficial, and we believe this paper will pave the road for future efforts in leveraging 3D data for tackling multiple people tracking. The code is available at (https://github.com/GianlucaMancusi/LoCO-Det ).

2022 - Focus on Impact: Indoor Exploration with Intrinsic Motivation [Articolo su rivista]
Bigazzi, Roberto; Landi, Federico; Cascianelli, Silvia; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita
abstract

2022 - How many Observations are Enough? Knowledge Distillation for Trajectory Forecasting [Relazione in Atti di Convegno]
Monti, A.; Porrello, A.; Calderara, S.; Coscia, P.; Ballan, L.; Cucchiara, R.
abstract

Accurate prediction of future human positions is an essential task for modern video-surveillance systems. Current state-of-the-art models usually rely on a "history" of past tracked locations (e.g., 3 to 5 seconds) to predict a plausible sequence of future locations (e.g., up to the next 5 seconds). We feel that this common schema neglects critical traits of realistic applications: as the collection of input trajectories involves machine perception (i.e., detection and tracking), incorrect detection and fragmentation errors may accumulate in crowded scenes, leading to tracking drifts. On this account, the model would be fed with corrupted and noisy input data, thus fatally affecting its prediction performance.In this regard, we focus on delivering accurate predictions when only few input observations are used, thus potentially lowering the risks associated with automatic perception. To this end, we conceive a novel distillation strategy that allows a knowledge transfer from a teacher network to a student one, the latter fed with fewer observations (just two ones). We show that a properly defined teacher supervision allows a student network to perform comparably to state-of-the-art approaches that demand more observations. Besides, extensive experiments on common trajectory forecasting datasets highlight that our student network better generalizes to unseen scenarios.

2022 - Information fusion as an integrative cross-cutting enabler to achieve robust, explainable, and trustworthy medical artificial intelligence [Articolo su rivista]
Holzinger, A.; Dehmer, M.; Emmert-Streib, F.; Cucchiara, R.; Augenstein, I.; Ser, J. D.; Samek, W.; Jurisica, I.; Diaz-Rodriguez, N.
abstract

Medical artificial intelligence (AI) systems have been remarkably successful, even outperforming human performance at certain tasks. There is no doubt that AI is important to improve human health in many ways and will disrupt various medical workflows in the future. Using AI to solve problems in medicine beyond the lab, in routine environments, we need to do more than to just improve the performance of existing AI methods. Robust AI solutions must be able to cope with imprecision, missing and incorrect information, and explain both the result and the process of how it was obtained to a medical expert. Using conceptual knowledge as a guiding model of reality can help to develop more robust, explainable, and less biased machine learning models that can ideally learn from less data. Achieving these goals will require an orchestrated effort that combines three complementary Frontier Research Areas: (1) Complex Networks and their Inference, (2) Graph causal models and counterfactuals, and (3) Verification and Explainability methods. The goal of this paper is to describe these three areas from a unified view and to motivate how information fusion in a comprehensive and integrative manner can not only help bring these three areas together, but also have a transformative role by bridging the gap between research and practical applications in the context of future trustworthy medical AI. This makes it imperative to include ethical and legal aspects as a cross-cutting discipline, because all future solutions must not only be ethically responsible, but also legally compliant.

2022 - Investigating Bidimensional Downsampling in Vision Transformer Models [Relazione in Atti di Convegno]
Bruno, Paolo; Amoroso, Roberto; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Vision Transformers (ViT) and other Transformer-based architectures for image classification have achieved promising performances in the last two years. However, ViT-based models require large datasets, memory, and computational power to obtain state-of-the-art results compared to more traditional architectures. The generic ViT model, indeed, maintains a full-length patch sequence during inference, which is redundant and lacks hierarchical representation. With the goal of increasing the efficiency of Transformer-based models, we explore the application of a 2D max-pooling operator on the outputs of Transformer encoders. We conduct extensive experiments on the CIFAR-100 dataset and the large ImageNet dataset and consider both accuracy and efficiency metrics, with the final goal of reducing the token sequence length without affecting the classification performance. Experimental results show that bidimensional downsampling can outperform previous classification approaches while requiring relatively limited computation resources.

2022 - Matching Faces and Attributes Between the Artistic and the Real Domain: the PersonArt Approach [Articolo su rivista]
Cornia, Marcella; Tomei, Matteo; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2022 - Retrieval-Augmented Transformer for Image Captioning [Relazione in Atti di Convegno]
Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2022 - SeeFar: Vehicle Speed Estimation and Flow Analysis from a Moving UAV [Relazione in Atti di Convegno]
Ning, M.; Ma, X.; Lu, Y.; Calderara, S.; Cucchiara, R.
abstract

Visual perception from drones has been largely investigated for Intelligent Traffic Monitoring System (ITMS) recently. In this paper, we introduce SeeFar to achieve vehicle speed estimation and traffic flow analysis based on YOLOv5 and DeepSORT from a moving drone. SeeFar differs from previous works in three key ways: the speed estimation and flow analysis components are integrated into a unified framework; our method of predicting car speed has the least constraints while maintaining a high accuracy; our flow analysor is direction-aware and outlier-aware. Specifically, we design the speed estimator only using the camera imaging geometry, where the transformation between world space and image space is completed by the variable Ground Sampling Distance. Besides, previous papers do not evaluate their speed estimators at scale due to the difficulty of obtaining the ground truth, we therefore propose a simple yet efficient approach to estimate the true speeds of vehicles via the prior size of the road signs. We evaluate SeeFar on our ten videos that contain 929 vehicle samples. Experiments on these sequences demonstrate the effectiveness of SeeFar by achieving 98.0% accuracy of speed estimation and 99.1% accuracy of traffic volume prediction, respectively.

2022 - Special Section on AI-empowered Multimedia Data Analytics for Smart Healthcare [Articolo su rivista]
Hossain, M. S.; Cucchiara, R.; Muhammad, G.; Tobon, D. P.; El Saddik, A.
abstract

2022 - Spot the Difference: A Novel Task for Embodied Agents in Changing Environments [Relazione in Atti di Convegno]
Landi, Federico; Bigazzi, Roberto; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2022 - The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition [Relazione in Atti di Convegno]
Cascianelli, Silvia; Pippi, Vittorio; Maarand, Martin; Cornia, Marcella; Baraldi, Lorenzo; Kermorvant, Christopher; Cucchiara, Rita
abstract

2022 - The Unreasonable Effectiveness of CLIP features for Image Captioning: an Experimental Analysis [Relazione in Atti di Convegno]
Barraco, Manuele; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2022 - Transform, Warp, and Dress: A New Transformation-Guided Model for Virtual Try-On [Articolo su rivista]
Fincato, Matteo; Cornia, Marcella; Landi, Federico; Cesari, Fabio; Cucchiara, Rita
abstract

Virtual try-on has recently emerged in computer vision and multimedia communities with the development of architectures that can generate realistic images of a target person wearing a custom garment. This research interest is motivated by the large role played by e-commerce and online shopping in our society. Indeed, the virtual try-on task can offer many opportunities to improve the efficiency of preparing fashion catalogs and to enhance the online user experience. The problem is far to be solved: current architectures do not reach sufficient accuracy with respect to manually generated images and can only be trained on image pairs with a limited variety. Existing virtual try-on datasets have two main limits: they contain only female models, and all the images are available only in low resolution. This not only affects the generalization capabilities of the trained architectures but makes the deployment to real applications impractical. To overcome these issues, we present Dress Code, a new dataset for virtual try-on that contains high-resolution images of a large variety of upper-body clothes and both male and female models. Leveraging this enriched dataset, we propose a new model for virtual try-on capable of generating high-quality and photo-realistic images using a three-stage pipeline. The first two stages perform two different geometric transformations to warp the desired garment and make it fit into the target person's body pose and shape. Then, we generate the new image of that same person wearing the try-on garment using a generative network. We test the proposed solution on the most widely used dataset for this task as well as on our newly collected dataset and demonstrate its effectiveness when compared to current state-of-the-art methods. Through extensive analyses on our Dress Code dataset, we show the adaptability of our model, which can generate try-on images even with a higher resolution.

2022 - Warp and Learn: Novel Views Generation for Vehicles and Other Objects [Articolo su rivista]
Palazzi, Andrea; Bergamini, Luca; Calderara, Simone; Cucchiara, Rita
abstract

In this work we introduce a new self-supervised, semi-parametric approach for synthesizing novel views of a vehicle starting from a single monocular image.Differently from parametric (i.e. entirely learning-based) methods, we show how a-priori geometric knowledge about the object and the 3D world can be successfully integrated into a deep learning based image generation framework. As this geometric component is not learnt, we call our approach semi-parametric.In particular, we exploit man-made object symmetry and piece-wise planarity to integrate rich a-priori visual information into the novel viewpoint synthesis process. An Image Completion Network (ICN) is then trained to generate a realistic image starting from this geometric guidance.This blend between parametric and non-parametric components allows us to i) operate in a real-world scenario, ii) preserve high-frequency visual information such as textures, iii) handle truly arbitrary 3D roto-translations of the input and iv) perform shape transfer to completely different 3D models. Eventually, we show that our approach can be easily complemented with synthetic data and extended to other rigid objects with completely different topology, even in presence of concave structures and holes.A comprehensive experimental analysis against state-of-the-art competitors shows the efficacy of our method both from a quantitative and a perceptive point of view.

2022 - Wind Turbine Power Curve Monitoring Based on Environmental and Operational Data [Articolo su rivista]
Cascianelli, S.; Astolfi, D.; Castellani, F.; Cucchiara, R.; Fravolini, M. L.
abstract

The power produced by a wind turbine depends on environmental conditions, working parameters, and interactions with nearby turbines. However, these aspects are often neglected in the design of data-driven models for wind farms' performance analysis. In this article, we propose to predict the active power and to provide reliable prediction intervals via ensembles of multivariate polynomial regression models that exploit a higher number of inputs (compared to most approaches in the literature), including operational and thermal variables. We present two main strategies: the former considers the environmental measurements collected at the other wind turbines in the farm as additional modeling information for the turbine under analysis; the latter combines multiple models relative to different operative conditions. We validate our approach on real data from the SCADA system of a wind farm in Italy and obtain a MAE of the order of 1.0% of the rated power of the turbine. Moreover, due to the structure of our approach, we can gain quantitative insights on the covariates most frequently selected depending on the working region of the wind turbines.

2021 - A Novel Attention-based Aggregation Function to Combine Vision and Language [Relazione in Atti di Convegno]
Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements - like regions and words - proper reduction functions are needed to transform a set of encoded elements into a single response, like a classification or similarity score. In this paper, we propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction, which can be used for both classification and ranking. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices, on both COCO and VQA 2.0 datasets. Experimentally, we demonstrate that our approach leads to a performance increase on both tasks. Further, we conduct ablation studies to validate the role of each component of the approach.

2021 - A Systematic Comparison of Depth Map Representations for Face Recognition [Articolo su rivista]
Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Maltoni, Davide; Cucchiara, Rita
abstract

2021 - AC-VRNN: Attentive Conditional-VRNN for multi-future trajectory prediction [Articolo su rivista]
Bertugli, A.; Calderara, S.; Coscia, P.; Ballan, L.; Cucchiara, R.
abstract

Anticipating human motion in crowded scenarios is essential for developing intelligent transportation systems, social-aware robots and advanced video surveillance applications. A key component of this task is represented by the inherently multi-modal nature of human paths which makes socially acceptable multiple futures when human interactions are involved. To this end, we propose a generative architecture for multi-future trajectory predictions based on Conditional Variational Recurrent Neural Networks (C-VRNNs). Conditioning mainly relies on prior belief maps, representing most likely moving directions and forcing the model to consider past observed dynamics in generating future positions. Human interactions are modelled with a graph-based attention mechanism enabling an online attentive hidden state refinement of the recurrent estimation. To corroborate our model, we perform extensive experiments on publicly-available datasets (e.g., ETH/UCY, Stanford Drone Dataset, STATS SportVU NBA, Intersection Drone Dataset and TrajNet++) and demonstrate its effectiveness in crowded scenes compared to several state-of-the-art methods.

2021 - Assessing the Role of Boundary-level Objectives in Indoor Semantic Segmentation [Relazione in Atti di Convegno]
Amoroso, Roberto; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Providing fine-grained and accurate segmentation maps of indoor scenes is a challenging task with relevant applications in the fields of augmented reality, image retrieval, and personalized robotics. While most of the recent literature on semantic segmentation has focused on outdoor scenarios, the generation of accurate indoor segmentation maps has been partially under-investigated. With the goal of increasing the accuracy of semantic segmentation in indoor scenarios, we focus on the analysis of boundary-level objectives, which foster the generation of fine-grained boundaries between different semantic classes and which have never been explored in the case of indoor segmentation. In particular, we test and devise variants of both the Boundary and Active Boundary losses, two recent proposals which deal with the prediction of semantic boundaries. Through experiments on the NYUDv2 dataset, we quantify the role of such losses in terms of accuracy and quality of boundary prediction and demonstrate the accuracy gain of the proposed variants.

2021 - DAG-Net: Double Attentive Graph Neural Network for Trajectory Forecasting [Relazione in Atti di Convegno]
Monti, Alessio; Bertugli, Alessia; Calderara, Simone; Cucchiara, Rita
abstract

Understanding human motion behaviour is a critical task for several possible applications like self-driving cars or social robots, and in general for all those settings where an autonomous agent has to navigate inside a human-centric environment. This is non-trivial because human motion is inherently multi-modal: given a history of human motion paths, there are many plausible ways by which people could move in the future. Additionally, people activities are often driven by goals, e.g. reaching particular locations or interacting with the environment. We address the aforementioned aspects by proposing a new recurrent generative model that considers both single agents' future goals and interactions between different agents. The model exploits a double attention-based graph neural network to collect information about the mutual influences among different agents and to integrate it with data about agents' possible future objectives. Our proposal is general enough to be applied to different scenarios: the model achieves state-of-the-art results in both urban environments and also in sports applications.

2021 - Estimating (and fixing) the Effect of Face Obfuscation in Video Recognition [Relazione in Atti di Convegno]
Tomei, Matteo; Baraldi, Lorenzo; Bronzin, Simone; Cucchiara, Rita
abstract

2021 - Explore and Explain: Self-supervised Navigation and Recounting [Relazione in Atti di Convegno]
Bigazzi, Roberto; Landi, Federico; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Embodied AI has been recently gaining attention as it aims to foster the development of autonomous and intelligent agents. In this paper, we devise a novel embodied setting in which an agent needs to explore a previously unknown environment while recounting what it sees during the path. In this context, the agent needs to navigate the environment driven by an exploration goal, select proper moments for description, and output natural language descriptions of relevant objects and scenes. Our model integrates a novel self-supervised exploration module with penalty, and a fully-attentive captioning model for explanation. Also, we investigate different policies for selecting proper moments for explanation, driven by information coming from both the environment and the navigation. Experiments are conducted on photorealistic environments from the Matterport3D dataset and investigate the navigation and explanation capabilities of the agent as well as the role of their interactions.

2021 - FashionSearch++: Improving Consumer-to-Shop Clothes Retrieval with Hard Negatives [Relazione in Atti di Convegno]
Morelli, Davide; Cornia, Marcella; Cucchiara, Rita
abstract

Consumer-to-shop clothes retrieval has recently emerged in computer vision and multimedia communities with the development of architectures that can find similar in-shop clothing images given a query photo. Due to its nature, the main challenge lies in the domain gap between user-acquired and in-shop images. In this paper, we follow the most recent successful research in this area employing convolutional neural networks as feature extractors and propose to enhance the training supervision through a modified triplet loss that takes into account hard negative examples. We test the proposed approach on the Street2Shop dataset, achieving results comparable to state-of-the-art solutions and demonstrating good generalization properties when dealing with different settings and clothing categories.

2021 - Foreword by general chairs [Relazione in Atti di Convegno]
Cucchiara, R.; Bimbo, A. D.; Sclaroff, S.
abstract

2021 - Foreword by general chairs [Relazione in Atti di Convegno]
Cucchiara, R.; Del Bimbo, A.; Sclaroff, S.
abstract

2021 - Foreword by general chairs [Relazione in Atti di Convegno]
Cucchiara, R.; Bimbo, A. D.; Sclaroff, S.
abstract

2021 - Foreword by general chairs [Relazione in Atti di Convegno]
Cucchiara, R.; Del Bimbo, A.; Sclaroff, S.
abstract

2021 - Future Urban Scenes Generation Through Vehicles Synthesis [Relazione in Atti di Convegno]
Simoni, Alessandro; Bergamini, Luca; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita
abstract

In this work we propose a deep learning pipeline to predict the visual future appearance of an urban scene. Despite recent advances, generating the entire scene in an end-to-end fashion is still far from being achieved. Instead, here we follow a two stages approach, where interpretable information is included in the loop and each actor is modelled independently. We leverage a per-object novel view synthesis paradigm; i.e. generating a synthetic representation of an object undergoing a geometrical roto-translation in the 3D space. Our model can be easily conditioned with constraints (e.g. input trajectories) provided by state-of-the-art tracking methods or by the user itself. This allows us to generate a set of diverse realistic futures starting from the same input in a multi-modal fashion. We visually and quantitatively show the superiority of this approach over traditional end-to-end scene-generation methods on CityFlow, a challenging real world dataset.

2021 - Improving Indoor Semantic Segmentation with Boundary-level Objectives [Relazione in Atti di Convegno]
Amoroso, Roberto; Baraldi, Lorenzo; Cucchiara, Rita
abstract

While most of the recent literature on semantic segmentation has focused on outdoor scenarios, the generation of accurate indoor segmentation maps has been partially under-investigated, although being a relevant task with applications in augmented reality, image retrieval, and personalized robotics. With the goal of increasing the accuracy of semantic segmentation in indoor scenarios, we develop and propose two novel boundary-level training objectives, which foster the generation of accurate boundaries between different semantic classes. In particular, we take inspiration from the Boundary and Active Boundary losses, two recent proposals which deal with the prediction of semantic boundaries, and propose modified geometric distance functions that improve predictions at the boundary level. Through experiments on the NYUDv2 dataset, we assess the appropriateness of our proposal in terms of accuracy and quality of boundary prediction and demonstrate its accuracy gain.

2021 - L'intelligenza non è artificiale. La rivoluzione tecnologica che sta già cambiando il mondo [Monografia/Trattato scientifico]
Cucchiara, Rita
abstract

2021 - Learning to Read L'Infinito: Handwritten Text Recognition with Synthetic Training Data [Relazione in Atti di Convegno]
Cascianelli, Silvia; Cornia, Marcella; Baraldi, Lorenzo; Piazzi, Maria Ludovica; Schiuma, Rosiana; Cucchiara, Rita
abstract

Deep learning-based approaches to Handwritten Text Recognition (HTR) have shown remarkable results on publicly available large datasets, both modern and historical. However, it is often the case that historical manuscripts are preserved in small collections, most of the time with unique characteristics in terms of paper support, author handwriting style, and language. State-of-the-art HTR approaches struggle to obtain good performance on such small manuscript collections, for which few training samples are available. In this paper, we focus on HTR on small historical datasets and propose a new historical dataset, which we call Leopardi, with the typical characteristics of small manuscript collections, consisting of letters by the poet Giacomo Leopardi, and devise strategies to deal with the training data scarcity scenario. In particular, we explore the use of carefully designed but cost-effective synthetic data for pre-training HTR models to be applied to small single-author manuscripts. Extensive experiments validate the suitability of the proposed approach, and both the Leopardi dataset and synthetic data will be available to favor further research in this direction.

2021 - Learning to Select: A Fully Attentive Approach for Novel Object Captioning [Relazione in Atti di Convegno]
Cagrandi, Marco; Cornia, Marcella; Stefanini, Matteo; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2021 - MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking? [Relazione in Atti di Convegno]
Fabbri, Matteo; Braso, Guillem; Maugeri, Gianluca; Cetintas, Orcun; Gasparini, Riccardo; Osep, Aljosa; Calderara, Simone; Leal-Taixe, Laura; Cucchiara, Rita
abstract

2021 - Multi-Category Mesh Reconstruction From Image Collections [Relazione in Atti di Convegno]
Simoni, Alessandro; Pini, Stefano; Vezzani, Roberto; Cucchiara, Rita
abstract

Recently, learning frameworks have shown the capability of inferring the accurate shape, pose, and texture of an object from a single RGB image. However, current methods are trained on image collections of a single category in order to exploit specific priors, and they often make use of category-specific 3D templates. In this paper, we present an alternative approach that infers the textured mesh of objects combining a series of deformable 3D models and a set of instance-specific deformation, pose, and texture. Differently from previous works, our method is trained with images of multiple object categories using only foreground masks and rough camera poses as supervision. Without specific 3D templates, the framework learns category-level models which are deformed to recover the 3D shape of the depicted object. The instance-specific deformations are predicted independently for each vertex of the learned 3D mesh, enabling the dynamic subdivision of the mesh during the training process. Experiments show that the proposed framework can distinguish between different object categories and learn category-specific shape priors in an unsupervised manner. Predicted shapes are smooth and can leverage from multiple steps of subdivision during the training process, obtaining comparable or state-of-the-art results on two public datasets. Models and code are publicly released.

2021 - Multimodal Attention Networks for Low-Level Vision-and-Language Navigation [Articolo su rivista]
Landi, Federico; Baraldi, Lorenzo; Cornia, Marcella; Corsini, Massimiliano; Cucchiara, Rita
abstract

Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a target destination. The goal gets even harder as the actions available to the agent get simpler and move towards low-level, atomic interactions with the environment. This setting takes the name of low-level VLN. In this paper, we strive for the creation of an agent able to tackle three key issues: multi-modality, long-term dependencies, and adaptability towards different locomotive settings. To that end, we devise "Perceive, Transform, and Act" (PTA): a fully-attentive VLN architecture that leaves the recurrent approach behind and the first Transformer-like architecture incorporating three different modalities -- natural language, images, and low-level actions for the agent control. In particular, we adopt an early fusion strategy to merge lingual and visual information efficiently in our encoder. We then propose to refine the decoding phase with a late fusion extension between the agent's history of actions and the perceptual modalities. We experimentally validate our model on two datasets: PTA achieves promising results in low-level VLN on R2R and achieves good performance in the recently proposed R4R benchmark. Our code is publicly available at https://github.com/aimagelab/perceive-transform-and-act.

2021 - Out of the Box: Embodied Navigation in the Real World [Relazione in Atti di Convegno]
Bigazzi, Roberto; Landi, Federico; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita
abstract

The research field of Embodied AI has witnessed substantial progress in visual navigation and exploration thanks to powerful simulating platforms and the availability of 3D data of indoor and photorealistic environments. These two factors have opened the doors to a new generation of intelligent agents capable of achieving nearly perfect PointGoal Navigation. However, such architectures are commonly trained with millions, if not billions, of frames and tested in simulation. Together with great enthusiasm, these results yield a question: how many researchers will effectively benefit from these advances? In this work, we detail how to transfer the knowledge acquired in simulation into the real world. To that end, we describe the architectural discrepancies that damage the Sim2Real adaptation ability of models trained on the Habitat simulator and propose a novel solution tailored towards the deployment in real-world scenarios. We then deploy our models on a LoCoBot, a Low-Cost Robot equipped with a single Intel RealSense camera. Different from previous work, our testing scene is unavailable to the agent in simulation. The environment is also inaccessible to the agent beforehand, so it cannot count on scene-specific semantic priors. In this way, we reproduce a setting in which a research group (potentially from other fields) needs to employ the agent visual navigation capabilities as-a-Service. Our experiments indicate that it is possible to achieve satisfying results when deploying the obtained model in the real world.

2021 - RMS-Net: Regression and Masking for Soccer Event Spotting [Relazione in Atti di Convegno]
Tomei, Matteo; Baraldi, Lorenzo; Calderara, Simone; Bronzin, Simone; Cucchiara, Rita
abstract

2021 - RefiNet: 3D Human Pose Refinement with Depth Maps [Relazione in Atti di Convegno]
D'Eusanio, Andrea; Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

Human Pose Estimation is a fundamental task for many applications in the Computer Vision community and it has been widely investigated in the 2D domain, i.e. intensity images. Therefore, most of the available methods for this task are mainly based on 2D Convolutional Neural Networks and huge manually-annotated RGB datasets, achieving stunning results. In this paper, we propose RefiNet, a multi-stage framework that regresses an extremely-precise 3D human pose estimation from a given 2D pose and a depth map. The framework consists of three different modules, each one specialized in a particular refinement and data representation, i.e. depth patches, 3D skeleton and point clouds. Moreover, we present a new dataset, called Baracca, acquired with RGB, depth and thermal cameras and specifically created for the automotive context. Experimental results confirm the quality of the refinement procedure that largely improves the human pose estimations of off-the-shelf 2D methods.

2021 - Revisiting The Evaluation of Class Activation Mapping for Explainability: A Novel Metric and Experimental Analysis [Relazione in Atti di Convegno]
Poppi, Samuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2021 - SHREC 2021: Skeleton-based hand gesture recognition in the wild [Articolo su rivista]
Caputo, Ariel; Giacchetti, Andrea; Soso, Simone; Pintani, Deborah; D'Eusanio, Andrea; Pini, Stefano; Borghi, Guido; Simoni, Alessandro; Vezzani, Roberto; Cucchiara, Rita; Ranieri, Andrea; Giannini, Franca; Lupinetti, Katia; Monti, Marina; Maghoumi, Mehran; LaViola Jr, Joseph; Le, Minh-Quan; Nguyen, Hai-Dang; Tran, Minh-Triet
abstract

This paper presents the results of the Eurographics 2019 SHape Retrieval Contest track on online gesture recognition. The goal of this contest was to test state-of-the-art methods that can be used to online detect command gestures from hands' movements tracking on a basic benchmark where simple gestures are performed interleaving them with other actions. Unlike previous contests and benchmarks on trajectory-based gesture recognition, we proposed an online gesture recognition task, not providing pre-segmented gestures, but asking the participants to find gestures within recorded trajectories. The results submitted by the participants show that an online detection and recognition of sets of very simple gestures from 3D trajectories captured with a cheap sensor can be effectively performed. The best methods proposed could be, therefore, directly exploited to design effective gesture-based interfaces to be used in different contexts, from Virtual and Mixed reality applications to the remote control of home devices.

2021 - Unifying tensor factorization and tensor nuclear norm approaches for low-rank tensor completion [Articolo su rivista]
Du, S.; Xiao, Q.; Shi, Y.; Cucchiara, R.; Ma, Y.
abstract

Low-rank tensor completion (LRTC) has gained significant attention due to its powerful capability of recovering missing entries. However, it has to repeatedly calculate the time-consuming singular value decomposition (SVD). To address this drawback, we, based on the tensor-tensor product (t-product), propose a new LRTC method-the unified tensor factorization (UTF)-for 3-way tensor completion. We first integrate the tensor factorization (TF) and the tensor nuclear norm (TNN) regularization into a framework that inherits the benefits of both TF and TNN: fast calculation and convex optimization. The conditions under which TF and TNN are equivalent are analyzed. Then, UTF for tensor completion is presented and an efficient iterative updated algorithm based on the alternate direction method of multipliers (ADMM) is used for our UTF optimization, and the solution of the proposed alternate minimization algorithm is also proven to be able to converge to a Karush–Kuhn–Tucker (KKT) point. Finally, numerical experiments on synthetic data completion and image/video inpainting tasks demonstrate the effectiveness of our method over other state-of-the-art tensor completion methods.

2021 - VITON-GT: An Image-based Virtual Try-On Model with Geometric Transformations [Relazione in Atti di Convegno]
Fincato, Matteo; Landi, Federico; Cornia, Marcella; Cesari, Fabio; Cucchiara, Rita
abstract

2021 - Video action detection by learning graph-based spatio-temporal interactions [Articolo su rivista]
Tomei, Matteo; Baraldi, Lorenzo; Calderara, Simone; Bronzin, Simone; Cucchiara, Rita
abstract

Action Detection is a complex task that aims to detect and classify human actions in video clips. Typically, it has been addressed by processing fine-grained features extracted from a video classification backbone. Recently, thanks to the robustness of object and people detectors, a deeper focus has been added on relationship modelling. Following this line, we propose a graph-based framework to learn high-level interactions between people and objects, in both space and time. In our formulation, spatio-temporal relationships are learned through self-attention on a multi-layer graph structure which can connect entities from consecutive clips, thus considering long-range spatial and temporal dependencies. The proposed module is backbone independent by design and does not require end-to-end training. Extensive experiments are conducted on the AVA dataset, where our model demonstrates state-of-the-art results and consistent improvements over baselines built with different backbones. Code is publicly available at https://github.com/aimagelab/STAGE_action_detection.

2021 - Watch Your Strokes: Improving Handwritten Text Recognition with Deformable Convolutions [Relazione in Atti di Convegno]
Cojocaru, Iulian; Cascianelli, Silvia; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita
abstract

Handwritten Text Recognition (HTR) in free-layout pages is a valuable yet challenging task which aims to automatically understand handwritten texts. State-of-the-art approaches in this field usually encode input images with Convolutional Neural Networks, whose kernels are typically defined on a fixed grid and focus on all input pixels independently. However, this is in contrast with the sparse nature of handwritten pages, in which only pixels representing the ink of the writing are useful for the recognition task. Furthermore, the standard convolution operator is not explicitly designed to take into account the great variability in shape, scale, and orientation of handwritten characters. To overcome these limitations, we investigate the use of deformable convolutions for handwriting recognition. This type of convolution deform the convolution kernel according to the content of the neighborhood, and can therefore be more adaptable to geometric variations and other deformations of the text. Experiments conducted on the IAM and RIMES datasets demonstrate that the use of deformable convolutions is a promising direction for the design of novel architectures for handwritten text recognition.

2021 - Working Memory Connections for LSTM [Articolo su rivista]
Landi, Federico; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita
abstract

2020 - 25th international conference on pattern recognition [Curatela]
Cucchiara, R.; Bimbo, A. D.; Sclaroff, S.
abstract

2020 - A Transformer-Based Network for Dynamic Hand Gesture Recognition [Relazione in Atti di Convegno]
D'Eusanio, Andrea; Simoni, Alessandro; Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

Transformer-based neural networks represent a successful self-attention mechanism that achieves state-of-the-art results in language understanding and sequence modeling. However, their application to visual data and, in particular, to the dynamic hand gesture recognition task has not yet been deeply investigated. In this paper, we propose a transformer-based architecture for the dynamic hand gesture recognition task. We show that the employment of a single active depth sensor, specifically the usage of depth maps and the surface normals estimated from them, achieves state-of-the-art results, overcoming all the methods available in the literature on two automotive datasets, namely NVidia Dynamic Hand Gesture and Briareo. Moreover, we test the method with other data types available with common RGB-D devices, such as infrared and color data. We also assess the performance in terms of inference time and number of parameters, showing that the proposed framework is suitable for an online in-car infotainment system.

2020 - A Unified Cycle-Consistent Neural Model for Text and Image Retrieval [Articolo su rivista]
Cornia, Marcella; Baraldi, Lorenzo; Tavakoli, Hamed R.; Cucchiara, Rita
abstract

2020 - Anomaly Detection for Vision-based Railway Inspection [Relazione in Atti di Convegno]
Gasparini, Riccardo; Pini, Stefano; Borghi, Guido; Scaglione, Giuseppe; Calderara, Simone; Fedeli, Eugenio; Cucchiara, Rita
abstract

2020 - Anomaly Detection, Localization and Classification for Railway Inspection [Relazione in Atti di Convegno]
Gasparini, Riccardo; D'Eusanio, Andrea; Borghi, Guido; Pini, Stefano; Scaglione, Giuseppe; Calderara, Simone; Fedeli, Eugenio; Cucchiara, Rita
abstract

2020 - Baracca: a Multimodal Dataset for Anthropometric Measurements in Automotive [Relazione in Atti di Convegno]
Pini, Stefano; D'Eusanio, Andrea; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

2020 - Compressed Volumetric Heatmaps for Multi-Person 3D Pose Estimation [Relazione in Atti di Convegno]
Fabbri, Matteo; Lanzi, Fabio; Calderara, Simone; Alletto, Stefano; Cucchiara, Rita
abstract

In this paper we present a novel approach for bottom-up multi-person 3D human pose estimation from monocular RGB images. We propose to use high resolution volumetric heatmaps to model joint locations, devising a simple and effective compression method to drastically reduce the size of this representation. At the core of the proposed method lies our Volumetric Heatmap Autoencoder, a fully-convolutional network tasked with the compression of ground-truth heatmaps into a dense intermediate representation. A second model, the Code Predictor, is then trained to predict these codes, which can be decompressed at test time to re-obtain the original representation. Our experimental evaluation shows that our method performs favorably when compared to state of the art on both multi-person and single-person 3D human pose estimation datasets and, thanks to our novel compression strategy, can process full-HD images at the constant runtime of 8 fps regardless of the number of subjects in the scene.

2020 - Conditional Channel Gated Networks for Task-Aware Continual Learning [Relazione in Atti di Convegno]
Abati, Davide; Tomczak, Jakub; Blankevoort, Tijmen; Calderara, Simone; Cucchiara, Rita; Bejnordi, Babak Ehteshami
abstract

2020 - Explaining Digital Humanities by Aligning Images and Textual Descriptions [Articolo su rivista]
Cornia, Marcella; Stefanini, Matteo; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita
abstract

Replicating the human ability to connect Vision and Language has recently been gaining a lot of attention in the Computer Vision and the Natural Language Processing communities. This research effort has resulted in algorithms that can retrieve images from textual descriptions and vice versa, when realistic images and sentences with simple semantics are employed and when paired training data is provided. In this paper, we go beyond these limitations and tackle the design of visual-semantic algorithms in the domain of the Digital Humanities. This setting not only advertises more complex visual and semantic structures but also features a significant lack of training data which makes the use of fully-supervised approaches infeasible. With this aim, we propose a joint visual-semantic embedding that can automatically align illustrations and textual elements without paired supervision. This is achieved by transferring the knowledge learned on ordinary visual-semantic datasets to the artistic domain. Experiments, performed on two datasets specifically designed for this domain, validate the proposed strategies and quantify the domain shift between natural images and artworks.

2020 - Face-from-Depth for Head Pose Estimation on Depth Images [Articolo su rivista]
Borghi, Guido; Fabbri, Matteo; Vezzani, Roberto; Calderara, Simone; Cucchiara, Rita
abstract

Depth cameras allow to set up reliable solutions for people monitoring and behavior understanding, especially when unstable or poor illumination conditions make unusable common RGB sensors. Therefore, we propose a complete framework for the estimation of the head and shoulder pose based on depth images only. A head detection and localization module is also included, in order to develop a complete end-to-end system. The core element of the framework is a Convolutional Neural Network, called POSEidon+, that receives as input three types of images and provides the 3D angles of the pose as output. Moreover, a Face-from-Depth component based on a Deterministic Conditional GAN model is able to hallucinate a face from the corresponding depth image. We empirically demonstrate that this positively impacts the system performances. We test the proposed framework on two public datasets, namely Biwi Kinect Head Pose and ICT-3DHP, and on Pandora, a new challenging dataset mainly inspired by the automotive setup. Experimental results show that our method overcomes several recent state-of-art works based on both intensity and depth input data, running in real-time at more than 30 frames per second.

2020 - Mercury: a vision-based framework for Driver Monitoring [Relazione in Atti di Convegno]
Borghi, Guido; Pini, Stefano; Vezzani, Roberto; Cucchiara, Rita
abstract

In this paper, we propose a complete framework, namely Mercury, that combines Computer Vision and Deep Learning algorithms to continuously monitor the driver during the driving activity. The proposed solution complies to the require-ments imposed by the challenging automotive context: the light invariance, in or-der to have a system able to work regardless of the time of day and the weather conditions. Therefore, infrared-based images, i.e. depth maps (in which each pixel corresponds to the distance between the sensor and that point in the scene), have been exploited in conjunction with traditional intensity images. Second, the non-invasivity of the system is required, since driver’s movements must not be impeded during the driving activity: in this context, the use of camer-as and vision-based algorithms is one of the best solutions. Finally, real-time per-formance is needed since a monitoring system must immediately react as soon as a situation of potential danger is detected.

2020 - Meshed-Memory Transformer for Image Captioning [Relazione in Atti di Convegno]
Cornia, Marcella; Stefanini, Matteo; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M² - a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the M² Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the "Karpathy" test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly available at :https://github.com/aimagelab/meshed-memory-transformer.

2020 - Multimodal Hand Gesture Classification for the Human-Car Interaction [Articolo su rivista]
D'Eusanio, Andrea; Simoni, Alessandro; Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

2020 - SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability [Relazione in Atti di Convegno]
Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

2020 - Welcome message from the general chairs [Relazione in Atti di Convegno]
Chen, C. W.; Cucchiara, R.; Hua, X. -S.
abstract

2019 - Art2Real: Unfolding the Reality of Artworks via Semantically-Aware Image-to-Image Translation [Relazione in Atti di Convegno]
Tomei, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

The applicability of computer vision to real paintings and artworks has been rarely investigated, even though a vast heritage would greatly benefit from techniques which can understand and process data from the artistic domain. This is partially due to the small amount of annotated artistic data, which is not even comparable to that of natural images captured by cameras. In this paper, we propose a semantic-aware architecture which can translate artworks to photo-realistic visualizations, thus reducing the gap between visual features of artistic and realistic data. Our architecture can generate natural images by retrieving and learning details from real photos through a similarity matching strategy which leverages a weakly-supervised semantic understanding of the scene. Experimental results show that the proposed technique leads to increased realism and to a reduction in domain shift, which improves the performance of pre-trained architectures for classification, detection, and segmentation. Code is publicly available at: https://github.com/aimagelab/art2real.

2019 - Artpedia: A New Visual-Semantic Dataset with Visual and Contextual Sentences in the Artistic Domain [Relazione in Atti di Convegno]
Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita
abstract

As vision and language techniques are widely applied to realistic images, there is a growing interest in designing visual-semantic models suitable for more complex and challenging scenarios. In this paper, we address the problem of cross-modal retrieval of images and sentences coming from the artistic domain. To this aim, we collect and manually annotate the Artpedia dataset that contains paintings and textual sentences describing both the visual content of the paintings and other contextual information. Thus, the problem is not only to match images and sentences, but also to identify which sentences actually describe the visual content of a given image. To this end, we devise a visual-semantic model that jointly addresses these two challenges by exploiting the latent alignment between visual and textual chunks. Experimental evaluations, obtained by comparing our model to different baselines, demonstrate the effectiveness of our solution and highlight the challenges of the proposed dataset. The Artpedia dataset is publicly available at: http://aimagelab.ing.unimore.it/artpedia.

2019 - Can adversarial networks hallucinate occluded people with a plausible aspect? [Articolo su rivista]
Fulgeri, F.; Fabbri, Matteo; Alletto, Stefano; Calderara, S.; Cucchiara, R.
abstract

When you see a person in a crowd, occluded by other persons, you miss visual information that can be used to recognize, re-identify or simply classify him or her. You can imagine its appearance given your experience, nothing more. Similarly, AI solutions can try to hallucinate missing information with specific deep learning architectures, suitably trained with people with and without occlusions. The goal of this work is to generate a complete image of a person, given an occluded version in input, that should be a) without occlusion b) similar at pixel level to a completely visible people shape c) capable to conserve similar visual attributes (e.g. male/female) of the original one. For the purpose, we propose a new approach by integrating the state-of-the-art of neural network architectures, namely U-nets and GANs, as well as discriminative attribute classification nets, with an architecture specifically designed to de-occlude people shapes. The network is trained to optimize a Loss function which could take into account the aforementioned objectives. As well we propose two datasets for testing our solution: the first one, occluded RAP, created automatically by occluding real shapes of the RAP dataset created by Li et al. (2016) (which collects also attributes of the people aspect); the second is a large synthetic dataset, AiC, generated in computer graphics with data extracted from the GTA video game, that contains 3D data of occluded objects by construction. Results are impressive and outperform any other previous proposal. This result could be an initial step to many further researches to recognize people and their behavior in an open crowded world.

2019 - Classifying Signals on Irregular Domains via Convolutional Cluster Pooling [Relazione in Atti di Convegno]
Porrello, Angelo; Abati, Davide; Calderara, Simone; Cucchiara, Rita
abstract

We present a novel and hierarchical approach for supervised classification of signals spanning over a fixed graph, reflecting shared properties of the dataset. To this end, we introduce a Convolutional Cluster Pooling layer exploiting a multi-scale clustering in order to highlight, at different resolutions, locally connected regions on the input graph. Our proposal generalises well-established neural models such as Convolutional Neural Networks (CNNs) on irregular and complex domains, by means of the exploitation of the weight sharing property in a graph-oriented architecture. In this work, such property is based on the centrality of each vertex within its soft-assigned cluster. Extensive experiments on NTU RGB+D, CIFAR-10 and 20NEWS demonstrate the effectiveness of the proposed technique in capturing both local and global patterns in graph-structured data out of different domains.

2019 - Driver Face Verification with Depth Maps [Articolo su rivista]
Borghi, Guido; Pini, Stefano; Vezzani, Roberto; Cucchiara, Rita
abstract

Face verification is the task of checking if two provided images contain the face of the same person or not. In this work, we propose a fully-convolutional Siamese architecture to tackle this task, achieving state-of-the-art results on three publicly-released datasets, namely Pandora, High-Resolution Range-based Face Database (HRRFaceD), and CurtinFaces. The proposed method takes depth maps as the input, since depth cameras have been proven to be more reliable in different illumination conditions. Thus, the system is able to work even in the case of the total or partial absence of external light sources, which is a key feature for automotive applications. From the algorithmic point of view, we propose a fully-convolutional architecture with a limited number of parameters, capable of dealing with the small amount of depth data available for training and able to run in real time even on a CPU and embedded boards. The experimental results show acceptable accuracy to allow exploitation in real-world applications with in-board cameras. Finally, exploiting the presence of faces occluded by various head garments and extreme head poses available in the Pandora dataset, we successfully test the proposed system also during strong visual occlusions. The excellent results obtained confirm the efficacy of the proposed method.

2019 - Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters [Relazione in Atti di Convegno]
Landi, Federico; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita
abstract

In Vision-and-Language Navigation (VLN), an embodied agent needs to reach a target destination with the only guidance of a natural language instruction. To explore the environment and progress towards the target location, the agent must perform a series of low-level actions, such as rotate, before stepping ahead. In this paper, we propose to exploit dynamic convolutional filters to encode the visual information and the lingual description in an efficient way. Differently from some previous works that abstract from the agent perspective and use high-level navigation spaces, we design a policy which decodes the information provided by dynamic convolution into a series of low-level, agent friendly actions. Results show that our model exploiting dynamic filters performs better than other architectures with traditional convolution, being the new state of the art for embodied VLN in the low-level action space. Additionally, we attempt to categorize recent work on VLN depending on their architectural choices and distinguish two main groups: we call them low-level actions and high-level actions models. To the best of our knowledge, we are the first to propose this analysis and categorization for VLN.

2019 - End-to-end 6-DoF Object Pose Estimation through Differentiable Rasterization [Relazione in Atti di Convegno]
Palazzi, Andrea; Bergamini, Luca; Calderara, Simone; Cucchiara, Rita
abstract

Here we introduce an approximated differentiable renderer to refine a 6-DoF pose prediction using only 2D alignment information. To this end, a two-branched convolutional encoder network is employed to jointly estimate the object class and its 6-DoF pose in the scene. We then propose a new formulation of an approximated differentiable renderer to re-project the 3D object on the image according to its predicted pose; in this way the alignment error between the observed and the re-projected object silhouette can be measured. Since the renderer is differentiable, it is possible to back-propagate through it to correct the estimated pose at test time in an online learning fashion. Eventually we show how to leverage the classification branch to profitably re-project a representative model of the predicted class (i.e. a medoid) instead. Each object in the scene is processed independently and novel viewpoints in which both objects arrangement and mutual pose are preserved can be rendered. Differentiable renderer code is available at:https://github.com/ndrplz/tensorflow-mesh-renderer.

2019 - Face Verification from Depth using Privileged Information [Relazione in Atti di Convegno]
Borghi, Guido; Pini, Stefano; Grazioli, Filippo; Vezzani, Roberto; Cucchiara, Rita
abstract

In this paper, a deep Siamese architecture for depth-based face verification is presented. The proposed approach efficiently verifies if two face images belong to the same person while handling a great variety of head poses and occlusions. The architecture, namely JanusNet, consists in a combination of a depth, a RGB and a hybrid Siamese network. During the training phase, the hybrid network learns to extract complementary mid-level convolutional features which mimic the features of the RGB network, simultaneously leveraging on the light invariance of depth images. At testing time, the model, relying only on depth data, achieves state-of-art results and real time performance, despite the lack of deep-oriented depth-based datasets.

2019 - Hand Gestures for the Human-Car Interaction: the Briareo dataset [Relazione in Atti di Convegno]
Manganaro, Fabio; Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

Natural User Interfaces can be an effective way to reduce driver's inattention during the driving activity. To this end, in this paper we propose a new dataset, called Briareo, specifically collected for the hand gesture recognition task in the automotive context. The dataset is acquired from an innovative point of view, exploiting different kinds of cameras, i.e. RGB, infrared stereo, and depth, that provide various types of images and 3D hand joints. Moreover, the dataset contains a significant amount of hand gesture samples, performed by several subjects, allowing the use of deep learning-based approaches. Finally, a framework for hand gesture segmentation and classification is presented, exploiting a method introduced to assess the quality of the proposed dataset.

2019 - Image-to-Image Translation to Unfold the Reality of Artworks: an Empirical Analysis [Relazione in Atti di Convegno]
Tomei, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

State-of-the-art Computer Vision pipelines show poor performances on artworks and data coming from the artistic domain, thus limiting the applicability of current architectures to the automatic understanding of the cultural heritage. This is mainly due to the difference in texture and low-level feature distribution between artistic and real images, on which state-of-the-art approaches are usually trained. To enhance the applicability of pre-trained architectures on artistic data, we have recently proposed an unpaired domain translation approach which can translate artworks to photo-realistic visualizations. Our approach leverages semantically-aware memory banks of real patches, which are used to drive the generation of the translated image while improving its realism. In this paper, we provide additional analyses and experimental results which demonstrate the effectiveness of our approach. In particular, we evaluate the quality of generated results in the case of the translation of landscapes, portraits and of paintings coming from four different styles using automatic distance metrics. Also, we analyze the response of pre-trained architecture for classification, detection and segmentation both in terms of feature distribution and entropy of prediction, and show that our approach effectively reduces the domain shift of paintings. As an additional contribution, we also provide a qualitative analysis of the reduction of the domain shift for detection, segmentation and image captioning.

2019 - Latent Space Autoregression for Novelty Detection [Relazione in Atti di Convegno]
Abati, Davide; Porrello, Angelo; Calderara, Simone; Cucchiara, Rita
abstract

Novelty detection is commonly referred to as the discrimination of observations that do not conform to a learned model of regularity. Despite its importance in different application settings, designing a novelty detector is utterly complex due to the unpredictable nature of novelties and its inaccessibility during the training procedure, factors which expose the unsupervised nature of the problem. In our proposal, we design a general framework where we equip a deep autoencoder with a parametric density estimator that learns the probability distribution underlying its latent representations through an autoregressive procedure. We show that a maximum likelihood objective, optimized in conjunction with the reconstruction of normal samples, effectively acts as a regularizer for the task at hand, by minimizing the differential entropy of the distribution spanned by latent vectors. In addition to providing a very general formulation, extensive experiments of our model on publicly available datasets deliver on-par or superior performances if compared to state-of-the-art methods in one-class and video anomaly detection settings. Differently from prior works, our proposal does not make any assumption about the nature of the novelties, making our work readily applicable to diverse contexts.

2019 - M-VAD Names: a Dataset for Video Captioning with Naming [Articolo su rivista]
Pini, Stefano; Cornia, Marcella; Bolelli, Federico; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic "someone" tag. The lack of movie description datasets with characters' visual annotations surely plays a relevant role in this shortage. Recently, we proposed to extend the M-VAD dataset by introducing such information. In this paper, we present an improved version of the dataset, namely M-VAD Names, and its semi-automatic annotation procedure. The resulting dataset contains 63k visual tracks and 34k textual mentions, all associated with character identities. To showcase the features of the dataset and quantify the complexity of the naming task, we investigate multimodal architectures to replace the "someone" tags with proper character names in existing video captions. The evaluation is further extended by testing this application on videos outside of the M-VAD Names dataset.

2019 - Manual Annotations on Depth Maps for Human Pose Estimation [Relazione in Atti di Convegno]
D'Eusanio, Andrea; Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

Few works tackle the Human Pose Estimation on depth maps. Moreover, these methods usually rely on automatically annotated datasets, and these annotations are often imprecise and unreliable, limiting the achievable accuracy using this data as ground truth. For this reason, in this paper we propose an annotation refinement tool of human poses, by means of body joints, and a novel set of fine joint annotations for the Watch-n-Patch dataset, which has been collected with the proposed tool. Furthermore, we present a fully-convolutional architecture that performs the body pose estimation directly on depth maps. The extensive evaluation shows that the proposed architecture outperforms the competitors in different training scenarios and is able to run in real-time.

2019 - Predicting the Driver's Focus of Attention: the DR(eye)VE Project [Articolo su rivista]
Palazzi, Andrea; Abati, Davide; Calderara, Simone; Solera, Francesco; Cucchiara, Rita
abstract

Predicting the Driver's Focus of Attention: the DR(eye)VE Project Andrea Palazzi, Davide Abati, Simone Calderara, Francesco Solera, Rita Cucchiara (Submitted on 10 May 2017 (v1), last revised 6 Jun 2018 (this version, v3)) In this work we aim to predict the driver's focus of attention. The goal is to estimate what a person would pay attention to while driving, and which part of the scene around the vehicle is more critical for the task. To this end we propose a new computer vision model based on a multi-branch deep architecture that integrates three sources of information: raw video, motion and scene semantics. We also introduce DR(eye)VE, the largest dataset of driving scenes for which eye-tracking annotations are available. This dataset features more than 500,000 registered frames, matching ego-centric views (from glasses worn by drivers) and car-centric views (from roof-mounted camera), further enriched by other sensors measurements. Results highlight that several attention patterns are shared across drivers and can be reproduced to some extent. The indication of which elements in the scene are likely to capture the driver's attention may benefit several applications in the context of human-vehicle interaction and driver attention analysis.

2019 - Recognizing social relationships from an egocentric vision perspective [Capitolo/Saggio]
Alletto, Stefano; Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita
abstract

In this chapter we address the problem of partitioning social gatherings into interacting groups in egocentric scenarios. People in the scene are tracked, their head pose and 3D location are estimated. Following the formalism of the f-formation, we define with the orientation and distance an inherently social pairwise feature capable of describing how two people stand in relation to one another. We present a Structural SVM based approach to learn how to weight each component of the feature vector depending on the social situation is applied to. To better understand the social dynamics, we also estimate what we call social relevance of each subject in a group using a saliency attentive model. Extensive tests on two publicly available datasets show that our solution achieves encouraging results when detecting social groups and their relevant subjects in the challenging egocentric scenarios.

2019 - SHREC 2019 Track: Online Gesture Recognition [Relazione in Atti di Convegno]
Caputo, F. M.; Burato, S.; Pavan, G.; Voillemin, T.; Wannous, H.; Vandeborre, J. P.; Maghoumi, M.; Taranta, E. M.; Razmjoo, A.; J. J. LaViola Jr., ; Manganaro, Fabio; Pini, S.; Borghi, G.; Vezzani, R.; Cucchiara, R.; Nguyen, H.; Tran, M. T.; Giachetti, A.
abstract

2019 - Self-Supervised Optical Flow Estimation by Projective Bootstrap [Articolo su rivista]
Alletto, Stefano; Abati, Davide; Calderara, Simone; Cucchiara, Rita; Rigazio, Luca
abstract

Dense optical flow estimation is complex and time consuming, with state-of-the-art methods relying either on large synthetic data sets or on pipelines requiring up to a few minutes per frame pair. In this paper, we address the problem of optical flow estimation in the automotive scenario in a self-supervised manner. We argue that optical flow can be cast as a geometrical warping between two successive video frames and devise a deep architecture to estimate such transformation in two stages. First, a dense pixel-level flow is computed with a projective bootstrap on rigid surfaces. We show how such global transformation can be approximated with a homography and extend spatial transformer layers so that they can be employed to compute the flow field implied by such transformation. Subsequently, we refine the prediction by feeding a second, deeper network that accounts for moving objects. A final reconstruction loss compares the warping of frame Xₜ with the subsequent frame Xₜ₊₁ and guides both estimates. The model has the speed advantages of end-to-end deep architectures while achieving competitive performances, both outperforming recent unsupervised methods and showing good generalization capabilities on new automotive data sets.

2019 - Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions [Relazione in Atti di Convegno]
Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As an image can be described in infinite ways depending on the goal and the context at hand, a higher degree of controllability is needed to apply captioning algorithms in complex scenarios. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by allowing both grounding and controllability. Given a control signal in the form of a sequence or set of image regions, we generate the corresponding caption through a recurrent architecture which predicts textual chunks explicitly grounded on regions, following the constraints of the given control. Experiments are conducted on Flickr30k Entities and on COCO Entities, an extended version of COCO in which we add grounding annotations collected in a semi-automatic manner. Results demonstrate that our method achieves state of the art performances on controllable image captioning, in terms of caption quality and diversity. Code and annotations are publicly available at: https://github.com/aimagelab/show-control-and-tell.

2019 - Towards Cycle-Consistent Models for Text and Image Retrieval [Relazione in Atti di Convegno]
Cornia, Marcella; Baraldi, Lorenzo; Rezazadegan Tavakoli, Hamed; Cucchiara, Rita
abstract

Cross-modal retrieval has been recently becoming an hot-spot research, thanks to the development of deeply-learnable architectures. Such architectures generally learn a joint multi-modal embedding space in which text and images could be projected and compared. Here we investigate a different approach, and reformulate the problem of cross-modal retrieval as that of learning a translation between the textual and visual domain. In particular, we propose an end-to-end trainable model which can translate text into image features and vice versa, and regularizes this mapping with a cycle-consistency criterion. Preliminary experimental evaluations show promising results with respect to ordinary visual-semantic models.

2019 - Video synthesis from Intensity and Event Frames [Relazione in Atti di Convegno]
Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

Event cameras, neuromorphic devices that naturally respond to brightness changes, have multiple advantages with respect to traditional cameras. However, the difficulty of applying traditional computer vision algorithms on event data limits their usability. Therefore, in this paper we investigate the use of a deep learning-based architecture that combines an initial grayscale frame and a series of event data to estimate the following intensity frames. In particular, a fully-convolutional encoder-decoder network is employed and evaluated for the frame synthesis task on an automotive event-based dataset. Performance obtained with pixel-wise metrics confirms the quality of the images synthesized by the proposed architecture.

2019 - Visual-Semantic Alignment Across Domains Using a Semi-Supervised Approach [Relazione in Atti di Convegno]
Carraggi, Angelo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval of images and sentences. In this setting, data coming from different modalities can be projected in a common embedding space, in which distances can be used to infer the similarity between pairs of images and sentences. While this approach has shown impressive performances on fully supervised settings, its application to semi-supervised scenarios has been rarely investigated. In this paper we propose a domain adaptation model for cross-modal retrieval, in which the knowledge learned from a supervised dataset can be transferred on a target dataset in which the pairing between images and sentences is not known, or not useful for training due to the limited size of the set. Experiments are performed on two target unsupervised scenarios, respectively related to the fashion and cultural heritage domain. Results show that our model is able to effectively transfer the knowledge learned on ordinary visual-semantic datasets, achieving promising results. As an additional contribution, we collect and release the dataset used for the cultural heritage domain.

2019 - What was Monet seeing while painting? Translating artworks to photo-realistic images [Relazione in Atti di Convegno]
Tomei, Matteo; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita
abstract

State of the art Computer Vision techniques exploit the availability of large-scale datasets, most of which consist of images captured from the world as it is. This brings to an incompatibility between such methods and digital data from the artistic domain, on which current techniques under-perform. A possible solution is to reduce the domain shift at the pixel level, thus translating artistic images to realistic copies. In this paper, we present a model capable of translating paintings to photo-realistic images, trained without paired examples. The idea is to enforce a patch level similarity between real and generated images, aiming to reproduce photo-realistic details from a memory bank of real images. This is subsequently adopted in the context of an unpaired image-to-image translation framework, mapping each image from one distribution to a new one belonging to the other distribution. Qualitative and quantitative results are presented on Monet, Cezanne and Van Gogh paintings translation tasks, showing that our approach increases the realism of generated images with respect to the CycleGAN approach.

2018 - Aligning Text and Document Illustrations: towards Visually Explainable Digital Humanities [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Cornia, Marcella; Grana, Costantino; Cucchiara, Rita
abstract

While several approaches to bring vision and language together are emerging, none of them has yet addressed the digital humanities domain, which, nevertheless, is a rich source of visual and textual data. To foster research in this direction, we investigate the learning of visual-semantic embeddings for historical document illustrations, devising both supervised and semi-supervised approaches. We exploit the joint visual-semantic embeddings to automatically align illustrations and textual elements, thus providing an automatic annotation of the visual content of a manuscript. Experiments are performed on the Borso d'Este Holy Bible, one of the most sophisticated illuminated manuscript from the Renaissance, which we manually annotate aligning every illustration with textual commentaries written by experts. Experimental results quantify the domain shift between ordinary visual-semantic datasets and the proposed one, validate the proposed strategies, and devise future works on the same line.

2018 - Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era [Articolo su rivista]
Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita
abstract

Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades both in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we discuss the effectiveness of convolutional neural networks (CNNs) models in saliency prediction. We present a set of Deep Learning architectures developed by us, which can combine both bottom-up cues and higher-level semantics, and extract spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. We will show how these deep networks closely recall the early saliency models, although improved with the semantics learned from the human ground-truth. Eventually, we will present a use-case in which saliency prediction is used to improve the automatic description of images.

2018 - Automatic Image Cropping and Selection using Saliency: an Application to Historical Manuscripts [Relazione in Atti di Convegno]
Cornia, Marcella; Pini, Stefano; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Automatic image cropping techniques are particularly important to improve the visual quality of cropped images and can be applied to a wide range of applications such as photo-editing, image compression, and thumbnail selection. In this paper, we propose a saliency-based image cropping method which produces significant cropped images by only relying on the corresponding saliency maps. Experiments on standard image cropping datasets demonstrate the benefit of the proposed solution with respect to other cropping methods. Moreover, we present an image selection method that can be effectively applied to automatically select the most representative pages of historical manuscripts thus improving the navigation of historical digital libraries.

2018 - Comportamento non verbale intergruppi “oggettivo”: una replica dello studio di Dovidio, kawakami e Gaertner (2002) [Abstract in Atti di Convegno]
Di Bernardo, Gian Antonio; Vezzali, Loris; Giovannini, Dino; Palazzi, Andrea; Calderara, Simone; Bicocchi, Nicola; Zambonelli, Franco; Cucchiara, Rita; Cadamuro, Alessia; Cocco, Veronica Margherita
abstract

Vi è una lunga tradizione di ricerca che ha analizzato il comportamento non verbale, anche considerando relazioni intergruppi. Solitamente, questi studi si avvalgono di valutazioni di coder esterni, che tuttavia sono soggettive e aperte a distorsioni. Abbiamo condotto uno studio in cui si è preso come riferimento il celebre studio di Dovidio, Kawakami e Gaertner (2002), apportando tuttavia alcune modifiche e considerando la relazione tra bianchi e neri. Partecipanti bianchi, dopo aver completato misure di pregiudizio esplicito e implicito, incontravano (in ordine contro-bilanciato) un collaboratore bianco e uno nero. Con ognuno di essi, parlavano per tre minuti di un argomento neutro e di un argomento saliente per la distinzione di gruppo (in ordine contro-bilanciato). Tali interazioni erano registrate con una telecamera kinect, che è in grado di tenere conto della componente tridimensionale del movimento. I risultati hanno rivelato vari elementi di interesse. Anzitutto, si sono creati indici oggettivi, a partire da un’analisi della letteratura, alcuni dei quali non possono essere rilevati da coder esterni, quali distanza interpersonale e volume di spazio tra le persone. I risultati hanno messo in luce alcuni aspetti rilevanti: (1) l’atteggiamento implicito è associato a vari indici di comportamento non verbale, i quali mediano sulle valutazioni dei partecipanti fornite dai collaboratori; (2) le interazioni vanno considerate in maniera dinamica, tenendo conto che si sviluppano nel tempo; (3) ciò che può essere importante è il comportamento non verbale globale, piuttosto che alcuni indici specifici pre-determinati dagli sperimentatori.

2018 - Deep Head Pose Estimation from Depth Data for In-car Automotive Applications [Relazione in Atti di Convegno]
Venturelli, Marco; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

Recently, deep learning approaches have achieved promising results in various fields of computer vision. In this paper, we tackle the problem of head pose estimation through a Convolutional Neural Network (CNN). Differently from other proposals in the literature, the described system is able to work directly and based only on raw depth data. Moreover, the head pose estimation is solved as a regression problem and does not rely on visual facial features like facial landmarks. We tested our system on a well known public dataset, extit{Biwi Kinect Head Pose}, showing that our approach achieves state-of-art results and is able to meet real time performance requirements.

2018 - Domain Translation with Conditional GANs: from Depth to RGB Face-to-Face [Relazione in Atti di Convegno]
Fabbri, Matteo; Borghi, Guido; Lanzi, Fabio; Vezzani, Roberto; Calderara, Simone; Cucchiara, Rita
abstract

Can faces acquired by low-cost depth sensors be useful to see some characteristic details of the faces? Typically the answer is not. However, new deep architectures can generate RGB images from data acquired in a different modality, such as depth data. In this paper we propose a new Deterministic Conditional GAN, trained on annotated RGB-D face datasets, effective for a face-to-face translation from depth to RGB. Although the network cannot reconstruct the exact somatic features for unknown individual faces, it is capable to reconstruct plausible faces; their appearance is accurate enough to be used in many pattern recognition tasks. In fact, we test the network capability to hallucinate with some Perceptual Probes, as for instance face aspect classification or landmark detection. Depth face can be used in spite of the correspondent RGB images, that often are not available for darkness of difficult luminance conditions. Experimental results are very promising and are as far as better than previous proposed approaches: this domain translation can constitute a new way to exploit depth data in new future applications.

2018 - Fully Convolutional Network for Head Detection with Depth Images [Relazione in Atti di Convegno]
Ballotta, Diego; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

Head detection and localization are one of most investigated and demanding tasks of the Computer Vision community. These are also a key element for many disciplines, like Human Computer Interaction, Human Behavior Understanding, Face Analysis and Video Surveillance. In last decades, many efforts have been conducted to develop accurate and reliable head or face detectors on standard RGB images, but only few solutions concern other types of images, such as depth maps. In this paper, we propose a novel method for head detection on depth images, based on a deep learning approach. In particular, the presented system overcomes the classic sliding-window approach, that is often the main computational bottleneck of many object detectors, through a Fully Convolutional Network. Two public datasets, namely Pandora and Watch-n-Patch, are exploited to train and test the proposed network. Experimental results confirm the effectiveness of the method, that is able to exceed all the state-of-art works based on depth images and to run with real time performance.

2018 - Guest editorial: Special section on “multimedia understanding via multimodal analytics” [Articolo su rivista]
Yan, Y.; Nie, L.; Cucchiara, R.
abstract

2018 - Hands on the wheel: a Dataset for Driver Hand Detection and Tracking [Relazione in Atti di Convegno]
Borghi, Guido; Frigieri, Elia; Vezzani, Roberto; Cucchiara, Rita
abstract

The ability to detect, localize and track the hands is crucial in many applications requiring the understanding of the person behavior, attitude and interactions. In particular, this is true for the automotive context, in which hand analysis allows to predict preparatory movements for maneuvers or to investigate the driver’s attention level. Moreover, due to the recent diffusion of cameras inside new car cockpits, it is feasible to use hand gestures to develop new Human-Car Interaction systems, more user-friendly and safe. In this paper, we propose a new dataset, called Turms, that consists of infrared images of driver’s hands, collected from the back of the steering wheel, an innovative point of view. The Leap Motion device has been selected for the recordings, thanks to its stereo capabilities and the wide view-angle. Besides, we introduce a method to detect the presence and the location of driver’s hands on the steering wheel, during driving activity tasks.

2018 - Head Detection with Depth Images in the Wild [Relazione in Atti di Convegno]
Ballotta, Diego; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

Head detection and localization is a demanding task and a key element for many computer vision applications, like video surveillance, Human Computer Interaction and face analysis. The stunning amount of work done for detecting faces on RGB images, together with the availability of huge face datasets, allowed to setup very effective systems on that domain. However, due to illumination issues, infrared or depth cameras may be required in real applications. In this paper, we introduce a novel method for head detection on depth images that exploits the classification ability of deep learning approaches. In addition to reduce the dependency on the external illumination, depth images implicitly embed useful information to deal with the scale of the target objects. Two public datasets have been exploited: the first one, called Pandora, is used to train a deep binary classifier with face and non-face images. The second one, collected by Cornell University, is used to perform a cross-dataset test during daily activities in unconstrained environments. Experimental results show that the proposed method overcomes the performance of state-of-art methods working on depth images.

2018 - LAMV: Learning to align and match videos with kernelized temporal layers [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Douze, Matthijs; Cucchiara, Rita; Jégou, Hervé
abstract

This paper considers a learnable approach for comparing and aligning videos. Our architecture builds upon and revisits temporal match kernels within neural networks: we propose a new temporal layer that finds temporal alignments by maximizing the scores between two sequences of vectors, according to a time-sensitive similarity metric parametrized in the Fourier domain. We learn this layer with a temporal proposal strategy, in which we minimize a triplet loss that takes into account both the localization accuracy and the recognition rate. We evaluate our approach on video alignment, copy detection and event retrieval. Our approach outperforms the state on the art on temporal video alignment and video copy detection datasets in comparable setups. It also attains the best reported results for particular event search, while precisely aligning videos.

2018 - Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World [Relazione in Atti di Convegno]
Fabbri, Matteo; Lanzi, Fabio; Calderara, Simone; Palazzi, Andrea; Vezzani, Roberto; Cucchiara, Rita
abstract

Multi-People Tracking in an open-world setting requires a special effort in precise detection. Moreover, temporal continuity in the detection phase gains more importance when scene cluttering introduces the challenging problems of occluded targets. For the purpose, we propose a deep network architecture that jointly extracts people body parts and associates them across short temporal spans. Our model explicitly deals with occluded body parts, by hallucinating plausible solutions of not visible joints. We propose a new end-to-end architecture composed by four branches (visible heatmaps, occluded heatmaps, part affinity fields and temporal affinity fields) fed by a time linker feature extractor. To overcome the lack of surveillance data with tracking, body part and occlusion annotations we created the vastest Computer Graphics dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, almost 10 million body poses) of human body parts for people tracking in urban scenarios. Our architecture trained on virtual data exhibits good generalization capabilities also on public real tracking benchmarks, when image resolution and sharpness are high enough, producing reliable tracklets useful for further batch data association or re-id modules.

2018 - Learning to Generate Facial Depth Maps [Relazione in Atti di Convegno]
Pini, Stefano; Grazioli, Filippo; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

In this paper, an adversarial architecture for facial depth map estimation from monocular intensity images is presented. By following an image-to-image approach, we combine the advantages of supervised learning and adversarial training, proposing a conditional Generative Adversarial Network that effectively learns to translate intensity face images into the corresponding depth maps. Two public datasets, namely Biwi database and Pandora dataset, are exploited to demonstrate that the proposed model generates high-quality synthetic depth images, both in terms of visual appearance and informative content. Furthermore, we show that the model is capable of predicting distinctive facial details by testing the generated depth maps through a deep model trained on authentic depth maps for the face verification task.

2018 - Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention [Articolo su rivista]
Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita
abstract

Image captioning has been recently gaining a lot of attention thanks to the impressive achievements shown by deep captioning architectures, which combine Convolutional Neural Networks to extract image representations, and Recurrent Neural Networks to generate the corresponding captions. At the same time, a significant research effort has been dedicated to the development of saliency prediction models, which can predict human eye fixations. Despite saliency information could be useful to condition an image captioning architecture, by providing an indication of what is salient and what is not, no model has yet succeeded in effectively incorporating these two techniques. In this work, we propose an image captioning approach in which a generative recurrent neural network can focus on different parts of the input image during the generation of the caption, by exploiting the conditioning given by a saliency prediction model on which parts of the image are salient and which are contextual. We demonstrate, through extensive quantitative and qualitative experiments on large scale datasets, that our model achieves superior performances with respect to different image captioning baselines with and without saliency. Finally, we also show that the trained model can focus on salient and contextual regions during the generation of the caption in an appropriate way.

2018 - Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model [Articolo su rivista]
Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita
abstract

Data-driven saliency has recently gained a lot of attention thanks to the use of Convolutional Neural Networks for predicting gaze fixations. In this paper we go beyond standard approaches to saliency prediction, in which gaze maps are computed with a feed-forward network, and present a novel model which can predict accurate saliency maps by incorporating neural attentive mechanisms. The core of our solution is a Convolutional LSTM that focuses on the most salient regions of the input image to iteratively refine the predicted saliency map. Additionally, to tackle the center bias typical of human eye fixations, our model can learn a set of prior maps generated with Gaussian functions. We show, through an extensive evaluation, that the proposed architecture outperforms the current state of the art on public saliency prediction datasets. We further study the contribution of each key component to demonstrate their robustness on different scenarios.

2018 - SAM: Pushing the Limits of Saliency Prediction Models [Relazione in Atti di Convegno]
Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita
abstract

The prediction of human eye fixations has been recently gaining a lot of attention thanks to the improvements shown by deep architectures. In our work, we go beyond classical feed-forward networks to predict saliency maps and propose a Saliency Attentive Model which incorporates neural attention mechanisms to iteratively refine predictions. Experiments demonstrate that the proposed strategy overcomes by a considerable margin the state of the art on the largest dataset available for saliency prediction. Here, we provide experimental results on other popular saliency datasets to confirm the effectiveness and the generalization capabilities of our model, which enable us to reach the state of the art on all considered datasets.

2018 - Unsupervised vehicle re-identification using triplet networks [Relazione in Atti di Convegno]
Marin-Reyes, P. A.; Bergamini, L.; Lorenzo-Navarro, J.; Palazzi, A.; Calderara, S.; Cucchiara, R.
abstract

Vehicle re-identification plays a major role in modern smart surveillance systems. Specifically, the task requires the capability to predict the identity of a given vehicle, given a dataset of known associations, collected from different views and surveillance cameras. Generally, it can be cast as a ranking problem: given a probe image of a vehicle, the model needs to rank all database images based on their similarities w.r.t the probe image. In line with recent research, we devise a metric learning model that employs a supervision based on local constraints. In particular, we leverage pairwise and triplet constraints for training a network capable of assigning a high degree of similarity to samples sharing the same identity, while keeping different identities distant in feature space. Eventually, we show how vehicle tracking can be exploited to automatically generate a weakly labelled dataset that can be used to train the deep network for the task of vehicle re-identification. Learning and evaluation is carried out on the NVIDIA AI city challenge videos.

2018 - Using Kinect camera for investigating intergroup non-verbal human interactions [Abstract in Atti di Convegno]
Vezzali, Loris; Di Bernardo, Gian Antonio; Cadamuro, Alessia; Cocco, Veronica Margherita; Crapolicchio, Eleonora; Bicocchi, Nicola; Calderara, Simone; Giovannini, Dino; Zambonelli, Franco; Cucchiara, Rita
abstract

A long tradition in social psychology focused on nonverbal behaviour displayed during dyadic interactions generally relying on evaluations from external coders. However, in addition to the fact that external coders may be biased, they may not capture certain type of behavioural indices. We designed three studies examining explicit and implicit prejudice as predictors of nonberval behaviour as reflected in objective indices provided by Kinect cameras. In the first study, we considered White-Black relations from the perspective of 36 White participants. Results revealed that implicit prejudice was associated with a reduction in interpersonal distance and in the volume of space between Whites and Blacks (vs. Whites and Whites), which in turn were associated with evaluations by collaborators taking part in the interaction. In the second study, 37 non-HIV participants interacted with HIV individuals. We found that implicit prejudice was associated with reduced volume of space between interactants over time (a process of bias overcorrection) only when they tried hard to control their behaviour (as captured by a stroop test). In the third study 35 non-disabled children interacted with disabled children. Results revealed that implicit prejudice was associated with reduced interpersonal distance over time.

2017 - A Video Library System Using Scene Detection and Automatic Tagging [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
abstract

We present a novel video browsing and retrieval system for edited videos, in which videos are automatically decomposed into meaningful and storytelling parts (i.e. scenes) and tagged according to their transcript. The system relies on a Triplet Deep Neural Network which exploits multimodal features, and has been implemented as a set of extensions to the eXo Platform Enterprise Content Management System (ECMS). This set of extensions enable the interactive visualization of a video, its automatic and semi-automatic annotation, as well as a keyword-based search inside the video collection. The platform also allows a natural integration with third-party add-ons, so that automatic annotations can be exploited outside the proposed platform.

2017 - A new era in the study of intergroup nonverbal behaviour: Studying intergroup dyadic interactions “online” [Abstract in Atti di Convegno]
DI BERNARDO, GIAN ANTONIO; Vezzali, Loris; Palazzi, Andrea; Calderara, Simone; Bicocchi, Nicola; Zambonelli, Franco; Cucchiara, Rita; Cadamuro, Alessia
abstract

We examined predictors and consequences of intergroup nonverbal behaviour by relying on new technologies and new objective indices. In three studies, both in the laboratory and in the field with children, behaviour was a function of implicit prejudice.

2017 - Affective level design for a role-playing videogame evaluated by a brain–computer interface and machine learning methods [Articolo su rivista]
Balducci, Fabrizio; Grana, Costantino; Cucchiara, Rita
abstract

Game science has become a research field, which attracts industry attention due to a worldwide rich sell-market. To understand the player experience, concepts like flow or boredom mental states require formalization and empirical investigation, taking advantage of the objective data that psychophysiological methods like electroencephalography (EEG) can provide. This work studies the affective ludology and shows two different game levels for Neverwinter Nights 2 developed with the aim to manipulate emotions; two sets of affective design guidelines are presented, with a rigorous formalization that considers the characteristics of role-playing genre and its specific gameplay. An empirical investigation with a brain–computer interface headset has been conducted: by extracting numerical data features, machine learning techniques classify the different activities of the gaming sessions (task and events) to verify if their design differentiation coincides with the affective one. The observed results, also supported by subjective questionnaires data, confirm the goodness of the proposed guidelines, suggesting that this evaluation methodology could be extended to other evaluation tasks.

2017 - Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era [Relazione in Atti di Convegno]
Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita
abstract

Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we propose a discussion on why convolutional neural networks (CNNs) are so accurate in saliency prediction. We present our DL architectures which combine both bottom-up cues and higher-level semantics, and incorporate the concept of time in the attentional process through LSTM recurrent architectures. Eventually, we present a video-specific architecture based on the C3D network, which can extracts spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. The merit of this work is to show how these deep networks are not mere brute-force methods tuned on massive amount of data, but represent well-defined architectures which recall very closely the early saliency models, although improved with the semantics learned by human ground-thuth.

2017 - Editorial Message from the Program Chairs [Relazione in Atti di Convegno]
Cucchiara, R.; Matsushita, Y.; Sebe, N.; Soatto, S.
abstract

2017 - Embedded Recurrent Network for Head Pose Estimation in Car [Relazione in Atti di Convegno]
Borghi, Guido; Gasparini, Riccardo; Vezzani, Roberto; Cucchiara, Rita
abstract

An accurate and fast driver's head pose estimation is a rich source of information, in particular in the automotive context. Head pose is a key element for driver's behavior investigation, pose analysis, attention monitoring and also a useful component to improve the efficacy of Human-Car Interaction systems. In this paper, a Recurrent Neural Network is exploited to tackle the problem of driver head pose estimation, directly and only working on depth images to be more reliable in presence of varying or insufficient illumination. Experimental results, obtained from two public dataset, namely Biwi Kinect Head Pose and ICT-3DHP Database, prove the efficacy of the proposed method that overcomes state-of-art works. Besides, the entire system is implemented and tested on two embedded boards with real time performance.

2017 - Fast and Accurate Facial Landmark Localization in Depth Images for In-car Applications [Relazione in Atti di Convegno]
Frigieri, Elia; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

A correct and reliable localization of facial landmark enables several applications in many fields, ranging from Human Computer Interaction to video surveillance. For instance, it can provide a valuable input to monitor the driver physical state and attention level in automotive context. In this paper, we tackle the problem of facial landmark localization through a deep approach. The developed system runs in real time and, in particular, is more reliable than state-of-the-art competitors specially in presence of light changes and poor illumination, thanks to the use of depth images as input. We also collected and shared a new realistic dataset inside a car, called MotorMark, to train and test the system. In addition, we exploited the public Eurecom Kinect Face Dataset for the evaluation phase, achieving promising results both in terms of accuracy and computational speed.

2017 - From Depth Data to Head Pose Estimation: a Siamese approach [Relazione in Atti di Convegno]
Venturelli, Marco; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

The correct estimation of the head pose is a problem of the great importance for many applications. For instance, it is an enabling technology in automotive for driver attention monitoring. In this paper, we tackle the pose estimation problem through a deep learning network working in regression manner. Traditional methods usually rely on visual facial features, such as facial landmarks or nose tip position. In contrast, we exploit a Convolutional Neural Network (CNN) to perform head pose estimation directly from depth data. We exploit a Siamese architecture and we propose a novel loss function to improve the learning of the regression network layer. The system has been tested on two public datasets, Biwi Kinect Head Pose and ICT-3DHP database. The reported results demonstrate the improvement in accuracy with respect to current state-of-the-art approaches and the real time capabilities of the overall framework.

2017 - From Groups to Leaders and Back. Exploring Mutual Predictability Between Social Groups and Their Leaders [Capitolo/Saggio]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Recently, social theories and empirical observations identified small groups and leaders as the basic elements which shape a crowd. This leads to an intermediate level of abstraction that is placed between the crowd as a flow of people, and the crowd as a collection of individuals. Consequently, automatic analysis of crowds in computer vision is also experiencing a shift in focus from individuals to groups and from small groups to their leaders. In this chapter, we present state-of-the-art solutions to the groups and leaders detection problem, which are able to account for physical factors as well as for sociological evidence observed over short time windows. The presented algorithms are framed as structured learning problems over the set of individual trajectories. However, the way trajectories are exploited to predict the structure of the crowd is not fixed but rather learned from recorded and annotated data, enabling the method to adapt these concepts to different scenarios, densities, cultures, and other unobservable complexities. Additionally, we investigate the relation between leaders and their groups and propose the first attempt to exploit leadership as prior knowledge for group detection.

2017 - Generative Adversarial Models for People Attribute Recognition in Surveillance [Relazione in Atti di Convegno]
Fabbri, Matteo; Calderara, Simone; Cucchiara, Rita
abstract

In this paper we propose a deep architecture for detecting people attributes (e.g. gender, race, clothing ...) in surveillance contexts. Our proposal explicitly deal with poor resolution and occlusion issues that often occur in surveillance footages by enhancing the images by means of Deep Convolutional Generative Adversarial Networks (DCGAN). Experiments show that by combining both our Generative Reconstruction and Deep Attribute Classification Network we can effectively extract attributes even when resolution is poor and in presence of strong occlusions up to 80% of the whole person figure.

2017 - Guest Editorial Special Issue on Wearable and Ego-Vision Systems for Augmented Experience [Articolo su rivista]
Serra, G.; Cucchiara, R.; Kitani, K. M.; Civera, J.
abstract

2017 - Hierarchical Boundary-Aware Neural Encoder for Video Captioning [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
abstract

The use of Recurrent Neural Networks for video captioning has recently gained a lot of attention, since they can be used both to encode the input video and to generate the corresponding description. In this paper, we present a recurrent video encoding scheme which can discover and leverage the hierarchical structure of the video. Unlike the classical encoder-decoder approach, in which a video is encoded continuously by a recurrent layer, we propose a novel LSTM cell, which can identify discontinuity points between frames or segments and modify the temporal connections of the encoding layer accordingly. We evaluate our approach on three large-scale datasets: the Montreal Video Annotation dataset, the MPII Movie Description dataset and the Microsoft Video Description Corpus. Experiments show that our approach can discover appropriate hierarchical representations of input videos and improve the state of the art results on movie description datasets.

2017 - Layout analysis and content classification in digitized books [Relazione in Atti di Convegno]
Corbelli, Andrea; Baraldi, Lorenzo; Balducci, Fabrizio; Grana, Costantino; Cucchiara, Rita
abstract

Automatic layout analysis has proven to be extremely important in the process of digitization of large amounts of documents. In this paper we present a mixed approach to layout analysis, introducing a SVM-aided layout segmentation process and a classification process based on local and geometrical features. The final output of the automatic analysis algorithm is a complete and structured annotation in JSON format, containing the digitalized text as well as all the references to the illustrations of the input page, and which can be used by visualization interfaces as well as annotation interfaces. We evaluate our algorithm on a large dataset built upon the first volume of the “Enciclopedia Treccani”.

2017 - Learning Where to Attend Like a Human Driver [Relazione in Atti di Convegno]
Palazzi, Andrea; Solera, Francesco; Calderara, Simone; Alletto, Stefano; Cucchiara, Rita
abstract

Despite the advent of autonomous cars, it's likely - at least in the near future - that human attention will still maintain a central role as a guarantee in terms of legal responsibility during the driving task. In this paper we study the dynamics of the driver's gaze and use it as a proxy to understand related attentional mechanisms. First, we build our analysis upon two questions: where and what the driver is looking at? Second, we model the driver's gaze by training a coarse-to-fine convolutional network on short sequences extracted from the DR(eye)VE dataset. Experimental comparison against different baselines reveal that the driver's gaze can indeed be learnt to some extent, despite i) being highly subjective and ii) having only one driver's gaze available for each sequence due to the irreproducibility of the scene. Eventually, we advocate for a new assisted driving paradigm which suggests to the driver, with no intervention, where she should focus her attention.

2017 - Learning to Map Vehicles into Bird's Eye View [Relazione in Atti di Convegno]
Palazzi, Andrea; Borghi, Guido; Abati, Davide; Calderara, Simone; Cucchiara, Rita
abstract

Awareness of the road scene is an essential component for both autonomous vehicles and Advances Driver Assistance Systems and is gaining importance both for the academia and car companies. This paper presents a way to learn a semantic-aware transformation which maps detections from a dashboard camera view onto a broader bird's eye occupancy map of the scene. To this end, a huge synthetic dataset featuring 1M couples of frames, taken from both car dashboard and bird's eye view, has been collected and automatically annotated. A deep-network is then trained to warp detections from the first to the second view. We demonstrate the effectiveness of our model against several baselines and observe that is able to generalize on real-world data despite having been trained solely on synthetic ones.

2017 - Modeling Multimodal Cues in a Deep Learning-based Framework for Emotion Recognition in the Wild [Relazione in Atti di Convegno]
Pini, Stefano; Ben Ahmed, Olfa; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita; Huet, Benoit
abstract

In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81% and 40.47%): we achieve an accuracy of 50.39% and 49.92% respectively on the validation and the testing data.

2017 - NeuralStory: an Interactive Multimedia System for Video Indexing and Re-use [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
abstract

In the last years video has been swamping the Internet: websites, social networks, and business multimedia systems are adopting video as the most important form of communication and information. Video are normally accessed as a whole and are not indexed in the visual content. Thus, they are often uploaded as short, manually cut clips with user-provided annotations, keywords and tags for retrieval. In this paper, we propose a prototype multimedia system which addresses these two limitations: it overcomes the need of human intervention in the video setting, thanks to fully deep learning-based solutions, and decomposes the storytelling structure of the video into coherent parts. These parts can be shots, key-frames, scenes and semantically related stories, and are exploited to provide an automatic annotation of the visual content, so that parts of video can be easily retrieved. This also allows a principled re-use of the video itself: users of the platform can indeed produce new storytelling by means of multi-modal presentations, add text and other media, and propose a different visual organization of the content. We present the overall solution, and some experiments on the re-use capability of our platform in edutainment by conducting an extensive user valuation %with students from primary schools.

2017 - POSEidon: Face-from-Depth for Driver Pose Estimation [Relazione in Atti di Convegno]
Borghi, Guido; Venturelli, Marco; Vezzani, Roberto; Cucchiara, Rita
abstract

Fast and accurate upper-body and head pose estimation is a key task for automatic monitoring of driver attention, a challenging context characterized by severe illumination changes, occlusions and extreme poses. In this work, we present a new deep learning framework for head localization and pose estimation on depth images. The core of the proposal is a regression neural network, called POSEidon, which is composed of three independent convolutional nets followed by a fusion layer, specially conceived for understanding the pose by depth. In addition, to recover the intrinsic value of face appearance for understanding head position and orientation, we propose a new Face-from-Depth approach for learning image faces from depth. Results in face reconstruction are qualitatively impressive. We test the proposed framework on two public datasets, namely Biwi Kinect Head Pose and ICT-3DHP, and on Pandora, a new challenging dataset mainly inspired by the automotive setup. Results show that our method overcomes all recent state-of-art works, running in real time at more than 30 frames per second.

2017 - Personalized Egocentric Video Summarization of Cultural Tour on User Preferences Input [Articolo su rivista]
Varini, P.; Serra, G.; Cucchiara, R.
abstract

In this paper, we propose a new method for customized summarization of egocentric videos according to specific user preferences, so that different users can extract different summaries from the same stream. Our approach, tailored on a cultural heritage scenario, relies on creating a short synopsis of the original video focused on key shots, in which concepts relevant to user preferences can be visually detected and the chronological flow of the original video is preserved. Moreover, we release a new dataset, composed of egocentric streams taken in uncontrolled scenarios, capturing tourists cultural visits in six art cities, with geolocalization information. Our experimental results show that the proposed approach is able to leverage user's preferences with an accent on storyline chronological flow and on visual smoothness.

2017 - Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks [Articolo su rivista]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
abstract

In this paper, we propose a novel scene detection algorithm which employs semantic, visual, textual and audio cues. We also show how the hierarchical decomposition of the storytelling video structure can improve retrieval results presentation with semantically and aesthetically effective thumbnails. Our method is built upon two advancements of the state of the art: 1) semantic feature extraction which builds video specific concept detectors; 2) multimodal feature embedding learning, that maps the feature vector of a shot to a space in which the Euclidean distance has task specific semantic properties. The proposed method is able to decompose the video in annotated temporal segments which allow for a query specific thumbnail extraction. Extensive experiments are performed on different data sets to demonstrate the effectiveness of our algorithm. An in-depth discussion on how to deal with the subjectivity of the task is conducted and a strategy to overcome the problem is suggested.

2017 - Segmentation models diversity for object proposals [Articolo su rivista]
Manfredi, Marco; Grana, Costantino; Cucchiara, Rita; Smeulders, Arnold W. M.
abstract

In this paper we present a segmentation proposal method which employs a box-hypotheses generation step followed by a lightweight segmentation strategy. Inspired by interactive segmentation, for each automatically placed bounding-box we compute a precise segmentation mask. We introduce diversity in segmentation strategies enhancing a generic model performance exploiting class-independent regional appearance features. Foreground probability scores are learned from groups of objects with peculiar characteristics to specialize segmentation models. We demonstrate results comparable to the state-of-the-art on PASCAL VOC 2012 and a further improvement by merging our proposals with those of a recent solution. The ability to generalize to unseen object categories is demonstrated on Microsoft COCO 2014.

2017 - Towards Video Captioning with Naming: a Novel Dataset and a Multi-Modal Approach [Relazione in Atti di Convegno]
Pini, Stefano; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Current approaches for movie description lack the ability to name characters with their proper names, and can only indicate people with a generic "someone" tag. In this paper we present two contributions towards the development of video description architectures with naming capabilities: firstly, we collect and release an extension of the popular Montreal Video Annotation Dataset in which the visual appearance of each character is linked both through time and to textual mentions in captions. We annotate, in a semi-automatic manner, a total of 53k face tracks and 29k textual mentions on 92 movies. Moreover, to underline and quantify the challenges of the task of generating captions with names, we present different multi-modal approaches to solve the problem on already generated captions.

2017 - Tracking social groups within and across cameras [Articolo su rivista]
Solera, Francesco; Calderara, Simone; Ristani, Ergys; Tomasi, Carlo; Cucchiara, Rita
abstract

We propose a method for tracking groups from single and multiple cameras with disjoint fields of view. Our formulation follows the tracking-by-detection paradigm where groups are the atomic entities and are linked over time to form long and consistent trajectories. To this end, we formulate the problem as a supervised clustering problem where a Structural SVM classifier learns a similarity measure appropriate for group entities. Multi-camera group tracking is handled inside the framework by adopting an orthogonal feature encoding that allows the classifier to learn inter- and intra-camera feature weights differently. Experiments were carried out on a novel annotated group tracking data set, the DukeMTMC-Groups data set. Since this is the first data set on the problem it comes with the proposal of a suitable evaluation measure. Results of adopting learning for the task are encouraging, scoring a +15% improvement in F1 measure over a non-learning based clustering baseline. To our knowledge this is the first proposal of this kind dealing with multi-camera group tracking.

2017 - Video registration in egocentric vision under day and night illumination changes [Articolo su rivista]
Alletto, Stefano; Serra, Giuseppe; Cucchiara, Rita
abstract

With the spread of wearable devices and head mounted cameras, a wide range of application requiring precise user localization is now possible. In this paper we propose to treat the problem of obtaining the user position with respect to a known environment as a video registration problem. Video registration, i.e. the task of aligning an input video sequence to a pre-built 3D model, relies on a matching process of local keypoints extracted on the query sequence to a 3D point cloud. The overall registration performance is strictly tied to the actual quality of this 2D-3D matching, and can degrade if environmental conditions such as steep changes in lighting like the ones between day and night occur. To effectively register an egocentric video sequence under these conditions, we propose to tackle the source of the problem: the matching process. To overcome the shortcomings of standard matching techniques, we introduce a novel embedding space that allows us to obtain robust matches by jointly taking into account local descriptors, their spatial arrangement and their temporal robustness. The proposal is evaluated using unconstrained egocentric video sequences both in terms of matching quality and resulting registration performance using different 3D models of historical landmarks. The results show that the proposed method can outperform state of the art registration algorithms, in particular when dealing with the challenges of night and day sequences.

2017 - Visual Saliency for Image Captioning in New Multimedia Services [Relazione in Atti di Convegno]
Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita
abstract

Image and video captioning are important tasks in visual data analytics, as they concern the capability of describing visual content in natural language. They are the pillars of query answering systems, improve indexing and search and allow a natural form of human-machine interaction. Even though promising deep learning strategies are becoming popular, the heterogeneity of large image archives makes this task still far from being solved. In this paper we explore how visual saliency prediction can support image captioning. Recently, some forms of unsupervised machine attention mechanisms have been spreading, but the role of human attention prediction has never been examined extensively for captioning. We propose a machine attention model driven by saliency prediction to provide captions in images, which can be exploited for many services on cloud and on multimedia data. Experimental evaluations are conducted on the SALICON dataset, which provides groundtruths for both saliency and captioning, and on the large Microsoft COCO dataset, the most widely used for image captioning.

2016 - A Browsing and Retrieval System for Broadcast Videos using Scene Detection and Automatic Annotation [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Messina, Alberto; Cucchiara, Rita
abstract

This paper presents a novel video access and retrieval system for edited videos. The key element of the proposal is that videos are automatically decomposed into semantically coherent parts (called scenes) to provide a more manageable unit for browsing, tagging and searching. The system features an automatic annotation pipeline, with which videos are tagged by exploiting both the transcript and the video itself. Scenes can also be retrieved with textual queries; the best thumbnail for a query is selected according to both semantics and aesthetics criteria.

2016 - A Deep Multi-Level Network for Saliency Prediction [Relazione in Atti di Convegno]
Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita
abstract

This paper presents a novel deep architecture for saliency prediction. Current state of the art models for saliency prediction employ Fully Convolutional networks that perform a non-linear combination of features extracted from the last convolutional layer to predict saliency maps. We propose an architecture which, instead, combines features extracted at different levels of a Convolutional Neural Network (CNN). Our model is composed of three main blocks: a feature extraction CNN, a feature encoding network, that weights low and high level feature maps, and a prior learning network. We compare our solution with state of the art saliency models on two public benchmarks datasets. Results show that our model outperforms under all evaluation metrics on the SALICON dataset, which is currently the largest public dataset for saliency prediction, and achieves competitive results on the MIT300 benchmark.

2016 - A location-aware architecture for an IoT-based smart museum [Articolo su rivista]
Fiore, Giuseppe Del; Mainetti, Luca; Mighali, Vincenzo; Patrono, Luigi; Alletto, Stefano; Cucchiara, Rita; Serra, Giuseppe
abstract

The Internet of Things, whose main goal is to automatically predict users' desires, can find very interesting opportunities in the art and culture field, as the tourism is one of the main driving engines of the modern society. Currently, the innovation process in this field is growing at a slower pace, so the cultural heritage is a prerogative of a restricted category of users. To address this issue, a significant technological improvement is necessary in the culture-dedicated locations, which do not usually allow the installation of hardware infrastructures. In this paper, we design and validate a no-invasive indoor location-aware architecture able to enhance the user experience in a museum. The system relies on the user's smartphone and a wearable device (with image recognition and localization capabilities) to automatically deliver personalized cultural contents related to the observed artworks. The proposal was validated in the MUST museum in Lecce (Italy).

2016 - An Indoor Location-aware System for an IoT-based Smart Museum [Articolo su rivista]
Alletto, Stefano; Cucchiara, Rita; Del Fiore, Giuseppe; Mainetti, Luca; Mighali, Vincenzo; Patrono, Luigi; Serra, Giuseppe
abstract

The new technologies characterizing the Internet of Things (IoT) allow realizing real smart environments able to provide advanced services to the users. Recently, these smart environments are also being exploited to renovate the users' interest on the cultural heritage, by guaranteeing real interactive cultural experiences. In this paper, we design and validate an indoor location-aware architecture able to enhance the user experience in a museum. In particular, the proposed system relies on a wearable device that combines image recognition and localization capabilities to automatically provide the users with cultural contents related to the observed artworks. The localization information is obtained by a Bluetooth low energy (BLE) infrastructure installed in the museum. Moreover, the system interacts with the Cloud to store multimedia contents produced by the user and to share environment-generated events on his/her social networks. Finally, several location-aware services, running in the system, control the environment status also according to users' movements. These services interact with physical devices through a multiprotocol middleware. The system has been designed to be easily extensible to other IoT technologies and its effectiveness has been evaluated in the MUST museum, Lecce, Italy.

2016 - Analysis and Re-use of Videos in Educational Digital Libraries with Automatic Scene Detection [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
abstract

The advent of modern approaches to education, like Massive Open Online Courses (MOOC), made video the basic media for educating and transmitting knowledge. However, IT tools are still not adequate to allow video content re-use, tagging, annotation and personalization. In this paper we analyze the problem of identifying coherent sequences, called scenes, in order to provide the users with a more manageable editing unit. A simple spectral clustering technique is proposed and compared with state-of-the-art results. We also discuss correct ways to evaluate the performance of automatic scene detection algorithms.

2016 - Body Part Based Re-identification from an Egocentric Perspective [Relazione in Atti di Convegno]
Fergnani, Federica; Alletto, Stefano; Serra, Giuseppe; De Mira, Joaquim; Cucchiara, Rita
abstract

With the spread of wearable cameras, many consumer applications ranging from social tagging to video summarization would greatly benefit from people re-identification methods capable of dealing with the egocentric perspective. In this regard, first-person camera views present such a unique setting that traditional re-identification methods results in poor performance when applied to this scenario. In this paper, we present a simple but effective solution that overcomes the limitations of traditional approaches by dividing people images into meaningful body parts. Furthermore, by taking into account human gaze information concerning where people look at when trying to recognize a person, we devise a meaningful way to weight the contributions of different bodyparts. Experimental results validate the proposal on a novel egocentric re-identification dataset, the first of its kind, showing that the performance increases when compared to current state of the art on egocentric sequences is significant.

2016 - Bridging the experiential gap in cultural visits with computer vision [Relazione in Atti di Convegno]
Cucchiara, R.; Del Bimbo, A.
abstract

This paper discusses the role of computer vision to bridge the experiential gap between the cultural and emotional experience of the visitors in museums or cultural heritage sites. We don't argue against the use of multiple sensors to provide a more complete cultural experience but claim the primary role of computer vision for such a task. Although many research challenges are still far to be solved effectively, especially for detection, re-identification, tracking and recognition, we believe that technology can be deployed already in real contexts and support concrete applications with interesting results that will open the door to valuable future applications.

2016 - Context Change Detection for an Ultra-Low Power Low-Resolution Ego-Vision Imager [Relazione in Atti di Convegno]
Paci, Francesco; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita; Benini, Luca
abstract

With the increasing popularity of wearable cameras, such as GoPro or Narrative Clip, research on continuous activity monitoring from egocentric cameras has received a lot of attention. Research in hardware and software is devoted to find new efficient, stable and long-time running solutions; however, devices are too power-hungry for truly always-on operation, and are aggressively duty-cycled to achieve acceptable lifetimes. In this paper we present a wearable system for context change detection based on an egocentric camera with ultra-low power consumption that can collect data 24/7. Although the resolution of the captured images is low, experimental results in real scenarios demonstrate how our approach, based on Siamese Neural Networks, can achieve visual context awareness. In particular, we compare our solution with hand-crafted features and with state of art technique and propose a novel and challenging dataset composed of roughly 30000 low-resolution images.

2016 - DR(eye)VE: a Dataset for Attention-Based Tasks with Applications to Autonomous and Assisted Driving [Relazione in Atti di Convegno]
Alletto, Stefano; Palazzi, Andrea; Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Autonomous and assisted driving are undoubtedly hot topics in computer vision. However, the driving task is extremely complex and a deep understanding of drivers' behavior is still lacking. Several researchers are now investigating the attention mechanism in order to define computational models for detecting salient and interesting objects in the scene. Nevertheless, most of these models only refer to bottom up visual saliency and are focused on still images. Instead, during the driving experience the temporal nature and peculiarity of the task influence the attention mechanisms, leading to the conclusion that real life driving data is mandatory. In this paper we propose a novel and publicly available dataset acquired during actual driving. Our dataset, composed by more than 500,000 frames, contains drivers' gaze fixations and their temporal integration providing task-specific saliency maps. Geo-referenced locations, driving speed and course complete the set of released data. To the best of our knowledge, this is the first publicly available dataset of this kind and can foster new discussions on better understanding, exploiting and reproducing the driver's attention process in the autonomous and assisted cars of future generations.

2016 - Exploring Architectural Details Through aWearable Egocentric Vision Device [Articolo su rivista]
Alletto, Stefano; Abati, Davide; Serra, Giuseppe; Cucchiara, Rita
abstract

Augmented user experiences in the cultural heritage domain are in increasing demand by the new digital native tourists of 21st century. In this paper, we propose a novel solution that aims at assisting the visitor during an outdoor tour of a cultural site using the unique first person perspective of wearable cameras. In particular, the approach exploits computer vision techniques to retrieve the details by proposing a robust descriptor based on the covariance of local features. Using a lightweight wearable board the solution can localize the user with respect to the 3D point cloud of the historical landmark and provide him with information about the details he is currently looking at. Experimental results validate the method both in terms of accuracy and computational effort. Furthermore, user evaluation based on real-world experiments shows that the proposal is deemed effective in enriching a cultural experience.

2016 - Eyewear Computing – Augmenting the Human with Head-Mounted Wearable Assistants [Recensione in Volume]
Cucchiara, Rita; Bulling, Andreas; Kunze, Kai; Rehg, James
abstract

The seminar was composed of workshops and tutorials on head-mounted eye tracking, egocentric vision, optics, and head-mounted displays. The seminar welcomed 30 academic and industry researchers from Europe, the US, and Asia with a diverse background, including wearable and ubiquitous computing, computer vision, developmental psychology, optics, and human-computer interaction. In contrast to several previous Dagstuhl seminars, we used an ignite talk format to reduce the time of talks to one half-day and to leave the rest of the week for hands-on sessions, group work, general discussions, and socialising. The key results of this seminar are 1) the identification of key research challenges and summaries of breakout groups on multimodal eyewear computing, egocentric vision, security and privacy issues, skill augmentation and task guidance, eyewear computing for gaming, as well as prototyping of VR applications, 2) a list of datasets and research tools for eyewear computing, 3) three small-scale datasets recorded during the seminar, 4) an article in ACM Interactions entitled “Eyewear Computers for Human-Computer Interaction”, as well as 5) two follow-up workshops on “Egocentric Perception, Interaction, and Computing” at the European Conference on Computer Vision (ECCV) as well as “Eyewear Computing” at the ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp).

2016 - Fast gesture recognition with Multiple StreamDiscrete HMMs on 3D Skeletons [Relazione in Atti di Convegno]
Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita
abstract

HMMs are widely used in action and gesture recognition due to their implementation simplicity, low computational requirement, scalability and high parallelism. They have worth performance even with a limited training set. All these characteristics are hard to find together in other even more accurate methods. In this paper, we propose a novel doublestage classification approach, based on Multiple Stream Discrete Hidden Markov Models (MSD-HMM) and 3D skeleton joint data, able to reach high performances maintaining all advantages listed above. The approach allows both to quickly classify presegmented gestures (offline classification), and to perform temporal segmentation on streams of gestures (online classification) faster than real time. We test our system on three public datasets, MSRAction3D, UTKinect-Action and MSRDailyAction, and on a new dataset, Kinteract Dataset, explicitly created for Human Computer Interaction (HCI). We obtain state of the art performances on all of them.

2016 - Historical Document Digitization through Layout Analysis and Deep Content Classification [Relazione in Atti di Convegno]
Corbelli, Andrea; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
abstract

Document layout segmentation and recognition is an important task in the creation of digitized documents collections, especially when dealing with historical documents. This paper presents an hybrid approach to layout segmentation as well as a strategy to classify document regions, which is applied to the process of digitization of an historical encyclopedia. Our layout analysis method merges a classic top-down approach and a bottom-up classification process based on local geometrical features, while regions are classified by means of features extracted from a Convolutional Neural Network merged in a Random Forest classifier. Experiments are conducted on the first volume of the ``Enciclopedia Treccani'', a large dataset containing 999 manually annotated pages from the historical Italian encyclopedia.

2016 - Layout analysis and content enrichment of digitized books [Articolo su rivista]
Grana, Costantino; Serra, Giuseppe; Manfredi, Marco; Coppi, Dalia; Cucchiara, Rita
abstract

In this paper we describe a system for automatically analyzing old documents and creating hyper linking between different epochs, thus opening ancient documents to young people and to make them available on the web with old and current content. We propose a supervised learning approach to segment text and illustration of digitized old documents using a texture feature based on local correlation aimed at detecting the repeating patterns of text regions and differentiate them from pictorial elements. Moreover we present a solution to help the user in finding contemporary content connected to what is automatically extracted from the ancient documents.

2016 - Motion Segmentation using Visual and Bio-mechanical Features [Relazione in Atti di Convegno]
Alletto, Stefano; Serra, Giuseppe; Cucchiara, Rita
abstract

Nowadays, egocentric wearable devices are continuously increasing their widespread among both the academic community and the general public. For this reason, methods capable of automatically segment the video based on the recorder motion patterns are gaining attention. These devices present the unique opportunity of both high quality video recordings and multimodal sensors readings. Significant efforts have been made in either analyzing the video stream recorded by these devices or the bio-mechanical sensor information. So far, the integration between these two realities has not been fully addressed, and the real capabilities of these devices are not yet exploited. In this paper, we present a solution to segment a video sequence into motion activities by introducing a novel data fusion technique based on the covariance of visual and bio-mechanical features. The experimental results are promising and show that the proposed integration strategy outperforms the results achieved focusing solely on a single source.

2016 - Multi-Level Net: a Visual Saliency Prediction Model [Relazione in Atti di Convegno]
Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita
abstract

State of the art approaches for saliency prediction are based on Full Convolutional Networks, in which saliency maps are built using the last layer. In contrast, we here present a novel model that predicts saliency maps exploiting a non-linear combination of features coming from different layers of the network. We also present a new loss function to deal with the imbalance issue on saliency masks. Extensive results on three public datasets demonstrate the robustness of our solution. Our model outperforms the state of the art on SALICON, which is the largest and unconstrained dataset available, and obtains competitive results on MIT300 and CAT2000 benchmarks.

2016 - Optimizing image registration for interactive applications [Relazione in Atti di Convegno]
Gasparini, Riccardo; Alletto, Stefano; Serra, Giuseppe; Cucchiara, Rita
abstract

With the spread of wearable and mobile devices, the request for interactive augmented reality applications is in constant growth. Among the different possibilities, we focus on the cultural heritage domain where a key step in the development applications for augmented cultural experiences is to obtain a precise localization of the user, i.e. the 6 degree-of-freedom of the camera acquiring the images used by the application. Current state of the art perform this task by extracting local descriptors from a query and exhaustively matching them to a sparse 3D model of the environment. While this procedure obtains good localization performance, due to the vast search space involved in the retrieval of 2D-3D correspondences this is often not feasible in real-time and interactive environments. In this paper we hence propose to perform descriptor quantization to reduce the search space and employ multiple KD-Trees combined with a principal component analysis dimensionality reduction to enable an efficient search. We experimentally show that our solution can halve the computational requirements of the correspondence search with regard to the state of the art while maintaining similar accuracy levels.

2016 - Performance measures and a data set for multi-target, multi-camera tracking [Relazione in Atti di Convegno]
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C.
abstract

To help accelerate progress in multi-target, multi-camera tracking systems, we present (i) a new pair of precision-recall measures of performance that treats errors of all types uniformly and emphasizes correct identification over sources of error; (ii) the largest fully-annotated and calibrated data set to date with more than 2 million frames of 1080 p, 60 fps video taken by 8 cameras observing more than 2, 700 identities over 85 min; and (iii) a reference software system as a comparison baseline. We show that (i) our measures properly account for bottom-line identity match performance in the multi-camera setting; (ii) our data set poses realistic challenges to current trackers; and (iii) the performance of our system is comparable to the state of the art.

2016 - Quick, accurate, smart: 3D computer vision technology helps assessing confined animals' behaviour [Articolo su rivista]
Barnard, Shanis; Calderara, Simone; Pistocchi, Simone; Cucchiara, Rita; Podaliri Vulpiani, Michele; Messori, Stefano; Ferri, Nicola
abstract

Mankind directly controls the environment and lifestyles of several domestic species for purposes ranging from production and research to conservation and companionship. These environments and lifestyles may not offer these animals the best quality of life. Behaviour is a direct reflection of how the animal is coping with its environment. Behavioural indicators are thus among the preferred parameters to assess welfare. However, behavioural recording (usually from video) can be very time consuming and the accuracy and reliability of the output rely on the experience and background of the observers. The outburst of new video technology and computer image processing gives the basis for promising solutions. In this pilot study, we present a new prototype software able to automatically infer the behaviour of dogs housed in kennels from 3D visual data and through structured machine learning frameworks. Depth information acquired through 3D features, body part detection and training are the key elements that allow the machine to recognise postures, trajectories inside the kennel and patterns of movement that can be later labelled at convenience. The main innovation of the software is its ability to automatically cluster frequently observed temporal patterns of movement without any pre-set ethogram. Conversely, when common patterns are defined through training, a deviation from normal behaviour in time or between individuals could be assessed. The software accuracy in correctly detecting the dogs' behaviour was checked through a validation process. An automatic behaviour recognition system, independent from human subjectivity, could add scientific knowledge on animals' quality of life in confinement as well as saving time and resources. This 3D framework was designed to be invariant to the dog's shape and size and could be extended to farm, laboratory and zoo quadrupeds in artificial housing. The computer vision technique applied to this software is innovative in non-human animal behaviour science. Further improvements and validation are needed, and future applications and limitations are discussed.

2016 - Scene-driven Retrieval in Edited Videos using Aesthetic and Semantic Deep Features [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
abstract

This paper presents a novel retrieval pipeline for video collections, which aims to retrieve the most significant parts of an edited video for a given query, and represent them with thumbnails which are at the same time semantically meaningful and aesthetically remarkable. Videos are first segmented into coherent and story-telling scenes, then a retrieval algorithm based on deep learning is proposed to retrieve the most significant scenes for a textual query. A ranking strategy based on deep features is finally used to tackle the problem of visualizing the best thumbnail. Qualitative and quantitative experiments are conducted on a collection of edited videos to demonstrate the effectiveness of our approach.

2016 - Shot, scene and keyframe ordering for interactive video re-use [Relazione in Atti di Convegno]
Baraldi, L.; Grana, C.; Borghi, G.; Vezzani, R.; Cucchiara, R.
abstract

This paper presents a complete system for shot and scene detection in broadcast videos, as well as a method to select the best representative key-frames, which could be used in new interactive interfaces for accessing large collections of edited videos. The final goal is to enable an improved access to video footage and the re-use of video content with the direct management of user-selected video-clips.

2016 - Socially Constrained Structural Learning for Groups Detection in Crowd [Articolo su rivista]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Modern crowd theories agree that collective behavior is the result of the underlying interactions among small groups of individuals. In this work, we propose a novel algorithm for detecting social groups in crowds by means of a Correlation Clustering procedure on people trajectories. The affinity between crowd members is learned through an online formulation of the Structural SVM framework and a set of specifically designed features characterizing both their physical and social identity, inspired by Proxemic theory, Granger causality, DTW and Heat-maps. To adhere to sociological observations, we introduce a loss function (G-MITRE) able to deal with the complexity of evaluating group detection performances. We show our algorithm achieves state-of-the-art results when relying on both ground truth trajectories and tracklets previously extracted by available detector/tracker systems.

2016 - Spotting prejudice with nonverbal behaviours [Relazione in Atti di Convegno]
Palazzi, Andrea; Calderara, Simone; Bicocchi, Nicola; Vezzali, Loris; DI BERNARDO, GIAN ANTONIO; Zambonelli, Franco; Cucchiara, Rita
abstract

Despite prejudice cannot be directly observed, nonverbal behaviours provide profound hints on people inclinations. In this paper, we use recent sensing technologies and machine learning techniques to automatically infer the results of psychological questionnaires frequently used to assess implicit prejudice. In particular, we recorded 32 students discussing with both white and black collaborators. Then, we identiﬁed a set of features allowing automatic extraction and measured their degree of correlation with psychological scores. Results conﬁrmed that automated analysis of nonverbal behaviour is actually possible thus paving the way for innovative clinical tools and eventually more secure societies.

2016 - Transductive People Tracking in Unconstrained Surveillance [Articolo su rivista]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
abstract

Long term tracking of people in unconstrained scenarios is still an open problem due to the absence of constant elements in the problem setting. The camera, when active, may move and both the background and the target appearance may change abruptly leading to the inadequacy of most standard tracking techniques. We propose to exploit a learning approach that considers the tracking task as a semi supervised learning (SSL) problem. Given few target samples the aim is to search the target occurrences in the video stream re-interpreting the problem as label propagation on a similarity graph. We propose a solution based on graph transduction that works iteratively frame by frame. Additionally, in order to avoid drifting, we introduce an update strategy based on an evolutionary clustering technique that chooses the visual templates that better describe target appearance evolving the model during the processing of the video. Since we model people appearance by means of covariance matrices on color and gradient information our framework is directly related to structure learning on Riemannian manifolds. Tests on publicly available datasets and comparisons with stateof- the-art techniques allow to conclude that our solution exhibit interesting performances in terms of tracking precision and recall in most of the considered scenarios.

2015 - A Deep Siamese Network for Scene Detection in Broadcast Videos [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
abstract

We present a model that automatically divides broadcast videos into coherent scenes by learning a distance measure between shots. Experiments are performed to demonstrate the effectiveness of our approach by comparing our algorithm against recent proposals for automatic scene segmentation. We also propose an improved performance measure that aims to reduce the gap between numerical evaluation and expected results, and propose and release a new benchmark dataset.

2015 - A General-Purpose Sensing Floor Architecture for Human-Environment Interaction [Articolo su rivista]
Vezzani, Roberto; Lombardi, Martino; Pieracci, Augusto; Santinelli, Paolo; Cucchiara, Rita
abstract

Smart environments are now designed as natural interfaces to capture and understand human behavior without a need for explicit human-computer interaction. In this paper, we present a general-purpose architecture that acquires and understands human behaviors through a sensing floor. The pressure field generated by moving people is captured and analyzed. Specific actions and events are then detected by a low-level processing engine and sent to high-level interfaces providing different functions. The proposed architecture and sensors are modular, general-purpose, cheap, and suitable for both small- and large-area coverage. Some sample entertainment and virtual reality applications that we developed to test the platform are presented.

2015 - Active query process for digital video surveillance forensic applications [Articolo su rivista]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
abstract

Multimedia forensics is a new emerging discipline regarding the analysis and exploitation of digital data as support for investigation to extract probative elements. Among them, visual data about people and people activities, extracted from videos in an efficient way, are becoming day by day more appealing for forensics, due to the availability of large video-surveillance footage. Thus, many research studies and prototypes investigate the analysis of soft biometrics data, such as people appearance and people trajectories. In this work, we propose new solutions for querying and retrieving visual data in an interactive and active fashion for soft biometrics in forensics. The innovative proposal joins the capability of transductive learning for semi-supervised search by similarity and a typical multimedia methodology based on user-guided relevance feedback to allow an active interaction with the visual data of people, appearance and trajectory in large surveillance areas. Approaches proposed are very general and can be exploited independently by the surveillance setting and the type of video analytic tools.

2015 - Automatic configuration and calibration of modular sensing floors [Relazione in Atti di Convegno]
Vezzani, Roberto; Lombardi, Martino; Cucchiara, Rita
abstract

Sensing floors are becoming an emerging solution for many privacy-compliant and large area surveillance systems. Many research and even commercial Technologies have been proposed in the last years. Similarly to distributed camera networks, the problem of calibration is crucial, specially when installed in wide areas. This paper addresses the general problem of automatic calibration and configuration of modular and scalable sensing floors. Working on training data only, the system automatically finds the spatial placement of each sensor module and estimates threshold parameters needed for people detection. Tests on several training sequences captured with a commercial sensing floor are provided to validate the method

2015 - Classification of Affective Data to Evaluate the Level Design in a Role-Playing Videogame [Relazione in Atti di Convegno]
Balducci, Fabrizio; Grana, Costantino; Cucchiara, Rita
abstract

This paper presents a novel approach to evaluate game level design strategies, applied to role playing games. Following a set of well defined guidelines, two game levels were designed for Neverwinter Nights 2 to manipulate particular emotions like boredom or flow, and tested by 13 subjects wearing a brain computer interface helmet. A set of features was extracted from the affective data logs and used to classify different parts of the gaming sessions, to verify the correspondence of the original level aims and the effective results on people emotions. The very interesting correlations observed, suggest that the technique is extensible to other similar evaluation tasks.

2015 - Detection of Human Movements with Pressure Floor Sensors [Relazione in Atti di Convegno]
Lombardi, Martino; Vezzani, Roberto; Cucchiara, Rita
abstract

Following the recent Internet of Everything (IoE) trend, several general-purpose devices have been proposed to acquire as much information as possible from the environment and from people interacting with it. Among the others, sensing floors are recently attracting the interest of the research community. In this paper, we propose a new model to store and process floor data. The model does not assume a regular grid distribution of the sensing elements and is based on the ground reaction force (GRF) concept, widely used in biomechanics. It allows the correct detection and tracking of people, outperforming the common background subtraction schema adopted in the past. Several tests on a real sensing floor prototype are reported and discussed

2015 - Egocentric Object Tracking: An Odometry-Based Solution [Relazione in Atti di Convegno]
Alletto, Stefano; Serra, Giuseppe; Cucchiara, Rita
abstract

Tracking objects moving around a person is one of the key steps in human visual augmentation: we could estimate their locations when they are out of our field of view, know their position, distance or velocity just to name a few possibilities. This is no easy task: in this paper, we show how current state-of-the-art visual tracking algorithms fail if challenged with a first-person sequence recorded from a wearable camera attached to a moving user. We propose an evaluation that highlights these algorithms' limitations and, accordingly, develop a novel approach based on visual odometry and 3D localization that overcomes many issues typical of egocentric vision. We implement our algorithm on a wearable board and evaluate its robustness, showing in our preliminary experiments an increase in tracking performance of nearly 20\% if compared to currently state-of-the-art techniques.

2015 - Egocentric Video Summarization of Cultural Tour based on User Preferences [Relazione in Atti di Convegno]
Varini, Patrizia; Serra, Giuseppe; Cucchiara, Rita
abstract

In this paper, we propose a new method to obtain customized video summarization according to specific user preferences. Our approach is tailored on Cultural Heritage scenario and is designed on identifying candidate shots, selecting from the original streams only the scenes with behavior patterns related to the presence of relevant experiences, and further filtering them in order to obtain a summary matching the requested user's preferences. Our preliminary results show that the proposed approach is able to leverage user's preferences in order to obtain a customized summary, so that different users may extract from the same stream different summaries.

2015 - Egocentric video personalization in cultural experiences scenarios [Relazione in Atti di Convegno]
Varini, Patrizia; Serra, Giuseppe; Cucchiara, Rita
abstract

In this paper we propose a novel approach for egocentric video personalization in a cultural experience scenario, based on shots automatic labelling according to different semantic dimensions, such as web leveraged knowledge of the surrounded cultural Points Of Interest, information about stops and moves, both relying on geolocalization, and camera’s wearer behaviour. Moreover we present a video personalization web system based on shots multi-dimensional semantic classification, that is designed to aid the visitor to browse and to retrieve relevant information to obtain a customized video. Experimental results show that the proposed techniques for video analysis achieve good performances in unconstrained scenario and user evaluation tests confirm that our solution is useful and effective.

2015 - GOLD: Gaussians of Local Descriptors for Image Representation [Articolo su rivista]
Serra, Giuseppe; Grana, Costantino; Manfredi, Marco; Cucchiara, Rita
abstract

The Bag of Words paradigm has been the baseline from which several successful image classification solutions were developed in the last decade. These represent images by quantizing local descriptors and summarizing their distribution. The quantization step introduces a dependency on the dataset, that even if in some contexts significantly boosts the performance, severely limits its generalization capabilities. Differently, in this paper, we propose to model the local features distribution with a multivariate Gaussian, without any quantization. The full rank covariance matrix, which lies on a Riemannian manifold, is projected on the tangent Euclidean space and concatenated to the mean vector. The resulting representation, a Gaussian of local descriptors (GOLD), allows to use the dot product to closely approximate a distance between distributions without the need for expensive kernel computations. We describe an image by an improved spatial pyramid, which avoids boundary effects with soft assignment: local descriptors contribute to neighboring Gaussians, forming a weighted spatial pyramid of GOLD descriptors. In addition, we extend the model leveraging dataset characteristics in a mixture of Gaussian formulation further improving the classification accuracy. To deal with large scale datasets and high dimensional feature spaces the Stochastic Gradient Descent solver is adopted. Experimental results on several publicly available datasets show that the proposed method obtains state-of-the-art performance.

2015 - Gesture Recognition using Wearable Vision Sensors to Enhance Visitors' Museum Experiences [Articolo su rivista]
Baraldi, Lorenzo; Paci, Francesco; Serra, Giuseppe; Cucchiara, Rita
abstract

We introduce a novel approach to cultural heritage experience: by means of ego-vision embedded devices we develop a system, which offers a more natural and entertaining way of accessing museum knowledge. Our method is based on distributed self-gesture and artwork recognition, and does not need fixed cameras nor radio-frequency identifications sensors. We propose the use of dense trajectories sampled around the hand region to perform self-gesture recognition, understanding the way a user naturally interacts with an artwork, and demonstrate that our approach can benefit from distributed training. We test our algorithms on publicly available data sets and we extend our experiments to both virtual and real museum scenarios, where our method shows robustness when challenged with real-world data. Furthermore, we run an extensive performance analysis on our ARM-based wearable device.

2015 - Innovative IoT-aware Services for a Smart Museum [Relazione in Atti di Convegno]
Mighali, Vincenzo; Del Fiore, Giuseppe; Patrono, Luigi; Mainetti, Luca; Alletto, Stefano; Serra, Giuseppe; Cucchiara, Rita
abstract

Smart cities are a trading topic in both the academic literature and industrial world. The capability to provide the users with added-value services through low-power and low-cost smart objects is very attractive in many fields. Among these, art and culture represent very interesting examples, as the tourism is one of the main driving engines of modern society. In this paper, we propose an IoT-aware architecture to improve the cultural experience of the user, by involving the most important recent innovations in the ICT field. The main components of the proposed architecture are: (i) an indoor localization service based on the Bluetooth Low Energy technology, (ii) a wearable device able to capture and process images related to the user's point of view, (iii) the user's mobile device useful to display customized cultural contents and to share multimedia data in the Cloud, and (iv) a processing center that manage the core of the whole business logic. In particular, it interacts with both wearable and mobile devices, and communicates with the outside world to retrieve contents from the Cloud and to provide services also to external users. The proposal is currently under development and it will be validated in the MUST museum in Lecce.

2015 - Learning to Divide and Conquer for Online Multi-Target Tracking [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Online Multiple Target Tracking (MTT) is often addressed within the tracking-by-detection paradigm. Detections are previously extracted independently in each frame and then objects trajectories are built by maximizing specifically designed coherence functions. Nevertheless, ambiguities arise in presence of occlusions or detection errors. In this paper we claim that the ambiguities in tracking could be solved by a selective use of the features, by working with more reliable features if possible and exploiting a deeper representation of the target only if necessary. To this end, we propose an online divide and conquer tracker for static camera scenes, which partitions the assignment problem in local subproblems and solves them by selectively choosing and combining the best features. The complete framework is cast as a structural learning task that unifies these phases and learns tracker parameters from examples. Experiments on two different datasets highlights a significant improvement of tracking performances (MOTA +10%) over the state of the art.

2015 - Learning to identify leaders in crowd [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Leader identification is a crucial task in social analysis, crowd management and emergency planning. In this paper, we investigate a computational model for the individuation of leaders in crowded scenes. We deal with the lack of a formal definition of leadership by learning, in a supervised fashion, a metric space based exclusively on people spatiotemporal information. Based on Tarde's work on crowd psychology, individuals are modeled as nodes of a directed graph and leaders inherits their relevance thanks to other members references. We note this is analogous to the way websites are ranked by the PageRank algorithm. During experiments, we observed different feature weights depending on the specific type of crowd, highlighting the impossibility to provide a unique interpretation of leadership. To our knowledge, this is the first attempt to study leader identification as a metric learning problem

2015 - Mapping Appearance Descriptors on 3D Body Models for People Re-identification [Articolo su rivista]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
abstract

People Re-identification aims at associating multiple instances of a person’s appearance acquired from different points of view, different cameras, or after a spatial or a limited temporal gap to the same identifier. The basic hypothesis is that the person’s appearance is mostly constant. Many appearance descriptors have been adopted in the past, but they are often subject to severe perspective and view-point issues. In this paper, we propose a complete re-identification framework which exploits non-articulated 3D body models to spatially map appearance descriptors (color and gradient histograms) into the vertices of a regularly sampled 3D body surface. The matching and the shot integration steps are directly handled in the 3D body model, reducing the effects of occlusions, partial views or pose changes, which normally afflict 2D descriptors. A fast and effective model to image alignment is also proposed. It allows operation on common surveillance cameras or image collections. A comprehensive experimental evaluation is presented using the benchmark suite 3DPeS

2015 - Measuring scene detection performance [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
abstract

In this paper we evaluate the performance of scene detection techniques, starting from the classic precision/recall approach, moving to the better designed coverage/overflow measures, and finally proposing an improved metric, in order to solve frequently observed cases in which the numeric interpretation is different from the expected results. Numerical evaluation is performed on two recent proposals for automatic scene detection, and comparing them with a simple but effective novel approach. Experimental results are conducted to show how different measures may lead to different interpretations.

2015 - Personalized Egocentric Video Summarization for Cultural Experience [Relazione in Atti di Convegno]
Varini, Patrizia; Serra, Giuseppe; Cucchiara, Rita
abstract

Recent egocentric video summarization approaches have dealt with motion analysis and social interaction without considering that user can be interested in preserving only part of the video related to his interests. In this paper we propose a new method for personalized video summarization of cultural experiences with the goal of extracting from the streams only the scenes corresponding to a user's specific topics request, chosen among the shots in which it's possible to deduce that the visitor was focusing on a point of interest. Preliminary experiments show that our approach is promising and allows visitor to better customize the summary of his experience.

2015 - Scene segmentation using temporal clustering for accessing and re-using broadcast video [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
abstract

Scene detection is a fundamental tool for allowing effective video browsing and re-using. In this paper we present a model that automatically divides videos into coherent scenes, which is based on a novel combination of local image descriptors and temporal clustering techniques. Experiments are performed to demonstrate the effectiveness of our approach, by comparing our algorithm against two recent proposals for automatic scene segmentation. We also propose improved performance measures that aim to reduce the gap between numerical evaluation and expected results.

2015 - Shot and Scene Detection via Hierarchical Clustering for Re-using Broadcast Video [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
abstract

Video decomposition techniques are fundamental tools for allowing effective video browsing and re-using. In this work, we consider the problem of segmenting broadcast videos into coherent scenes, and propose a scene detection algorithm based on hierarchical clustering, along with a very fast state-of-the-art shot segmentation approach. Experiments are performed to demonstrate the effectiveness of our algorithms, by comparing against recent proposals for automatic shot and scene segmentation.

2015 - Towards the evaluation of reproducible robustness in tracking-by-detection [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Conventional experiments on MTT are built upon the belief that fixing the detections to different trackers is sufficient to obtain a fair comparison. In this work we argue how the true behavior of a tracker is exposed when evaluated by varying the input detections rather than by fixing them. We propose a systematic and reproducible protocol and a MATLAB toolbox for generating synthetic data starting from ground truth detections, a proper set of metrics to understand and compare trackers peculiarities and respective visualization solutions.

2015 - Understanding social relationships in egocentric vision [Articolo su rivista]
Alletto, Stefano; Serra, Giuseppe; Calderara, Simone; Cucchiara, Rita
abstract

The understanding of mutual people interaction is a key component for recognizing people social behavior, but it strongly relies on a personal point of view resulting difficult to be a-priori modeled. We propose the adoption of the unique head mounted cameras first person perspective (ego-vision) to promptly detect people interaction in different social contexts. The proposal relies on a complete and reliable system that extracts people׳s head pose combining landmarks and shape descriptors in a temporal smoothed HMM framework. Finally, interactions are detected through supervised clustering on mutual head orientation and people distances exploiting a structural learning framework that specifically adjusts the clustering measure according to a peculiar scenario. Our solution provides the flexibility to capture the interactions disregarding the number of individuals involved and their level of acquaintance in context with a variable degree of social involvement. The proposed system shows competitive performances on both publicly available ego-vision datasets and ad hoc benchmarks built with real life situations.

2015 - Wearable Vision for Retrieving Architectural Details in Augmented Tourist Experiences [Relazione in Atti di Convegno]
Alletto, Stefano; Serra, Giuseppe; Cucchiara, Rita
abstract

The interest in cultural cities is in constant growth, and so is the demand for new multimedia tools and applications that enrich their fruition. In this paper we propose an egocentric vision system to enhance tourists' cultural heritage experience. Exploiting a wearable board and a glass-mounted camera, the visitor can retrieve architectural details of the historical building he is observing and receive related multimedia contents. To obtain an effective retrieval procedure we propose a visual descriptor based on the covariance of local features. Differently than the common Bag of Words approaches our feature vector does not rely on a generated visual vocabulary, removing the dependence from a specific dataset and obtaining a reduction of the computational cost. 3D modeling is used to achieve a precise visitor's localization that allows browsing visible relevant details that the user may otherwise miss. Experimental results conducted on a publicly available cultural heritage dataset show that the proposed feature descriptor outperforms Bag of Words techniques.

2014 - 3D Hough transform for sphere recognition on point clouds [Articolo su rivista]
Camurri, Marco; Vezzani, Roberto; Cucchiara, Rita
abstract

Three-dimensional object recognition on range data and 3D point clouds is becoming more important nowadays. Since many real objects have a shape that could be approximated by simple primitives, robust pattern recognition can be used to search for primitive models. For example, the Hough transform is a well-known technique which is largely adopted in 2D image space. In this paper, we systematically analyze different probabilistic/randomized Hough transform algorithms for spherical object detection in dense point clouds. In particular, we study and compare four variants which are characterized by the number of points drawn together for surface computation into the parametric space and we formally discuss their models. We also propose a new method that combines the advantages of both single-point and multi-point approaches for a faster and more accurate detection. The methods are tested on synthetic and real datasets.

2014 - A complete system for garment segmentation and color classification [Articolo su rivista]
Manfredi, Marco; Grana, Costantino; Calderara, Simone; Cucchiara, Rita
abstract

In this paper, we propose a general approach for automatic segmentation, color-based retrieval and classification of garments in fashion store databases, exploiting shape and color information. The garment segmentation is automatically initialized by learning geometric constraints and shape cues, then it is performed by modeling both skin and accessory colors with Gaussian Mixture Models. For color similarity retrieval and classification, to adapt the color description to the users’ perception and the company marketing directives, a color histogram with an optimized binning strategy, learned on the given color classes, is introduced and combined with HOG features for garment classification. Experiments validating the proposed strategy, and a free-to-use dataset publicly available for scientific purposes, are finally detailed.

2014 - A fast and effective ellipse detector for embedded vision applications [Articolo su rivista]
Fornaciari, M.; Prati, A.; Cucchiara, R.
abstract

Several papers addressed ellipse detection as a first step for several computer vision applications, but most of the proposed solutions are too slow to be applied in real time on large images or with limited hardware resources. This paper presents a novel algorithm for fast and effective ellipse detection and demonstrates its superior speed performance on large and challenging datasets. The proposed algorithm relies on an innovative selection strategy of arcs which are candidate to form ellipses and on the use of Hough transform to estimate parameters in a decomposed space. The final aim of this solution is to represent a building block for new generation of smart-phone applications which need fast and accurate ellipse detection also with limited computational resources. © 2014 Elsevier Ltd.

2014 - Benchmarking for Person Re-identification [Capitolo/Saggio]
Vezzani, Roberto; Cucchiara, Rita
abstract

The evaluation of computer vision and pattern recognition systems is usually a burdensome and time-consuming activity. In this chapter all the benchmarks publicly available for re-identification will be reviewed and compared, starting from the ancestors VIPeR and Caviar to the most recent datasets for 3D modeling such as SARC3d (with calibrated cameras) and RGBD-ID (with range sensors). Specific requirements and constraints are highlighted and reported for each of the described collections. In addition, details on the metrics that are mostly used to test and evaluate the re-identification systems are provided.

2014 - Covariance of Covariance Features for Image Classification [Relazione in Atti di Convegno]
Serra, Giuseppe; Grana, Costantino; Manfredi, Marco; Cucchiara, Rita
abstract

In this paper we propose a novel image descriptor built by computing the covariance of pixel level features on densely sampled patches and encoding them using their covariance. Appropriate projections to the Euclidean space and feature normalizations are employed in order to provide a strong descriptor usable with linear classifiers. In order to remove border effects, we further enhance the Spatial Pyramid representation with bilinear interpolation. Experimental results conducted on two common datasets for object and texture classification show that the performance of our method is comparable with state of the art techniques, but removing any dataset specific dependency in the feature encoding step.

2014 - Detection of static groups and crowds gathered in open spaces by texture classification [Articolo su rivista]
Manfredi, Marco; Vezzani, Roberto; Calderara, Simone; Cucchiara, Rita
abstract

A surveillance system specifically developed to manage crowded scenes is described in this paper. In particular we focused on static crowds, composed by groups of people gathered and stayed in the same place for a while. The detection and spatial localization of static crowd situations is performed by means of a One Class Support Vector Machine, working on texture features extracted at patch level. Spatial regions containing crowds are identified and filtered using motion information to prevent noise and false alarms due to moving flows of people. By means of one class classification and inner texture descriptors, we are able to obtain, from a single training set, a sufficiently general crowd model that can be used for all the scenarios that shares a similar viewpoint. Tests on public datasets and real setups validate the proposed system.

2014 - From Ego to Nos-Vision: Detecting Social Relationships in First-Person Views [Relazione in Atti di Convegno]
Alletto, Stefano; Serra, Giuseppe; Calderara, Simone; Solera, Francesco; Cucchiara, Rita
abstract

In this paper we present a novel approach to detect groups in ego-vision scenarios. People in the scene are tracked through the video sequence and their head pose and 3D location are estimated. Based on the concept of f-formation, we define with the orientation and distance an inherently social pairwise feature that describes the affinity of a pair of people in the scene. We apply a correlation clustering algorithm that merges pairs of people into socially related groups. Due to the very shifting nature of social interactions and the different meanings that orientations and distances can assume in different contexts, we learn the weight vector of the correlation clustering using Structural SVMs. We extensively test our approach on two publicly available datasets showing encouraging results when detecting groups from first-person camera views.

2014 - Gesture Recognition in Ego-Centric Videos using Dense Trajectories and Hand Segmentation [Relazione in Atti di Convegno]
Baraldi, Lorenzo; Paci, Francesco; Serra, Giuseppe; Benini, Luca; Cucchiara, Rita
abstract

We present a novel method for monocular hand gesture recognition in ego-vision scenarios that deals with static and dynamic gestures and can achieve high accuracy results using a few positive samples. Specifically, we use and extend the dense trajectories approach that has been successfully introduced for action recognition. Dense features are extracted around regions selected by a new hand segmentation technique that integrates superpixel classification, temporal and spatial coherence. We extensively testour gesture recognition and segmentation algorithms on public datasets and propose a new dataset shot with a wearable camera. In addition, we demonstrate that our solution can work in near real-time on a wearable device.

2014 - Head Pose Estimation in First-Person Camera Views [Relazione in Atti di Convegno]
Alletto, Stefano; Serra, Giuseppe; Calderara, Simone; Cucchiara, Rita
abstract

In this paper we present a new method for head pose real-time estimation in ego-vision scenarios that is a key step in the understanding of social interactions. In order to robustly detect head under changing aspect ratio, scale and orientation we use and extend the Hough-Based Tracker which allows to follow simultaneously each subject in the scene. In an ego-vision scenario where a group interacts in a discussion, each subject's head orientation will be more likely to remain focused for a while on the person who has the floor. In order to encode this behavior we include a stateful Hidden Markov Model technique that enforces the predicted pose with the temporal coherence from a video sequence. We extensively test our approach on several indoor and outdoor ego-vision videos with high illumination variations showing its validity and outperforming other recent related state of the art approaches.

2014 - Human Behavior Understanding: 5th International Workshop, HBU 2014 Zurich, Switzerland, September 12, 2014 Proceedings [Relazione in Atti di Convegno]
Park, H. S.; Salah, A. A.; Lee, Y. J.; Morency, L. -P.; Sheikh, Y.; Cucchiara, R.
abstract

2014 - Illustrations Segmentation in Digitized Documents Using Local Correlation Features [Relazione in Atti di Convegno]
Coppi, Dalia; Grana, Costantino; Cucchiara, Rita
abstract

In this paper we propose an approach for Document Layout Analysis based on local correlation features. We identify and extract illustrations in digitized documents by learning the discriminative patterns of textual and pictorial regions. The proposal has been demonstrated to be effective on historical datasets and to outperform the state-of-the-art in presence of challenging documents with a large variety of pictorial elements.

2014 - Kernelized Structural Classification for 3D Dogs Body Parts Detection [Relazione in Atti di Convegno]
Pistocchi, Simone; Calderara, Simone; Barnard, S.; Ferri, N.; Cucchiara, Rita
abstract

Despite pattern recognition methods for human behavioral analysis has flourished in the last decade, animal behavioral analysis has been almost neglected. Those few approaches are mostly focused on preserving livestock economic value while attention on the welfare of companion animals, like dogs, is now emerging as a social need. In this work, following the analogy with human behavior recognition, we propose a system for recognizing body parts of dogs kept in pens. We decide to adopt both 2D and 3D features in order to obtain a rich description of the dog model. Images are acquired using the Microsoft Kinect to capture the depth map images of the dog. Upon depth maps a Structural Support Vector Machine (SSVM) is employed to identify the body parts using both 3D features and 2D images. The proposal relies on a kernelized discriminative structural classificator specifically tailored for dogs independently from the size and breed. The classification is performed in an online fashion using the LaRank optimization technique to obtaining real time performances. Promising results have emerged during the experimental evaluation carried out at a dog shelter, managed by IZSAM, in Teramo, Italy.

2014 - Learning Graph Cut Energy Functions for Image Segmentation [Relazione in Atti di Convegno]
Manfredi, Marco; Grana, Costantino; Cucchiara, Rita
abstract

In this paper we address the task of learning how to segment a particular class of objects, by means of a training set of images and their segmentations. In particular we propose a method to overcome the extremely high training time of a previously proposed solution to this problem, Kernelized Structural Support Vector Machines. We employ a one-class SVM working with joint kernels to robustly learn significant support vectors (representative image-mask pairs) and accordingly weight them to build a suitable energy function for the graph cut framework. We report results obtained on two public datasets and a comparison of training times on different training set sizes.

2014 - Learning Superpixel Relations for Supervised Image Segmentation [Relazione in Atti di Convegno]
Manfredi, Marco; Grana, Costantino; Cucchiara, Rita
abstract

In this paper we propose to extend the well known graph cut segmentation framework by learning superpixel relations and use them to weight superpixel-to-superpixel edges in a superpixel graph. Adjacent superpixel-pairs are analyzed to build an object boundary model, able to discriminate between superpixel-pairs belonging to the same object or placed on the edge between the foreground object and the background. Several superpixel-pair features are investigated and exploited to build a non-linear SVM to learn object boundary appearance. The adoption of this modified graph cut enhances the performance of a previously proposed segmentation method on two publicly available datasets, reaching state-of-the-art results.

2014 - Miniature illustrations retrieval and innovative interaction for digital illuminated manuscripts [Articolo su rivista]
Borghesani, Daniele; Grana, Costantino; Cucchiara, Rita
abstract

In this paper we propose a multimedia solution for the interactive exploration of illuminated manuscripts. We leveraged on the joint exploitation of content-based image retrieval and relevance feedback to provide an effective mechanism to navigate through the manuscript and add custom knowledge in the form of tags. The similarity retrieval between miniature illustrations is based on covariance descriptors, integrating color, spatial and gradient information. The proposed relevance feedback technique, namely Query Remapping Feature Space Warping, accounts for the user’s opinions by accordingly warping the data points. This is obtained by means of a remapping strategy (from the Riemannian space where covariance matrices lie, referring back to Euclidean space) useful to boost the retrieval performance. Experiments are reported to show the quality of the proposal. Moreover, the complete prototype with user interaction, as already showcased at museums and exhibitions, is presented.

2014 - On detection of novel categories and subcategories of images using incongruence [Relazione in Atti di Convegno]
Coppi, D.; De Campos, T.; Yan, F.; Kittler, J.; Cucchiara, R.
abstract

Novelty detection is a crucial task in the development of autonomous vision systems. It aims at detecting if samples do not conform with the learnt models. In this paper, we consider the problem of detecting novelty in object recognition problems in which the set of object classes are grouped to form a semantic hierarchy. We follow the idea that, within a semantic hierarchy, novel samples can be defined as samples whose categorization at a specific level contrasts with the categorization at a more general level. This measure indicates if a sample is novel and, in that case, if it is likely to belong to a novel broad category or to a novel sub-category. We present an evaluation of this approach on two hierarchical subsets of the Caltech256 objects dataset and on the SUN scenes dataset, with different classification schemes. We obtain an improvement over Weinshall et al. and show that it is possible to bypass their normalisation heuristic. We demonstrate that this approach achieves good novelty detection rates as far as the conceptual taxonomy is congruent with the visual hierarchy, but tends to fail if this assumption is not satisfied. Copyright 2014 ACM.

2014 - Pattern recognition and crowd analysis [Articolo su rivista]
Bandini, S.; Calderara, S.; Cucchiara, R.
abstract

2014 - Preface [Relazione in Atti di Convegno]
Park, H. S.; Salah, A. A.; Lee, Y. J.; Morency, L. -P.; Sheikh, Y.; Cucchiara, R.
abstract

2014 - Substrate for a sensitive floor and method for displaying loads on the substrate [Brevetto]
Lucchese, Claudio; Cucchiara, Rita; Lombardi, Martino; Pieracci, Augusto; Santinelli, Paolo; Vezzani, Roberto
abstract

The substrate (1; 50) for making a sensitive floor comprises: a first frame made of high-conductivity sensing means (2a-2d) having a first orientation; a second frame made of high-conductivity sensing means (3a-3d) which is adapted to be laid on said first frame and has a second orientation, other than said first orientation, said second frame (3a-3d) forming a support layer for floor finishing products; an element (4) made of a conductive material, which comprises: an elastically compressible thickness (S1), two opposite faces (104, 204) contacting said two first and second frames (2a-2d), (3a-3d), an electric resistor whose resistance is proportional to said thickness (S1).

2014 - Visions for augmented cultural heritage experience [Articolo su rivista]
Cucchiara, R.; Del Bimbo, A.
abstract

Museum visitor experiences differ from person to person, from cognitive to affective experiences. Progress in information technology has provided us with the opportunity to improve both the quantity and personalization of cultural information, privileging the cognitive experience against the affective. Computer vision promises to be an extraordinary enabling technology for augmenting visitor experiences, bridging the affective gap by understanding the visitor's individual cognitive needs and interests and his or her situational affective state. © 2014 IEEE.

2014 - Visual Tracking: An Experimental Survey [Articolo su rivista]
A. W. M., Smeulder; D. M., Chu; Cucchiara, Rita; Calderara, Simone; A., Dehghan; M., Shah
abstract

There is a large variety of trackers, which have been proposed in the literature during the last two decades with some mixed success. Object tracking in realistic scenarios is difficult problem, therefore it remains a most active area of research in Computer Vision. A good tracker should perform well in a large number of videos involving illumination changes, occlusion, clutter, camera motion, low contrast, specularities and at least six more aspects. However, the performance of proposed trackers have been evaluated typically on less than ten videos, or on the special purpose datasets. In this paper, we aim to evaluate trackers systematically and experimentally on 315 video fragments covering above aspects. We selected a set of nineteen trackers to include a wide variety of algorithms often cited in literature, supplemented with trackers appearing in 2010 and 2011 for which the code was publicly available. We demonstrate that trackers can be evaluated objectively by survival curves, Kaplan Meier statistics, and Grubs testing. We find that in the evaluation practice the F-score is as effective as the object tracking accuracy (OTA) score. The analysis under a large variety of circumstances provides objective insight into the strengths and weaknesses of trackers.

2013 - A Fast Approach for Integrating ORB Descriptors in the Bag of Words Model [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Manfredi, Marco; Cucchiara, Rita
abstract

In this paper we propose to integrate the recently introduces ORB descriptors in the currently favored approach for image classification, that is the Bag of Words model. In particular the problem to be solved is to provide a clustering method able to deal with the binary string nature of the ORB descriptors. We suggest to use a k-means like approach, called k-majority, substituting Euclidean distance with Hamming distance and majority selected vector as the new cluster center. Results combining this new approach with other features are provided over the ImageCLEF 2011 dataset.

2013 - A mobile vision system for fast and accurate ellipse detection [Relazione in Atti di Convegno]
Fornaciari, M.; Cucchiara, R.; Prati, A.
abstract

Several papers addressed ellipse detection as a first step for several computer vision applications, but most of the proposed solutions are too slow to be applied in real time on large images or with limited hardware resources, as in the case of mobile devices. This demo is based on a novel algorithm for fast and accurate ellipse detection. The proposed algorithm relies on a careful selection of arcs which are candidate to form ellipses and on the use of Hough transform to estimate parameters in a decomposed space. The demo will show it working on a commercial smart-phone. © 2013 IEEE.

2013 - A people counting system for business analytics [Relazione in Atti di Convegno]
Pane, C.; Gasparini, M.; Prati, A.; Gualdi, G.; Cucchiara, R.
abstract

This paper deals with people counting in stores for business analytics using stereo vision. Among the several problems in this type of applications, two are the most relevant for our purposes: the management of occlusions and the distinction between adult people (potential customers) and other objects (children, trolleys, strollers, animals, etc.). The proposed solution uses a novel approach for object detection (based on background suppression on a so-called 'depth bird-eye view' and the clustering on the 3D point cloud by means of mean shift with a cylindrical kernel) followed by an adult people classifier which exploits a fitness measure with respect to a cylindrical human body model. The fitness is computed using Montecarlo sampling to estimate the volume occupation. Experiments are conducted on two real setups (including a store in a normal day of activity) and compared with a previous work. The results demonstrate the accuracy of the proposed solution. © 2013 IEEE.

2013 - AN AUTOMATED PICKING WORKSTATION FOR HEALTHCARE APPLICATIONS [Articolo su rivista]
Piccinini, P.; Gamberini, Rita; Prati, A.; Rimini, Bianca; Cucchiara, Rita
abstract

The costs associated with the management of healthcare systems have been subject to continuous scrutiny for some time now, with a view to reducing them without affecting the quality as perceived by final users. A number of different solutions have arisen based on centralisation of healthcare services and investments in Information Technology (IT). One such example is centralised management of pharmaceuticals among a group of hospitals which is then incorporated into the different steps of the automation supply chain. This paper focuses on a new picking workstation available for insertion in automated pharmaceutical distribution centres and which is capable of replacing manual workstations and bringing about improvements in working time. The workstation described uses a sophisticated computer vision algorithm to allow picking of very diverse and complex objects randomly available on a belt or in bins. The algorithm exploits state-of-the-art feature descriptors for an approach that is robust against occlusions and distracting objects, and invariant to scale, rotation or illumination changes. Finally, the performance of the designed picking workstation is tested in a large experimentation focused on the management of pharmaceutical items.

2013 - Automatic Single-Image People Segmentation and Removal for Cultural Heritage Imaging [Relazione in Atti di Convegno]
Manfredi, Marco; Grana, Costantino; Cucchiara, Rita
abstract

In this paper, the problem of automatic people removal from digital photographs is addressed. Removing unintended people from a scene can be very useful to focus further steps of image analysis only on the object of interest, A supervised segmentation algorithm is presented and tested in several scenarios.

2013 - Beyond Bag of Words for Concept Detection and Search of Cultural Heritage Archives [Relazione in Atti di Convegno]
Grana, Costantino; Serra, Giuseppe; Manfredi, Marco; Cucchiara, Rita
abstract

Several local features have become quite popular for concept detection and search, due to their ability to capture distinctive details. Typically a Bag of Words approach is followed, where a codebook is built by quantizing the local features. In this paper, we propose to represent SIFT local features extracted from an image as a multivariate Gaussian distribution, obtaining a mean vector and a covariance matrix. Differently from common techniques based on the Bag of Words model, our solution does not rely on the construction of a visual vocabulary, thus removing the dependence of the image descriptors on the specific dataset and allowing to immediately retargeting the features to different classification and search problems. Experimental results are conducted on two very different Cultural Heritage image archives, composed of illuminated manuscript miniatures, and architectural elements pictures collected from the web, on which the proposed approach outperforms the Bag of Words technique both in classification and retrieval.

2013 - Hand Segmentation for Gesture Recognition in EGO-Vision [Relazione in Atti di Convegno]
Serra, Giuseppe; Camurri, Marco; Baraldi, Lorenzo; Michela, Benedetti; Cucchiara, Rita
abstract

Portable devices for first-person camera views will play a central role in future interactive systems. One necessary step for feasible human-computer guided activities is gesture recognition, preceded by a reliable hand segmentation from egocentric vision. In this work we provide a novel hand segmentation algorithm based on Random Forest superpixel classification that integrates light, time and space consistency. We also propose a gesture recognition method based Exemplar SVMs since it requires a only small set of positive samples, hence it is well suitable for the egocentric video applications. Furthermore, this method is enhanced by using segmented images instead of full frames during test phase. Experimental results show that our hand segmentation algorithm outperforms the state-of-the-art approaches and improves the gesture recognition accuracy on both the publicly available EDSH dataset and our dataset designed for cultural heritage applications.

2013 - Human Behavior Understanding with Wide Area Sensing Floors [Relazione in Atti di Convegno]
Lombardi, Martino; Pieracci, Augusto; Santinelli, Paolo; Vezzani, Roberto; Cucchiara, Rita
abstract

The research on innovative and natural interfaces aims at developing devices able to capture and understand the human behavior without the need of a direct interaction. In this paper we propose and describe a framework based on a sensing floor device. The pressure field generated by people or objects standing on the floor is captured and analyzed. Local and global features are computed by a low level processing unit and sent to high level interfaces. The framework can be used in different applications, such as entertainment, education or surveillance. A detailed description of the sensing element and the processing architectures is provided, together with some sample applications developed to test the device capabilities.

2013 - Image Classification with Multivariate Gaussian Descriptors [Relazione in Atti di Convegno]
Grana, Costantino; Serra, Giuseppe; Manfredi, Marco; Cucchiara, Rita
abstract

Techniques based on Bag Of Words approach represent images by quantizing local descriptors and summarizing their distribution in a histogram. Dierently, in this paper we describe an image as multivariate Gaussian distribution, estimated over the extracted local descriptors. The estimated distribution is mapped to a high-dimensional descriptor, by concatenating the mean vector and the projection of the covariance matrix on the Euclidean space tangent to the Riemannian manifold. To deal with large scale datasets and high dimensional feature spaces the Stochastic Gradient Descent solver is adopted. The experimental results on Caltech-101 and ImageCLEF2011 show that the method obtains competitive performance with state-of-the art approaches.

2013 - Intelligent video surveillance as a service [Capitolo/Saggio]
Prati, A.; Vezzani, R.; Fornaciari, M.; Cucchiara, R.
abstract

Nowadays, intelligent video surveillance has become an essential tool of the greatest importance for several security-related applications. With the growth of installed cameras and the increasing complexity of required algorithms, in-house self-contained video surveillance systems become a chimera for most institutions and (small) companies. The paradigm of Video Surveillance as a Service (VSaaS) helps distributing not only storage space in the cloud (necessary for handling large amounts of video data), but also infrastructures and computational power. This chapter will briefly introduce the motivations and the main characteristics of a VSaaS system, providing a case study where research-lab computer vision algorithms are integrated in a VSaaS platform. The lessons learnt and some future directions on this topic will be also highlighted.

2013 - Learning articulated body models for people re-identification [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
abstract

People re-identification is a challenging problem in surveillance and forensics and it aims at associating multiple instances of the same person which have been acquired from different points of view and after a temporal gap. Image-based appearance features are usually adopted but, in addition to their intrinsically low discriminability, they are subject to perspective and view-point issues. We propose to completely change the approach by mapping local descriptors extracted from RGB-D sensors on a 3D body model for creating a view-independent signature. An original bone-wise color descriptor is generated and reduced with PCA to compute the person signature. The virtual bone set used to map appearance features is learned using a recursive splitting approach. Finally, people matching for re-identification is performed using the Relaxed Pairwise Metric Learning, which simultaneously provides feature reduction and weighting. Experiments on a specific dataset created with the Microsoft Kinect sensor and the OpenNi libraries prove the advantages of the proposed technique with respect to state of the art methods based on 2D or non-articulated 3D body models.

2013 - Lightweight Sign Recognition for Mobile Devices [Relazione in Atti di Convegno]
Fornaciari, Michele; Prati, Andrea; Grana, Costantino; Cucchiara, Rita
abstract

The diffusion of powerful mobile devices has posed the basis for new applications implementing on the devices (which are embedded devices) sophisticated computer vision and pattern recognition algorithms. This paper describes the implementation of a complete system for automatic recognition of places localized on a map through the recognition of significant signs by means of the camera of a mobile device (smartphone, tablet, etc.). The paper proposes a novel classification algorithm based on the innovative use of bag-of-words on ORB features. The recognition is achieved using a simple yet effective search scheme which exploits GPS localization to limit the possible matches. This simple solution brings several advantages, such as the speed also on limited-resource devices, the usability also with limited training samples and the easiness of adapting to new training samples and classes. The overall architecture of the system is based on a REST-JSON client-server architecture. The experimental results have been conducted in a real scenario and evaluating the different parameters which influence the performance.

2013 - Modeling Local Descriptors with Multivariate Gaussians for Object and Scene Recognition [Relazione in Atti di Convegno]
Serra, Giuseppe; Grana, Costantino; Manfredi, Marco; Cucchiara, Rita
abstract

Common techniques represent images by quantizing local descriptors and summarizing their distribution in a histogram. In this paper we propose to employ a parametric description and compare its capabilities to histogram based approaches. We use the multivariate Gaussian distribution, applied over the SIFT descriptors, extracted with dense sampling on a spatial pyramid. Every distribution is converted to a high-dimensional descriptor, by concatenating the mean vector and the projection of the covariance matrix on the Euclidean space tangent to the Riemannian manifold. Experiments on Caltech-101 and ImageCLEF2011 are performed using the Stochastic Gradient Descent solver, which allows to deal with large scale datasets and high dimensional feature spaces.

2013 - On the design of embedded solutions to banknote recognition [Articolo su rivista]
Rashid, A.; Prati, A.; Cucchiara, R.
abstract

Banknote recognition systems have many applications in the modern world of automatic monetary transaction machines. They are traditionally based on simple classifiers applied over manually selected areas. A new solution in this field, borrowed by content-based image retrieval (CBIR), which is based on dense scale-invariant feature transform features in a bag-of-words framework followed by a support vector machine (SVM) classifier, is explored. The proposed computer vision system for banknote recognition, on one hand, enables recognition at high accuracy and speed, and, on the other hand, provides basis for further applications, e.g., counterfeit detection and fitness test. This approach makes the system robust to various defects, which may occur during image acquisition or during circulation life of banknote. We implemented and tested on an embedded platform three state-of-the-art classification techniques [SVM, artificial neural network (ANN), and hidden Markov model (HMM)]. The comparative results are reported for accuracy with different sizes of the training datasets and with various types of datasets. In this framework, the SVM classifier outperforms ANN and HMM on the basis of speed and accuracy on our embedded platform. © 2013 Society of Photo-Optical Instrumentation Engineers.

2013 - People reidentification in surveillance and forensics: a Survey [Articolo su rivista]
Vezzani, Roberto; Baltieri, Davide; Cucchiara, Rita
abstract

The field of surveillance and forensics research is currently shifting focus and is now showing an ever increasing interest in the task of people reidentification. This is the task of assigning the same identifier to all instances of a particular individual captured in a series of images or videos, even after the occurrence of significant gaps over time or space. People reidentification can be a useful tool for people analysis in security as a data association method for long-term tracking in surveillance. However, current identification techniques being utilized present many difficulties and shortcomings. For instance, they rely solely on the exploitation of visual cues such as color, texture, and the object's shape. Despite the many advances in this field, reidentification is still an open problem. This survey aims to tackle all the issues and challenging aspects of people reidentification while simultaneously describing the previously proposed solutions for the encountered problems. This begins with the first attempts of holistic descriptors and progresses to the more recently adopted 2D and 3D model-based approaches. The survey also includes an exhaustive treatise of all the aspects of people reidentification, including available datasets, evaluation metrics, and benchmarking.

2013 - Sensing floors for privacy-compliant surveillance of wide areas [Relazione in Atti di Convegno]
Lombardi, Martino; Pieracci, Augusto; Santinelli, Paolo; Vezzani, Roberto; Cucchiara, Rita
abstract

Surveillance systems can really benefit from the integration of multiple and heterogeneous sensors. In this paper we describe an innovative sensing floor. Thanks to its low cost and ease of installation, the floor is suitable for both private and public environments, from narrow zones to wide areas. The floor is made adding a sensing layer below commercial floating tiles. The sensor is scalable, reliable, and completely invisible to the users. The temporal and spatial resolutions of the data are high enough to identify the presence of people, to recognize their behavior and to detect events in a privacy compliant way. Experimental results on a real prototype implementation confirm the potentiality of the framework.

2013 - Structured learning for detection of social groups in crowd [Relazione in Atti di Convegno]
Solera, Francesco; Calderara, Simone; Cucchiara, Rita
abstract

Group detection in crowds will play a key role in future behavior analysis surveillance systems. In this work we build a new Structural SVM-based learning framework able to solve the group detection task by exploiting annotated video data to deduce a sociologically motivated distance measure founded on Hall's proxemics and Granger's causality. We improve over state-of-the-art results even in the most crowded test scenarios, while keeping the classification time affordable for quasi-real time applications. A new scoring scheme specifically designed for the group detection task is also proposed.

2013 - UNIMORE at ImageCLEF 2013: Scalable Concept Image Annotation [Relazione in Atti di Convegno]
Grana, Costantino; Serra, Giuseppe; Manfredi, Marco; Cucchiara, Rita; Martoglia, Riccardo; Mandreoli, Federica
abstract

In this paper we propose a large-scale Image annotation system for the Scalable Concept Image Annotation task. For each concept to be detected a separated classifier is built using the provided textual annotation. Images are represented as a Multivariate Gaussian distribution of a set of local features extracted over a dense regular grid. Textual analysis, on the web pages containing training images, is performed to retrieve a relevant set of samples for learning each concept classifier. An online SVMs solver based on Stochastic Gradient Descent is used to manage the large amount of training data. Experimental results show that the combination of different kind of local features encoded with our strategy achieves very competitive performance both in terms of mAP and mean F-measure.

2013 - Video surveillance online repository (ViSOR) [Relazione in Atti di Convegno]
Vezzani, Roberto; Cucchiara, Rita
abstract

This paper describe the ViSOR (Video Surveillance Online Repository) repository, designed with the aim of establishing an open platform for collecting, annotating, retrieving, and sharing surveillance videos, as well as evaluating the performance of automatic surveillance systems. The repository is free and researchers can collaborate sharing their own videos or datasets. Most of the included videos are annotated. Annotations are based on a reference ontology which has been defined integrating hundreds of concepts, some of them coming from the LSCOM and MediaMill ontologies. A new annotation classification schema is also provided, which is aimed at identifying the spatial, temporal and domain detail level used. The web interface allows video browsing, querying by annotated concepts or by keywords, compressed video previewing, media downloading and uploading. Finally, ViSOR includes a performance evaluation desk which can be used to compare different annotations.

2012 - 2D Images Map Warping for Improved User Interaction [Relazione in Atti di Convegno]
Borghesani, Daniele; Grana, Costantino; Cucchiara, Rita
abstract

In this paper, we suggest an interaction model designed to fit users' expectations in front of an image retrieval system. A lightweight relevance feedback strategy, working directly on the 2D projection of image features, allows the user to spatially navigate the media collection maintaining the real-time constraint. A preliminary evaluation of this relevance feedback strategy shows good performance compared with other known approaches.

2012 - A human vs. machine challenge in fashion color classification [Relazione in Atti di Convegno]
Grana, C.; Borghesani, D.; Cucchiara, R.
abstract

For this demo, we present a set of stark applications designed to evaluate the performance of a color similarity retrieval system against human operators performance in the same tasks. The proposed series of tests give some interesting insights about the perception of color classes and the reliability of manual annotation in the fashion context. © 2012 Springer-Verlag.

2012 - Class-based color bag of words for fashion retrieval [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Cucchiara, Rita
abstract

Color signatures, histograms and bag of colors are basic and effective strategies for describing the color content of images, for retrieving images by their color appearance or providing color annotation. In some domains, colors assume a specific meaning for users and the color-based classification and retrieval should mirror the initial suggestions given by users in the training set. For instance in fashion world, the names given to the dominant color of a garment or a dress reflect the fashion dictact and not an uniform division of the color space.In this paper we propose a general approach to implement color signature as a trained bag of words, defined on the basis of user defined color classes. The novel Class-based Color Bag of Words is a easy computable bag of words of color, constructed following an approach similar to the Median Cut algorithm, but biased by color distribution in the trained classes. Moreover, to dramatically reduce the computational effort we propose 3D integral histograms, a 3D extension of integral images, easily extensible for many histogram-based signature in 3D color space. Several comparisons in large fashion datasets confirm the discriminant power of this signature.

2012 - Integrate tool for online analysis and offline mining of people trajectories [Articolo su rivista]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

In the past literature, online alarm-based video-surveillance and offline forensic-based data mining systems are often treated separately, even from different scientific communities. However, the founding techniques are almost the same and, despite some examples in commercial systems, the cases on which an integrated approach is followed are limited. For this reason, this study describes an integrated tool capable of putting together these two subsystems in an effective way. Despite its generality, the proposal is here reported in the case of people trajectory analysis, both in real time and offline. Trajectories are modelled based on either their spatial location or their shape, and proper similarity measures are proposed. Special solutions to meet real-time requirements in both cases are also presented and the trade-off between efficiency and efficacy is analysed by comparing when using a statistical model and when not. Examples of results in large datasets acquired in the University campus are reported as preliminary evaluation of the system.

2012 - Intelligent Video Surveillance [Capitolo/Saggio]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

Safety and security reasons are pushing the growth of surveillance systems, for both prevention and forensic tasks. Unfortunately, most of the installed systems have recording capability only, with quality so poor that makes them completely unhelpful. This chapter will introduce the concepts of modern systems for Intelligent Video Surveillance (IVS), with the claim of providing neither a complete treatment nor a technical description of this topic but of representing a simple and concise panorama of the motivations, components, and trends of these systems. Different from CCTV systems, IVS should be able, for instance, to monitor people in public areas and smart homes, to control urban traffi c, and to identity assessment for security and safety of critical infrastructure.

2012 - Learning Non-Target Items for Interesting Clothes Segmentation in Fashion Images [Relazione in Atti di Convegno]
Grana, Costantino; Calderara, Simone; Borghesani, Daniele; Cucchiara, Rita
abstract

In this paper we propose a color-based approach for skin detection and interest garment selection aimed at an automatic segmentation of pieces of clothing. For both purposes, the color description is extracted by an iterative energy minimization approach and an automatic initialization strategy is proposed by learning geometric constraints and shape cues. Experiments confirms the good performance of this technique both in the context of skin removal and in the context of classification of garments.

2012 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Preface [Relazione in Atti di Convegno]
Fusiello, A.; Murino, V.; Cucchiara, R.
abstract

2012 - Multimedia for Cultural Heritage: Key Issues [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; Borghesani, Daniele; M., Agosti; A. D., Bagdanov
abstract

Multimedia technologies have recently created the conditions for a true revolution in the Cultural Heritage domain, particularly in reference to the study, exploitation, and fruition of artistic works. New opportunities are arising for researchers in the field of multimedia to share their research results with people coming from the field of art and culture, and viceversa. This paper gathers together opinions and ideas shared during the final discussion session at the 1st International Workshop on Multimedia for Cultural Heritage, as a summary of the problems and possible directions to solve to them.

2012 - Multistage Particle Windows for Fast and Accurate Object Detection [Articolo su rivista]
G., Gualdi; A., Prati; Cucchiara, Rita
abstract

The common paradigm employed for object detection is the sliding window (SW) search. This approach generates grid-distributed patches, at all possible positions and sizes, which are evaluated by a binary classifier: the trade-off between computational burden and detection accuracy is the real critical point of sliding windows; several methods have been proposed to speed up the search such as adding complementary features. We propose a paradigm that differs from any previous approach, since it casts object detection into a statistical-based search using a Monte Carlo sampling for estimating the likelihood density function with Gaussian kernels. The estimation relies on a multi-stage strategy where the proposal distribution is progressively refined by taking into account the feedback of the classifiers. The method can be easily plugged in a Bayesian-recursive framework to exploit the temporal coherency of the target objects in videos. Several tests on pedestrian and face detection, both on images and videos, with different types of classifiers (cascade of boosted classifiers, soft cascades and SVM) and features (covariance matrices, Haar-like features, integral channel features and histogram of oriented gradients) demonstrate that the proposed method provides higher detection rates and accuracy as well as a lower computational burden w.r.t. sliding window detection.

2012 - People Orientation Recognition by Mixtures of Wrapped Distributions on Random Trees [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
abstract

The recognition of people orientation in single images is still an open issue in several real cases, when the image resolution is poor, body parts cannot be distinguished and localized or motion cannot be exploited. However, the estimation of a person orientation, even an approximated one, could be very useful to improve people tracking and re-identification systems, or to provide a coarse alignment of body models on the input images. In these situations, holistic features seem to be more effective and faster than model based 3D reconstructions. In this paper we propose to describe the people appearance with multi-level HoG feature sets and to classify their orientation using an array of Extremely Randomized Trees classifiers trained on quantized directions. The outputs of the classifiers are then integrated into a global continuous probability density function using a Mixture of Approximated Wrapped Gaussian distributions. Experiments on the TUD Multiview Pedestrians, the Sarc3D, and the 3DPeS datasets confirm the efficacy of the method and the improvement with respect to state of the art approaches.

2012 - Preface [Relazione in Atti di Convegno]
Grana, C.; Cucchiara, R.
abstract

2012 - Real-time object detection and localization with SIFT-based clustering [Articolo su rivista]
Piccinini, P.; Prati, A.; Cucchiara, R.
abstract

This paper presents an innovative approach for detecting and localizing duplicate objects in pick-and-place applications under extreme conditions of occlusion, where standard appearance-based approaches are likely to be ineffective. The approach exploits SIFT keypoint extraction and mean shift clustering to partition the correspondences between the object model and the image onto different potential object instances with real-time performance. Then, the hypotheses of the object shape are validated by a projection with a fast Euclidean transform of some delimiting points onto the current image. Moreover, in order to improve the detection in the case of reflective or transparent objects, multiple object models (of both the same and different faces of the object) are used and fused together. Many measures of efficacy and efficiency are provided on random disposals of heavily-occluded objects, with a specific focus on real-time processing. Experimental results on different and challenging kinds of objects are reported. © 2012 Elsevier B.V. All rights reserved.

2012 - Relevance Feedback as an Interactive Navigation Tool [Relazione in Atti di Convegno]
Borghesani, Daniele; Grana, Costantino; Cucchiara, Rita
abstract

Image collections are searched in common retrieval systems in many different ways, but the typical presentation is by means of a grid styled view. In this paper we try to suggest a novel use of relevance feedback as a tool to warp the view and allow the user to spatially navigate the image collection, and at the same time focus on his retrieval aim. This is obtained by the use of a distance based space warping on the 2D projection of the distance matrix.

2012 - Special Issue: Recent Achievements in Multimedia for Cultural Heritage - Guest Editorial [Articolo su rivista]
Cucchiara, Rita; Grana, Costantino
abstract

For quite some time, libraries, document and historical centers from opposite corners of the world have been the caretakers of our rich and assorted social legacy. They have protected and furnished access to the testimonies of knowledge, beauty and inspiration, such as sculptures, paintings, music and literature. The new information technologies have created unbelievable opportunities to make this common heritage more accessible for all. Culture is following the digital path and “memory institutions” are adapting the way in which they communicate with their public. Multimedia technologies have recently created the conditions for a true revolution in the cultural heritage area, with reference to the study, valorization, and fruition of artistic works. New multimedia technologies shall be able to be utilized to plan unique approaches to the perception and fulfillment of the masterful legacy, for instance, through smart cultural objects and new interfaces with the backing of items such as story-telling, gaming and learning.All the plurality of masterpieces (paintings, books, manuscripts, even photos of sculptures and architecture) can be effectively embedded into a unique ``paradigm'' through digitization. This allows a significant reduction in costs, an enormous expansion of public accessibility (and therefore income), and at the same time a tremendous freedom for data elaboration. In brief, digitization enhances pleasure for the public and usefulness to experts on cultural heritage assets.

2012 - Towards Artistic Collections Navigation Tools based on Relevance Feedback [Relazione in Atti di Convegno]
Borghesani, Daniele; Grana, Costantino; Cucchiara, Rita
abstract

Artistic image collections are usually managed via textual metadata into standard content management systems. More sophisticated searches can be performed using image retrieval technologies based on visual content. Nevertheless, the problem of the information presentation remains. In this paper we try to move beyond the classic grid-styled presentation model, suggesting a novel use of relevance feedback as a navigation tool. Relevance feedback is therefore used to warp the view and allow the user to spatially navigate the image collection, and at the same time focus on his retrieval aim. This is obtained exploiting a distance based space warping on the 2D projection of the distance matrix. Multitouch gestures are employed to provide feedbacks by natural interaction with the system.

2012 - Understanding dyadic interactions applying proxemic theory on videosurveillance trajectories [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita
abstract

Understanding social and collective people behaviour in open spaces is one of the frontier of modern video surveillance. Many sociological theories, and proxemics in particular, have been proved their validity as a support for classifying and interpreting human behaviour. Proxemics suggest some simple but effective behavioural rules, useful to understand what people are doing and their social involvement with other individuals. In this paper we propose to extend the proxemics analysis along the time and provide a solution for analysing sequences of proxemic states computed between trajectories of people pairs (dyads). Trajectories, computed from videosurveillance videos, are first analysed and converted to a sequence of symbols according to proxemic theory. Then an elastic measure for comparing those sequences is introduced. Finally, interactions are classified both in an off-line unsupervised way and in an on-line fashion. Results on videosurveillance data, demonstrate that sequences of proxemic states can be effective in characterizing mutual interactions and experiments in capturing the most frequent dyads interactions and on-line classifying them when a labelled training set is available are proposed.

2012 - Veiling Luminance estimation on FPGA-based embedded smart camera [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Santinelli, Paolo; Cucchiara, Rita
abstract

This paper describes the design and development of a Veiling Luminance estimation system based on the use of a CMOS image sensor, fully implemented on FPGA. The system is composed of the CMOS Image sensor, FPGA, DDR SDRAM, USB controller and SPI (Serial Peripheral Interface) Flash. The FPGA is used to build a system-on-chip integrating a soft processor (Xilinx MicroBlaze) and all the hardware blocks needed to handle the external peripherals and memory. The soft processor is used to handle image acquisition and all computational tasks need to compute the Veiling Luminance value. The advantages of this single chip FPGA implementation include the reduction of the hardware requirements, power consumption, and system complexity. The problem of the high dynamic range images have been addressed with multiple acquisitions at different exposure times. Vignetting, radial distortion and angular weighting, as required by veiling luminance definition, are handled by a single integer look-up table (LUT) access. Results are compared with a state of the art certified instrument.

2011 - 3DPes: 3D People Dataset for Surveillance and Forensics [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
abstract

The interest of the research community in creating reference datasets for performance analysis is always very high. Although new datasets, collecting large amounts of video footage are spreading in surveillance and forensics, few bench-marks with annotation data are available for testing specific tasks and especially for 3D/multi-view analysis. In this paper we present 3DPeS, a new dataset for 3D/multi- view surveillance and forensic applications. This has been designed for discussing and evaluating research results in people re-identification and other related activities (people detection, people segmentation and people tracking). The new assessed version of the dataset contains hundreds of video sequences of 200 people taken from a multi-camera distributed surveillance system over several days, with different light conditions; each person is detected multiple times and from different points of view. In surveillance scenarios, the dataset can be exploited to evaluate people reacquisition, 3D body models and people activity reconstruction algorithms. In forensics it can be adopted too, by relaxing some constraints (e.g. real time) and neglecting some information (e.g. calibration). Some results on this new dataset are presented using state of the art methods for people re-identification as a benchmark for future comparisons.

2011 - A Real-Time Embedded Solution for Skew Correction in Banknote Analysis [Relazione in Atti di Convegno]
Rashid, Adnan; Prati, Andrea; Cucchiara, Rita
abstract

Several industrial applications do require embedded solutionsboth for compacting the hardware occupation and reducing energy consumption, and for achieving high speed performance. This paper presents a computer vision system developed for correcting image skew in applications for banknote analysis and classification. The system must be very efficient and run on a fixed-point DSP with limited computational resources. Consequently, we propose three innovative improvements to basic and general-purpose image processing techniques that can be helpful in other computer vision applications on embedded devices. In particular, we address: a) an efficient labeling with an unionfind approach for hole filling, b) a fast Hough transform implementation, and c) a very high-speed estimation of affinetransformation for skew correction. The reported results demonstrate both the accuracy and the efficiency of the system,also in presence of severe skew. In terms of efficiency, the computational time is reduced of about two orders of magnitude.

2011 - A Reasoning Engine for Intruders' Localization in Wide Open Areas using a Network of Cameras and RFIDs [Relazione in Atti di Convegno]
Cucchiara, Rita; Fornaciari, Michele; Haider, Razia; Mandreoli, Federica; Martoglia, Riccardo; Prati, Andrea; Sassatelli, Simona
abstract

Wide open areas represent challenging scenarios forsurveillance systems, since sensory data can be affected bynoise, uncertainty, and distractors. Therefore, the tasks oflocalizing and identifying targets (e.g., people) in such environmentssuggest to go beyond the use of camera-only deployments.In this paper, we propose an innovative systemrelying on the joint use of cameras and RFIDs, allowing usto “map” RFID tags to people detected by cameras and,thus, highlighting potential intruders. To this end, sophisticatedfiltering techniques preserve the uncertainty of dataand overcome the heterogeneity of sensors, while an evidentialfusion architecture, based on Transferable Belief Model,combines the two sources of information and manages conflictbetween them. The conducted experimental evaluationshows very promising results.

2011 - A low-cost system and calibration method for veiling luminance measurement [Relazione in Atti di Convegno]
Cattini, Stefano; Grana, Costantino; Cucchiara, Rita; Rovati, Luigi
abstract

A CCD-based measuring instrument aimed at the veiling luminance estimation and the relative low-cost calibration method are described. The system may allow the estimation of the optimum luminance levels in road-tunnels lighting, thus both increasing the drivers safety and avoiding energy wasting hence unjustified higher lighting-costs.

2011 - A multi-stage pedestrian detection using monolithic classifiers [Relazione in Atti di Convegno]
Gualdi, G.; Prati, A.; Cucchiara, R.
abstract

Despite the many efforts in finding effective feature sets or accurate classifiers for people detection, few works have addressed ways for reducing the computational burden introduced by the sliding window paradigm. This paper proposes a multi-stage procedure for refining the search for pedestrians using the HOG features and the monolithic SVM classifier. The multi-stage procedure is based on particle-based estimation of pdfs and exploits the margin provided by the classifier to draw more particles on the areas where the classifier's response is higher. This iterative algorithm achieves the same accuracy than sliding window using less particles (and thus being more efficient) and, conversely, is more accurate when configured to work at the same computational load. Experimental results on publicly available datasets demonstrate that this method, previously proposed for boosted classifiers only, can be successfully applied to monolithic classifiers. © 2011 IEEE.

2011 - An evidential fusion architecture for people surveillance in wide open areas [Relazione in Atti di Convegno]
Fornaciari, M.; Sottara, D.; Prati, A.; Mello, P.; Cucchiara, R.
abstract

A new evidential fusion architecture is proposed to build anhybrid articial intelligent system for people surveillance in wide open areas. Authorized people and intruders are identied and localized thanks to the joint employment of cameras and RFID tags. Complex Event Processing and Transferable Belief Model are exploited for handling noisy data and uncertainty propagation. Experimental results on complex synthetic scenarios demonstrate the accuracy of the proposed solution.

2011 - Appearance tracking by transduction in surveillance scenarios [Relazione in Atti di Convegno]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
abstract

We propose a formulation of people tracking problem as a Transductive Learning (TL) problem. TL is an effective semi-supervised learning technique by which many classification problems have been recently reinterpreted as learning labels from incomplete datasets. In our proposal the joint exploitation of spectral graph theory and Riemannian manifold learning tools leads to the formulation of a robust approach for appearance based tracking in Video Surveillance scenarios. The key advantage of the presented method is a continuously updated model of the tracked target, used in the TL process, that allows to on-line learn the target visual appearance and consequently to improve the tracker accuracy. Experiments on public datasets show an encouraging advancement over alternative state-of the-art techniques.

2011 - Automatic segmentation of digitalized historical manuscripts [Articolo su rivista]
Grana, Costantino; Borghesani, Daniele; Cucchiara, Rita
abstract

The artistic content of historical manuscripts provides a lot of challenges in terms of automatic text extraction, picture segmentation and retrieval by similarity. In particular this work addresses the problem of automatic extraction of meaningful pictures, distinguishing them from handwritten text and floral and abstract decorations. The proposed solution firstly employs a circular statistics description of a directional histogram in order to extract text. Then visual descriptors are computed over the pictorial regions of the page: the semantic content is distinguished from the decorative parts using color histograms and a novel texture feature called Gradient Spatial Dependency Matrix. The feature vectors are finally processed using an embedding procedure which allows increased performance in later SVM classification. Results for both feature extraction and embedding based classification are reported, supporting the effectiveness of the proposal on high resolution replicas of artistic manuscripts.

2011 - Contextual Information and Covariance Descriptors for People Surveillance: An Application for Safety of Construction Workers [Articolo su rivista]
Gualdi, Giovanni; Prati, Andrea; Cucchiara, Rita
abstract

In computer science, contextual information can be used both to reduce computations and to increase accuracy. This paper discusses how it can be exploited for people surveillance in very cluttered environments in terms of perspective (i.e., weak scenecalibration) and appearance of the objects of interest (i.e., relevance feedback on the training of a classifier). These techniques are applied to a pedestrian detector that uses a LogitBoost classifier, appropriately modified to work with covariance descriptors which lie on Riemannian manifolds. On each detected pedestrian, a similar classifier is employed to obtain a precise localization of the head. Two novelties on the algorithms are proposed in this case: polar image transformations to better exploit the circular feature of the head appearance and multispectral image derivatives that catch not only luminance but also chrominance variations. The complete approach has been tested on the surveillance of a construction site to detect workers that do not wear the hard hat: in such scenarios, the complexity and dynamics are very high, making pedestrian detection a real challenge.

2011 - Detecting Anomalies in People’s Trajectories using Spectral Graph Analysis [Articolo su rivista]
Calderara, Simone; Uri, Heinemann; Prati, Andrea; Cucchiara, Rita; Naftali, Tishby
abstract

Video surveillance is becoming the technology of choice for monitoring crowded areas for security threats. While video provides ample information for human inspectors, there is a great need for robust automated techniques that can efficiently detect anomalous behavior in streaming video from single ormultiple cameras. In this work we synergistically combine two state-of-the-art methodologies. The rst is the ability to track and label single person trajectories in a crowded area using multiple video cameras, and the second is a new class of novelty detection algorithms based on spectral analysis of graphs. By representing the trajectories as sequences of transitions betweennodes in a graph, shared individual trajectories capture only a small subspace of the possible trajectories on the graph. This subspace is characterized by large connected components of the graph, which are spanned by the eigenvectors with the low eigenvalues of the graph Laplacian matrix. Using this technique, we develop robust invariant distance measures for detectinganomalous trajectories, and demonstrate their application on realvideo data.

2011 - Energy-efficient Feedback Tracking on Embedded Smart Cameras by Hardware-level Optimization [Relazione in Atti di Convegno]
M., Casares; Santinelli, Paolo; S., Velipasalar; Prati, Andrea; Cucchiara, Rita
abstract

Embedded systems have limited processing power, memory and energy. When camera sensors are added to an embedded system, the problem of limited resources becomes even more pronounced. In this paper, we introduce two methodologies to increase the energy-efficiency and battery-life of an embeddedsmart camera by hardware-level operations when performingobject detection and tracking. The CITRIC platform is employedas our embedded smart camera. First, down-sampling is performed at hardware level on the micro-controller of the imagesensor rather than performing software-level down-sampling atthe main microprocessor of the camera board. In addition, instead of performing object detection and tracking on wholeimage, we first estimate the location of the target in the nextframe, form a search region around it, then crop the next frameby using the HREF and VSYNC signals at the micro-controllerof the image sensor, and perform detection and tracking onlyin the cropped search region. Thus, the amount of data thatis moved from the image sensor to the main memory at eachframe is optimized. Also, we can adaptively change the size ofthe cropped window during tracking depending on the objectsize. Reducing the amount of transferred data, better use ofthe memory resources, and delegating image down-samplingand cropping tasks to the micro-controller on the image sensor,result in significant decrease in energy consumption and increasein battery-life. Experimental results show that hardware-leveldown-sampling and cropping, and performing detection andtracking in cropped regions provide 41.24% decrease in energyconsumption, and 107.2% increase in battery-life. Compared toperforming software-level down-sampling and processing wholeframes, proposed methodology provides an additional 8 hours ofcontinuous processing on 4 AA batteries, increasing the lifetimeof the camera to 15.5 hours.

2011 - Energy-efficient Object Detection and Tracking on Embedded Smart Cameras by Hardware-level Operations at the Image Sensor [Relazione in Atti di Convegno]
M., Casares; Santinelli, Paolo; S., Velipasalar; Prati, Andrea; Cucchiara, Rita
abstract

Embedded smart cameras have limited processing power, memory and energy. In this paper, we introduce two methodologies to increase the energy-efficiency and the battery-life of an embedded smart camera by hardware-level operations when performing object detection and tracking. We use the CITRIC platform as our embedded smart camera. We first perform down-sampling at hardware-level on the microcontroller of the image sensor rather than performing software-level down-sampling at the main microprocessor of the camera board. In addition, instead of performing object detection on whole image, we first estimate the location of the target in the next frame, form a search region around it, then crop the next frame by using the HREF and VSYNC signals at the microcontrollerof the image sensor, and perform detection and tracking only in the cropped search region. Thus, the amount of data that is moved from the image sensor to the main memory at each frame, is greatly reduced. Thanks to reduced data transfer, better use of the memory resources and not occupying the main microprocessor with image down-sampling and cropping tasks, we obtain significant savings in energy consumption and battery-life. Experimental results show that hardware-level down-sampling and cropping, and performing detection in cropped regions provide 54:14% decrease in energy consumption, and 121:25% increase in battery-life compared to performing software-level downsampling and processing whole frame.

2011 - Energy-efficient foreground object detection on embedded smart cameras by hardware-level operations [Relazione in Atti di Convegno]
Casares, M.; Santinelli, P.; Velipasalar, S.; Prati, A.; Cucchiara, R.
abstract

Embedded smart cameras have limited processing power, memory and energy. In this paper, we introduce two methodologies to increase the energy-efficiency and the battery-life of an embedded smart camera by hardware-level operations when performing foreground object detection. We use the CITRIC platform as our embedded smart camera. We first perform down-sampling at hardware level on the micro-controller of the image sensor rather than performing software-level down-sampling at the main microprocessor of the camera board. In addition, we crop an image frame at hardware level by using the HREF and VSYNC signals at the micro-controller of the image sensor to perform foreground object detection only in the cropped search region instead of the whole image. Thus, the amount of data that is moved from the image sensor to the main memory at each frame, is greatly reduced. Thanks to reduced data transfer, better use of the memory resources and not occupying the main microprocessor with image down-sampling and cropping tasks, we obtain significant savings in energy consumption and battery-life. Experimental results show that hardware-level down-sampling and cropping, and performing detection in cropped regions provide 54.14% decrease in energy consumption, and 121.25% increase in battery-life compared to performing software-level down-sampling and processing whole frames. © 2011 IEEE.

2011 - Feature Space Warping Relevance Feedback with Transductive Learning [Relazione in Atti di Convegno]
Borghesani, Daniele; Coppi, Dalia; Grana, Costantino; Calderara, Simone; Cucchiara, Rita
abstract

Relevance feedback is a widely adopted approach to improve content-based information retrieval systems by keeping the user in the retrieval loop. Among the fundamental relevance feedback approaches, feature space warping has been proposed as an effective approach for bridging the gap between high-level semantics and the low-level features. Recently, combination of feature space warping and query point movement techniques has been proposed in contrast to learning based approaches, showing good performance under dierent data distributions. In this paper we propose to merge feature space warping and transductive learning, in order to benet from both the ability of adapting data to the user hints and the information coming from unlabeled samples. Experimental results on an image retrieval task reveal signicant performance improvements from the proposed method.

2011 - Identification of Intruders in Groups of People using Cameras and RFIDs [Relazione in Atti di Convegno]
Cucchiara, Rita; Fornaciari, Michele; Haider, Razia; Mandreoli, Federica; Prati, Andrea
abstract

The identification of intruders in groups of people moving in wide open areas represents a challenging scenario where coordination between cameras can be certainly used but this solution is not enough. In this paper, we propose to go beyond pure vision-based approaches by integrating the use of distributed cameras with the RFID technology. To this end, we introduce a system that “maps” RFID tags to people detected by cameras by using sophisticated techniques to filter the singular modalities and an evidential fusion architecture, based on Transferable Belief Model, to combine the two sources of information and manage conflict between them. The conducted experimental evaluation shows very promising results, especially in treating groups of people.

2011 - Iterative active querying for surveillance data retrieval in crime detection and forensics [Relazione in Atti di Convegno]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
abstract

Large sets of visual data are now available both, in real time andoff line, at time of investigation in multimedia forensics, however passive querying systems often encounter difﬁculties in retrieving signiﬁcant results. In this paper we propose an iterativeactive querying system for video surveillance and forensic applications based on the continuous interaction between the userand the system. The positive and negative user feedbacks areexploited as the input of a graph based transductive procedurefor iteratively reﬁning the initial query results. Experimentsare shown using people trajectories and people appearance asdistance metrics.

2011 - Joint ACM workshop on human gesture and behavior understanding (J-HGBU'11) [Relazione in Atti di Convegno]
Pantic, M.; Pentland, A.; Vinciarelli, A.; Cucchiara, R.; Daoudi, M.; Del Bimbo, A.
abstract

The ability to understand social signals of a person we are communicating with is the core of social intelligence. Social Intelligence is a facet of human intelligence that has been argued to be indispensable and perhaps the most important for success in life. At the same time, human-centric multimedia applications for humans and about humans are becoming increasingly important. 3D modeled human-objects, like bodies, heads and faces are exploited for animation, security, and human computer interaction, while three dimensional motion of arms, legs and local body features is used for more complete human gesture, activity and behavior analysis. The Joint Human Gesture and Behavior Understanding (J-HGBU) workshop event consists of two parts focusing on these complementary challenges: the Workshop on Multimedia Access to 3D Human Objects (MA3HO'11) and the Workshop on Social Signal Processing (SSPW'11). © 2011 ACM.

2011 - MA3HO'11 foreword [Relazione in Atti di Convegno]
Cucchiara, R.; Daoudi, M.; Del Bimbo, A.
abstract

2011 - Markerless Body Part Tracking for Action Recognition [Articolo su rivista]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper presents a method for recognising human actions bytracking body parts without using artificial markers. A sophisticated appearance-based tracking able to cope with occlusions is exploited to extract a probability map for each moving object. A segmentation technique based on mixture of Gaussians (MoG) is then employed to extract and track significantpoints on this map, corresponding to significant regions on the human silhouette. The evolution of the mixture in time is analysed by transforming it in a sequence of symbols (corresponding to a MoG). The similarity between actions is computed by applying global alignment and dynamic programming techniques to the corresponding sequences and using a variational approximation of the Kullback-Leibler divergence to measure the dissimilarity between two MoGs. Experiments on publicly available datasets and comparison with existing methods are provided.

2011 - Mixtures of von Mises Distributions for People Trajectory Shape Analysis [Articolo su rivista]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

People trajectory analysis is a recurrent task inmany pattern recognition applications, such as surveillance,behavior analysis, video annotation, and many others. In thispaper we propose a new framework for analyzing trajectoryshape, invariant to spatial shifts of the people motion in thescene. In order to cope with the noise and the uncertainty ofthe trajectory samples, we propose to describe the trajectoriesas a sequence of angles modelled by distributions of circularstatistics, i.e. a mixture of von Mises (MovM) distributions.To deal with MovM, we define a new specific EM algorithmfor estimating the parameters and derive a closed form of theBhattacharyya distance between single vM pdfs. Trajectories arethen modelled with a sequence of symbols, corresponding tothe most suitable distribution in the mixture, and comparedeach other after a global alignment procedure to cope withtrajectories of different lengths. The trajectories in the trainingset are clustered according with their shape similarity in an offlinephase, and testing trajectories are then classified with aspecific on-line EM, based on sufficient statistics. The approachis particularly suitable for classifying people trajectories in videosurveillance, searching for abnormal (i.e. infrequent) paths. Testson synthetic and real data are provided with also a completecomparison with other circular statistical and alignment methods.

2011 - Multi-view people surveillance using 3D information [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita; A., Utasi; C., Benedek; T., Sziranyi
abstract

In this paper we introduce a novel surveillance system, which uses 3D information extracted from multiple cameras to detect, track and re-identify people. The detection method is based on a 3D Marked Point Process model using two pixel-level features extracted from multi-plane projections of binary foreground masks, and uses a stochastic optimization framework to estimate the position and the height of each person. We apply a rule based Kalman-filter tracking on the detection results to find the object-to-object correspondence between consecutive time steps. Finally, a 3D body model based long-term tracking module connects broken tracks and is also used to re-identify people

2011 - Optimal Decision Trees Generation from OR-Decision Tables [Relazione in Atti di Convegno]
Grana, Costantino; Montangero, Manuela; Borghesani, Daniele; Cucchiara, Rita
abstract

In this paper we present a novel dynamic programming algorithm to synthesize an optimal decision tree from OR-decision tables,an extension of standard decision tables,which allow to choose between several alternative actions in the same rule. Experiments are reported,showing the computational time improvements over state of the art implementations of connected components labeling,using this modelling technique.

2011 - People appearance tracing in video by spectral graph transduction [Relazione in Atti di Convegno]
Coppi, Dalia; Calderara, Simone; Cucchiara, Rita
abstract

Following people in different video sources is a challenging task: variations in the type of camera, in the lighting conditions, in the scene settings (e.g. crowd or occlusions) and in the point of view must be accounted. In this paper we propose a system based only on appearance information that, disregarding temporal and spatial information, can be flexibly applied on both moving and static cameras. We exploit the joint use of transductive learning and spectral properties of graph Laplacians proposing a formulation of the people tracing problem as a semi-supervised classification. The knowledge encoded in two labeled input sets of positive and negative samples of the target person and the continuous spectral update of these models allow us to obtain a robust approach for people tracing in surveillance video sequences. Experiments on publicly available datasets show satisfactory results and exhibit a good robustness in dealing with short and long term occlusions.

2011 - Probabilistic people tracking with appearance models and occlusion classification: The AD-HOC system [Articolo su rivista]
Vezzani, Roberto; Grana, Costantino; Cucchiara, Rita
abstract

AD-HOC (Appearance Driven Human tracking with Occlusion Classification) is a complete framework for multiple people tracking in video surveillance applications in presence of large occlusions. The appearance-based approach allows the estimation of the pixel-wise shape of each tracked person even during the occlusion. This peculiarity can be very useful for higher level processes, such as action recognition or event detection. A first step predicts the position of all the objects in the new frame while a MAP framework provides a solution for best placement. A second step associates each candidate foreground pixel to an object according to mutual object position and color similarity. A novel definition of non-visible regions accounts for the parts of the objects that are not detected in the current frame, classifying them as dynamic, scene or apparent occlusions. Results on surveillance videos are reported, using in-house produced videos and the PETS2006 test set.

2011 - Relevance feedback strategies for artistic image collections tagging [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Cucchiara, Rita
abstract

This paper provides an analysis on relevance feedback techniques in a multimedia system designed for the interactive exploration and annotation of artistic collections, in particular illuminated manuscripts. The relevance feedback is presented not only as a very effective technique to improve the performance of the system, but also as a clever way to increase the user experience, mixing the interactive surfing through the artistic content with the possibility to gather valuable information from the user, and consequently improving his retrieval satisfaction. We compare a modification of the Mean-Shift Feature Space Warping algorithm, as representative of the standard RF procedures, and a learning-based technique based on transduction, considered in order to overcome some limitation of the previous technique. Experiments are reported regarding the adopted visual features based on covariance matrices.

2011 - SARC3D: a new 3D body model for People Tracking and Re-identification [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
abstract

We propose a new simplified 3D body model (called Sarc3D) for surveillance application, that can be created, updated and compared in rea-time.People are detected and tracked in each calibrated camera, and their silhouette, appearance, position and orientation are extracted and used to place, scale and orientate a 3D body model. Foreach vertex of the model a signature (color features, reliability and saliency) is computed from the 2D appearance images and exploited for mathing. This approach achieves robustness against partial occlusions, pose and viewpoint changes. The complete proposal and a full experimental evaluation is presented, using a new benchmark suite and the PETS2009 dataset.

2011 - Using Monolithic Classifiers On Multi-stage Pedestrian Detection [Relazione in Atti di Convegno]
Gualdi, Giovanni; Prati, Andrea; Cucchiara, Rita
abstract

Despite the many efforts in finding effective feature sets or accurate classifiers for people detection, few works have addressed ways for reducing the computational burden introducedby the sliding window paradigm. This paper proposes a multi-stage procedure for refining the search for pedestrians using the HOG features and the monolithic SVM classifier. The multi-stage procedure is based on particle-based estimation of pdfs and exploits the margin provided by the classifier to draw more particles on the areas where the classifier’s response is higher. This iterative algorithm achieves the same accuracy than sliding window using less particles (and thus being more efficient) and, conversely, is more accurate when configured to work at thesame computational load. Experimental results on publicly available datasets demonstrate that this method, previouslyproposed for boosted classifiers only, can be successfully applied to monolithic classifiers.

2011 - Vision based smoke detection system using image energy and color information [Articolo su rivista]
Calderara, Simone; Piccinini, Paolo; Cucchiara, Rita
abstract

Smoke detection is a crucial task in many video surveillance applications and could have a great impact to raise the level of safety of urban areas. Many commercial smoke detection sensors exist but most of them cannot be applied in open space or outdoor scenarios. With this aim, the paper presents a smoke detection system that uses a common CCD camera sensor to detect smoke in images and trigger alarms. First, a proper background model is proposed to reliably extract smoke regions and avoid over-segmentation and false positives in outdoor scenarios where many distractors are present, such as moving trees or light reflexes. A novel Bayesian approach is adopted to detect smoke regions in the scene analyzing image energy by means of the Wavelet Transform coefficients and Color Information. A statistical model of image energy is built, using a temporal Gaussian Mixture, to analyze the energy decay that typically occurs when smoke covers the scene then the detection is strengthen evaluating the color blending between a reference smoke color and the input frame. The proposed system is capable of detecting rapidly smoke events both in night and in day conditions with a reduced number of false alarms hence is particularly suitable for monitoring large outdoor scenarios where common sensors would fail. An extensive experimental campaign both on recorded videos and live cameras evaluates the efficacy and efficiency of the system in many real world scenarios, such as outdoor storages and forests.

2010 - 3D Body Model Construction and Matching for Real Time People Re-Identification [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
abstract

Wide area video surveillance always requires to extract and integrate information coming from different cameras and views. Re-identification of people captured from different cameras or different views is one of most challenging problems. In this paper, we present a novel approach for people matching with vertices-based 3D human models.People are detected and tracked in each calibrated camera, and their silhouette, appearance, position and orientation are extracted and used to place, scale and orientate a 3D body model. Colour features are computed from the 2D appearance images and mapped to the 3D model vertices, generating the 3D model for each tracked person. A distance function between 3D models is defined in order to find matches among models belonging to the same person. This approach achieves robustness against partial occlusions, pose and viewpoint changes. A first experimental evaluation is conducted using images extracted from a real camera set-up.

2010 - A Videosurveillance data browsing software architecture for forensics: From trajectories similarities to video fragments [Relazione in Atti di Convegno]
Aravecchia, M.; Calderara, S.; Chiossi, S.; Cucchiara, R.
abstract

The information contained in digital video surveillance repositories can present relevant hints, when not even legal evidence, during investigations. As the amount of video data often forbids manual search, some tools have been developed during the past years in order to aid investigators in the look up process. We propose an application for forensic video analysis which aims at analysing the activities in a given scenario, particularly focusing on trajectories followed by people and their visual appearances. The recorded videos can be browsed by investigators thanks to a user-friendly interface, allowing easy information retrieval, through the choice of the best mining strategy. The underlying application architecture implements different feature and query models as well as query optimization strategies in order to return the best response in terms of both efficacy and efficiency.

2010 - Alignment-based Similarity of People Trajectories using Semi-directional Statistics [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper presents a method for comparing people trajectories for video surveillance applications, based on semi-directional statistics. In fact, the modelling of a trajectory as a sequence of angles, speeds and time lags, requires the use of a statistical tool capable to jointly consider periodic and linear variables. Our statistical method is compared with two state-of-the-art methods.

2010 - Bag-Of-Words Classification of Miniature Illustrations [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Gualdi, Giovanni; Cucchiara, Rita
abstract

In this paper a system for illuminated manuscripts images analysis is presented. In particular the bag-of-keypoints strategy, commonly adopted for object recognition, image classification and scene recognition, is applied to the classification of automatically extracted miniatures. Pictures are characterized by SURF descriptors, and a classification procedure is performed, comparing the results of Naive Bayes and histogram intersection distance measures.

2010 - Decision Trees for Fast Thinning Algorithms [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Cucchiara, Rita
abstract

We propose a new efficient approach for neighborhood exploration, optimized with decision tables and decision trees, suitable for local algorithms in image processing. In this work, it is employed to speed up two widely used thinning techniques. The performance gain is shown over a large freely available dataset of scanned document images.

2010 - Event Driven Software Architecture for Multi-camera and Distributed Surveillance Research Systems [Relazione in Atti di Convegno]
Vezzani, Roberto; Cucchiara, Rita
abstract

Surveillance of wide areas with several connected cameras integrated in the same automatic system is no more a chimera, but modular, scalable and flexible architectures are mandatory to manage them. This paper points out the main issues on the development of distributed surveillance systems and proposes an integrated framework particularly suitable for research purposes. As first, exploiting a computer architecture analogy, a three layer tracking system is proposed, which copes with the integration of both overlapping and non overlapping cameras. Then, a static service oriented architecture is adopted to collect and manage the plethora of high level modules, such as face detection and recognition, posture and action classification, and so on. Finally, the overall architecture is controlled by an event driven communication infrastructure, which assures the scalability and the flexibility of the system.

2010 - Fast Background Initialization with Recursive Hadamard Transform [Relazione in Atti di Convegno]
Baltieri, Davide; Vezzani, Roberto; Cucchiara, Rita
abstract

In this paper, we present a new and fast techniquefor background estimation from cluttered image sequences.Most of the background initialization approaches developedso far collect a number of initial frames and then requirea slow estimation step which introduces a delay wheneverit is applied. Conversely, the proposed technique redistributesthe computational load among all the frames bymeans of a patch by patch preprocessing, which makesthe overall algorithm more suitable for real-time applications.For each patch location a prototype set is created andmaintained. The background is then iteratively estimatedby choosing from each set the most appropriate candidatepatch, which should verify a sort of frequency coherencewith its neighbors. To this aim, the Hadamard transformhas been adopted which requires less computation time thanthe commonly used DCT. Finally, a refinement step exploitsspatial continuity constraints along the patch borders toprevent erroneous patch selections. The approach has beencompared with the state of the art on videos from availabledatasets (ViSOR and CAVIAR), showing a speed up of about10 times and an improved accuracy

2010 - HMM Based Action Recognition with Projection Histogram Features [Relazione in Atti di Convegno]
Vezzani, Roberto; Baltieri, Davide; Cucchiara, Rita
abstract

Hidden Markov Models (HMM) have been widely used for action recognition, since they allow to easily model the temporal evolution of a single or a set of numeric features extracted from the data. The selection of the feature set and the related emission probability function are the key issues to be defined. In particular, if the training set is not sufficiently large, a manual or automatic feature selection and reduction is mandatory. In this paper we propose to model the emission probability function as a Mixture of Gaussian and the feature set is obtained from the projection histograms of the foreground mask. The projectionhistograms contain the number of moving pixel for each row and for each column of the frame and they provide sufficient information to infer the instantaneous posture of the person. Then, the HMM framework recovers the temporal evolution of the postures recognizing in such a manner the global action. The proposed method have been successfully tested on the UT-Tower and on the Weizmann Datasets.

2010 - High Performance Connected Components Labeling on FPGA [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Santinelli, Paolo; Cucchiara, Rita
abstract

This paper proposes a comparison of the two most advanced algorithms for connected components labeling, highlighting how they perform on a soft core SoC architecture based on FPGA. In particular we test our block based connected components labeling algorithm, optimized with decision tables and decision trees. The embedded system is composed of the CMOS image sensor, FPGA, DDR SDRAM, USB controller and SPI Flash. Results highlight the importance of caching and instructions and data cache sizes for high performance image processing tasks.

2010 - Improving classification and retrieval of illuminated manuscripts with semantic information [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Cucchiara, Rita
abstract

In this paper we detail a proposal of exploitation of expert-made commentaries in a unified system for illuminated manuscripts images analysis. In particular we will explore the possibility to improve the automatic segmentation of meaningful pictures, as well as the retrieval by similarity search engine, using clusters of keywords extracted from commentaries as semantic information.

2010 - Mobile video surveillance systems: An architectural overview [Capitolo/Saggio]
Cucchiara, R.; Gualdi, G.
abstract

The term mobile is now added to most of computer based systems as synonymous of several different concepts, ranging on ubiquitousness, wireless connection, portability, and so on. In a similar manner, also the name mobile video surveillance is spreading, even though it is often misinterpreted with just limited views of it, such as front-end mobile monitoring, wireless video streaming, moving cameras, distributed systems. This chapter presents an overview of mobile video surveillance systems, focusing in particular on architectural aspects (sensors, functional units and sink modules). A short survey of the state of the art is presented. The chapter will also tackle some problems of video streaming and video tracking specifically designed and optimized for mobile video surveillance systems, giving an idea of the best results that can be achieved in these two foundation layers. © 2010 Springer-Verlag.

2010 - Moving pixels in static cameras: detecting dangerous situations due to environment or people [Capitolo/Saggio]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

Dangerous situations arise in everyday life and many efforts have been lavished to exploit technology to increase the level of safety in urban areas. Video analysis is absolutely one of the most important and emerging technology for security purposes. Automatic video surveillance systems commonly analyze the scene searching for moving objects. Well known techniques exist to cope with this problem that is commonly referred as change detection". Every time a dierence against a reference model is sensed, it should be analyzed to allow the system to discriminateamong a usual situation or a possible threat. When the sensor is a camera, motion is the key element to detect changes and moving objects must be correctly classied according to their nature. In this context we can distinguish among two dierent kinds of threat that can lead to dangerous situations in a video-surveilled environment. The first one is due to environmental changes such as rain, fog or smoke present in the scene. This kind of phenomena are sensed by the camera as moving pixelsand, subsequently as moving objects in the scene. This kind of threats shares some common characteristics such as texture, shape and color information and can be detected observing the features' evolution in time. The second situation arises whenpeople are directly responsible of the dangerous situation. In this case a subject is acting in an unusual way leading to an abnormal situation. From the sensor's point of view, moving pixels are still observed, but specic features and time-dependent statistical models should be adopted to learn and then correctly detect unusual and dangerous behaviors. With these premises, this chapter will present two different case studies. The rst one describes the detection of environmental changes in theobserved scene and details the problem of reliably detecting smoke in outdoor environments using both motion information and global image features, such as color information and texture energy computed by the means of the Wavelet transform.The second refers to the problem of detecting suspicious or abnormal people behaviors by means of people trajectory analysis in a multiple cameras video-surveillance scenario. Specically, a technique to infer and learn the concept of normality is proposed jointly with a suitable statistical tool to model and robustly compare people trajectories.

2010 - Multi-stage Sampling with Boosting Cascades for Pedestrian Detection in Images and Videos [Relazione in Atti di Convegno]
Gualdi, Giovanni; Prati, Andrea; Cucchiara, Rita
abstract

Many works address the problem of object detection by means of machine learning with boosted classifiers. They exploit sliding window search, spanning the whole image: the patches, at all possible positions and sizes, are sent to the classifier. Several methods have been proposed to speed up the search (adding complementary features or using specialized hardware). In this paper we propose a statisticalbased search approach for object detection which uses a Monte Carlo sampling approach for estimating the likelihood density function with Gaussian kernels. The estimation relies on a multi-stage strategy where the proposal distribution is progressively refined by taking into account the feedback of the classifier (i.e. its response). For videos, this approach is plugged in a Bayesian-recursive framework which exploits the temporal coherency of the pedestrians. Several tests on both still images and videos on common datasets are provided in order to demonstrate therelevant speedup and the increased localization accuracy with respect to sliding window strategy using a pedestrian classifier based on covariance descriptors and a cascade of Logitboost classifiers.

2010 - Mutual Calibration of Camera Motes and RFIDs for People Localization and Identification [Relazione in Atti di Convegno]
Cucchiara, Rita; Fornaciari, Michele; Prati, Andrea; Santinelli, Paolo
abstract

Achieving both localization and identication of people ina wide open area using only cameras can be a challengingtask, which requires cross-cutting requirements : high reso-lution for identication, whereas low resolution for having awide coverage of the localization. Consequently, this paperproposes the joint use of cameras (only devoted to local-ization) and RFID sensors (devoted to identication) withthe nal objective of detecting and localizing intruders. Toground the observations on a common coordinate system,a calibration procedure is dened. This procedure only de-mands a training phase with a single person moving in thescene holding a RFID tag. Although preliminary, the resultsdemonstrate that this calibration is sufficiently accurate tobe applied whenever dierent scenarios, where area of over-lap between the eld of view (FoV) of a camera and theField of sense" (FoS) of a (blind) sensor must be efficientlydetermined.

2010 - Optimized Block-based Connected Components Labeling with Decision Trees [Articolo su rivista]
Grana, Costantino; Borghesani, Daniele; Cucchiara, Rita
abstract

In this paper we define a new paradigm for 8-connection labeling, which employes a general approach to improve neighborhood exploration and minimizes the number of memory accesses. Firstly we exploit and extend the decision table formalism introducing OR-decision tables, in which multiple alternative actions are managed. An automatic procedure to synthesize the optimal decision tree from the decision table is used, providing the most effective conditions evaluation order. Secondly we propose a new scanning technique that moves on a 2x2 pixel grid over the image, which is optimized by the automatically generated decision tree.An extensive comparison with the state of art approaches is proposed, both on synthetic and real datasets. The synthetic dataset is composed of different sizes and densities random images, while the real datasets are an artistic image analysis dataset, a document analysis dataset for text detection and recognition, and finally a standard resolution dataset for picture segmentation tasks. The algorithm provides an impressive speedup over the state of the art algorithms.

2010 - People trajectory mining with statistical pattern recognition [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita
abstract

People social interaction analysis is a complex and interesting problem that can be faced from several points of view depending on the application context. In videosurveillance contexts many indicators of people habits and relations exist and, among these, people trajectories analysis can reveal many aspects of the way people behave in social environments. We propose a statistical framework for trajectories mining that analyzes, in an integrated solution, several aspects of the trajectories such as location, shape and speed properties. Three different models are proposed to deal with non-idealities of the selected features in conjunction with a robust inexact- matching similarity measure for comparing sequences with different lengths. Experimental results in a real scenario demonstrates the efficacy of the framework in clustering people trajectories with the purpose of analyze frequent behaviors in complex environments.

2010 - Perspective and Appearance Context for People Surveillance in Open Areas [Relazione in Atti di Convegno]
Gualdi, Giovanni; Prati, Andrea; Cucchiara, Rita
abstract

Contextual information can be used both to reduce computationsand to increase accuracy and this paper presentshow it can be exploited for people surveillance in terms ofperspective (i.e. weak scene calibration) and appearance ofthe objects of interest (i.e. relevance feedback on the trainingof a classifier). These techniques are applied to a pedestriandetector that exploits covariance descriptors througha LogitBoost classifier on Riemannian manifolds. The approachhas been tested on a construction working site wherecomplexity and dynamics are very high, making human detectiona real challenge. The experimental results demonstratethe improvements achieved by the proposed approach.

2010 - Polar Representation of Covariance Descriptors for Circular Features [Articolo su rivista]
Gualdi, Giovanni; Prati, Andrea; Cucchiara, Rita
abstract

The use of polar representation of covariance descriptors, suitable for the classification of circular feature sets, is proposed. It overcomes the implicit limits of state-of-the-art methods based on axis-oriented rectangular patches. The suitability of the proposed solution is verified on two case studies, namely head detection and polymer classification in photomicrograph contexts.

2010 - Rerum Novarum: Interactive Exploration of Illuminated Manuscripts [Relazione in Atti di Convegno]
Borghesani, Daniele; Grana, Costantino; Cucchiara, Rita
abstract

This paper describes an interactive application for the exploration and annotation of illuminated manuscripts, which typically contain thousands of pictures, used to comment or embellish the manuscript Gothic text. The system is composed by a modern user interface for browsing, surfing and querying, an automatic segmentation module, to ease the initial picture extraction task, and a similarity based retrieval engine, used to provide visually assisted tagging capabilities. A relevance feedback procedure is included to further refine the results.

2010 - Surfing on Artistic Documents with Visually Assisted Tagging [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Cucchiara, Rita
abstract

This paper describes a complete architecture for the interactive exploration and annotation of artistic collections. In particular the focus is on Renaissance illuminated manuscripts, which typically contain thousands of pictures, used to comment or embellish the manuscript Gothic text. The final aim is to create a human centered multimedia application allowing the non practitioners to enjoy these masterpieces and expert users to share their knowledge. The system is composed by a modern user interface for browsing, surfing and querying, an automatic segmentation module, to ease the initial picture extraction task, and a similarity based retrieval engine, used to provide visually assisted tagging capabilities. A relevance feedback procedure is included to further refine the results. Experiments are reported regarding the adopted visual features based on covariance matrices and the Mean Shift Feature Space Warping relevance feedback. Finally some hints on the user interface for museum installations are discussed.

2010 - Unsupervised Learning in Body-area Networks [Relazione in Atti di Convegno]
Bicocchi, Nicola; Lasagni, Matteo; Mamei, Marco; Prati, Andrea; Cucchiara, Rita; Zambonelli, Franco
abstract

Pattern recognition is becoming a key application in bodyarea networks. This paper presents a framework promoting unsupervised training for multi-modal, multi-sensor classification systems. Specifically, it enables sensors provided with patter-recognition capabilities to autonomously supervise the learning process of other sensors. The approach is discussed using a case study combining a smart camera and a body-worn accelerometer. The body-worn accelerometer sensor is trained to recognize four user activities pairing accelerometer data with labels coming from the camera. Experimental results illustrate the applicability of the approach in different conditions.

2010 - Video Surveillance Online Repository (ViSOR): an integrated framework [Articolo su rivista]
Vezzani, Roberto; Cucchiara, Rita
abstract

The availability of new techniques and tools for Video Surveillance and the capability of storing huge amounts of visual data acquired by hundreds of cameras every day call for a convergence between pattern recognition, computer vision and multimedia paradigms. A clear need for this convergence is shown by new research projects which attempt to exploit both ontology-based retrieval and video analysis techniques also in the field of surveillance.This paper presents the ViSOR (Video Surveillance Online Repository) framework, designed with the aim of establishing an open platform for collecting, annotating, retrieving, and sharing surveillance videos, as well as evaluating the performance of automatic surveillance systems. Annotations are based on a reference ontology which has been defined integrating hundreds of concepts, some of them coming from the LSCOM and MediaMill ontologies. A new annotation classification schema is also provided, which is aimed at identifying the spatial, temporal and domain detail level used.The ViSOR web interface allows video browsing, querying by annotated concepts or by keywords, compressed video previewing, media downloading and uploading.Finally, ViSOR includes a performance evaluation desk which can be used to compare different annotations.

2010 - Video sorveglianza per l'individuazione di persone e l'analisi comportamentale [Articolo su rivista]
Cucchiara, Rita
abstract

In questo articolo si parla delle nuove frontiere di visione artificiale nella videosorveglianza di persone in ambienti pubblici e privati ed in particolare di analisi comportamentale. Sono poi presentate alcuni progetti in corso presso l’ImageLab di Modena

2009 - 1st ACM Workshop on Multimedia in Forensics - MiFor 09, Co-located with the 2009 ACM International Conference on Multimedia, MM 09: Foreword [Relazione in Atti di Convegno]
Cucchiara, R.; Worring, M.
abstract

2009 - A Fast Multi-model Approach for Object Duplicate Extraction [Relazione in Atti di Convegno]
Piccinini, Paolo; Prati, Andrea; Cucchiara, Rita
abstract

This paper presents an innovative approach for localizingand segmenting duplicate objects for industrial applications.The working conditions are challenging, withcomplex heavily-occluded objects, arranged at random inthe scene. To account for high flexibility and processingspeed, this approach exploits SIFT keypoint extraction andmean shift clustering to efficiently partition the correspondencesbetween the object model and the duplicates ontothe different object instances. The re-projection (by meansof an Euclidean transform) of some delimiting points ontothe current image is used to segment the object shapes. Thisprocedure is compared in terms of accuracy with existinghomography-based solutions which make use of RANSACto eliminate outliers in the homography estimation. Moreover,in order to improve the extraction in the case of reflectiveor transparent objects, multiple object models are usedand fused together. Experimental results on different andchallenging kinds of objects are reported.

2009 - A Real-Time System for Abnormal Path Detection [Relazione in Atti di Convegno]
Calderara, Simone; C., Alaimo; Prati, Andrea; Cucchiara, Rita
abstract

This paper proposes a real-time system capable to extract andmodel object trajectories from a multi-camera setup with theaim of identifying abnormal paths. The trajectories are modeledas a sequence of positional distributions (2D Gaussians)and clustered in the training phase by exploiting an innovativedistance measure based on a global alignment techniqueand Bhattacharyya distance between Gaussians. An on-lineclassification procedure is proposed in order to on-the-fly classifynew trajectories into either “normal” or “abnormal” (in thesense of rarely seen before, thus unusual and potentially interesting).Experiments on a real scenario will be presented.

2009 - AI*IA 2009: Emergent Perspectives in Artificial Intelligence, XIth International Conference of the Italian Association for Artificial Intelligence [Curatela]
Serra, Roberto; Cucchiara, Rita
abstract

Proceedings of the XIth International Conference on Artificial Intelligence

2009 - An efficient Bayesian framework for on-line action recognition [Relazione in Atti di Convegno]
Vezzani, Roberto; Piccardi, Massimo; Cucchiara, Rita
abstract

On-line action recognition from a continuous stream of actionsis still an open problem with fewer solutions proposedcompared to time-segmented action recognition. The mostchallenging task is to classify the current action while findingits time boundaries at the same time. In this paper wepropose an approach capable of performing on-line actionsegmentation and recognition by means of batteries of HMMtaking into account all the possible time boundaries and actionclasses. A suitable Bayesian normalization is appliedto make observation sequences of different length comparableand computational optimizations are introduce to achievereal-time performances. Results on a well known actiondataset prove the efficacy of the proposed method

2009 - Automatic Analysis of Historical Manuscripts [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Cucchiara, Rita
abstract

In this paper a document analysis tool for historical manuscripts is proposed. The goal is to automatically segment layout components of the page, that is text, pictures and decorations. We specifically focused on the pictures, proposing a set of visual features able to identify significant pictures and separating them from all the floral and abstract decorations. The analysis is performed by blocks using a limited set of color and texture features, including a new texture descriptor particularly effective for this task, namely Gradient Spatial Dependency Matrix. The feature vectors are processed by an embedding procedure which allows increased performance in later SVM classification.

2009 - Color features performance comparison for image retrieval [Relazione in Atti di Convegno]
Borghesani, Daniele; Grana, Costantino; Cucchiara, Rita
abstract

This paper proposes a comparison of color features for image retrieval. In particular the UCID image database has been employed to compare the retrieval capabilities of different color descriptors. The set of descriptors comprises global and spatially related features, and the tests show that HSV based global features provide the best performance at varying brightness and contrast settings.

2009 - Connected component labeling techniques on modern architectures [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Cucchiara, Rita
abstract

In this paper we present an overview of the historical evolution of connected component labeling algorithms, and in particular the ones applied on images stored in raster scan order. This brief survey aims at providing a comprehensive comparison of their performance on modern architectures, since the high availability of memory and the presence of caches make some solutions more suitable and fast. Moreover we propose a new strategy for label propagation based on a 2x2 blocks, which allows to improve the performance of many existing algorithms. The tests are conducted on high resolution images obtained from digitized historical manuscripts and a set of transformations is applied in order to show the algorithms behavior at different image resolutions and with a varying number of labels.

2009 - Covariance Descriptors on Moving Regions for Human Detection in Very Complex Outdoor Scenes [Relazione in Atti di Convegno]
Gualdi, Giovanni; Prati, Andrea; Cucchiara, Rita
abstract

The detection of humans in very complex scenes can be very challenging, due to the performance degradation of classical motion detection and tracking approaches. An alternative approach is the detection of human-like patterns over the whole image. The present paper follows this line by extending Tuzel et al.’s technique [1] based on covariance descriptors and LogitBoost algorithm applied over Riemannian manifolds. Our proposal represents a significant extension of it by: (a) exploiting motion information to focus the attention over areas in which motion is present or was present in the recent past; (b) enriching the human classifier by additional, dedicated cascades trained on positive and negative samples taken from the specific scene; (c) using a rough estimation of the scene perspective, to reduce false detections and improve system performance. This approach is suitable in multi-camera scenarios, since the monolithic block for human-detection remains the same for the whole system, whereas the parameter tuning and set-up of the three proposed extensions (the only camera-dependent parts of the system), are automatically computed for each camera. The approach has been tested on a construction working site in which complexity and dynamics are very high, making human detection a real challenge. The experimental results demonstrate the improvements achieved by the proposed approach.

2009 - Dynamic Pictorially Enriched Ontologies for Digital Video Libraries [Articolo su rivista]
M., Bertini; A., Del Bimbo; Serra, Giuseppe; C., Torniai; Cucchiara, Rita; Grana, Costantino; Vezzani, Roberto
abstract

This article presents a framework for automatic semantic annotation of video streams with an ontology that includes concepts expressed using linguistic terms and visual data.

2009 - Fast Block Based Connected Components Labeling [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Cucchiara, Rita
abstract

In this paper we present a new optimization technique for the neighborhood computation in connected component labeling focused on images stored in raster scan order. This new technique is based on a 2x2 square block analysis of the image, and it exploits the fact that, when using 8-connection, the pixels of a 2x2 square are all connected to each other. This implies that they will share the same label at the end of the computation. To prove the effectiveness of our proposal, we show a comprehensive comparison of the most used and advanced connected components labeling techniques presented so far. The tests are conducted on high resolution images obtained from digitized historical manuscripts and a set of transformations is applied in order to show the algorithms behavior at different image resolutions and with a varying number of labels.

2009 - Learning People Trajectories using Semi-directional Statistics [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper proposes a system for people trajectory shape analysis by exploiting a statistical approach which accounts for sequences of both directional (the directions of the trajectory) and linear (the speeds) data. A semi-directional distribution (AWLG - Approximated Wrapped and Linear Gaussian) is used with a mixture to find main directions and speeds. A variational version of the mutual information criterion is proposed to prove the statistical dependency of the data. Then, in order to compare data sequences, we define an inexact method with a Kullback-Leibler-based distance measure and employ a global alignment technique is to handle sequences of different lengths and with local shifts or deformations. A comprehensive analysis of variable dependency and parameter estimation techniques are reported and evaluated on both synthetic and real data sets.

2009 - Multimedia in forensics [Relazione in Atti di Convegno]
Worring, M.; Cucchiara, R.
abstract

2009 - Multiple Object Segmentation for Pick-and-Place Applications [Relazione in Atti di Convegno]
Piccinini, Paolo; Prati, Andrea; Cucchiara, Rita
abstract

This paper presents a novel approach for detecting multipleinstances of the same object for pick-and-place automation.The working conditions are very challenging, with complex objects, arranged at random in the scene, and heavily occluded. This approach exploits SIFT to obtain a set of correspondences between the object model and the current image. In order to segment the multiple instances of the object, the correspondences are clustered among the objects using a voting scheme which determines the best estimate of the object’s center through mean shift. This procedure is compared in terms of accuracy with existing homography-based solutions which make use of RANSAC to eliminate outliers in the homography estimation.

2009 - Multiple object detection for pick-and-place applications [Relazione in Atti di Convegno]
Piccinini, P.; Prati, A.; Cucchiara, R.
abstract

This paper presents a novel approach for detecting multiple instances of the same object for pick-and-place automation. The working conditions are very challenging, with complex objects, arranged at random in the scene, and heavily occluded. This approach exploits SIFT to obtain a set of correspondences between the object model and the current image. In order to segment the multiple instances of the object, the correspondences are clustered among the objects using a voting scheme which determines the best estimate of the object's center through mean shift. This procedure is compared in terms of accuracy with existing homography-based solutions which make use of RANSAC to eliminate outliers in the homography estimation.

2009 - Pathnodes integration of standalone Particle Filters for people tracking on distributed surveillance systems [Relazione in Atti di Convegno]
Vezzani, Roberto; Baltieri, Davide; Cucchiara, Rita
abstract

In this paper, we present a new approach to object tracking based on batteries of particle filter working in multicamera systems with non overlapped fields of view. In each view the moving objects are tracked with independent particle filters; each filter exploits a likelihood function based on both color and motion information. The consistent labeling of people exiting from a camera field of view and entering in a neighbor one is obtained sharing particles information for the initialization of new filtering trackers. The information exchange algorithm is based on path-nodes, which are a graph-based scene representation usually adopted in computer graphics. The approach has been tested even in case of simultaneous transitions, occlusions, and groups of people. Promising results have been obtained and here presented using a real setup of non overlapped cameras.

2009 - Picture Extraction from Digitized Historical Manuscripts [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Cucchiara, Rita
abstract

In this work we propose a system for automatic document segmentation to extract graphical elements from historical manuscripts and then to identify significant pictures from them, removing floral and abstract decorations. The system performs a block based analysis by means of color and texture features. The Gradient Spatial Dependency Matrix, a new texture operator particularly effective for this task, is proposed. The feature vectors are processed by an embedding procedure which allows increased performance in later SVM classification. Results for both feature extraction and embedding based classification are reported, supporting the effectiveness of the proposal.

2009 - Proceedings of International Workshop on Multimedia in forensics [Curatela]
M., Worring; Cucchiara, Rita
abstract

It is our great pleasure to welcome you to the 1st ACM Workshop on Multimedia in Forensics -- MiFor'09.With the proliferation of multimedia data on the web, surveillance cameras in cities, and mobile phones in everyday life we see an enormous growth in multimedia data that needs to be analyzed by forensic investigators. The sheer volume of such datasets makes manual inspection of all data impossible. Tools are needed to support the investigator in their quest for relevant clues and evidence and in their strive towards preventing crime.The multimedia community has developed new solutions for management of large collections of video footage, images, audio and other multimedia content, knowledge extraction and categorization, pattern recognition, indexing and retrieval, searching, browsing and visualization, and modeling and simulation in various domains. Due to the inherent uncertainty and complexity of forensic data, applying those techniques to forensic data is not straightforward. The time is ripe to tailor these results for forensics. Multimedia in forensics is the workshop which target is to join the research topics and the applications.The workshop aims at addressing the multimedia toolbox supporting the forensic process from the prevention of crime, capturing and annotation of the crime scene, the investigation of the data in the lab, up to the presentation of the results in court. It is a first attempt in bringing multimedia tools in to this exciting application field. The target audience consists of researchers working on innovative technology, representatives from companies developing tools, and forensic investigators in various disciplines.Despite the ambitious objective for the workshop and it being the first edition, it attracted a good number of quality submissions fairly distributed among different countries and among the different topics of the workshop. The MiFor09 Technical Program Committee includes the most experienced researchers in the related research fields, and thanks to their indispensable effort we were able to select 11 papers for oral presentation.The workshop schedules four oral sessions, named "Detection and Mining", "Multimedia forensics prototypes", "Forgery and Splicing Detection" and "Tracking". In addition, the program includes a keynote address by Professor Mohan Kankanhalli, a distinguished lecturer in the field.

2009 - Statistical Pattern Recognition for Multi-Camera Detection, Tracking and Trajectory Analysis [Capitolo/Saggio]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

This chapter will address most of the aspects of modern video surveillance with the reference to the research activity conducted at University of Modena and Reggio Emilia, Italy, within the scopes of the national FREE SURF (FREE SUrveillance in a pRivacy-respectFul way) and NATO-funded BE SAFE (Behavioral lEarning in Surveilled Areas with Feature Extraction) projects. Moving object detection and tracking from a single camera, multi-camera consistent labeling and trajectory shape analysis for path classification will be the main topics of this chapter.

2009 - Statistical pattern recognition for multi-camera detection, tracking, and trajectory analysis [Capitolo/Saggio]
Calderara, S.; Cucchiara, R.; Vezzani, R.; Prati, A.
abstract

2009 - Video Analysis for Ambient intelligence in Urban Environments [Capitolo/Saggio]
Prati, Andrea; Cucchiara, Rita
abstract

Ambient Intelligence (AmI) is an emerging field of research that comprises new paradigms, techniques and systems for intelligent processing of distributed sensing. A challenging arena for AmI framework is represented by urban environments that are characterized by high complexity, numerous sources of data,and spreading of interesting and non-trivial applications. In this context, the project LAICA (Laboratory of Ambient Intelligence for a friendly city) represents a real experiment of the usefulness of AmI for advanced services to citizens. This chapter will address solutions of video analysis that can be directly applied in urban AmI. It describes in details the uniqueness of LAICA approach, focusing in particular on the use of computer vision techniques for monitoring public parks. People surveillanceand web-based video broadcasting will be taken into account.

2009 - Video surveillance and multimedia forensics: an application to trajectory analysis [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper reports an application of trajectory analysis in which forensics and video surveillance techniques are jointly employed for providing a new tool of multimedia forensics. Advanced video surveillance techniques are used to extract from a multi-camera system the trajectories of the moving people which are then modelled by either their positions (projected on the ground plane) or their directions of movement. Both these two representations can be very suitable for querying large video repositories, by searching for similar trajectories in terms of either sequences of positions or trajectory shape (encoded as sequence of angles, where positions do not care). Preliminary examples of the possible use of this approach are shown.

2008 - "Inside the Bible": Segmentation, Annotation and Retrieval for a New Browsing Experience [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Calderara, Simone; Cucchiara, Rita
abstract

In this paper we present a system for automatic segmentation, annotation and image retrieval based on content, focused on illuminated manuscripts and in particular the Borso D'Este Holy Bible. To enhance the interaction possibilities with this work, full of decorations and illustrations, we exploit some well known document analysis techniques in addition to some new approaches, in order to achieve good segmentation of pages into meaningful visual objects with the relative annotation. We wanted to extend the standard keyword-based retrieval approach in a commentary with a modern visual-based retrieval by appearance similarity: an entire software user interface for exploration and visual search of illuminated manuscripts.

2008 - A Markerless Approach for Consistent Action Recognition in a Multi-camera System [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

This paper presents a method for recognizing human actions in a multi-camera setup. The proposed method automatically extracts significant points on the human body, without the need of artificial markers. A sophisticated appearance-based tracking able to cope with occlusions is exploited to extract a probability map for each moving object. A segmentation technique based on mixture of Gaussians is then employed to extract and track significant points on this map, corresponding to significant regions on the human silhouette. The point tracking produces a set of 3D trajectories that are compared with other trajectories by means of global alignment and dynamic programming techniques. Preliminary experiments showed the potentiality of the proposed approach.

2008 - AD-HOC: Appearance Driven Human tracking with Occlusion Handling [Relazione in Atti di Convegno]
Vezzani, Roberto; Cucchiara, Rita
abstract

AD-HOC copes with the problem of multiple people tracking in video surveillance in presence of large occlusions. The main novelty is the adoption of an appearance-based approach in a formal Bayesian framework: the status of each object is defined at pixel level, where each pixel is characterized by the appearance, i.e. the color (integrated along the time) and the likelihood to belong to the object. With these data at pixel-level and a probability of non-occlusion at object-level, the problem of occlusions is addressed. The method does not aim at detecting the presence of an occlusion only, but classifies the type of occlusion at a sub-region level and evolve the status of theobject in a selective way. The AD-HOC tracking has been tested in many application for indoor and outdoor surveillance. Results on PETS2006 test set are reported where many people and abandoned objects are detected and tracked.

2008 - Action Signature: a Novel Holistic Representation for Action Recognition [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

Recognizing different actions with a unique approach can be a difficult task. This paper proposes a novel holistic representation of actions that we called "action signature". This 1D trajectory is obtained by parsing the 2D image containing the orientations of the gradient calculated on the motion feature map called motion-history image. In this way, the trajectory is a sketch representation of how the object motion varies in time. A robust statistical framework based on mixtures of von Mises distributions and dynamic programming for sequence alignment are used to compare and classify actions/trajectories. The experimental results show a rather high accuracy in distinguishing quite complicated actions, such as drinking, jumping, or abandoning an object.

2008 - Annotation Collection and Online Performance Evaluation for Video Surveillance: the ViSOR Project [Relazione in Atti di Convegno]
Vezzani, Roberto; Cucchiara, Rita
abstract

This paper presents the Visor (VIdeo Surveillance Online Repository) project designed with the aim of establishing anopen platform for collecting, annotating, retrieving, sharingsurveillance videos, and of evaluating the performanceof automatic surveillance systems. The main idea is to exploitthe collaborative paradigm spreading in the web communityto join together the ontology based annotation andretrieval concepts and the requirements of the computer visionand video surveillance communities. The ViSOR openrepository is based on a reference ontology which integratesmany concepts, also coming from LSCOM and MediaMillontologies. The web interface allows video browse, queryby annotated concepts or by keywords, compressed videopreview, media download and upload. The repository containsmetadata annotations, which can be either manuallycreated as ground truth or automatically generated by videosurveillance systems. Their automatic annotations can becompared each other or with the reference ground-truth exploitingan integrated on-line performance evaluator.

2008 - Artificial vision for the surveillance video [Articolo su rivista]
Cucchiara, R.
abstract

2008 - Bayesian-competitive Consistent Labeling for People Surveillance [Articolo su rivista]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

This paper presents a novel and robust approach to consistent labeling for people surveillance in multi-camera systems. A general framework scalable to any number of cameras with overlapped views is devised. An off-line training process automatically computes ground-plane homography and recovers epipolar geometry. When a new object is detected in any one camera, hypotheses for potential matching objects in the other cameras are established. Each of the hypotheses is evaluated using a prior and likelihood value. The prior accounts for the positions of the potential matching objects, while the likelihood is computed by warping the vertical axis of the new object on the field of view of the other cameras and measuring the amount of match. In the likelihood, two contributions (forward and backward) are considered so as to correctly handle the case of groups of people merged into single objects. Eventually, a maximum-a-posteriori approach estimates the best label assignment for the new object. Comparisons with other methods based on homography and extensive outdoor experiments demonstrate that the proposed approach is accurate and robust in coping with segmentation errors and in disambiguating groups.

2008 - Describing Texture Directions with Von Mises Distributions [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Cucchiara, Rita
abstract

In this work we describe a new approach for texture characterization. Starting from the autocorrelation matrix an elegant description through a mixture of Von Mises distributions is proposed. A compact 6 valued descriptor is produced for each block and served as input to an SVM classifier. Tests are carried out on high resolution illuminated manuscripts images.

2008 - Enabling Technologies on Hybrid Camera Networks for Behavioral Analysis of Unattended Indoor Environments and Their Surroundings [Relazione in Atti di Convegno]
Gualdi, Giovanni; Prati, Andrea; Cucchiara, Rita; E., Ardizzone; M., La Cascia; L., Lo Presti; M., Morana
abstract

This paper presents a layered network architecture and the enabling technologies for accomplishing vision-based behavioral analysis of unattended environments. Specifically the vision network covers both the attended environment and its surroundings by means of hybrid cameras. The layer overlooking at the surroundings is laid outdoor and tracks people, monitoring entrance/exit points. It recovers the geometry of the site under surveillance and communicates people positions to a higher level layer. The layer monitoring the unattended environment undertakes similar goals, with the addition of maintaining a global mosaic of the observed scene for further understanding. Moreover, it merges information coming from sensors beyond the vision to deepen the understanding or increase the reliability of the system. The behavioral analysis is demanded to a third layer that merges the information received from the two other layers and infers knowledge about what happened, happens and will be likely happening in the environment. The paper also describes a case study that was implemented in the Engineering Campus of the University of Modena and Reggio Emilia, where our surveillance system has been deployed in a computer laboratory which was often unaccessible due to lack of attendance.

2008 - HECOL: Homography and Epipolar-based Consistent Labeling for Outdoor Park Surveillance [Articolo su rivista]
Calderara, Simone; Prati, Andrea; Cucchiara, Rita
abstract

Outdoor surveillance is one of the most attractive application of video processing and analysis. Robust algorithms must be defined and tuned to cope with the non-idealities of outdoor scenes. For instance, in a public park, an automatic video surveillance system must discriminate between shadows, reflections, waving trees, people standing still or moving, and other objects. Visual knowledge coming from multiple cameras can disambiguate cluttered and occluded targets by providing a continuous consistent labeling of tracked objects among the different views. This work proposes a new approach for coping with this problem in multi-camera systems with overlapped Fields of View (FoVs). The presence of overlapped zones allows the definition of a geometry-based approach to reconstruct correspondences between FoVs, using only homography and epipolar lines (hereinafter HECOL: Homography and Epipolar-based COnsistent Labeling) computed automatically with a training phase. We also propose a complete system that provides segmentation and tracking of people in each camera module. Segmentation is performed by means of the SAKBOT (Statistical and Knowledge Based Object Tracker) approach, suitably modified to cope with multi-modal backgrounds, reflections and other artefacts, typical of outdoor scenes. The extracted objects are tracked using a statistical appearance model robust against occlusions and segmentation errors. The main novelty of this paper is the approach to consistent labeling. A specific Camera Transition Graph is adopted to efficiently select the possible correspondence hypotheses between labels. A Bayesian MAP optimization assigns consistent labels to objects detected by several points of views: the object axis is computed from the shape tracked in each camera module and homography and epipolar lines allow a correct axis warping in other image planes. Both forward and backward probability contributions from the two different warping directions make the approach robust against segmentation errors, and capable of disambiguating groups of people. The system has been tested in a real setup of a urban public park, within the Italian LAICA (Laboratory of Ambient Intelligence for a friendly city) project. The experiments show how the system can correctly track and label objects in a distributed system with real-time performance. Comparisons with simpler consistent labeling methods and extensive outdoor experiments with ground truth demonstrate the accuracy and robustness of the proposed approach.

2008 - Pervasive Self-Learning with multi-modal distributed sensors [Relazione in Atti di Convegno]
Bicocchi, Nicola; Mamei, Marco; Prati, Andrea; Cucchiara, Rita; Zambonelli, Franco
abstract

Truly ubiquitous computing poses new and significantchallenges. One of the key aspects that will condition theimpact of these new tecnologies is how to obtain a manageablerepresentation of the surrounding environment startingfrom simple sensing capabilities. This will make devicesable to adapt their computing activities on an everchangingenvironment. This paper presents a frameworkto promote unsupervised training processes among differentsensors. This framework allows different sensors to exchangethe needed knowledge to create a model to classifyevents. In particular we developed, as a case study,a multi-modal multi-sensor classification system combiningdata from a camera and a body-worn accelerometer to identifythe user motion state. The body-worn accelerometerlearns a model of the user behavior exploiting the informationcoming from the camera and uses it later on to classifythe user motion in an autonomous way. Experimentsdemonstrate the accuracy of the proposed approach in differentsituations.

2008 - Reliable smoke detection system in the domains of image energy and color [Relazione in Atti di Convegno]
Piccinini, Paolo; Calderara, Simone; Cucchiara, Rita
abstract

Smoke detection calls for a reliable and fast distinction between background, moving objects and variable shapes that are recognizable as smoke. In our system we propose a stable background suppression module joined with a smoke detection module working on segmented objects. It exploits two features: the energy variation in wavelet model and a color model of the smoke. The decrease of energy ratio in wavelet domain between background and current image is a clue to detect smoke representing the variations of texture level. A mixture of Gaussians models this texture ratio for temporal evolution. The color model is used as reference to measure the deviation of the current pixel color from the model. The two features have been combined using a Bayesian classifier to detect smoke in the scene. Experiments on real data and a comparison between our background model and Gaussian Mixture(MOG) model for smoke detection are presented. © 2008 IEEE.

2008 - Smoke detection in video surveillance: A MoG model in the wavelet domain [Relazione in Atti di Convegno]
Calderara, Simone; Piccinini, Paolo; Cucchiara, Rita
abstract

The paper presents a new fast and robust technique of smoke detection in video surveillance images. The approach aims at detecting the spring or the presence of smoke by analyzing color and texture features of moving objects, segmented with background subtraction. The proposal embodies some novelties: first the temporal behavior of the smoke is modeled by a Mixture of Gaussians (MoG ) of the energy variation in the wavelet domain. The MoG takes into account the image energy variation due to either external luminance changes or the smoke propagation. It allows a distinction to energy variation due to the presence of real moving objects such as people and vehicles. Second, this textural analysis is enriched by a color analysis based on the blending function. Third, a Bayesian model is defined where the texture and color features, detected at block level, contributes to model the likelihood while a global evaluation of the entire image models the prior probability contribution. The resulting approach is very flexible and can be adopted in conjunction to a whichever video surveillance system based on dynamic background model. Several tests on tens of different contexts, both outdoor and indoor prove its robustness and precision. © 2008 Springer-Verlag Berlin Heidelberg.

2008 - Smoke detection in videosurveillance: the use of VISOR (Video Surveillance On-line Repository) [Relazione in Atti di Convegno]
Vezzani, Roberto; Calderara, Simone; Piccinini, Paolo; Cucchiara, Rita
abstract

Visor (VIdeo Surveillance Online Repository) is a large videorepository, designed for containing annotated video surveillancefootages, comparing annotations, evaluating systemperformance, and performing retrieval tasks. The web interfaceallows video browse, query by annotated conceptsor by keywords, compressed video preview, media downloadand upload. The repository contains metadata annotations,both manually created ground-truth data and automaticallyobtained outputs of particular systems. An exampleof application is the collection of videos and annotationsfor smoke detection, an important video surveillance task. Inthis paper we present the architecture of ViSOR, the build-insurveillance ontology which integrates many concepts, alsocoming from LSCOM, and MediaMill, the annotation toolsand the visualization of results for performance evaluation.The annotation is obtained with an automatic smoke detectionsystem, capable to detect people, moving objects, andsmoke in real-time.

2008 - Using Dominant Sets for Object Tracking with Freely Moving Camera [Relazione in Atti di Convegno]
Gualdi, Giovanni; A., Albarelli; Prati, Andrea; A., Torsello; M., Pelillo; Cucchiara, Rita
abstract

Object tracking with freely moving cameras is an openissue, since background information cannot be exploited forforeground segmentation, and plain feature tracking is notrobust enough for target tracking, due to occlusions, distractors and object deformations. In order to deal withsuch challenging conditions a traditional approach, basedon Camshift-like color-based features, is augmented by introducing a structural model of the object to be tracked incorporating previous knowledge about the spatial relationsbetween the parts. Hence, an attributed graph is built ontop of the features extracted from each frame and a graphmatching technique is used to extract the optimal matchwith the model. Pixel-wise and object-wise comparisonwith other tracking techniques with respect to manually obtained ground truth are presented.

2008 - Using circular statistics for trajectory shape analysis [Relazione in Atti di Convegno]
Prati, Andrea; Calderara, Simone; Cucchiara, Rita
abstract

The analysis of patterns of movement is a crucial task for several surveillance applications, for instance to classify normal or abnormal people trajectories on the basis of their occurrence. This paper proposes to model the shape of a single trajectory as a sequence of angles described using a Mixture of Von Mises (MoVM) distribution. A complete EM (Expectation Maximization) algorithm is derived for MoVM parameters estimation and an on-line version proposed to meet real time requirement. Maximum-A-Posteriori is used to encode the trajectory as a sequence of symbols corresponding to the MoVM components. Iterative k-medoids clustering groups trajectories in a variable number of similarity classes. The similarity is computed aligning (with dynamic programming) two sequences and considering as symbol-to-symbol distance the Bhattacharyya distance between von Mises distributions. Extensive experiments have been performed on both synthetic and real data. ©2008 IEEE.

2008 - ViSOR: Video Surveillance On-line Repository for Annotation Retrieval [Relazione in Atti di Convegno]
Vezzani, Roberto; Cucchiara, Rita
abstract

The Imagelab Laboratory of the University of Modena andReggio Emilia has designed a large video repository, aimingat containing annotated video surveillance footages. The webinterface, named ViSOR (VIdeo Surveillance Online Repository),allows video browse, query by annotated concepts or bykeywords, compressed preview, video download and upload.The repository contains metadata annotation, both manuallyannotated ground-truth data and automatically obtained outputsof a particular system. In such a manner, the users of therepository are able to perform validation tasks of their ownalgorithms as well as comparative activities.

2008 - Video Streaming for Mobile Video Surveillance [Articolo su rivista]
Gualdi, Giovanni; A., Prati; Cucchiara, Rita
abstract

Mobile video surveillance represents a new paradigm that encompasses, on the one side, ubiquitous video acquisition and, on the other side, ubiquitous video processing and viewing, addressing both computer-based and human-based surveillance. To this aim, systems must provide efficient video streaming with low latency and low frame skipping, even over limited bandwidth networks. This work presents MoSES (MObile Streaming for vidEo Surveillance), an effective system for mobile video surveillance for both PC and PDA clients; it relies over H.264/AVC video coding and GPRS/EDGE-GPRS network. Adaptive control algorithms are employed to achieve the best tradeoff between low latency and good video fluidity. MoSES provides a good-quality video streaming that is used as input to computer-based video surveillance applications for people segmentation and tracking. In this paper new and general-purpose methodologies for streaming performance evaluation are also proposed and used to compare MoSES with existing solutions in terms of different parameters (latency, image quality, video fluidity, and frame losses), as well as in terms of performance in people segmentation and tracking.

2007 - A Distributed Outdoor Video Surveillance System for Detection of Abnormal People Trajectories [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

Distributed surveillance systems are nowadays widely adopted to monitor large areas for security purposes. In this paper, we present a complete multicamera system designed for people tracking from multiple partially overlapped views and capable of inferring and detecting abnormal people trajectories. Detection and tracking are performed by means of background suppression and an appearance-based probabilistic approach. Objects' label ambiguities are geometrically solved and the concept of "normality" is learned from data using a robust statistical model based on Von Mises distributions. Abnormal trajectories are detected using a first-order Bayesian network and, for each abnormal event, the appearance of the subject from each view is logged. Experiments demonstrate that our system can process with real-time performance up to three cameras simultaneously in an unsupervised setup and under varying environmental conditions.

2007 - A Dynamic Programming Technique for Classifying Trajectories [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, A.
abstract

This paper proposes the exploitation of a dynamic programming technique for efficiently comparing people trajectories adopting an encoding scheme that jointly takes into account both the direction and the velocity of movement. With this approach, each pair of trajectories in the training set is compared and the corresponding distance computed. Clustering is achieved by using the k-medoids algorithm and each cluster is modeled with a 1-D Gaussian over the distance from the medoid. A MAP framework is adopted for the testing phase. The reported results are encouraging.

2007 - A Multi-Camera Vision System for Fall Detection and Alarm Generation [Articolo su rivista]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

In-house video surveillance can represent an excellent support for people with some difficulties (e.g. elderly or disabled people) living alone and with a limited autonomy. New hardware technologies and in particular digital cameras are now affordable and they have recently gained credit as tools for (semi-)automatically assuring people's safety. In this paper a multi-camera vision system for detecting and tracking people and recognizing dangerous behaviours and events such as a fall is presented. In such a situation a suitable alarm can be sent, e.g. by means of an SMS. A novel technique of warping people's silhouette is proposed to exchange visual information between partially overlapped cameras whenever a camera handover occurs. Finally, a multi-client and multi-threaded transcoding video server delivers live video streams to operators/remote users in order to check the validity of a received alarm. Semantic and event-based transcoding algorithms are used to optimize the bandwidth usage. A two-room setup has been created in our laboratory to test the performance of the overall system and some of the results obtained are reported.

2007 - An Open Source Architecture for Low-Latency Video Streaming on PDAs [Relazione in Atti di Convegno]
Gualdi, Giovanni; Prati, Andrea; Cucchiara, Rita
abstract

This paper presents a open-source system for low-latency video streaming on PDAs, specifically addressing mobile video surveillance requirements. The system is based on H.264 and suitably modified to obtain the best trade-off between image quality and video fluidity, working also at very limited bandwidths. Moreover, the used con- trols allow to keep the number of lost frames very low. A large set of experiments and comparisons have been carried out and the achieved results demonstrate the efficacy and efficiency of our system.

2007 - Compressed Domain Features Extraction for Shot Characterization [Relazione in Atti di Convegno]
Grana, Costantino; Vezzani, Roberto; Borghesani, Daniele; Cucchiara, Rita
abstract

In this work, we propose a system for shot comparison directly working on the MPEG-1 stream in the compressed domain, extracting both color, texture and motion features considering all frames with a reasonable computational cost, and results comparable to those obtained on uncompressed keyframes. In particular a summary descriptor for each Group Of Pictures (GOP) is computed and employed for shot characterization and comparison. The Mallows distance allows to match different length clips in a unified framework.

2007 - Detection of Abnormal Behaviors using a Mixture of Von Mises Distributions [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

This paper proposes the use of a mixture of Von Mises distributions to detect abnormal behaviors of moving people. The mixture is created from an unsupervised training set by exploiting k-medoids clustering algorithm based on Bhattacharyya distance between distributions. The extracted medoids are used as modes in the multi-modal mixture whose weights are the priors of the specific medoid. Given the mixture model a new trajectory is verified on the model by considering each direction composing it as independent. Experiments over a real scenario composed of multiple, partially-overlapped cameras are reported.

2007 - Dynamic Pictorial Ontologies for Video Digital libraries Annotation [Relazione in Atti di Convegno]
M., Bertini; A., Del Bimbo; C., Torniai; Grana, Costantino; Cucchiara, Rita
abstract

In this paper, we present the dynamic pictorial ontology paradigm for video annotation. Ontologies are often used to describe a given domain for different goals, including description of multimedia data. In the case of video annotation, the visual knowledge cannot be described using only abstract concepts but is more effectively represented in a visual form. To this aim, we introduce visual concepts, elicited from the data set as the most representative prototypes that specialize abstract concepts. The ontology created is intrinsically dynamic since it must embrace the perceptual and visual experience during annotation. Thus visual concepts can change, adapting to the multimedia content analyzed. Motivation for this new ontology paradigm are discussed together with a proposal of a framework for ontology creation, maintenance, and automatic annotation of video. The creation and usage of dynamic pictorial ontologies have been tested for soccer domain exploiting low level perceptual features and higher level domain features.

2007 - Efficient stereo vision for obstacle detection and AGV navigation [Relazione in Atti di Convegno]
Cucchiara, R.; Perini, E.; Pistoni, G.
abstract

2007 - Enhancing HSV Histograms with Achromatic Points Detection for Video Retrieval [Relazione in Atti di Convegno]
Grana, Costantino; Vezzani, Roberto; Cucchiara, Rita
abstract

Color is one of the most meaningful features used in content based retrieval of visual data. In video content based retrieval, color features computed on selected frames are integrated with other low-level features concerning texture, shape and motion in order to find clip similarities. For example, the Scalable Color feature defined in the MPEG-7 standard exploits HSV histograms to create color feature vectors. HSV is a widely adopted space in image and video retrieval, but its quantization for histogram generation can create misleading errors in classification of achromatic and low saturated colors. In this paper we propose an Enhanced HSV Histogram with achromatic point detection based on a single Hue and Saturation parameter that can correct this limitation. The enhanced histograms have proven to be effective in color analysis and they have been used in a system for automatic clip annotation called PEANO, where pictorial concepts are extracted by a clip clustering and used for similarity based automatic annotation.

2007 - Expert environments: Machine intelligence methods for ambient intelligence [Articolo su rivista]
P., Remagnino; Prati, Andrea; G. L., Foresti; Cucchiara, Rita
abstract

2007 - International workshop on Visual and Multimedia Digital Libraries (VMDL07) [Relazione in Atti di Convegno]
Del Bimbo, A.; Boujemaa, N.; Cucchiara, R.
abstract

2007 - Linear Transition Detection as a Unified Shot Detection Approach [Articolo su rivista]
Grana, Costantino; Cucchiara, Rita
abstract

In this paper, we propose an automatic system forvideo shot segmentation, called Linear Transition Detector (LTD),unique for both cuts and linear transitions detection. Comparisonwith publicly available shot detection systems is reported ondifferent sports (Formula 1, basket, soccer and cycling) andTRECVID 2005 results are also reported.

2007 - Mobile Video Surveillance with Low-Bandwidth Low-Latency Video Streaming [Relazione in Atti di Convegno]
Gualdi, Giovanni; Prati, Andrea; Cucchiara, Rita
abstract

This paper presents a system for remote live video surveillance. Videos are acquired from a fixed camera at 10 fps and QVGA resolution, compressed at 5 or 20 kbit/s with H.264, and streamed to a remote site, where they get processed by an automatic video surveillance system. The target surveillance application performs moving object segmentation and tracking. Both ends (video acquisition and processing) could be connected through a wireless network, specifically GPRS.The whole system is studied and optimized to maintain low latency. The reported experiments demonstrate that the proposed system is able to send up to four video streams over GPRS or E-GPRS network, without significantly affecting the performance of the automatic video surveillance system. Comparative tests have been performed with other existing streaming solutions.

2007 - Network patterns recognition for automatic dermatologie images classification [Relazione in Atti di Convegno]
Grana, C.; Daniele, V.; Pellacani, G.; Seidenari, S.; Cucchiara, R.
abstract

In this paper we focus on the problem of automatic classification of melanocytic lesions, aiming at identifying the presence of reticular patterns. The recognition of reticular lesions is an important step in the description of the pigmented network, in order to obtain meaningful diagnostic information. Parameters like color, size or symmetry could benefit from the knowledge of having a reticular or non-reticular lesion. The detection of network patterns is performed with a three-steps procedure. The first step is the localization of line points, by means of the line points detection algorithm, firstly described by Steger. The second step is the linking of such points into a line considering the direction of the line at its endpoints and the number of line points connected to these. Finally a third step discards the meshes which couldn't be closed at the end of the linking procedure and the ones characterized by anomalous values of area or circularity. The number of the valid meshes left and their area with respect to the whole area of the lesion are the inputs of a discriminant function which classifies the lesions into reticular and non-reticular. This approach was tested on two balanced (both sets are formed by 50 reticular and 50 non-reticular images) training and testing sets. We obtained above 86% correct classification of the reticular and non-reticular lesions on real skin images, with a specificity value never lower than 92%.

2007 - Network patterns recognition for automatic dermatoscopic images classification [Relazione in Atti di Convegno]
Grana, Costantino; Vanini, Daniele; Seidenari, Stefania; Pellacani, Giovanni; Cucchiara, Rita
abstract

In this paper we focus on the problem of automatic classification of melanocytic lesions, aiming at identifying the presence of reticular patterns. The recognition of reticular lesions is an important step in the description of the pigmented network, in order to obtain meaningful diagnostic information. Parameters like color, size or symmetry could benefit from the knowledge of having a reticular or non-reticular lesion. The detection of network patterns is performed with a three-steps procedure. The first step is the localization of line points, by means of the line points detection algorithm, firstly described by Steger. The second step is the linking of such points into a line considering the direction of the line at its endpoints and the number of line points connected to these. Finally a third step discards the meshes which couldn’t be closed at the end of the linking procedure and the ones characterized by anomalous values of area or circularity. The number of the valid meshes left and their area with respect to the whole area of the lesion are the inputs of a discriminant function which classifies the lesions into reticular and non-reticular. This approach was tested on two balanced (both sets are formed by 50 reticular and 50 non-reticular images) training and testing sets. We obtained above 86% correct classification of the reticular and non-reticular lesions on real skin images, with a specificity value never lower than 92%.

2007 - Proceedings - 14th International conference on Image Analysis and Processing, ICIAP 2007: Foreword [Relazione in Atti di Convegno]
Cucchiara, R.
abstract

2007 - Proceedings of 14th International Conference on Image Analysis and Processing (ICIAP 2007) [Curatela]
Cucchiara, Rita
abstract

2007 - Prototypes Selection with Context Based Intra-class Clustering for Video Annotation with Mpeg7 Features [Relazione in Atti di Convegno]
Grana, Costantino; Vezzani, Roberto; Cucchiara, Rita
abstract

In this work, we analyze the effectiveness of perceptual features to automatically annotate video clips in domain-specific video digital libraries. Typically, automatic annotation is provided by computing clip similarity with respect to given examples, which constitute the knowledgebase, in accordance with a given ontology or a classification scheme. Since the amount of training clips is normally very large, we propose to automatically extract some prototypes, or visual concepts, for each class instead of using the whole knowledge base. The prototypes are generated after a Complete Link clustering based on perceptual features with an automatic selection of the number of clusters. Context based information are used in an intra-class clustering framework to provide selection of more discriminative clips. Reducing the number of samples makes the matching process faster and lessens the storage requirements. Clips are annotated following the MPEG-7 directives to provide easier portability. Results are provided on videos taken from sports and news digital libraries.

2007 - Semi-automatic Video Digital Library Annotation Tools [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; Vezzani, Roberto
abstract

In this work, we present a general purpose systemfor hierarchical structural segmentation and automaticannotation of video clips, by means of standardizedlow level features. We propose to automatically extractsome prototypes for each class with a context basedintra-class clustering. Clips are annotated followingthe MPEG-7 standard directives to provide easierportability. Results of automatic annotation and semiautomaticmetadata creation are provided.

2007 - Similarity-Based Retrieval with MPEG-7 3D Descriptors: Performance Evaluation on the Princeton Shape Benchmark [Relazione in Atti di Convegno]
Grana, Costantino; M., Davolio; Cucchiara, Rita
abstract

In this work, we describe in detail the new MPEG-7 Perceptual 3D Shape Descriptor and provide a set of tests with different 3D objects databases, mainly with the Princeton Shape Benchmark. With this purpose we created a function library called Retrieval-3D and fixed some bugs of the MPEG-7 eXperimentation Model (XM). We explain how to match the Attributed Relational Graph (ARG) of every 3D model with the modified nested Earth Mover’s Distance (mnEMD). Finally we compare our results with the best found in literature, including the first MPEG-7 3D descriptor, i.e. the Shape Spectrum Descriptor.

2007 - Sports Video Annotation Using Enhanced HSV Histograms in Multimedia Ontologies [Relazione in Atti di Convegno]
M., Bertini; A., Del Bimbo; C., Torniai; Grana, Costantino; Vezzani, Roberto; Cucchiara, Rita
abstract

This paper presents multimedia ontologies, where multimedia data and traditional textual ontologies are merged. A solution for their implementation for the soccer video domain and a method to perform automatic soccer video annotation using these extended ontologies is shown. HSV is a widely adopted space in image and video retrieval, but its quantization for histogram generation can create misleading errors in classification of achromatic and low saturated colors. In this paper we propose an Enhanced HSV Histogram with achromatic point detection based on a single Hue and Saturation parameter that can correct this limitation.The more general concepts of the sport domain (e.g. play/break, crowd, etc.) are put in correspondence with the more general visual features of the video like color and texture, while the more specific concepts of the soccer domain (e.g. highlights such as attack actions) are put in correspondence with domain specific visual feature like the soccer playfield and the players. Experimental results for annotation of soccer videos using generic concepts are presented.

2007 - Using a Wireless Sensor Network to Enhance Video Surveillance [Articolo su rivista]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto; L., Benini; E., Farella; P., Zappi
abstract

To enhance video surveillance systems, multi-modal sensor integration can be a successful strategy. In this work, a computer vision system able to detect and track people from multiple cameras is integrated with a wireless sensor network mounting passive Pyroelectric InfraRed sensors. Thetwo subsystems are briefly described and possible cases in which computer vision algorithms are likely to fail are discussed. Then, simple but reliable outputs from the sensor nodes are exploited to improve the accuracy of the vision system. In particular, two case studies are reported: the first uses the presence detection of sensors to disambiguate between an open door and a moving person, while the second handles motion direction changes during occlusions. Preliminary results are reported and demonstrate the usefulness of the integration of the two subsystems.

2007 - Video Shots Comparison using the Mallows Distance [Relazione in Atti di Convegno]
Grana, Costantino; Borghesani, Daniele; Cucchiara, Rita
abstract

In this work, we focus on two aspects of the comparison of video shots. We present a new approach to extract a variable number of key frames from a shot, by the use of a hierarchical clustering with automatic level selection, in order to provide optimal allocation of features on different parts of the shot. We then employ the Mallows distance as an effective technique to compare the discrete distributions of features, independently from the features selected for the specific application. Results and comparisons on a soccer documentary video are provided.

2007 - Video transcoding and streaming for mobile applications [Relazione in Atti di Convegno]
Gualdi, G.; Prati, A.; Cucchiara, R.
abstract

The present work shows a system for compressing and streaming of live videos over networks with low bandwidths (radio mobile networks), with the objective to design an effective solution for mobile video access. We present a mobile ready-to-use streaming system, that encodes video using h264 codec (offering good quality and frame rate at very low bit-rates) and streams it over the network using UDP protocol. A dynamic frame rate control has been implemented in order to obtain the best trade off between playback fluency and latency. © Springer-Verlag Berlin Heidelberg 2007.

2007 - Visor: Video Surveillance Online Repository [Relazione in Atti di Convegno]
Vezzani, Roberto; Cucchiara, Rita
abstract

Aim of the Visor Project [1] is to gather and makefreely available a repository of surveillance andvideo footages for the research community onpattern recognition and multimedia retrieval. Thegoal is to create an open forum and a free repositoryto exchange, compare and discuss results of manyproblems in video surveillance and retrieval.Together with the videos, the repository containsmetadata annotation, both manually annotated asground-truth and automatically obtained by videosurveillance systems. Annotation refers to a largeontology of concepts on surveillance and securityrelated objects and events. The ontology has beendefined including concepts from LSCOM andMediaMill ontologies. As well as videos andannotations, Visor provides tools for enriching theontology, annotating new videos, searching bytextual queries, composing and downloading videos.

2006 - 3-D Virtual Environments on Mobile Devices for Remote Surveillance [Relazione in Atti di Convegno]
Vezzani, Roberto; Cucchiara, Rita; A., Malizia; L., Cinque
abstract

In this paper we present a distributed videosurveillanceframework. Our end is the remote monitoringof the behavior of people moving in a scene exploitinga virtual reconstruction on low capabilitiesdevices, like PDAs and cell phones. The main noveltyof this system is the effective integration of the computervision and computer graphics modules. The first,using a probabilistic frameworks, can detect the position,the trajectory and the posture of peoples movingin the scene. The second exploits the new possibility ofboth standard 3D graphics libraries on mobile (namelyJSR184 and M3G graphic format) and new PDAsprocessing capability in order to reconstruct the remotesurveillance data in real-time.

2006 - A Distributed Domotic Surveillance System [Capitolo/Saggio]
Cucchiara, Rita; Grana, Costantino; Prati, Andrea; Vezzani, Roberto
abstract

Distributed video surveillance has a direct application in intelligent home automation or domotics (from the Latin word domus, that means “home”, and informatics); in particular, in-house videosurveillance can provide good support for people with some difficulties (e.g., elderly or disabled people) living alone and with a limited autonomy. New hardware technologies for surveillance are now affordable and provide high reliability. Problems related to reliable software solutions are not completely solved, especially concerning the application of general-purpose computer vision techniques in indoor environments. Indeed, assuming the objective is to detect the presence of people, track them, and recognize dangerous behaviours by means of abrupt changes in their posture, robust techniques must cope with non-trivial difficulties. In particular, luminance changes and shadows must be taken into account, frequent posture changes must be faced, and large and long-lasting occlusions are common due to the vicinity of the cameras and the presence of furnitureand doors that can often hide parts of the person’s body. These problems are analyzed and solutions based on background suppression, appearance-based probabilistic tracking, and probabilistic reasoning for posture recognition are described.

2006 - A semi-automatic system for segmentation of cardiac M-mode images [Articolo su rivista]
L., Bertelli; Cucchiara, Rita; G., Paternostro; Prati, Andrea
abstract

Pixel classifiers are often adopted in pattern recognition as a suitable method for image segmentation. A common approach to the performance evaluation of classifier systems is based on the measurement of the classification errors and, at the same time, on the computational time. In general, multiclassifiers have proven to be more precise in the classification in many applications, but at the cost of a higher computational load. This paper analyzes different classifiers and proposes an evaluation of the classifiers in the case of semi-automatic processes with human interaction. Medical imaging is a typical application, where automatic or semi-automatic segmentation can be a valuable support to the diagnosis. The paper focuses on the segmentation of cardiac images of fruit flies (genetic model for analyzing human heart's diseases). Analysis is based on M-modes, that are gray-level images derived from mono-dimensional projections of the video frames on a line. Segmentation of the M-mode images is provided by classifiers and integrated in a multiclassifier. A neural network classifier, a Bayesian classifier, and a classifier based on hidden Markov chains are joined by means of a Behavior Knowledge Space fusion rule. The comparative evaluation is discussed in terms of both accuracy and required time, in which the time to correct the classifier errors by means of human intervention is also taken into account.

2006 - A semi-automatic video annotation tool with MPEG-7 content collections [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; D., Bulgarelli; Vezzani, Roberto
abstract

In this work, we present a general purpose system for hierarchical structural segmentation and automatic annotation of video clips, by means of standardized low level features. We propose to automatically extract some prototypes for each class with a context based intra-class clustering. Clips are annotated following the MPEG-7 standard directives to provide easier portability. Results of automatic annotation and semiautomatic metadata creation are provided

2006 - A system for automatic face obscuration for privacy purposes [Articolo su rivista]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

This work proposes a method for automatic face obscuration capable of protecting people's identity. Since face detection heavily benefits from the possibility to exploit tracking, multi-camera people tracking has been integrated with a face detector based on colour clustering and Hough transform. Moreover, the multiple viewpoints provided by multiple cameras are exploited in order to always obtain a good-quality image of the face. The identity of people in different views is kept consistent by means of a geometrical, uncalibrated approach based on homographies. Experimental results show the accuracy of the proposed approach. (c) 2006 Elsevier B.V. All rights reserved.

2006 - Advanced video surveillance with pan tilt zoom cameras [Relazione in Atti di Convegno]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

In this paper an advanced video surveillance system is proposed.Our goal is the detection of the people’s heads toallow their obscuration for privacy issues or to performrecognition tasks. We propose a system based on active PTZ(Pan-Tilt-Zoom) cameras that produce head images havinga large enough size, and can cover an area larger than stillcameras. Since conventional approaches are not suitable toPTZ cameras, the proposed approach is based on the socalleddirection histograms to compute the ego-motion andon frame differencing for detecting moving objects. It exploitspost-processing and active contours to extract preciseshape of moving objects to be fed to a probabilistic algorithmto track moving people in the scene. Person following,instead, is based on simple heuristic rules that movethe camera as soon as the selected person is close to theborder of the field of view. Finally, a color and shape basedhead detection that takes advantage of the people trackingis presented. Experimental results on a live active camerademonstrate the feasibility of real-time person followingand of the consecutive head detection phase.

2006 - Comparison of color clustering algorithms for segmentation of dermatological images [Relazione in Atti di Convegno]
Melli, Rudy Mirko; Grana, Costantino; Cucchiara, Rita
abstract

Automatic segmentation of skin lesions in clinical images is a very challenging task; it is necessary for visual analysis of the edges, shape and colors of the lesions to support the melanoma diagnosis, but, at the same time, it is cumbersome since lesions (both naevi and melanomas) do not have regular shape, uniform color, or univocal structure. Most of the approaches adopt unsupervised color clustering. This works compares the most spread color clustering algorithms, namely median cut, k-means, fuzzy-c means and mean shift applied to a method for automatic border extraction, providing an evaluation of the upper bound in accuracy that can be reached with these approaches. Different tests have been performed to examine the influence of the choice of the parameter settings with respect to the performances of the algorithms. Then a new supervised learning phase is proposed to select the best number of clusters and to segment the lesion automatically. Examples have been carried out in a large database of medical images, manually segmented by dermatologists. From these experiments mean shift was resulted the best technique, in term of sensitivity and specificity. Finally, a qualitative evaluation of the goodness of segmentation has been validated by the human experts too, confirming the results of the quantitative comparison.

2006 - Distance transform for automatic dermatologic images composition [Relazione in Atti di Convegno]
Grana, Costantino; Pellacani, Giovanni; Seidenari, Stefania; Cucchiara, Rita
abstract

In this paper we focus on the problem of automatically registering dermatological images, because even if different products are available, most of them share the problem of a limited field of view on the skin. A possible solution is then the composition of multiple takes of the same lesion with digital software, such as that for panorama images creation.In this work, to perform an automatic selection of matching points the Harris Corner Detector is used, and to cope with outlier couples we employed the RANSAC method. Projective mapping is then used to match the two images. Given a set of correspondence points, Singular Value Decomposition was used to compute the transform parameters.At this point the two images need to be blended together. One initial assumption is often implicitly made: the aim is to merge two rectangular images. But when merging occurs between more than two images iteratively, this assumption will fail. To cope with differently shaped images, we employed the Distance Transform and provided a weighted merging of images. Different tests were conducted with dermatological images, both with standard rectangular frame and with not typical shapes, as for example a ring due to the objective and lens selection. The successive composition of different circular images with other blending functions, such as the Hat function, doesn’t correctly get rid of the border and residuals of the circular mask are still visible. By applying Distance Transform blending, the result produced is insensitive of the outer shape of the image.

2006 - Estimating Geospatial Trajectory of a Moving Camera [Relazione in Atti di Convegno]
A., Hakeem; Vezzani, Roberto; S., Shah; Cucchiara, Rita
abstract

This paper proposes a novel method for estimating thegeospatial trajectory of a moving camera. The proposedmethod uses a set of reference images with known GPS(global positioning system) locations to recover the trajectoryof a moving camera using geometric constraints. Theproposed method has three main steps. First, scale invariantfeatures transform (SIFT) are detected and matched betweenthe reference images and the video frames to calculatea weighted adjacency matrix (WAM) based on the numberof SIFT matches. Second, using the estimated WAM, themaximum matching reference image is selected for the currentvideo frame, which is then used to estimate the relativeposition (rotation and translation) of the video frame usingthe fundamental matrix constraint. The relative position isrecovered upto a scale factor and a triangulation amongthe video frame and two reference images is performed toresolve the scale ambiguity. Third, an outlier rejection andtrajectory smoothing (using b-spline) post processing stepis employed. This is because the estimated camera locationsmay be noisy due to bad point correspondence or degenerateestimates of fundamental matrices. Results of recoveringcamera trajectory are reported for real sequences.

2006 - FaceMouse: A human-computer interface for tetraplegic people [Relazione in Atti di Convegno]
Perini, Emanuele; S., Soria; Prati, Andrea; Cucchiara, Rita
abstract

This paper proposes a new human-machine interface particularly conceived for people with severe disabilities (specifically tetraplegic people), that allows them to interact with the computer for their everyday life by means of mouse pointer. In this system, called FaceMouse, instead of classical pointer paradigm that requires the user to look at the point where to move, we propose to use a paradigm called derivative paradigm, where the user does not indicate the precise position, but the direction along which the mouse pointer must be moved. The proposed system is composed of a common, lowcost webcam, and by a set of computer vision techniques developed to identify the parts of the user's face (the only body part that a tetraplegic person can move) and exploit them for moving the pointer. Specifically, the implemented algorithm is based on template matching to track the nose of the user and on cross-correlation to calculate the best match. Finally, several real applications of the system are described and experimental results carried out by disabled people are reported.

2006 - Fast Dynamic Mosaicing and Person Following [Relazione in Atti di Convegno]
Prati, Andrea; F., Seghedoni; Cucchiara, Rita
abstract

A system for video surveillance purposes in wide areas based on active cameras, also capable to follow a person in the scene by keeping him framed, is presented. The proposed approach is based on the so-called direction histograms to compute the ego-motion and on frame differencing for detecting moving objects. It exploits post-processing and active contours to extract precise shape of moving objects to be fed to a probabilistic algorithm to track moving people in the scene. Person following, instead, is based on simple heuristic rules that move the camera as soon as the selected person is close to the border of the field of view. Experimental results on a live active camera demonstrate the feasibility of real-time person following.

2006 - Group Detection at Camera Handoff for Collecting People Appearance in Multi-camera Systems [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

Logging information on moving objects is crucial in video surveillance systems. Distributed multi-camera systems can provide the appearance of objects/people from different viewpoints and at different resolutions, allowing a more complete and precise logging of the information. This is achieved through consistent labeling to correlate collected information of the same person. This paper proposes a novel approach to consistent labeling also capable to fully characterize groups of people and to manage miss segmentations. The ground-plane homography and the epipolar geometry are automatically learned and exploited to warp objects' principal axes between overlapped cameras. A MAP estimator that exploits two contributions (forward and backward) is used to choose the most probable label configuration to be assigned at the handoff of a new object. Extensive experiments demonstrate the accuracy of the proposed method in detecting single and simultaneous handoffs, miss segmentations, and groups.

2006 - Line Detection and Texture Characterization of Network Patterns [Relazione in Atti di Convegno]
Grana, Costantino; Cucchiara, Rita; Pellacani, Giovanni; Seidenari, Stefania
abstract

This paper describes a complete approach to detect, localize and describe network patterns. Such texture is automatically detected with Gaussian derivative kernels and Fisher linear discriminant analysis; line closure and thinning is provided by morphological masking and line luminance profile fitting provides width estimation. Detection results on dermatological images are reported and discussed.

2006 - Low-latency Live Video Streaming over Low-Capacity Networks [Relazione in Atti di Convegno]
Gualdi, Giovanni; Cucchiara, Rita; Prati, Andrea
abstract

This paper presents an effective system for streaming over low-capacity networks (such as GPRS and EGPRS) of live videos with low latency. Existing solutions are either too complex or not suitable to our scope. For this reason, we developed a complete, ready-to-use streaming system based on H.264/AVC codec and UDP/IP stack. The system employs adaptive controls to achieve the best tradeoff between low latency and good video fluency, by keeping the UDP buffer occupancy at the decoder side between two given levels. Our experiments demonstrate that this system is able to transmit live videos at CIF format and 10 fps over GPRS/EGPRS with very low latency (1.73 sec on average, basically due to the network delay), good fluency and average quality, measured with PSNR, of 31 dB on GPRS at 23 kbps at 10 fps.

2006 - MOM: multimedia ontology manager. A framework for automatic annotation and semantic retrieval of video sequences [Relazione in Atti di Convegno]
M., Bertini; A., Del Bimbo; C., Torniai; Grana, Costantino; Cucchiara, Rita
abstract

Effective usage of multimedia digital libraries has to deal with the problem of building efficient content annotation and retrieval tools. MOM (Multimedia Ontology Manager) is a complete system that allows the creation of multimedia ontologies, supports automatic annotation and creation of extended text (and audio) commentaries of video sequences, and permits complex queries by reasoning on the ontology.

2006 - MPEG-7 Pictorially Enriched Ontologies for Video Annotation [Relazione in Atti di Convegno]
Grana, Costantino; Vezzani, Roberto; Bulgarelli, Daniele; Cucchiara, Rita
abstract

A system for the automatic creation of Pictorially Enriched Ontologies is presented, that is ontologies for context-based video digital libraries, enriched by pictorial concepts for video annotation, summarization and similarity-based retrieval. Extraction of pictorial concepts with video clips clustering, ontology storing with MPEG-7, and the use of the ontology for stored video annotation are described. Re-sults on sport videos and TRECVID2005 video material are reported.

2006 - Multimedia Surveillance: Content-based Retrieval with Multicamera People Tracking [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

Multimedia surveillance relates to the exploitation of multimedia tools for retrieving information from surveillance data, for emerging applications such as video post-analysis for forensic purposes. Searching for all the sequences in which a certain person was present is a typical query that is carried out by means of example images. Unfortunately, surveillance cameras often have low resolution, making retrieval based on appearance difficult. This paper proposes to exploit a two-step retrieval process that merges similarity-based retrieval with multicamera tracking-based retrieval able to create consistent traces of a person from different views and, thus, different resolutions. A mixture model is used to summarize these traces into a single prototype on which retrieval is performed. Experimental results demonstrate the accuracy of the retrieval process also in the case of varying illumination conditions.

2006 - PEANO: Pictorial Enriched Annotation of Video [Relazione in Atti di Convegno]
Grana, Costantino; Vezzani, Roberto; Bulgarelli, Daniele; Gualdi, Giovanni; Cucchiara, Rita; M., Bertini; C., Torniai; A., Del Bimbo
abstract

In this DEMO, we present a tool set for video digital library management that allows i) structural annotation of edited videos in MPEG-7 by automatically extracting shots and clips; ii) automatic semantic annotation based on perceptual similarity against a taxonomy enriched with pictorial concepts iii) video clip access and hierarchical summarization with stand-alone and web interface iv) access to clips from mobile platform in GPRS-UMTS videostreaming. The tools can be applied in different domain-specific Video Digital Libraries. The main novelty is the possibility to enrich the annotation with pictorial concepts that are added to a textual taxonomy in order to make the automatic annotation process more fast and often effective. The resulting multimedia ontology is described in the MPEG-7 framework. The PEANO (Perceptual Annotation of Video) tool has been tested over video art, sport (Soccer, Olimpic Games 2006, Formula 1) and news clips.

2006 - Performance of the MPEG-7 Shape Spectrum Descriptor for 3D objects retrieval [Relazione in Atti di Convegno]
Grana, Costantino; Cucchiara, Rita
abstract

In this work, we describe in detail the MPEG-7 Shape Spectrum Descriptor and provide a set of tests with different 3D objects databases. To verify if the literature reported low performance of this descriptor were due to the comparison employed, we also used the Earth Movers Distance which allows much more detailed histograms comparisons. Finally we compare our outcomes with the best results in related work.

2006 - Reliable background suppression for complex scenes [Relazione in Atti di Convegno]
Calderara, Simone; Melli, Rudy Mirko; Prati, Andrea; Cucchiara, Rita
abstract

This paper describes a system for motion detection based on background suppression,specifically conceived for working in complex scenes with vacillating background,camouflage, illumination changing, etc.. The system contains proper techniques for background bootstrapping, shadow removal, ghost suppression and selective updating of the background model. The results on the challenging videos provided in VSSN '06 Open Source Algorithm Competition dataset demonstrate that the proposed system outperforms the widely-used mixture-of-Gaussians approach.

2006 - Semantic Annotation and Adaptation of Live Sports Videos [Relazione in Atti di Convegno]
M., Bertini; Cucchiara, Rita; A., Del Bimbo; Prati, Andrea
abstract

This paper addresses multimedia tools for universal multimedia access to sports videos by means of automatic annotation and content-based adaptation. The goal is to provide boosting technologies to allow the new generations of mobile devices (phones and PDAs) to better exploit the available bandwidth and to achieve a reasonable cost/quality trade-off in remote access to long-lasting live events, such as sport competitions. Although the available bandwidth for mobile communication has increased thanks to new telecommunication standards such as GPRSand UMTS, it is still insufficient for high quality video transmission. The limited resources of low-cost terminals and the high costs of data transfer hinder de-facto many possible multimedia services. First, the quality is limited by the small display size and memory available on many mobile devices. Second, the limited bandwidthmay affect user satisfaction either because of the time spent waiting for the download or the latency in streaming a live video. Moreover, even if the user is willing to wait for the download or accepts frame dropping, a reduction of data to send would be unavoidable in order to bring down the costs of the service. As a matter of fact, most telecommunication companies charge a fee proportional to the number of bytes transferred. Hence, the cost of accessing a long-lasting live video, such as a 90-minute soccer competition, is stilltoo high for most users.

2006 - Semantic adaptation of sport videos with user-centred performance analysis [Articolo su rivista]
M., Bertini; Cucchiara, Rita; A., DEL BIMBO; Prati, Andrea
abstract

In semantic video adaptation measures of performance must consider the impact of the errors in the automatic annotation over the adaptation in relationship with the preferences and expectations of the user. In this paper, we define two new performance measures Viewing Quality Loss and Bit-rate Cost Increase, that are obtained from classical peak signal-to-noise ration (PSNR) and bit rate, and relate the results of semantic adaptation to the errors in the annotation of events and objects and the user's preferences and expectations. We present and discuss results obtained with a system that performs automatic annotation of soccer sport video highlights and applies different coding strategies to different parts of the video according to their relative importance for the end user. With reference to this framework, we analyze how highlights' statistics and the errors of the annotation engine influence the performance of semantic adaptation and reflect into the quality of the video displayed at the user's client and the increase of transmission costs.

2006 - Special Issue on Multimedia Surveillance Systems: Guest Editorial [Articolo su rivista]
Aggarwal, Jk; Cucchiara, Rita
abstract

It is with considerable pride that we present this special issue of ACM multimedia based on the presentations at the third Video Surveillance and Sensor Network workshop, in conjunction with the ACM conference in Singapore 2005. The papers were thoroughly reviewed independently of the review process for the workshop. This special issue consists of eight papers drawn from a number of areas. It appears that we are breaking new ground as explained in this issue.Whenever we say multimedia, we think of systems and services that manage heterogeneous data for human-oriented applications; human users are normally the subjects who access and use multimedia data, multimediastreams, multimedia content, and multimedia interfaces in many different applications contexts. Following this abstraction, multimedia surveillance systems would be only a surveillance system able to produce output of the task in a multimedia format, providing distilled video, images and sounds of the monitored environment, which would possibly be annotated in an efficient and standard way or possibly transcoded in another media such as text or animation, to improve further querying to surveillance stored data.

2006 - Sub-Shot Summarization for MPEG-7 based Fast Browsing [Relazione in Atti di Convegno]
Grana, Costantino; Cucchiara, Rita
abstract

In this paper, we propose a system for automatic video summarization at sub-shot level. Our work covers two main aspects: the first is the sub-shot detection, which is performed without a priori constraints on the number or length of the shots. The algorithm is based on color histograms and motion features, and employs fuzzy c-means with variable number of clusters. The second aspect is an in depth discussion on the annotation of summaries with the MPEG-7 standard. Results on mixed genres TV material, from TRECVID videos, are reported.

2006 - The LAICA project: Experiments on Multicamera People Tracking and Logging [Relazione in Atti di Convegno]
Calderara, Simone; Cucchiara, Rita; Prati, Andrea
abstract

Logging information on moving objects is crucial in video surveillance systems. Distributed multi-camera systems can provide the appearance of objects/people from differentviewpoints and at different resolutions, allowing a more complete and precise logging of the information. This is achieved through consistent labeling to correlate collected information of the same person. This paper proposes a novel approach to consistent labeling also capable tofully characterize groups of people and to manage miss segmentations. The ground-plane homography and the epipolar geometry are automatically learned and exploited to warp objects’ principal axes between overlapped cameras. A MAP estimator that exploits two contributions (forward and backward) is used to choose the most probable label con£guration to be assigned at the handoff of a new object. Extensive experiments demonstrate the accuracy of the proposed method in detecting single and simultaneous handoffs, miss segmentations, and groups.

2006 - University of Modena and Reggio Emilia at TRECVID 2006 [Relazione in Atti di Convegno]
Grana, Costantino; Vezzani, Roberto; Cucchiara, Rita
abstract

What approach or combination of approaches did you test in each of your submitted runs?TRECVID2005_UNIMORE_??.xml: the same linear transition detector (LTD) was tested forevery run, with ten uniformly spaced thresholds for the detection.What if any significant differences (in terms of what measures) did you find among theruns?The system behaved as expected: the higher the threshold the better the recall. Of course theprecision lowered correspondently. Interesting enough, it seems that we cannot overcome theoverall limit around 80% for recall and 88% for precision, independently of the other parameter.Based on the results, can you estimate the relative contribution of each component of yoursystem/approach to its effectiveness?One of the main objective of our system was to test the performance of a single algorithm forboth cuts and gradual transitions. So all the merit and the demerits are related to our LTD.Overall, what did you learn about runs/approaches and the research question(s) thatmotivated them?The use of a single algorithm allows the system to be run without training. Just a singleparameter may be employed to tune the sensibility of the system, thus allowing its use in generalpurpose/user friendly systems.

2006 - Video Clip Clustering for Assisted Creation of MPEG-7 Pictorially Enriched Ontologies [Relazione in Atti di Convegno]
Grana, Costantino; Bulgarelli, Daniele; Cucchiara, Rita
abstract

In this paper, we present a system for the assisted creation of Pictorially Enriched Ontologies, that is ontologies for context-based digital libraries enriched by pictorial concepts for video annotation, summarization and similarity based retrieval. Here we detail the approach for video clips clustering and pictorial concepts extraction together with the approach for storing the ontology within the MPEG-7 framework. The clustering is performed by Complete Link hierarchical clustering on color histograms and motion features. Results on Formula 1 TV material are reported.

2005 - Adaptation and Annotation of Formula 1 Sport Videos [Relazione in Atti di Convegno]
Grana, Costantino; Tardini, Giovanni; Cucchiara, Rita
abstract

In this paper, we approach the problem of detecting editing features suitable for video annotation, by paying attention to artifacts and effects introduced in video editing. In particular, a linear transition detection algorithm is presented, which can characterize the transition center and length with high precision. The technique works with sub-frame granularity and is able to include both abrupt cuts and longer dissolves in a single approach. Theoretical justification for the algorithm is provided with an optimization technique for real cases. We present results obtained exploiting the editing features on a Formula 1 video digital library, detecting replays and providing pre classification hints for automatic shot annotation.

2005 - Ambient Intelligence for Security in Public Parks: the LAICA Project [Relazione in Atti di Convegno]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

In this paper, we address the exploitation of computervision techniques to develop multimedia services andautomatic monitoring systems related to the securityand the privacy in public areas. The research is part ofa two-year ltalian project called LAICA, intended toprovide advanced services for citizens and publicofficers. Citizens want fast and friendly web access topublic places, to see the environment in real-timewithout violating the privacy laws. Public officers andpolicy centres want a fast and reactive monitoringsystem, capable to automatically detect dangeroussituations, given the huge amount of cameras that cannot be monitored simultaneously by human operators.In this work, we describe the project and the definedmethodologies in multi-camera video mosaicing,people tracking and consistent labelling, and access toprocessed data with face obscuration.

2005 - Ambient Intelligence in Urban Environments [Relazione in Atti di Convegno]
Cucchiara, Rita; Prati, Andrea; C., Osti; S., Pavani
abstract

This paper reports advances achieved within a project called LAICA (Laboratorio di Ambient Intelligence per una Città Amica) on Ambient Intelligence in urban environments. The overall LAICA architecture is described and the unified operative centre developed by Regulus SpA (partner of the project) to collect and correlate data from different sensors and prototypes is depicted. Moreover, the paper describes the results obtained in developing a system for video surveillance in public parks, devoted to create a mosaic image of the scene and to extract and track moving people. Moreover, the system takes the privacy issues into account, proposing a method for face detection and tracking able to obscure faces in order to protect people’s identity.

2005 - An integrated framework for semantic annotation and adaptation [Articolo su rivista]
M., Bertini; Cucchiara, Rita; A., Del Bimbo; Prati, Andrea
abstract

Tools for the interpretation of significant events from video and video clip adaptation can effectively support automatic extraction and distribution of relevant content from video streams. In fact, adaptation can adjust meaningful content, previously detected and extracted, to the user/client capabilities and requirements. The integration of these two functions is increasingly important, due to the growing demand of multimedia data from remote clients with limited resources (PDAs, HCCs, Smart phones). In this paper we propose an unified framework for event-based and object-based semantic extraction from video and semantic on-line adaptation. Two cases of application, highlight detection and recognition from soccer videos and people behavior detection in domotic* applications, are analyzed and discussed.

2005 - Assessing Temporal Coherence for Posture Classification with Large Occlusions [Relazione in Atti di Convegno]
Cucchiara, Rita; Vezzani, Roberto
abstract

In this paper we present a people posture classificationapproach especially devoted to cope with occlusions. Inparticular, the approach aims at assessing temporal coherenceof visual data over probabilistic models. A mixed predictiveand probabilistic tracking is proposed: a probabilistictracking maintains along time the actual appearance ofdetected people and evaluates the occlusion probability; anadditional tracking with Kalman prediction improves the estimationof the people position inside the room. ProbabilisticProjection Maps (PPMs) created with a learning phaseare matched against the appearance mask of the track. Finally,an Hidden Markov Model formulation of the posturecorrects the frame-by-frame classification uncertainties andmakes the system reliable even in presence of occlusions.Results obtained over real indoor sequences are discussed.

2005 - Auto-iris Compensation for Traffic Surveillance Systems [Relazione in Atti di Convegno]
Cucchiara, Rita; Melli, Rudy Mirko; Prati, Andrea
abstract

This paper addresses auto-iris compensation. Auto-iris can be really troublesome for motion detection and tracking techniques based on background or frame differencing,since it can change quickly the average intensity of thecurrent frame. To cope with this, we introduced a two-step autoiris compensation approach in our traffic monitoring system. First, the auto-iris detection is based on the computation of the average of the luminance difference obtained by background suppression. Then, if an auto-iris is detected, the compensation phase is started. In this phase, the auto-iris’ behaviour is empirically modelled and, thus, compensated. Experimental results demonstrate the accuracy of the proposed approach, with both quantitative measures and visual analysis.

2005 - Computer vision system for in-house video surveillance [Articolo su rivista]
Cucchiara, Rita; Grana, Costantino; Prati, Andrea; Vezzani, Roberto
abstract

In-house video surveillance to control the safety of people living in domestic environments is considered. In this context, common problems and general purpose computer vision techniques are discussed and implemented in an integrated solution comprising a robust moving object detection module which is able to disregard shadows, a tracking module designed to handle large occlusions, and a posture detector. These factors, shadows, large occlusions and people's posture, are the key problems that are encountered with in-house surveillance systems, A distributed system with cameras installed in each room of a house can be used to provide full coverage of people's movements. Tracking is based on a probabilistic approach in which the appearance and probability of occlusions are computed for the current camera and warped in the next camera's view by positioning the cameras to disambiguate the occlusions. The application context is the emerging area of domotics (from the Latin word domus, meaning 'home', and informatics). In particular, indoor video surveillance, which makes it possible for elderly and disabled people to live with a sufficient degree of autonomy, via interaction with this new technology, which can be distributed in a house at affordable costs and with high reliability.

2005 - Consistent labeling for multi-camera object tracking [Relazione in Atti di Convegno]
Calderara, Simone; Prati, Andrea; Vezzani, Roberto; Cucchiara, Rita
abstract

In this paper, we present a new approach to multi-camera object tracking based on the consistent labeling. An automatic and reliable procedure allows to obtain the homographic transformation between two overlapped views, without any manual calibration of the cameras. Object's positions are matched by using the homography when the object is firstly detected in one of the two views. The approach has been tested also in the case of simultaneous transitions and in the case in which people are detected as a group during the transition. Promising results are reported over a real setup of overlapped cameras.

2005 - Domain knowledge extension with pictorially enriched ontologies [Relazione in Atti di Convegno]
Bertini, M.; Cucchiara, R.; Del Bimbo, A.; Torniai, C.
abstract

Classifying video elements according to some pre-defined ontology of the video content is the typical way to perform video annotation. Ontologies are built by defining relationship between linguistic terms that describe domain concepts at different abstraction levels. Linguistic terms are appropriate to distinguish specific events and object categories but they are inadequate when they must describe video entities or specific patterns of events. In these cases visual prototypes can better express pattern specifications and the diversity of visual events. To support video annotation up to the level of pattern specification enriched ontologies, that include visual concepts together with linguistic keywords, are needed. This paper presents Pictorially Enriched ontologies and provides a solution for their implementation in the soccer video domain. The pictorially enriched ontology created is used both to directly assign multimedia objects to concepts, providing a more meaningful definition than the linguistics terms, and to extend the initial knowledge of the domain, adding subclasses of highlights or new highlight classes that were not defined in the linguistic ontology. Automatic annotation of soccer clips up to the pattern specification level using a pictorially enriched ontology is discussed. © Springer-Verlag Berlin Heidelberg 2005.

2005 - Entry Edge of Field of View for multi-camera tracking in distributed video surveillance [Relazione in Atti di Convegno]
Calderara, Simone; Vezzani, Roberto; Prati, Andrea; Cucchiara, Rita
abstract

Efficient solution to people tracking in distributed videosurveillance is requested to monitor crowded and large environments.This paper proposes a novel use of the EntryEdges of Field of View (E2oFoV) to solve the consistentlabeling problem between partially overlapped views. Anautomatic and reliable procedure allows to obtain the homographictransformation between two overlapped views,without any manual calibration of the cameras. Throughthe homography, the consistent labeling is established eachtime a new track is detected in one of the cameras. A CameraTransition Graph (CTG) is defined to speed up the establishmentprocess by reducing the search space. Experimentalresults prove the effectiveness of the proposed solutionalso in challenging conditions.

2005 - Foreword [Relazione in Atti di Convegno]
Aggarwal, J. K.; Cucchiara, R.; Chang, E.; Wang, Y. -F.
abstract

2005 - Guest editorial: Special issue on video segmentation for semantic annotation and transcoding [Articolo su rivista]
Cucchiara, R.; Del Bimbo, A.
abstract

2005 - MPEG-7 Compliant Shot Detection in Sport Videos [Relazione in Atti di Convegno]
Grana, Costantino; Tardini, Giovanni; Cucchiara, Rita
abstract

In this paper we propose a system for automatic detection of shots in sport videos. Our work covers two main aspects: the first is robust shot detection in presence of fast object motion and camera operations. To this aim we propose a new algorithm, unique for both cuts and linear transitions detection, which only needs the tuning of two parameters. An extended comparison with four transition detection algorithms, representing the state of the art in literature, is reported. Examples with formula 1, basket, soccer and cycling videos are analyzed. The second aspect is an in depth discussion on the annotation of shots and transitions with the MPEG-7 standard.

2005 - Making the home safer and more secure through visual surveillance [Relazione in Atti di Convegno]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

Video surveillance has a direct application in intelligent home automation or domotics (from the Latin word domus, that means “home”, and informatics). In particular, in-house video surveillance can provide good support for people with some difficulties (e.g. elderly or disabled people) living alone and with limited autonomy. A key aspect in video surveillance systems for domotics is that of analyzing behaviours of the monitored people. To accomplish this task, people must be detected and tracked, and their posture must be analyzed in order to model behaviours recognizing abrupt changes in it. Problems related to reliable software solutions are not completely solved, in particular luminance changes, shadows and frequent posture changes must be taken into account. Long-lasting occlusions are common due to the proximity of the cameras and the presence of furniture and doors that can often hide parts of a person’s body. For these reasons, a probabilistic and appearance-based tracking, particularly conceivable for people tracking and posture classification, has been developed. However, despite its effectiveness for long-lasting and large occlusions, this approach tends to fail whenever the person is monitored with multiple cameras and he appears in one of them already occluded. Different views provided by multiple cameras can be exploited to solve occlusions by warping known object appearance into the occluded view. To this aim, this paper describes an approach to posture classification based on projection histograms, reinforced by HMM for assuring temporal coherence of the posture.

2005 - Multimedia Surveillance Systems [Relazione in Atti di Convegno]
Cucchiara, Rita
abstract

The integration of video technology and sensor networks constitutes the fundamental infrastructure for new generations of multimedia surveillance systems, where many different media streams (audio, video, images, textual data, sensor signals) will concur to provide an automatic analysis of the controlled environment and a real-time interpretation of the scene. New solutions can be devised to enlarge the view of traditional surveillance systems by means of distributed architectures with fixed and active cameras, to enhance their view with other sensed data, to explore multi-resolution views with zooming and omnidirectional cameras. Applications regard surveillance of wide indoor and outdoor area and particularly people surveillance: in this case, multimedia surveillance systems can be enriched with biometric technology; the best views of detected persons and their extracted visual features (e.g. faces, voices, trajectories)can be exploited for people identification. VSSN05 is the third edition of the workshop, co-located at ACM Multimedia Conference, that embraces research reports on video surveillance and, since the edition of 2004, sensor networks. Thispaper gives a short overview of the hot topics in multimedia surveillance systems and introduces some research activities currently engaged in the world and presented at VSSN05.

2005 - On the usefulness of object shape coding with MPEG-4 [Relazione in Atti di Convegno]
Prati, Andrea; Cucchiara, Rita
abstract

This paper reports the results of an in-depth analysis ofthe degree of usefulness of object shape coding in videocompression. In particular, MPEG-4 is used as referencestandard. The influence of different coding parameters onthe performance is deeply examined and discussions on theresults are provided. Object shape coding is compared withclassical (MPEG-2) frame-based coding both at an objectivelevel (by comparing PSNR/quality and bitrate/filesize)and at a subjective level (asking to a set of users to expresstheir opinion on overall quality, cognitive effectiveness, andwillingness to pay). In conclusion, this paper aims at answering to the question whether it is convenient to use object shape coding instead of frame-based coding or not.

2005 - Posture Classification in a Multi-camera Indoor Environment [Relazione in Atti di Convegno]
Cucchiara, R.; Prati, A.; Vezzani, R.
abstract

Posture classification is a key process for analyzing thepeople’s behaviour. Computer vision techniques can behelpful in automating this process, but clutteredenvironments and consequent occlusions make this taskoften difficult. Different views provided by multiplecameras can be exploited to solve occlusions by warpingknown object appearance into the occluded view. To thisaim, this paper describes an approach to postureclassification based on projection histograms, reinforcedby HMM for assuring temporal coherence of the posture.The single camera posture classification is then exploitedin the multi-camera system to solve the cases in which theocclusions make the classification impossible.Experimental results of the classification from both thesingle camera and the multi-camera system are provided.

2005 - Predictive and Probabilistic Tracking to Detect Stopped Vehicles [Relazione in Atti di Convegno]
Melli, Rudy Mirko; Cucchiara, Rita; Prati, Andrea; L., DE COCK
abstract

Many techniques and models have been proposed for vehicles surveillance in highways. In the past, tracking algorithms based on Kalman filter have been largely usedfor their efficiency in the prediction and low computationalcost. However, predictive filters can not solve long-lastingocclusions. In this paper, we propose a new mixed predictiveand probabilistic tracking that exploits the advantagesof predictive filters for moving vehicles and adopts probabilistic and appearance-based tracking for stopped vehicles. The proposed tracking is part of a complete videosurveillance system, oriented to control tunnels and highwaysfrom cluttered views, that is implemented in an embeddedDSP platform and provides background suppression,a novel shadow detection algorithm, tracking, and scenerecognition module. The experimental results are obtainedover several hours of videos acquired in pre-existing platforms of CCTV surveillance systems.

2005 - Probabilistic posture classification for human-behavior analysis [Articolo su rivista]
Cucchiara, Rita; Grana, Costantino; Prati, Andrea; Vezzani, Roberto
abstract

Computer vision and ubiquitous multimedia access nowadays make feasible the development of a mostly automated system for human-behavior analysis. In this context, our proposal is to analyze human behaviors by classifying the posture of the monitored person and, consequently, detecting corresponding events and alarm situations, like a fall. To this aim, our approach can be divided in two phases: for each frame, the projection histograms (Haritaoglu et al., 1998) of each person are computed and compared with the probabilistic projection maps stored for each posture during the training phase; then, the obtained posture is further validated exploiting the information extracted by a tracking module in order to take into account the reliability of the classification of the first phase. Moreover, the tracking algorithm is used to handle occlusions, making the system particularly robust even in indoors environments. Extensive experimental results demonstrate a promising average accuracy of more than 95% in correctly classifying human postures, even in the case of challenging conditions.

2005 - Real Time Semantic Adaptation of Sports Video with User-centred Performance Analysis [Relazione in Atti di Convegno]
M., Bertini; Cucchiara, Rita; A., DEL BIMBO; Prati, Andrea
abstract

Semantic video adaptation improves traditional adaptation by taking into account the degree of relevance of the different portions of the content. It employs solutions to detect the significant parts of the video and applies different compression ratios to elements that have different importance. Performance of semantic adaptation heavily depends on the quality and precision of the automatic annotation, whether it operates in strict or nonstrict real time, and the codec which is used to perform adaptation at the event or object level. It should consider the effects of the errors in the automatic extraction of objects and events over the operation of the adaptation subsystem, and relate these effects to the preferences for the objects and events of the video program, that have been decided by the user. In this paper, we present strict real time annotation and adaptation of sports video and introduce two new performance measures: Viewing Quality Loss and Bit-rate Cost Increase, that are obtained from classical PSNR and Bit Ratio, but relate the results of semantic adaptation with the user’s preferences and expectations.

2005 - Shot Detection for Formula 1 Video Digital Libraries [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; Tardini, Giovanni
abstract

Metadata extraction is one of the first tasks to be performed for automatic Digital Library annotation, and in particular shot detection has been widely explored in literature. While a lot of methods have been proposed for the detection of abrupt cuts, only a small number of them has explicitly addressed the problem of gradual transitions. In this paper we propose an algorithm that exploits a precise model of linear transition. Experimental results on Formula 1 car races videos show the robustness of this method. These test videos are characterized by extreme situations such as fast camera and objects motion and very different kinds of shots. The algorithm is able to estimate the exact length of the transition and an error score is also given as a fitness measure to the linear model, to discriminate true transitions from false detections. The final shot segmentation is delivered as an MPEG7 compliant output.

2005 - Shot detection and motion analysis for automatic MPEG-7 annotation of sports videos [Relazione in Atti di Convegno]
Tardini, Giovanni; Grana, Costantino; R., Marchi; Cucchiara, Rita
abstract

In this paper we describe general algorithms that are devised for MPEG-7 automatic annotation of Formula 1 videos, and in particular for camera-car shots detection. We employed a shot detection algorithm suitable for cuts and linear transitions detection, which is able to precisely detect both the transition's center and length. Statistical features based on MPEG motion compensation vectors arc then employed to provide motion characterization, using a subset of the motion types defined in MPEG-7, and shot type classification. Results on shot detection and classification are provided.

2005 - T_PARK: Ambient Intelligence for Security in Public Parks [Relazione in Atti di Convegno]
Cucchiara, Rita; Prati, Andrea; L., Benini; E., Farella
abstract

In this paper, we present joint research activities in computer vision and sensor networks for a distributedsurveillance of urban parks. Distributed visual surveillance of urban environments is one of the most interesting scenarios in Ambient Intelligence; in addition, the automated monitoring of public parks, often crowded by children and aduits, is still a very difficult task due to the number of objects of interests. In this context, integrating the power of low cost sensors with the information provided by cameras can lead to a more reliable solution to people tracking in wide areas. Specifically, the deficiencies of one approach can be (at least partially) covered by the advantages of the other. The goal is to perform people tracking in parks (toachieve trackable parks - T-Parks), both in zones covered by overlapped cameras and afso, thanks to sensors, in areas not covered by any camera. In this paper, we propose a new technique for multi-camera people tracking based on a learning phase to automatically calibrate pairs of cameras and to build Areas of Field Views (AoFoVs) in order to establish consistent labelling of people. In addition, sensornetworks distributed at the borders of the AoFoV give an estimation of the probability of people overlapping, triggering specific algorithms of face detection or headcounting to identify the single person. The research ofT-Parks is part of a two-year Italian project called LAICA, intended to provide advanced services for citizens and public officers based on ambient intelligence technologies.

2005 - Video Annotation with Pictorially Enriched Ontologies [Relazione in Atti di Convegno]
C., Torniai; A., DEL BIMBO; Cucchiara, Rita; M., Bertini
abstract

Video annotation is typically performed by classifying video elements according to some pre-defined ontology of the video content domain. Ontologies are defined by establishing relationships between linguistic terms, that specify domain concepts at different abstraction levels. However, although linguistic terms are appropriate to distinguish event and object categories, they are inadequate when they must describe specific patterns of events or video entities. Instead, in these cases, pattern specifications are better expressed through visual prototypes that capture the essence of the event or entity. Pictorially enriched ontologies, that include visual concepts together with linguistic keywords, are therefore needed tosupport video annotation up to the level of detail of pattern specification. This paper presents pictorially enriched ontologies and provide a solution for their implementation in the soccer video domain. The pictorially enriched ontology is used both to directly assign multimedia objects to concepts, providing a more meaningful definition than the linguistics terms, and to extend the initial knowledge of the domain, adding subclasses of highlights or new highlight classes that were not defined in the linguistic ontology. Automatic annotation of soccer clips up to the pattern specification level using a pictorially enriched ontology is discussed.

2005 - Video understanding and content-based retrieval [Relazione in Atti di Convegno]
Y., Zhai; J., Liu; X., Cao; A., Basharat; A., Hakeem; S., Ali; M., Shah; Grana, Costantino; Cucchiara, Rita
abstract

This year, the joint team of UCF and the University of Modenahas participated in the following tasks: (1) shot boundarydetection, (2) low-level feature extraction, (3) high-levelfeature extraction, (4) topic search and (5) BBC rushes management.The shot boundary detection was contributed bythe Image Lab at the University of Modena. The other taskswere performed by the Computer Vision Team at UCF.

2004 - An Intelligent Surveillance System for Dangerous Situation Detection in Home Environments [Articolo su rivista]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

In this paper we address the problem of human posture classification, in particular focusing to an indoor surveillance application. The approach was initially inspired to a previous works of Haritaoglou et al. [5] that uses histogram projections to classify people’s posture. Projection histograms are here exploited as the main feature for the posture classification, but, differently from [5], we propose a supervised statistical learning phase to create probability maps adopted as posture templates. Moreover, camera calibration and homography are included to solve perspective problems and to improve the precision of the classification. Furthermore, we make use of a finite state machine to detect dangerous situations as falls and to activate a suitable alarm generator. The system works on-line on standard workstations with network cameras.

2004 - An image analysis approach for automatically re-orienteering CT images for dental implants [Articolo su rivista]
Cucchiara, Rita; Lamma, E; Sansoni, T.
abstract

In the last decade, computerized tomography (CT) has become the most frequently used imaging modality to obtain a correct pre-operative implant planning. In this work, we present an image analysis and computer vision approach able to identify, from the reconstructed 3D data set, the optimal cutting plane specific to each implant to be planned, in order to obtain the best view of the implant site and to have correct measures. If the patient requires more implants, different cutting planes are automatically identified, and the axial and cross-sectional images can be re-oriented accordingly to each of them. In the paper, we describe the defined algorithms in order to recognize 3D markers (each one aligned with a missed tooth for which an implant has to be planned) in the 3D reconstructed space, and the results in processing red] exams, in terms of effectiveness and precision and reproducibility of the measure.

2004 - Automated extraction and description of dark areas in surface microscopy melanocytic lesion images [Articolo su rivista]
Pellacani, Giovanni; Grana, Costantino; Cucchiara, Rita; Seidenari, Stefania
abstract

Background: Identification of dark areas inside a melanocytic lesion (ML) is of great importance for melanoma diagnosis, both during clinical examination and employing programs for automated image analysis. Objective: The aim of our study was to compare two different methods for the automated identification and description of dark areas in epiluminescence microscopy images of MLs and to evaluate their diagnostic capability. Methods: Two methods for the automated extraction of ´absolute´ (ADAs) and ´relative´ dark areas (RDAs) and a set of parameters for their description were developed and tested on 339 images of MLs acquired by means of a polarized-light videomicroscope. Results: Significant differences in dark area distribution between melanomas and nevi were observed employing both methods, permitting a good discrimination of MLs (diagnostic accuracy = 74.6 and 71.2% for ADAs and RDAs, respectively). Conclusions: Both methods for the automated identification of dark areas are useful for melanoma diagnosis and can be implemented in programs for image analysis. Copyright

2004 - Color Calibration for a Dermatological Video Camera System [Relazione in Atti di Convegno]
Grana, Costantino; Pellacani, Giovanni; Seidenari, Stefania; Cucchiara, Rita
abstract

In this work, we describe a technique to calibrate images for skin analysis in dermatology. Using a common reference we correct non-uniform illumination effects, give an estimation of the gamma correction and produce a XYZ conversion matrix. The final result is then reverted to a non standard RGB color space, built from the instrument images. In this way different instruments behave uniformly allowing colorimetric characterization, while improving the results of common algorithms. The proposed techniques should be the initial support for a distributed framework where dermatological images can be consistently compared.

2004 - Content-based Video Adaptation with User's Preference [Relazione in Atti di Convegno]
M., Bertini; Cucchiara, Rita; A., DEL BIMBO; Prati, Andrea
abstract

In this papes we present an integrated system that hasbeen designed to support automatic semantic extraction ofhighlights in sports video and automatic video adaptationaccording to user’s preferences. To analyze the user’s satisfaction, we propose a new performance measure that explicitly takes into account the user’s preferences and considers the number and type of errors produced by the annotation engine and the way in which these errors affectthe compressed video quality and bandwidth allocation. Weprovide experimental results with application to soccer andswimming.

2004 - Introduction to the special section on in-vehicle computer vision systems [Articolo su rivista]
Cucchiara, Rita; D., Lovell; Prati, Andrea; M. M., Trivedi
abstract

2004 - Neighbor cache prefetching for multimedia image and video processing [Articolo su rivista]
Cucchiara, Rita; M., Piccardi; Prati, Andrea
abstract

Cache performance is strongly influenced by the type of locality embodied in programs. In particular, multimedia programs handling images and videos are characterized by a bidimensional spatial locality, which is not adequately exploited by standard caches. In this paper we propose novel cache prefetching techniques for image data, called neighbor prefetching, able to improve exploitation of bidimensional spatial locality. A performance comparison is provided against other assessed prefetching techniques on a multimedia workload (with MPEG-2 and MPEG-4 decoding, image processing, and visual object segmentation), including a detailed evaluation of both the miss rate and the memory access time. Results prove that neighbor prefetching achieves a significant reduction in the time due to delayed memory cycles (more than 97% on MPEG-4 with respect to 75% of the second performing technique). This reduction leads to a substantial speedup on the overall memory access time (up to 140% for MPEG-4). Performance has been measured with the PRIMA trace-driven simulator, specifically devised to support cache prefetching.

2004 - Object-based and Event-based Semantic Video Adaptation [Relazione in Atti di Convegno]
M., Bertini; Cucchiara, Rita; A., DEL BIMBO; Prati, Andrea
abstract

Semantic video adaptation allows to transmit video contentwith different viewing quality, depending on the relevanceof the content from the user’s viewpoint. To this end, an automatic annotation subsystem must be employed thatautomatically detect relevant objects and events in the videostream. In this paper we present a composite framework thatis made of an automatic annotation engine and a semantics-based adaptation module. Three new different compression solutions are proposed that work at the object or event level. Their performance is compared according to a new measure that takes into account the user’s satisfaction and the effects on it of the errors in the annotation module.

2004 - Objects and Events Recognition for Sport Videos Transcoding [Relazione in Atti di Convegno]
M., Bertini; A., DEL BIMBO; Prati, Andrea; Cucchiara, Rita
abstract

2004 - Probabilistic People Tracking for Occlusion Handling [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; Tardini, Giovanni; Vezzani, Roberto
abstract

This work presents a novel people tracking approach, able to cope with frequent shape changes and large occlusions. In particular, the tracks are described by means of probabilistic masks and appearance models. Occlusions due to other tracks or due to background objects and false occlusions are discriminated. The tracking system is general enough to be applied with any motion segmentation module, it can track people interacting each other and it maintains the pixel assignment to track even with large occlusions. At the same time, the update model is very reactive, so as to cope with sudden body motion and silhouette's shape changes. Due to its robustness, it has been used in many experiments of people behavior control in indoor situations.

2004 - Real-time motion segmentation from moving cameras [Articolo su rivista]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

This paper describes our approach to real-time detection of camera motion and moving object segmentation in videos acquired from moving cameras. As far as we know, none of the proposals reported in the literature are able to meet real-time requirements. In this work, we present an approach based on a color segmentation followed by a region-merging on motion through Markov Random Fields (MRFs). The technique we propose is inspired to a work of Gelgon and Bouthemy (Pattern Recognition 33 (2000) 725-40), that has been modified to reduce computational cost in order to achieve a fast segmentation (about 10 frame per second). To this aim a modified region matching algorithm (namely Partitioned Region Matching) and an innovative arc-based MRF optimization algorithm with a suitable definition of the motion reliability are proposed. Results on both synthetic and real sequences are reported to confirm validity of our solution.

2004 - Semantic Annotation and Transcoding for Sport Videos [Relazione in Atti di Convegno]
M., Bertini; A., DEL BIMBO; Prati, Andrea; Cucchiara, Rita
abstract

Telecommunication companies are demonstrating interestin providing mobile video services. The availability of largerbandwidth, and the improvements in terms of resolution ofthe displays of third generation mobile phones, let telecomand content provider companies to provide new services totheir customers. Among these services users can watch acertain number of sport videos, usually a selection of thebest actions occurred during a play. In order to provide atimely and satisfying service to customers there is need oftools and systems that help to detect and recognize the interesting events, and optimize the use of bandwidth, coding these events and the most interesting objects within them at the best visual quality/bandwidth ratio.

2004 - Semantic Annotation and Transcoding of Soccer Videos [Relazione in Atti di Convegno]
M., Bertini; A., DEL BIMBO; Cucchiara, Rita; Prati, Andrea
abstract

2004 - Semantic Transcoding of Videos by using Adaptive Quantization [Articolo su rivista]
Cucchiara, Rita; Grana, Costantino; Prati, Andrea
abstract

This paper proposes the use of an approach of video transcoding driven by the video content and providedwith the adaptive quantization of MPEG standards.Computer vision techniques can extract semanticsfrom videos according with user's interests: the videosemantics is exploited to adapt the video in order tomeet the device's capabilities and the user'srequirements and preserve the best quality possible. Well assessed video analysis techniques are used to segment the video into objects grouped in classes ofrelevance to which the user can assign a weight proportional to their relevance. This weight is used todecide the quantization values to be applied in theMPEG-2 encoding to each macroblock. A modified version of the PSNR (Peak Signal-to-Noise Ratio) is used as performance metric and comparativeevaluation is reported with respect to other codingstandards such as JPEG, JPEG 2000, (basic) MPEG-2, and MPEG-4. Experimental results are provided on different situations, one indoor and oneoutdoor. Keywords:Videotranscoding, adaptive quantization, motion detection

2004 - Semantic Video Adaptation based on Automatic Annotation of Sport Videos [Relazione in Atti di Convegno]
M., Bertini; Cucchiara, Rita; A., DEL BIMBO; Prati, Andrea
abstract

Semantic video adaptation improves traditional adaptation by taking into account the degree of relevance of the different portions of the content. It employs solutions to detect the significant parts of the video and applies different compression ratios to elements that have different importance. Performance of semantic adaptation heavily depends on the precision of the automatic annotation andthe way of operation of the codec which is used to perform adaptation at the event or object level. In this paper, we discuss critical factors that affect performance of automatic annotation and define new performance measures of semantic adaptation, Viewing Quality Loss and Bitrate Cost Increase, that are obtained from classical PSNR and Bit Rate, but relate the results of semantic adaptation with the user’s preferences and expectations. The new measuresare discussed in detail for a system of sport annotation and adaptation with reference to different user profiles.

2004 - Track-based and object-based occlusion for people tracking refinement in indoor surveillance [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; Tardini, Giovanni
abstract

People tracking deals with problems of shape changes, self-occlusions and track occlusions due to other interfering tracks and fixed objects that hide parts of the people shape. These problems are more critical in indoor surveillance and in particular in home automation settings, in which the need to merge information obtained form different cameras distributed around the house calls for the integration of reliable data obtained during time. Therefore, tracking algorithms should be carefully tuned to cope with occlusions and shape changes, working not only at pixel level but also at region level. In this work we provide a novel technique for object tracking, based on probabilistic masks and appearance models. Occlusions due to other tracks or due to background objects and false occlusions are discriminated. The classification of occluded regions of the track is exploited in a selective model update. The tracking system is general enough to be applied with any motion segmentation module, it can track people interacting each other and it maintains the pixel to track assignment even with large occlusions. At the same time, the model update is very reactive, so as to cope with sudden body motion and silhouette's shape changes. Due to its robustness, it has been used in different experiments of people behavior control in indoor situations.

2004 - Using computer vision techniques for dangerous situation detection in domotic applications [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; Prati, Andrea; Tardini, Giovanni; Vezzani, Roberto
abstract

We describe an integrated solution devised for inhouse video surveillance, to control the safety of people living in a domestic environment. The system is composed of robust moving object detection module, able to disregard shadows, a tracking module designed for large occlusion solution and of a posture detector. Shadows, large occlusions and deformable model of people are key features of inhouse surveillance. Moreover, the requirements of high speed reaction to dangerous situations and the need to implement a reliable and low cost televiewing system, led to the introduction of a new multimedia model of semantic transcoding, capable of supporting different user's requests and constraints of their devices (PDA, smart phones, ...). Our application context is the emerging area of domotics (from the Latin word domus that means "home" and informatics) and, in particular, indoor video surveillance of the house where people with some difficulties (elders and disabled people) can now live in a sufficient degree of autonomy, thanks to the strong interaction with the new technologies that can be distributed in the house with affordable costs and high reliability.

2003 - A Hough transform-based method for radial lens distortion correction [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; A., Prati; Vezzani, Roberto
abstract

The paper presents an approach for a robust (semi-)automatic correction of radial lens distortion in images and videos. This method, based on the Hough transform, has the characteristics to be applicable also on videos from unknown cameras that, consequently, can not be a priori calibrated. We approximated the lens distortion by considering only the lower-order term of the radial distortion. Thus, the method relies on the assumption that pure radial distortion transforms straight lines into curves. The computation of the best value of the distortion parameter is performed in a multi-resolution way. The method precision depends on the scale of the multi-resolution and on the Hough space's resolution. Experiments are provided for both outdoor, uncalibrated camera and an indoor, calibrated one. The stability of the value found in different frames of the same video demonstrates the reliability of the proposed method.

2003 - A machine learning approach for human posture detection in domotics applications [Relazione in Atti di Convegno]
L., Panini; Cucchiara, Rita
abstract

This paper describes an approach for human postureclassification that has been devised for indoor surveillance in domotic applications. The approach was initially inspired to a previous works of Haritaoglou et al. [2] that uses histogram projections to classify people’s posture. We modify and improve the generality of the approach by adding a machine learning phase in order to generate probability maps. A statistic classifier has then defined that compares the probability maps and the histogram profiles extracted from each moving people. The approach results to be very robust if the initial constraints are satisfied and exhibits a very lowcomputational time so that it can be used to process livevideos with standard platforms.

2003 - A new algorithm for border description of polarized light surface microscopic images of pigmented skin lesions [Articolo su rivista]
Grana, Costantino; Pellacani, Giovanni; Cucchiara, Rita; Seidenari, Stefania
abstract

The aim of this study was to provide mathematical descriptors for the border of pigmented skin lesion images and to assess their efficacy for distinction among different lesion groups. New descriptors such as lesion slope and lesion slope regularity are introduced and mathematically defined. A new algorithm based on the Catmull-Rom spline method and the computation of the gray-level gradient of points extracted by interpolation of normal direction on spline points was employed. The efficacy of these new descriptors was tested on a data set of 510 pigmented skin lesions, composed by 85 melanomas and 425 nevi, by employing statistical methods for discrimination between the two populations.

2003 - Camera-car Video Analysis for Steering Wheel's Tracking [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; A., Prati; F., Vigetti; M., Piccardi
abstract

Monitoring and controlling the driver’s guidance by analyzing the rotation impressed to the steering-wheel can be a very important task in order to improve safety. This paper proposes a general-purpose method to track the steering wheel’s absolute angle by using a single camera vision system mounted inside the car. The absolute angle is computed by means of the accumulation of inter-frame relative rotations and the error propagation is prevented with an alignment process. The approach is based on the modeling of the motion of the steering wheel, as it appears perspectivelydistorted by the point of view of the un-calibrated camera. We modified the Lucas-Kanade method for an approximatively rotational motion model in order to provide the detection and tracking of significant features on the wheel. The experimental results are compared with ground-truthed data obtained with different types of sensors.

2003 - Computer Vision Techniques for PDA Accessibility of In-House Video Surveillance [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; A., Prati; Vezzani, Roberto
abstract

In this paper we propose an approach to indoor environment surveillance and, in particular, to people behaviour control in home automation context. The reference application is a silent and automatic control of the behaviour of people living alone in the house and specially conceived for people with limited autonomy (e.g., elders or disabled people). The aim is to detect dangerous events (such as a person falling down) and to react to these events by establishing a remote connection with low-performance clients, such as PDA (Personal Digital Assistant). To this aim, we propose an integrated server architecture, typically connected in intranet with network cameras, able to segment and track objects of interest; in the case of objects classified as people, the system must also evaluate the people posture and infer possible dangerous situations. Finally, the system is equipped with a specifically designed transcoding server to adapt the video content to PDA requirements (display area and bandwidth) and to the user's requests. The main issues of the proposal are a reliable real-time object detector and tracking module, a simple but effective posture classifier improved by a supervised learning phase, and an high performance transcoding inspired on MPEG-4 object-level standard, tailored to PDA. Results on different video sequences and performance analysis are discussed.

2003 - Detecting moving objects, ghosts, and shadows in video streams [Articolo su rivista]
Cucchiara, Rita; Grana, Costantino; Piccardi, Massimo; Prati, Andrea
abstract

Background subtraction methods are widely exploited for moving object detection in videos in many applications, such as traffic monitoring, human motion capture, and video surveillance. How to correctly and efficiently model and update the background model and how to deal with shadows are two of the most distinguishing and challenging aspects of such approaches. This work proposes a general-purpose method that combines statistical assumptions with the object-level knowledge of moving objects, apparent objects (ghosts), and shadows acquired in the processing of the previous frames. Pixels belonging to moving objects, ghosts, and shadows are processed differently in order to supply an object-based selective update. The proposed approach exploits color information for both background subtraction and shadow detection to improve object segmentation and background update. The approach proves fast, flexible, and precise in terms of both pixel accuracy and reactivity to background changes.

2003 - Detecting moving shadows: Algorithms and evaluation [Articolo su rivista]
Prati, Andrea; I., Mikic; Mm, Trivedi; Cucchiara, Rita
abstract

Moving shadows need careful consideration in the development of robust dynamic scene analysis systems. Moving shadow detection is critical for accurate object detection in video streams since shadow points are often misclassified as object points, causing errors in segmentation and tracking. Many algorithms have been proposed in the literature that deal with shadows. However, a comparative evaluation of the existing approaches is still lacking. In this paper, we present a comprehensive survey of moving shadow detection approaches. We organize contributions reported in the literature in four classes two of them are statistical and two are deterministic. We also present a comparative empirical evaluation of representative algorithms selected from these four classes. Novel quantitative (detection and discrimination rate) and qualitative metrics (scene and object independence, flexibility to shadow situations, and robustness to noise) are proposed to evaluate these classes of algorithms on a benchmark suite of indoor and outdoor video sequences. These video sequences and associated ground-truth data are made available at http://cvrr.ucsd.edu/aton/shadow to allow for others in the community to experiment with new algorithms and metrics.

2003 - Domotics for disability: smart surveillance and smart video server [Relazione in Atti di Convegno]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

In this paper we address the problem of human posture classification, in particular focusing to an indoor surveillance application. The approach was initially inspired to a previous works of Haritaoglou et al. [6] that uses histogram projections to classify people’s posture. Projection histograms are here exploited as the main feature for the posture classification, but, differently from [6], we propose a supervised statistical learning phase to create probability maps adopted as posture templates. Moreover, camera calibration and homography is included to resolve prospective problems and improve the precision of classification. Furthermore, we make use of a finite state machineto detect dangerous situations as falls and to activate a suitable alarm generator. The system works on line on standard workstation with network cameras.

2003 - Image Representation and Retrieval with Topological Trees [Relazione in Atti di Convegno]
Grana, Costantino; Pellacani, Giovanni; Seidenari, Stefania; Cucchiara, Rita
abstract

Typical processes of image representation comprehend initial region segmentation followed by a description of single regions’ feature and their relationships. Then a graph model can be exploited in order to integrate the knowledge of the specific regions (that are the attributed relational graph’s (ARG) nodes) and the regions’ relations (that are the ARG’s edges). In this work we use color features to guide region segmentation, geometric features to characterize regions one by one and topological features (and in particular inclusion) to describe regions’ relationships. Guided by the inclusion property we define the Topological Tree (TT) as an image representation model that exploiting the transitive property of inclusion, uses the adjacency and inclusion topological features. We propose an approach based on a recursive version of fuzzy c-means to construct the topological tree directly from the initial image, performing both segmentation and TT construction. The TT can be exploited in many applications of image analysis and image retrieval by similarity in those contexts where inclusion is a key feature: we propose an applicative case of analysis of dermatological images to support the melanoma diagnosis.In this paper describe details of the TT algorithm, including the management of not ideality and an approximate measure of tree similarity in order to retrieve skin lesion with a similar TT-based description.

2003 - Improving data prefetching efficacy in multimedia applications [Articolo su rivista]
Cucchiara, Rita; Prati, Andrea; M., Piccardi
abstract

The workload of multimedia applications has a strong impact on cache memory performance, since the locality of memory references embedded in multimedia programs differs from that of traditional programs. In many cases, standard cache memory organization achieves poorer performance when used for multimedia. A widely-explored approach to improve cache performance is hardware prefetching, which allows the pre-loading of data in the cache before they are referenced. However, existing hardware prefetching approaches are unable to exploit the potential improvement in performance, since they are not tailored to multimedia locality. In this paper we propose novel effective approaches to hardware prefetching to be used in image processing programs for multimedia. Experimental results are reported for a suite of multimedia image processing programs including MPEG-2 decoding and encoding, convolution, thresholding, and edge chain coding.

2003 - Object Segmentation in Videos from Moving Camera with MRFs on Color and Motion Features [Relazione in Atti di Convegno]
Cucchiara, Rita; Prati, Andrea; Vezzani, Roberto
abstract

In this paper we address the problem of fast segmenting moving objects in video acquired by moving camera or more generally with a moving background. We present an approach based on a color segmentation followed by a region-merging on motion through Markov Random Fields (MRFs). The technique we propose is inspired to a work of Gelgon and Bouthemy [6], that has been modified to reduce computational cost in order to achieve a fast segmentation (about ten frame per second). To this aim a modified region matching algorithm (namely Partitioned Region Matching) and an innovative arc-based MRF optimization algorithmwith a suitable definition of the motion reliability are proposed. Results on both synthetic and real sequences are reported to confirm validity of our solution.

2003 - Object and Event Detection for Semantic Annotation and Transcoding [Relazione in Atti di Convegno]
M., Bertini; Cucchiara, Rita; A., DEL BIMBO; Prati, Andrea
abstract

Video annotation provides a suitable way to describe, organize, and index stored videos. On the other hand,transcoding aims at adapting content to the usedclientcapabilities and requirements. Both cues are now mandatory, given the tremendous demand of multimediaaccess from remote clients, in particular nowadays thatnew terminals with limited resources (PDAs, HCCs, Smartphones) have access to the network. In this paper wepropose an unified framework to define event-based andobject-based semantic extraction from video to provideboth semantic video annotation for video stored andsemantic on-line transcoding from live cameras. Two casestudies (highlights’ extraction from soccer videos for theannotation and people behavior detection in domoticapplication for transcoding) and corresponding experimental results are reported.

2003 - Proceedings of 1st Workshop on “In-Vehicle (Cognitive) Computer Vision Systems” [Curatela]
Cucchiara, Rita; M., Trivedi; Prati, Andrea
abstract

2003 - Semantic video transcoding using classes of relevance [Articolo su rivista]
Cucchiara, Rita; Grana, Costantino; Prati, Andrea
abstract

In this work we present a framework for on-the-fly video transcoding that exploits computer vision-based techniques to adapt the Web access to the user requirements. Theproposed transcoding approach aims at coping with both user bandwidth and resources capabilities, and with user interests in the video's content. We propose an object-basedsemantic transcoding that, according to the user-dened classes of relevance, applies different transcoding techniques to the objects segmented in a scene. Object extraction is provided by on-the-fly video processing, without manual annotation. Multiple transcoding policies are reviewed and a performance evaluation metric based on the Weighted Mean Square Error (and corresponding PSNR), that takes into account the perceptual user requirements by means of classes of relevance, is dened. Results are analyzed by varying transcoding techniques, bandwidth requirements and video types (with indoor and outdoor scenes), showing that the use of semantics can dramatically improve the bandwidth to distortion ratio.

2003 - Steering wheel's angle tracking from camera-car [Relazione in Atti di Convegno]
Cucchiara, Rita; Prati, Andrea; F., Vigetti
abstract

This paper proposes a general-purpose method to trackthe steering wheel’s absolute angle by using a single camera vision system mounted inside the car. The approachis based on the modeling of the motion of thesteering wheel, as it appears perspectively distorted bythe point of view of the un-calibrated camera. We modifiedthe Lucas-Kanade method for an approzimativelyrotational motion model in order to provide the detectionand tracking of significant features on the wheel.The experimental results are compared with ground-trutheddata obtained with different types of sensors.

2003 - Tuning range image segmentation by genetic algorithm [Articolo su rivista]
G., Pignalberi; Cucchiara, Rita; L., Cinque; S., Levialdi
abstract

Several range image segmentation algorithms have been proposed, each one to be tuned by a number of parameters in order to provide accurate results on a given class of images. Segmentation parameters are generally affected by the type of surfaces (e.g., planar versus curved) and the nature of the acquisition system (e.g., laser range finders or structured light scanners). It is impossible to answer the question, which is the best set of parameters given a range image within a class and a range segmentation algorithm? Systems proposing such a parameter optimization are often based either on careful selection or on solution space-partitioning methods. Their main drawback is that they have to limit their search to a subset of the solution space to provide an answer in acceptable time. In order to provide a different automated method to search a larger solution space, and possibly to answer more effectively the above question, we propose a tuning system based on genetic algorithms. A complete set of tests was performed over a range of different images and with different segmentation algorithms. Our system provided a particularly high degree of effectiveness in terms of segmentation quality and search time.

2002 - A Decision Support System for Range Image Segmentation [Relazione in Atti di Convegno]
L., Cinque; Cucchiara, Rita; S., Levialdi; G., Pignalberi
abstract

2002 - A Framework for Semantic Video Transcoding [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; A., Prati
abstract

In this work we present a transcoding framework and an object-based technique to adapt live and stored videos to the user bandwidth and resources capabilities.Multiple transcoding policies are reviewed and a performance evaluation metric based on the Weighted Mean Square Error that allows different classes of relevance is presented.We present results for different transcoding policies and for different bandwidth requirements, showing that the use of semantic can improve the bandwidth to distortion ratio.

2002 - Building the Topological Tree by Recursive FCM Color Clustering [Abstract in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; Prati, Andrea; Seidenari, Stefania; Pellacani, Giovanni
abstract

In this paper we define a Topological Tree (TT) as a knowledge representation method that aims to describe important visual and spatial features of image regions, namely the color similarity, the inclusion and the spatial adjacency. The topological tree exhibits some interesting properties that can be exploited to extract knowledge from images for information retrieval, image understanding and diagnosis purposes. Examples of applications in dermatology are described. The TT can be constructed after segmentation, by computing the spatial relationships of regions or can be generated directly during the segmentation: to this aim we present a novel recursive fuzzy c-means (FCM) clustering algorithm based on the Principal Component Analysis of the color space. The recursive FCM proves to be effective for underlining the adjacency and inclusion property of regions.

2002 - Data-type Dependent Cache Prefetching for MPEG Applications [Relazione in Atti di Convegno]
Cucchiara, Rita; M., Piccardi; Prati, Andrea
abstract

Data cache prefetching is an effective technique to improve performance of cache memories, whenever the prefetching algorithm is able to correctly predict useful data to be prefetched. To this aim, adequate information on the program’s data locality must be used by the prefetching algorithm. In particular, multimedia applications are characterized by a substantial amount of image and video processing, which exhibits spatial locality in both the dimensions of the 2D data structures used for images and frames. However, in multimedia programs many memory references are made also to non-image data, characterized by standard spatial locality. In this work, we explore the adoption of different prefetching techniques in dependence of the data type (i.e., image and non-image), thus making itpossible to tune the prefetching algorithms to the differentforms of locality, and achieving overall performance optimization. In order to prevent interference between the two different data types, a split cache with two separated caches for image and non-image data is also evaluated as an alternative to a standard unified cache. Results on a multimedia workload (MPEG-2 and MPEG-4 decoders) show that standard prefetching techniques such as One-block-lookahead and the Stride Prediction Table are effective for standard data, while novel 2D prefetching techniques perform best on image data. In addition, at a parity of size, unified caches offer in general better performance that split caches, thank to the more flexible allocation of a unified cache space.

2002 - Detecting Moving Objects and their Shadows: An Evaluation with the PETS2002 Dataset [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; A., Prati
abstract

This work presents a general-purpose method for moving visual object segmentation in videos and discusses results attained on sequences of PETS2002 datasets. The proposed approach, called Sakbot, exploits color and motion information to detect objects, shadows and ghosts, i.e. foreground objects with apparent motion. The method is based on background suppression in the color space. The main peculiarity of the approach is the exploitation of motion and shadow information to selectively update the background, improving the statistical background model with the knowledge of detected objects. The approach is able to detect Moving Visual Objects (MVOs), and stopped objects too, since the motion status is maintained at the level of tracking module. HSV color space is exploited for shadow detection in order to enhance both segmentation and background update. Time measures and precision performance analysis in tracking and counting people is provided for surveillance and monitoring purposes.

2002 - Development of a new program for image analysis of digital videomicroscopic images of pigmented skin lesions [Abstract in Rivista]
Seidenari, Stefania; Pellacani, Giovanni; Grana, Costantino; Cucchiara, Rita
abstract

Although an improvement of the diagnostic accuracy of pigmented skin lesions (PSL) has been achieved by the epiluminescence technique (ELM), the interpretation of ELM criteria is often confusing, especially for inexperienced observers. To enhance the reproducibility and accuracy of clinical judgement and the training of inexperienced operators, programs for PSL image analysis and algorithms for automatic diagnosis have been developed. The aim of our study was to develop a new program for PSL image analysis, able to describe different aspects of PSLs and to test its descriptive capability on PSL acquired by means of a digital videomicroscope (VMS 110A, Scalar Mitsubishi, Japan) using 20-fold magnification. After automatic border identification and baricentre determination, some geometric parameters, describing shape characteristics of the lesion, were calculated. A mathematical description of the border cut-off was obtained. The texture of the lesion was calculated applying the co-occurrence matrix at different image resolutions. Dark areas and colour areas, referring to selected colour groups, were obtained and their aspect and distribution were mathematically defined and calculated. 281 common nevi and 117 melanomas were numerically described by our program and the capability of the mathematical parameters to distinguish between benign and malignant lesion was tested by means of discriminant analysis. Significant differences were observed for most parameters between different PSL populations. The automatic classification enabled the distinction between melanomas and nevi with a 100% sensitivity and a 82.9% specificity.

2002 - Exploiting color and topological features for region segmentation with recursive fuzzy c-means [Articolo su rivista]
Cucchiara, Rita; Grana, Costantino; Seidenari, Stefania; Pellacani, Giovanni
abstract

In this paper we define a novel approach for image segmentation into regions which focuses on both visual and topological cues, namely color similarity, inclusion and spatial adjacency. Many color clustering algorithms have been proposed in the past for skin lesion images but none exploits explicitly the inclusion properties between regions. Our algorithm is based on a recursive version of fuzzy c-means (FCM) clustering algorithm in the 2D color histogram constructed by Principal Component Analysis (PCA) of the color space. The distinctive feature of the proposal is that recursion is guided by the evaluation of adjacency and mutual inclusion properties of extracted regions; then, the recursive analysis addresses only included regions or regions with a not-negligible size. This approach allows a coarse-to-fine segmentation which focuses the attention on the inner parts of the images, in order to highlight the internal structure of the object depicted in the image. This could be particularly useful in many applications, especially in the biomedical image analysis. In this work we apply the technique to the segmentation of skin lesions in dermatoscopic images. It could be a suitable support for the diagnosis of skin melanoma, since dermatologists are interested in the analysis of the spatial relations, the symmetrical positions and the inclusion of regions.

2002 - Improvement in range segmentation parameters tuning [Relazione in Atti di Convegno]
Cinque, L.; Corzani, F.; Levialdi, S.; Cucchiara, R.; Pignalberi, G.
abstract

A great effort has been done during last years to improve range image segmentation results. The efficacy of the algorithms is affected by the parameters tuning. In this work two well-known search techniques have been applied to this task: genetic algorithms and simulated annealing. These techniques are adopted in cascade: the former to obtain a rough seed point set and the latter to have a more precise refinement of suitable solutions. We addressed our efforts towards the range segmenter proposed by the University of Bern, that seems to be the best in term of versatility, being able to segment planar and curved surfaces, and in term of speed and quality of the performed segmentations. © 2002 IEEE.

2002 - Iterative fuzzy clustering for detecting regions of interest in skin lesions [Articolo su rivista]
Cucchiara, Rita; Grana, Costantino; Piccardi, Massimo
abstract

Image analysis tools are spreading in dermatology since the introduction of dermoscopy (epiluminescence microscopy), in the effort of algorithmically reproducing clinical evaluations. Color-based region segmentation of skin lesions is one of the key steps for correctly collecting statistics that can help clinicians in their diagnosis. Nevertheless, an efficient and accurate region segmentation algorithm has not been proposed in the literatureyet. This work proposes an iterative fuzzy c-means clustering algorithm based on PCA with the Karhunen-Loève transform of the color space. A topological tree is provided to store the mutual inclusions of the regions and then used to summarize the structural properties of the skin lesion. Preliminary experimental results are presented and discussed.

2002 - Performance analysis of MPEG-4 decoder and encoder [Relazione in Atti di Convegno]
F., Cavalli; Cucchiara, Rita; M., Piccardi; Prati, Andrea
abstract

In this paper, a performance analysis of MPEG-4 encoder and decoder programs on standard personal computer is presented. The paper first describes the MPEG-4 computational load and discusses related works, then outlines the performance analysis. Experimental results show that while the decoder program can be easily executed in real time, the encoder requires execution times in the order of seconds per frame which call for substantial optimisation to satisfy the real-time constraints.

2002 - Semantic Transcoding for Live Video Server [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; A., Prati
abstract

In this paper we present transcoding techniques for a video server architecture that enables the user to access live video streams by using different devices with different capabilities. For live videos, annotation methods cannot be exploited. Instead we propose methods of on-the-fly transcoding that adapt the video content with respect to the user resources and the video semantic. Thus we propose an object-based transcoding with "classes of relevance" (for instance People, Face and Background). To compare the different strategies we propose a metric based on the Weighted Mean Square Error that allows the analysis of different application scenarios by means of a class-wise distortion measure. The obtained results show that the use of semantic can improve the bandwidth to distortion ratio significantly.

2002 - Semantic transcoding for live video server [Relazione in Atti di Convegno]
Cucchiara, R.; Grana, C.; Prati, A.
abstract

In this paper we present transcoding techniques for a video server architecture that enables the user to access live video streams by using different devices with different capabilities. For live videos, annotation methods cannot be exploited. Instead we propose methods of on-the-fly transcoding that adapt the video content with respect to the user resources and the video semantic. Thus we propose an object-based transcoding with "classes of relevance"(for instance People, Face and Background). To compare the different strategies we propose a metric based on the Weighted Mean Square Error that allows the analysis of different application scenarios by means of a class-wise distortion measure. The obtained results show that the use of semantic can improve the bandwidth to distortion ratio significantly.

2002 - Using the Topological Tree for skin lesion structure description [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino
abstract

In this work we describe the Topological Tree (TT) as a knowledge representation method that relates some important visual and spatial features of image regions, namely the color similarity, the inclusion and the spatial adjacency. Starting from color-based region segmentation of an image into disjoint regions, their spatial relationships can be devised and described with graph-based methods. We are interested in the region’s propriety “to be included into” (in the sense of “surrounded by”) another region. This property could be very useful in biomedical imaging and in particular in the diagnosis of skin melanoma. The TT can be constructed after segmentation, by computing the spatial relationships of regions or can be generated directly during the segmentation: to this aim we present a novel recursive fuzzy c-means (FCM) clustering algorithm based on the PCA of the color space. In the paper, in addition to the TT definition and the construction algorithm description, some results are presented and discussed.

2001 - A Metodology to Award a Score to Range Image Segmentation [Relazione in Atti di Convegno]
L., Cinque; Cucchiara, Rita; S., Levialdi; G., Pignalberi
abstract

2001 - An Application of Machine Learning and Statistics to Defect Detection [Articolo su rivista]
Cucchiara, Rita; P., Mello; M., Piccardi; F., Riguzzi
abstract

We present an application of machine learning and statistics to the problem of distinguishing between defective and non-defective industrial workpieces, where the defect takes the form of a long and thin crack on the surface of the piece. From the images of pieces a number of features are extracted by using the Hough transform and the Correlated Hough transform. Two datasets are considered, one containing only features related to the Hough transform and the other containing also features related to the Correlated Hough transform. On these datasets we have compared six different learning algorithms: an attribute-value learner, C4.5, a backpropagation neural network, NeuralWorks Predict, a k-nearest neighbour algorithm, and three statistical techniques, linear, logistic and quadratic discriminant. The experiments show that C4.5 performs best for both feature sets and gives an average accuracy of 93.3% for the first dataset and 95.9% for the second dataset

2001 - Analysis and detection of shadows in video streams: a comparative evaluation [Relazione in Atti di Convegno]
Prati, Andrea; Cucchiara, Rita; I., Mikic; Mm, Trivedi
abstract

Robustness to changes in illumination conditions as well as viewing perspectives is an important requirement for many computer vision applications. One of the key factors in enhancing the robustness of dynamic scene analysis is that of accurate and reliable means for shadow detection. Shadow detection is critical for correct object detection in image sequences. Many algorithms have been proposed in the literature that deal with shadows. However, a comparative evaluation of the existing approaches isstill lacking. In this paper, the full range of problems underlyingthe shadow detection are identified and discussed. We classify the proposed solutions to this problem using a taxonomy of four main classes, called deterministic model and non-model based and statistical parametric and nonparametric. Novel quantitative (detection and discrimination accuracy) and qualitative metrics (scene and object independence, flexibility to shadow situations and robustness to noise) are proposed to evaluate these classes of algorithms on a benchmark suite of indoor and outdoor videosequences.

2001 - Comparative Evaluation of Moving Shadow Detection Algorithms [Relazione in Atti di Convegno]
Prati, Andrea; I., Mikic; Cucchiara, Rita; M. M., Trivedi
abstract

Moving shadows need careful consideration in the development of robust dynamic scene analysis systems. Moving shadow detection is critical for accurate object detection in video streams, since shadow points are often misclassified as object points causing errors in segmentation and tracking. Many algorithms have been proposed in the literature that deal with shadows. However, acomparative evaluation of the existing approaches is still lacking. In this paper, the full range of problems underlying the shadowdetection are identified and discussed. We present a comprehensive survey of moving shadow detection approaches. We organize contributions reported in the literature in four classes. We also present a comparative empirical evaluation of representative algorithms selected from these four classes. Quantitative (detection and discrimination accuracy) and qualitative metrics (scene and object independence, flexibility to shadow situations and robustness to noise) are proposed to evaluate these classes of algorithms on a benchmark suite of indoor and outdoor video sequences. These video sequences and associated “ground-truth” data are made available at http://cvrr.ucsd.edu:88/aton/shadow to allow for others in the community to experiment with new algorithms and metrics.

2001 - Detecting objects, shadows and ghosts in video streams by exploiting color and motion information [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; M., Piccardi; A., Prati
abstract

Many approaches to moving object detection for traffic monitoring and video surveillance proposed in the literature are based on background suppression methods. How to correctly and efficiently update the background model and how to deal with shadows are two of the more distinguishing and challenging features of such approaches. This work presents a general-purpose method for segmentation of moving visual objects (MVOs) based on an object-level classification in MVOs, ghosts and shadows. Background suppression needs a background model to be estimated and updated: we use motion and shadow information to selectively exclude from the background model MVOs and their shadows, while retaining ghosts. The color information (in the HSV color space) is exploited to shadow suppression and, consequently, to enhance both MVOs segmentation and background update.

2001 - Enhancing Implant Surgery Planning via Computerized Image Processing [Articolo su rivista]
Cucchiara, Rita; F., Franchini; A., Lamma; E., Lamma; T., Sansoni; E., Sarti
abstract

Computerized tomography (CT) and magnetic resolution imaging (MRI) are the medical imaging modalities to deliver cross-sectional images of the human body. In the last decade, CT has become the most frequently used imaging modality for the evaluation of the jaw for dental implants (see [Rot98]). Furthermore, image reformatting software has been developed in order to obtain a correct pre-operative diagnosis and treatment planning regarding osseointegrated implants (see, for instance, [CSIa] and [CSIb]). Previous work (see [Cla90]) has shown that CT images are affected by a distortion ratio from 0 to 6 percent. This might be due to the alignment of the patient during the scanning, to his/her movements and eventually to the saturation of pixels composing the image. To solve the first cause, intraoral stents can be used for centering the patient’s head perpendicularly to the axis of the implant to be installed. However, when more than one implant have to be installed, eventually with very different axes, it is better to not expose the patient to multiple CT scanning, which would be necessary to have different CT acquisitions each one perpendicular to the axis of one of the planned teeth.In this work, we present a software approach for enhancing implant surgical planning in order to get exact morphological measurements of the bone and planned teeth by a single CT acquisition. This is achieved by applying image-processing techniques to the original CT images, in order to produce new CT images lying on different planes, and eventually perpendicular to a different tooth. The resulting software system (named DentalVox) has been implemented in C++ and runs on Intel-based personal computers under the Windows operating system. DentalVox ensures better mechanical results in the design and planning of a dental implant with respect to other similar software tools, being able to reconstruct axial (and panorex and cross-sectional) images once any direction is chosen. This allows to get a better mechanical and aestethic prothesis implantata in the underlying jaw bone morphology.

2001 - From eager to lazy constrained data acquisition: a general framework [Articolo su rivista]
P., Mello; M., Milano; G., Gavanelli; E., Lamma; M., Piccardi; Cucchiara, Rita
abstract

Constraint Satisfaction Problems (CSPs)(17)) are an effective framework for modeling a variety of real life applications and many techniques have been proposed for solving them efficiently. CSPs are based on the assumption that all constrained data (values in variable domains) are available at the beginning of the computation. However, many non-toy problems derive their parameters from an external environment. Data retrieval can be a hard task, because data can come from a third-party system that has to convert information encoded with signals (derived from sensors) into symbolic information (exploitable by a CSP solver). Also, data can be provided by the user or have to be queried to a database. For this purpose, we introduce an extension of the widely used CSP model, called Interactive Constraint Satisfaction Problem (ICSP) model. The variable domain values can be acquired when needed during the resolution process by means of Interactive Constraints, which retrieve (possibly consistent) information. A general framework for constraint propagation algorithms is proposed which is parametric in the number of acquisitions performed at each step. Experimental results show the effectiveness of the proposed approach. Some applications which can benefit from the proposed solution are also discussed.

2001 - Improving shadow suppression in moving object detection with HSV color information [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; M., Piccardi; A., Prati; S., Sirotti
abstract

Video-surveillance and traffic analysis systems can be heavily improved using vision-based techniques able to extract, manage and track objects in the scene. However, problems arise due to shadows. In particular, moving shadows can affect the correct localization, measurements and detection of moving objects. This work aims to present a technique for shadow detection and suppression used in a system for moving visual object detection and tracking. The major novelty of the shadow detection technique is the analysis carried out in the HSV color space to improve the accuracy in detecting shadows. Signal processing and optic motivations of the approach proposed are described. The integration and exploitation of the shadow detection module into the system are outlined and experimental results are shown and evaluated

2001 - Iterative fuzzy clustering for detecting regions of interest in skin lesions [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; M., Piccardi
abstract

Image analysis tools are spreading in dermatology since the introduction of dermoscopy (epiluminescence microscopy), in the effort of algorithmically reproducing clinical evaluations. Color-based region segmentation of skin lesions is one of the key steps for correctly collecting statistics that can help clinicians in their diagnosis. Nevertheless, an efficient and accurate region segmentation algorithm has not been proposed in the literature yet. This work proposes an iterative fuzzy c-means clustering algorithm based on PCA with the Karhunen-Loève transform of the color space. A topological tree is provided to store the mutual inclusions of the regions and then used to summarize the structural properties of the skin lesion. Preliminary experimental results are presented and discussed.

2001 - Temporal analysis of cache prefetching strategies for multimedia applications [Relazione in Atti di Convegno]
Cucchiara, Rita; M., Piccardi; Prati, Andrea
abstract

Prefetching is a widely adopted technique for improving performance of cache memories. Performances are typically affected by the design parameters, such as cache size and associativity, but also by the type of locality embodied in the programs. In particular multimedia tools and programs handling images and video are characterized & a bi-dimensional spatiallocality that could be greatly exploited by the inclusion of prefetching in the cache architecture. In this paper we compare some prefetching techniques for multimedia programs (such as MPEG compression, image processing, visual object egmentation) by performing a detailed evaluation of the memory access time. The goal is to prove that a signifcant speedup can be achieved by using either standard prefecthing techniques (such as OBL or adaptive prefetchind or some innovative andimage-oriented prefetching methods, like the neighbor prefetching described in the paper. Performance are measured with the PRIMA trace-driven simulator.

2001 - The Sakbot system for moving object detection and tracking [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; G., Neri; M., Piccardi; Prati, Andrea
abstract

This paper presents Sakbot, a system for moving object detection and tracking in traffic monitoring and video surveillance applications. The system is endowed with robust and efficient detection techniques, which main features are the statistical and knowledge-based background update and the use of HSV color information for shadow suppression. Tracking is performed by means of a flexible tracking module based on symbolic reasoning, which can be tuned to several different applications.

2001 - The Sakbot system for moving object detection and tracking [Capitolo/Saggio]
Cucchiara, Rita; Grana, Costantino; Neri, Gianni; Piccardi, Massimo; Prati, Andrea
abstract

This paper presents Sakbot, a system for moving object detection in traffic monitoring and video surveillance applications. The system is endowed with robust and efficient detection techniques, which main features are the statistical and knowledge-based background update and the use of HSV color information for shadow suppression. Tracking is provided by a symbolic reasoning module allowing flexible object tracking over a variety of different applications. This system proves effective on many different situations, both from the point of view of the scene appearance and the purpose of the application.

2000 - An Application of Machine Learning and Statistics to Defect Detection [Relazione in Atti di Convegno]
Cucchiara, Rita; P., Mello; M., Piccardi; F., Riguzzi
abstract

2000 - Computational models for image processing for shared-memory multiprocessors [Articolo su rivista]
A., Callipo; Cucchiara, Rita; M., Piccardi
abstract

Different tasks in image processing exhibit different computational requirements that should be considered with respect to the architecture. This is particularly critical in parallel machines where many parallelization techniques, as data partitioning and mapping on processors, use of shared memory space, exploitation of pipelining with pre-fetching affect dramatically the performance with a strong relation with algorithm and architectural parameters.The paper defines computational models for tightly-coupled multiprocessors with crossbar architecture, both for data-parallel local algorithms and for global algorithms such as spatial transformations. To solve the intrinsic memory limitations of low-cost, highly integrated systems, the paper proposes to extend the classical block processing model by analytically modeling also the case of multiple processing stages.The models have been compared in detail and have been efficiently adopted for optimizing performance in block processing on crossbar multiprocessors for low-level computer vision applications.

2000 - Focus based Feature Extraction for Pallets Recognition [Relazione in Atti di Convegno]
Cucchiara, Rita; M., Piccardi; Prati, Andrea
abstract

Visual recognition for object grasping is a well-known challenge for robot automation in industrial applications. A typical example is pallet recognition in industrial environment for pick-and-place automated process. The aim of vision and reasoning algorithms is to help robots in choosing the best pallets holes location. This work proposes an application-based approach, which fulfil all requirements, dealing with every kind of occlusions and light situations possible. Even some ”meaning noise” (or ”meaning misunderstanding”) is considered. A pallet model, with limited degrees of freedom, is described and, starting from it, a complete approach to pallet recognition is outlined. In the model we define both virtual and real corners, that are geometricalobject proprieties computed by different image analysis operators. Real corners are perceived by processing brightness information directly from the image, while virtual corners are inferred at a higher level of abstraction. A final reasoning stage selects the best solution fitting the model. Experimental results and performance are reported in order to demonstrate the suitability of the proposed approach.

2000 - Hardware prefetching techniques for cache memories in multimedia applications [Relazione in Atti di Convegno]
Cucchiara, Rita; M., Piccardi; Prati, Andrea
abstract

The workload of niultimedia applications has a strong impact on cache memory performance, since the locality of memory references embedded in multimedia programs differs from that of traditional programs. In many cases, standard cache memory organization achieves poorer performance when used for multimedia. A widely explored approach to improve cache performance is hardware prefetching that allows the pre-loading of data in the cache before they are referenced. However, existing hardware prefetching approaches partially miss thepotential performance improvement, since they are not tailored to multimedia locality. In this paper we propose novel effective approaches to hardware prefetching to be used in image processing programs for multimedia. Experimental results are reported for a suite of multimedia image processing programs including convolutions with kernels, MPEG-2 decoding, and edgechain coding.

2000 - Image Analysis and Rule-Based Reasoning for a Traffic Monitoring [Articolo su rivista]
Cucchiara, Rita; M., Piccardi; P., Mello
abstract

The paper presents an approach for detecting vehicles in urban traffic scenes by means of rule-based reasoning on visual data. The strength of the approach is its formal separation between the low-level image processing modules (used for extracting visual data under various illumination conditions) and the high-level module, which provides a general-purpose knowledge-based framework for tracking vehicles in the scene. The image-processing modules extract visual data from the scene by spatio-temporal analysis during daytime, and by morphological analysis of headlights at night, The high-level module is designed as a forward chaining production rule system, working on symbolic data, i.e., vehicles and their attributes (area, pattern, direction, and others) and exploiting a set of heuristic rules tuned to urban traffic conditions, The synergy between the artificial intelligence techniques of the high-level and the low-level image analysis techniques provides the system with flexibility and robustness.

2000 - Improving data prefetching efficacy in multimedia applications [Relazione in Atti di Convegno]
Cucchiara, Rita; M., Piccardi; Prati, Andrea
abstract

The workload of multimedia applications has a strong impact on cache memory performance, since the locality of memory references embedded in multimedia programs diers from that of traditional programs. In many cases, standard cache memory organization achieves poorer performance when used for multimedia. A widely-explored approach to improve cache performance is hardware prefetching, which allows the pre-loading of data in the cache before they are referenced. However, existing hardware prefetching approaches unable to exploit the potential improvement in performance, since they are not tailored to multimedia locality. In this paper we propose novel effective approaches to hardware prefetching to be used in image processing programs for multimedia. Experimental results (both on efficiency and on efficacy of the proposed approach) are reported for a suite of multimedia image processing programs including MPEG-2 decoding and encoding, convolution, thresholding, and edge chain coding.

2000 - Optimal Range Segmentation Parameters Through Genetic Algorithms [Relazione in Atti di Convegno]
L., Cinque; Cucchiara, Rita; S., Levialdi; S., Martinz; G., Pignalberi
abstract

A wide number of algorithms for surface segmentation in range images have been recently proposed characterized by different approaches (edge filling, region growing, ...), different surface types (either for planar or curved surfaces) and different parameters involved. Optimization of the parameter set is a particularly critical task since the range of parameter variability is often quite large: parameter selection depends on surface type, sensors and the required speed which strongly affect performance. A framework for parameters optimization is proposed based on genetic algorithms. Such algorithms allow a general approach that has been successfully applied on different state-of-the-art segmenters and different range image databases.

2000 - Scuola "La Visione delle Macchine" [Curatela]
Cucchiara, Rita; M., Piccardi; Prati, Andrea
abstract

2000 - Statistic and knowledge-based moving object detection in traffic scenes [Relazione in Atti di Convegno]
Cucchiara, Rita; Grana, Costantino; M., Piccardi; A., Prati
abstract

The most common approach used for vision-based traffic surveillance consists of a fast segmentation of moving visual objects (MVOs) in the scene together with an intelligent reasoning module capable of identifying, tracking and classifying the MVOs in dependency of the system goal. In this paper we describe our approach for MVOs segmentation in an unstructured traffic environment. We consider complex situations with moving people, vehicles and infrastructures that have different aspect model and motion model. In this case we define a specific approach based on background subtraction with statistic and knowledge-based background update. We show many results of real-time tracking of traffic MVOs in outdoor traffic scene such as roads, parking area intersections, and entrance with barriers

1999 - 3D Object Recognition by VC-graphs and Interactive Constraint Satisfaction [Relazione in Atti di Convegno]
Cucchiara, Rita; E., Lamma; P., Mello; M., Milano; M., Piccardi
abstract

We propose a novel approach for recognizing 3D CADmade objects in complex range images containing several overlapped and different objects. Objects are modeled by a graph whose nodes are surfaces and arcs are surface relations. We propose an object-centered graph model, called Visual Constraint graph (VC-graph), with special visual constraints modeling occlusions between object surfaces. The VC-graph is used for recognizing objects from each possible point of view, instead of evaluating many different single-view graphs. The reasoning engine is based on an original extension of the Constraint Satisfaction Problem (CSP) paradigm, called Interactive CSP (ICSP). CSP requires the acquisition of all surfaces before starting constraint propagation; instead, ICSP guides the acquisition of new surfaces only on-demand, without computing useless information and focussing attention only on significant image parts.

1999 - Constraint Propagation and Value Acquisition: why we should do it Interactively [Relazione in Atti di Convegno]
E., Lamma; P., Mello; M., Milano; Cucchiara, Rita; G., Gavanelli; M., Piccardi
abstract

In Constraint Satisfaction Problems (CSPs) values belonging to variable domains should be completely known before the constraint propagationprocess starts. In many applications, however, the acquisition of domain values is a computational expensive process or some domainvalues could not be available at the beginningof the computation. For this purpose, we introduce an Interactive Constraint SatisfactionProblem (ICSP) model as extension of the widely used CSP model. The variable domainvalues can be acquired when needed duringthe resolution process by means of InteractiveConstraints, which retrieve (possibly consistent)information. Experimental results on randomly generated CSPs and for 3D object recognition show the effectiveness of the proposedapproach.

1999 - Eliciting Visual Primitives for Detecting Elongated Shapes [Articolo su rivista]
Cucchiara, Rita; M., Piccardi
abstract

Elsevier eds

1999 - Eliciting visual primitives for detection of elongated shapes [Articolo su rivista]
Cucchiara, R.; Piccardi, M.
abstract

This paper deals with the problem of eliciting visual primitives for visual search with the aim of detecting 2D objects characterized, primarily, by an elongated shape. The paper proposes a new visual primitive obtained by combining in a suitable correlation, a basic set of standard local features. This primitive is able to synthesize the information associated with local features and, as a more effective ensemble of proprieties of the considered model, enhance detection. The paper discusses the approach, presents the new primitive and evaluates its robustness in the case of non-ideal and noisy images. Finally an application to the context of visual inspection is presented.

1999 - Exploiting Cache in Multimedia [Relazione in Atti di Convegno]
Cucchiara, Rita; M., Piccardi; Prati, Andrea
abstract

The paper explores cache strategies for multimedia. Although many architectural improvements have been designed for multimedia, the cache structure and the standard caching policies of general-purpose processors exhibit poor performance in exploiting the 2D spatial locality typical of programs handling and processing images. In this paper we propose a novel caching approach suitably tailored to the requirement of multimedia programs. Our proposal exploits hardware pre-fetching for allocating in cache blocks of data that satisfy the 2D spatial locality requirements. Results refer to a benchmark suite of multimedia program including MPEG decoding and image processing programs with different data dependency and access scheme to image data.

1999 - Extending CLP(FD) with Interactive Data Acquisition for 3D Visual Object Recognition [Relazione in Atti di Convegno]
Cucchiara, Rita; M., Gavanelli; E., Lamma; P., Mello; M., Milano; M., Piccardi
abstract

1999 - Image Analysis and Rule-Based Reasoning for a Traffic Monitoring [Relazione in Atti di Convegno]
Cucchiara, Rita; M., Piccardi; P., Mello
abstract

The paper describes a system for detecting vehicles in urban traffic scenes in daytime and at night by means of image analysis and rule-based reasoning. The strength of the proposed approach is its formal separation between the low-level image processing modules (detecting moving vehicles under day and night light) and the high-level module, which provides a single framework for tracking vehicles in the scene. The image processing modules perform spatio-temporal analysis on moving templates in daytime images, and morphological analysis of headlight pairs in night images. The high-level module is designed as a forward chained production rule system, working on symbolic data, i.e. vehicles and their attributes, and exploiting a set of heuristic roles tuned to urban traffic conditions. The synergy between the artificial intelligence techniques of the high level and low-level image analysis techniques provides the system with flexibility and robustness.

1999 - Real-time Detection of Moving Vehicles [Relazione in Atti di Convegno]
Cucchiara, Rita; M., Piccardi; Prati, Andrea; Scarabottolo, Nello
abstract

Computer vision-based traffic flow monitoring is of major importance for enforcing traffic management policies. Information such as the number of vehicles passing on a road per time unit, or vehicles' turning rates at intersections are exploited by traffic management policies to supervise traffic-light timings. Computer vision-based traffic flow monitoring requiresextraction of moving vehicles from traffic scenes in real time. To accomplish this task, efficient algorithms must be used and effective, low-cost hardware implementation must be pursued. This paper first describes the algorithms used in VTTS (Vehicular Traffic Tracking System) to achieve segmentation of moving vehicles. Then, hardware implementation on a re-programmable FPGA-based board is described in detail.

1999 - Rule-based reasoning on visual data for urban traffic monitoring [Relazione in Atti di Convegno]
Cucchiara, Rita; M., Gavanelli; Prati, Andrea; M., Piccardi
abstract

The paper describes a system for detecting vehicles in urban traffic scenes by means of rule-based reasoning on visual data. The strength of the proposed approach is its formal separation between low-level image processing modules (able for extracting visual data under various illumination conditions) and the high-level module, which provides a single framework for tracking vehicles in the scene. The image processing modules extract visual data from the scene, by spatio-temporal analysis during day-time, and by morphological analysis of headlights at night. The high-level module is designed as a forward chaining production rule system, working on symbolic data, i.e. vehicles and their attributes (area, pattern, direction...) and exploiting a set of heuristic rules tuned to urban traffic conditions. The synergy between the artificial intelligence techniques of the high level and the low-level image analysis techniques provides the system with flexibility and robustness.

1999 - Segmentation of Moving Objects at Frame Rate: A Dedicated Hardware Solution [Relazione in Atti di Convegno]
Cucchiara, Rita; P., Onfiani; Prati, Andrea; Scarabottolo, Nello
abstract

Many works in image processing concern segmentation of moving objects in sequence of images. This problem is particularly critical, since it represents the first step of many complex processes of computer vision, for applications like object tracking, video-surveillance, monitoring, and autonomous navigation. In such applications, both real-time and low-cost requirements should be satisfied.To this aim we propose a dedicated hardware solution, based on reconfigurable logic, that provides motion detection and moving objects segmentation at framerate.

1999 - Vehicle Detection under Day and Night Illumination [Relazione in Atti di Convegno]
Cucchiara, Rita; M., Piccardi
abstract

Effective detection of vehicles in urban traffic scenes can be achieved by exploiting image analysis techniques. Nevertheless, vehicle detection in daytime and at night can’t be approached with the same image analysis algorithms, due to the strongly different illumination conditions. This paper describes the two different sets of image analysis algorithms that have been used in the VTTS system (Vehicular Traffic Tracking System) for extracting vehicles from image sequences acquired in daytime and at night. In the system, a supervising level selects the set of algorithms to apply and performs vehicle tracking under control of a rule-based decision module. The paper describes the tracking module, and reports experimental results for both vehicle detection andtracking.

1998 - A Rule-based Vehicular Traffic Tracking System [Relazione in Atti di Convegno]
Barattin, M.; Cucchiara, R.; Piccardi, M.
abstract

The paper presents a computer vision-based approach to the problem of vehicular traffic monitoring. The approach associates a high-level tracking system to a low-level system that performs moving vehicles detection. The high-level module is based on a large set of rules and is able to keep tracks of all moving or stopped vehicles along the image sequence.

1998 - A real-time hardware implementation of the hough transform [Articolo su rivista]
Cucchiara, Rita; G., Neri; M., Piccardi
abstract

The paper presents a hardware implementation of algorithms based on the Hough transform (HT) for real-time straight line detection. In particular, the basic HT on the edge points (EHT) and the Gradient-Weighted Hough transform (GWHT) for gray-level images are analyzed in detail and implemented on a pipelined architecture using Field Programmable Gate Arrays (FPGA). Algorithms execution times are compared with other hardware and software based systems in order to assess the efficiency of the presented approach. The paper shows how the achievable performance can meet the real-time requirements of an industrial inspection application.

1998 - Exploiting image processing locality in cache pre-fetching [Relazione in Atti di Convegno]
Cucchiara, R.; Piccardi, M.
abstract

Emerging trends in computer design attempt to include specific solutions for handling images also in general-purpose computers, because of the current spread of multimedia, image processing and computer graphics applications. In this context, we propose hardware pre-fetching techniques specific for caching images: The main issue we state is that most algorithms working on images exhibit a 2D spatial locality that is not taken into account in current cache organization and data access strategies. To this aim we propose an adaptive local pre-fetching for the image data type; this technique, mirroring the two-dimensional spatial locality of image processing algorithms, results in being more efficient than other approaches, such as sequential pre-fetching and adaptive pre-fetching. Performance is evaluated on different classes of image processing algorithms, namely raster-scan and propagative algorithms, common in computer vision and multimedia applications.

1998 - Genetic algorithms for clustering in machine vision [Articolo su rivista]
Cucchiara, Rita
abstract

The paper presents a genetic algorithm for clustering objects in images based on their visual features. In particular, a novel solution code (named Boolean Matching Code) and a correspondent reproduction operator (the Single Gene Crossover) are defined specifically for clustering and are compared with other standard genetic approaches. The paper describes the clustering algorithm in detail, in order to show the suitability of the genetic paradigm and underline the importance of effective tuning of algorithm parameters to the application. The algorithm is evaluated on some test sets and an example of its application in automated visual inspection is presented.

1998 - The Vector-Gradient Hough Transform [Articolo su rivista]
Cucchiara, Rita; F., Filicori
abstract

The paper presents a new transform, called vector-gradient Hough transform, for identifying elongated shapes in gray-scale images. This goal is achieved not only by collecting information on the edges of the objects, but also by reconstructing their transversal profile of luminosity. The main features of the new approach are related to its vector space formulation and the associated capability of exploiting all the vector information of the luminosity gradient

1997 - An interactive constraint-based system for selective attention in visual search [Relazione in Atti di Convegno]
Cucchiara, R.; Lamma, E.; Mello, P.; Milano, M.
abstract

In this paper, we face the problem of model-based object recognition in a scene. Computer vision techniques usually separate the extraction of visual information from the scene from the reasoning on the symbolic data. We propose to interactively intertwine the two parts: the reasoning task on visual information is based on constraint satisfaction techniques. Objects are modeled by means of constraints and constraint propagation recognizes an object in the scene. To this purpose, we extend the classical Constraint Satisfaction Problem (CSP) approach which is not suitable for coping with undefined information. We thus propose an Interactive CSP model for reasoning on partially defined data, generating new constraints which can be used to guide the search and to incrementally process newly acquired knowledge.

1997 - Block processing on multiprocessor DSPs for multimedia applications [Relazione in Atti di Convegno]
Cucchiara, R; Callipo, A; Piccardi, M
abstract

The paper explores software development for multiprocessor DSPs for data parallel local algorithms. These algorithms are very common in multimedia applications, such as filtering, compression and so on. Multiprocessor DSPs are very attractive for this application since they offer performance typical of parallel machines together with limited cost. The paper provides performance analysis and software design issues according to different data partitioning models. As a case study, performance evaluation has been carried out on the Multimedia Video Processor from Texas Instrument.

1997 - Exploiting symbolic learning in visual inspection [Relazione in Atti di Convegno]
Piccardi, M.; Cucchiara, R.; Bariani, M.; Mello, P.
abstract

The paper describes the use of data analysis techniques in the computer-vision inspection of industrial workpieces. Computer-vision inspection aims at accomplishing quality verification of fabricated parts by means of automated visual procedures. Gathering the visual information into models proves a critical task, especially when subjective judgement is involved in quality verification. In this work, intelligent data analysis techniques based on symbolic learning by examples have been explored in order to automatically devise and parametrize effective quantitative models. The paper reports and discusses the experimental results achieved in an industrial application.

1997 - Learning for feature selection and shape detection [Relazione in Atti di Convegno]
Cucchiara, R.; Piccardi, M.; Bariani, M.; Mello, P.
abstract

The paper proposes a general framework for shape detection based on supervised symbolic learning. Differently from other visual systems exploiting machine learning, the proposed architecture does not follow the object segmentation - feature extraction and (learning based) classification approach. Instead, an initial data-driven processing selects points of interest in the scene by means of complex features which hypothesize the presence of the target shape; hypotheses are validated by a classifier defined by a machine learning algorithm. Learning is exploited not only for defining the model, i.e. the description of the target for the classifier, but also for defining the description language, i.e. the feature set useful in generating reliable object hypotheses. The proposed architecture of visual system has been implemented for an industrial application of unstructured shape detection: examples and results are reported in the paper.

1997 - The GIOTTO system: A parallel computer for image processing [Articolo su rivista]
Cucchiara, R.; Di Stefano, L.; Piccardi, M.; Cinotti, T. S.
abstract

This paper presents the GIOTTO system, a parallel computer based on a scalable single instruction, multiple data (SIMD) array of processors specially conceived for image processing. The system is characterized by a reduced-size array and a novel organization of the memory subsystem, designed to support transparent processing of images larger than the array. The system is designed to meet the computational requirements of machine vision, together with the compactness, ease of integration and flexibility called for by industrial robotic environments. The paper describes the system architecture in detail, focusing on original solutions conceived to endow the system with flexibility and performance. As proof of GIOTTO's suitability for robotic application, its use in a robot vision experiment is presented, showing the approach to the vision problem, the parallel algorithms, and performance analysis. © 1997 Academic Press Limited.

1996 - DARPA benchmark image processing on SIMD parallel machines [Relazione in Atti di Convegno]
Cucchiara, Rita; Piccardi, Massimo
abstract

The aim of this paper is to present a comparative analysis of the execution times of low-level vision algorithms on two different SIMD parallel machines. The set of algorithms is part of the DARPA Image Understanding benchmark, a widely-accepted platform for performance comparison of parallel systems in the field of computer vision. The considered computer architectures represent two opposite solutions in terms of granularity in approaching the SIMD paradigm, one with a coarse-grain array of floating-point processors and the other with a fine-grain array of single-bit processing elements. For these reasons, the set of algorithms was implemented on both systems taking into account machine specificities. In this work some insights into implementation issues and a comparative analysis of the assessed execution times are presented.

1996 - Detection of luminosity profiles of elongated shapes [Relazione in Atti di Convegno]
Cucchiara, Rita; Piccardi, Massimo
abstract

In this work, a novel technique for identifying elongated shapes in grey-scale images is presented. The method provides detection and identification of elongated shapes not only modelling their principal direction, but also reconstructing the transversal luminosity profile. The approach is proposed starting from the gradient-weighted Hough transform, endowing the Hough space with more complete information about the luminance gradient of the image. This paper presents and discusses the algorithm devised to implement the method on discretized data. As an example of application, we present results on images from mechanical pieces, where real and false defects are discriminated through effective reconstruction of their luminosity profile.

1996 - The vector-gradient Hough transform for identifying straight-translation generated shapes [Relazione in Atti di Convegno]
Cucchiara, R.; Filicori, F.
abstract

The paper introduces the vector-gradient Hough transform (VGHT), a modified version of the gradient weighted Hough transform (GWHT), defined in vector space and able to exploit all the vector information of the gradient of luminosity. The new formulation, directly derived from the Radon transform, is analyzed and compared with the GWHT, in order to point out the improvement in selectivity provided by the VGHT in a strictly polar parametric space, without any relevant increase in computational complexity. This approach can be very suitable for identifying a specifically defined model of shapes in gray level images, ideally generated by a translation in the 2D space of a 1D luminosity profile. Finally, the suitability of the VGHT in real applications is shown with examples in the area of defect identification for automated visual inspection. © 1996 IEEE.

1995 - A highly selective HT based algorithm for detecting extended, almost rectilinear shapes [Relazione in Atti di Convegno]
Cucchiara, R.; Filicori, F.
abstract

The paper presents an highly selective algorithm for detecting extended and almost rectilinear shapes in digital images, in presence of structured and unstructured noise; it exploits the Gradient-based Hough Transform, followed by a special purpose correlation process in the parameter space. The paper discusses the algorithm and its application in a quality inspection task for detecting fabrication defects in mechanical pieces.

1995 - Detection of circular objects by wave propagation on a mesh-connected computer [Articolo su rivista]
Cucchiara, R.; Di Stefano, L.; Piccardi, M.
abstract

Circular objects can be detected in low-contrast and/or blurred images by propagating intensity values according to a two-dimensional wave equation and then finding peaks generated by constructive interference. The paper proposes a parallel algorithm for SIMD mesh-connected computers (MCCs) that is based on this approach; the algorithm presents a fast difference scheme and a search for peaks based on spatio-temporal extremes, and also incorporates absorbing boundary conditions. We describe the parallel algorithm using machine-independent pseudo-code since our goal is to provide detailed guidelines for implementation on MCCs. Pseudo-code statements are expanded with respect to a basic model′s instruction set in order to evaluate costs associated with the algorithm′s subtasks and discuss implementation choices. © 1995 Academic Press, Inc.

1994 - Parallel vision subsystem for robotic inspection and manipulation [Relazione in Atti di Convegno]
Cucchiara, R.; Di Stefano, L.; Monacelli, M.; Piccardi, M.; Rustichelli, G.
abstract

The paper describes an SIMD massively parallel computer conceived for robot vision, and presents an application where the system is installed on a mobile robot to support inspection and manipulation tasks. The system consists of a scalable array of up to 4K single-bit processors controlled by a general-purpose microcomputer through a dedicated interface. The system's architecture is overviewed, addressing the available prototype as well as the functional units currently under design. We provide examples of image analysis goals that can be efficiently reached using spatially-organized parallel architectures such as SIMD array and give measurements of present performances. Using code profiles we also evaluate the speedup associated with the configuration being designed.

1993 - Processing of variable size images on a cellular array: Performance analysis with the Abingdon Cross Benchmark [Relazione in Atti di Convegno]
Piccardi, M.; De Stefano, L.; Cucchiara, R.; Cinotti, T. S.
abstract

Handling a continuous flow of variable size images is a requirement for real time computer vision machines. A modular system based on a small size SIMD cellular array of 1-bit processing elements has been developed with this goal in mind and it is now evaluated against the Abingdon Cross Benchmark specifications. The benchmark tests the combination of algorithms and architecture and generates a quality factor expressed as the ratio of the image lateral size and the processing time. The examined machine supports an efficient means to automatically partition, process and reconstruct images larger than the array size. The authors briefly describe the system, discuss the selected algorithms and present performance results and estimates for several system configurations.

1993 - Reconfiguring the boundaries of a mesh-connected array of processors with run-time programmable logic [Articolo su rivista]
Cucchiara, R.; Salmon Cinotti, T.; Neri, G.; Rustichelli, G.
abstract

Fine grain mesh-connected arrays of processors with a SIMD architecture are considered an attractive solution for many important and emerging real-time data handling applications that require high speed processing of dynamic data sampled over a bidimensional region. Nevertheless, even if the interconnections between the arrays' adjacent nodes are well suited to those applications that may be handled by neighbour based algorithms, the SIMD computational model is, in general terms, still not flexible. Furthermore, the limited size of economically viable arrays calls for a severe overhead in data transfer and array boundary data handling that may impair the system's efficiency. Without modifying the arrays' internal structure, most algorithms could overcome some of their implementation drawbacks with a flexible, fast switching facility on the array boundary. This article presents a 'boundary processor' based on programmable gate arrays whose aim is to dynamically activate the required boundary interconnection pattern either under software control or through an on-line hardware reconfiguration facility. The device has been designed and implemented at the University of Bologna as part of a computer vision machine for robotic applications. © 1993.

1992 - Analysis of design methodology with logic cell arrays [Relazione in Atti di Convegno]
Cucchiara, R.; Neri, G.; Rustichelli, G.; Cinotti, T. S.
abstract

Field Programmable Gale Arrays (FPGA) are becoming popular as an alternative to ASIC because of their good price/performance ratio in terms of speed and flexibility and ease of design implementation especially in the prototype phase. This paper analyses FPGA design both from the methodology and optimization techniques points of view. As an example the paper describes in detail the design optimization adopted for an array processor control unit which is characterized by a very large number of signals and repetitive architecture. The logic complexity and the corresponding simulation has led to choose an LCA (XIUNX) instead of a gate-array Ixised design. All design phases are discussed and a comparison with other design methodologies is presented.

1992 - Cluster partitioning in image analysis classification: A genetic algorithm approach [Relazione in Atti di Convegno]
Alippi, C.; Cucchiara, R.
abstract

A classification of data by using the genetic algorithm computational paradigm is proposed. The best data partition is defined to be the one minimizing the sum of Pythagorean distances between each datum in a cluster and the relative center of class or center of mass. Background is given, and the relevant genetic algorithm description is provided. The model for the genetic application is presented. Simulation results confirm genetic algorithms to be powerful tools for the solution of optimization problems.

Università degli studi di Modena e Reggio Emilia

Pubblicazioni