Nuova ricerca

Enver SANGINETO

Ricercatore t.d. art. 24 c. 3 lett. B
Dipartimento di Ingegneria "Enzo Ferrari"


Home | Curriculum(pdf) | Didattica |


Pubblicazioni

2024 - Semantic Residual Prompts for Continual Learning [Relazione in Atti di Convegno]
Menabue, Martin; Frascaroli, Emanuele; Boschini, Matteo; Sangineto, Enver; Bonicelli, Lorenzo; Porrello, Angelo; Calderara, Simone
abstract

Prompt-tuning methods for Continual Learning (CL) freeze a large pre-trained model and train a few parameter vectors termed prompts. Most of these methods organize these vectors in a pool of key-value pairs and use the input image as query to retrieve the prompts (values). However, as keys are learned while tasks progress, the prompting selection strategy is itself subject to catastrophic forgetting, an issue often overlooked by existing approaches. For instance, prompts introduced to accommodate new tasks might end up interfering with previously learned prompts. To make the selection strategy more stable, we leverage a foundation model (CLIP) to select our prompts within a two-level adaptation mechanism. Specifically, the first level leverages a standard textual prompt pool for the CLIP textual encoder, leading to stable class prototypes. The second level, instead, uses these prototypes along with the query image as keys to index a second pool. The retrieved prompts serve to adapt a pre-trained ViT, granting plasticity. In doing so, we also propose a novel residual mechanism to transfer CLIP semantics to the ViT layers. Through extensive analysis on established CL benchmarks, we show that our method significantly outperforms both state-of-the-art CL approaches and the zero-shot CLIP test. Notably, our findings hold true even for datasets with a substantial domain gap w.r.t. the pre-training knowledge of the backbone model, as showcased by experiments on satellite imagery and medical datasets. The codebase is available at https://github.com/aimagelab/mammoth.


2024 - SpectralCLIP: Preventing Artifacts in Text-Guided Style Transfer from a Spectral Perspective [Relazione in Atti di Convegno]
Xu, Z.; Xing, S.; Sangineto, E.; Sebe, N.
abstract


2023 - Deep Learning and Large Scale Models for Bank Transactions [Relazione in Atti di Convegno]
Garuti, Fabrizio; Luetto, Simone; Cucchiara, Rita; Sangineto, Enver
abstract

The success of Artificial Intelligence (AI) in different research and application areas has increased the interest in adopting Deep Learning techniques also in the financial field. Particularly interesting is the case of financial transactional data, which represent one of the most valuable sources of information for banks and other financial institutes. However, the heterogeneity of the data, composed of both numerical and categorical attributes, makes the use of standard Deep Learning methods difficult. In this paper, we present UniTTAB, a Transformer network for transactional time series, which can uniformly represent heterogeneous time-dependent data, and which is trained on a very large scale of real transactional data. As far as we know, the dataset we used for training is the largest real bank transactions dataset used for Deep Learning methods in this field, being all the other common datasets either much smaller or synthetically generated. The use of this very large real training dataset, makes our UniTTAB the first foundation model for transactional data.


2023 - Input Perturbation Reduces Exposure Bias in Diffusion Models [Relazione in Atti di Convegno]
Ning, M.; Sangineto, E.; Porrello, A.; Calderara, S.; Cucchiara, R.
abstract

Denoising Diffusion Probabilistic Models have shown an impressive generation quality although their long sampling chain leads to high computational costs. In this paper, we observe that a long sampling chain also leads to an error accumulation phenomenon, which is similar to the exposure bias problem in autoregressive text generation. Specifically, we note that there is a discrepancy between training and testing, since the former is conditioned on the ground truth samples, while the latter is conditioned on the previously generated results. To alleviate this problem, we propose a very simple but effective training regularization, consisting in perturbing the ground truth samples to simulate the inference time prediction errors. We empirically show that, without affecting the recall and precision, the proposed input perturbation leads to a significant improvement in the sample quality while reducing both the training and the inference times. For instance, on CelebA 64×64, we achieve a new state-of-the-art FID score of 1.27, while saving 37.5% of the training time. The code is available at https://github.com/forever208/DDPM-IP.


2023 - StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized Tokenizer of a Large-Scale Generative Model [Relazione in Atti di Convegno]
Xu, Z.; Sangineto, E.; Sebe, N.
abstract

Despite the progress made in the style transfer task, most previous work focus on transferring only relatively simple features like color or texture, while missing more abstract concepts such as overall art expression or painter-specific traits. However, these abstract semantics can be captured by models like DALL-E or CLIP, which have been trained using huge datasets of images and textual documents. In this paper, we propose StylerDALLE, a style transfer method that exploits both of these models and uses natural language to describe abstract art styles. Specifically, we formulate the language-guided style transfer task as a non-autoregressive token sequence translation, i.e., from input content image to output stylized image, in the discrete latent space of a large-scale pretrained vector-quantized tokenizer, e.g., the discrete variational auto-encoder (dVAE) of DALL-E. To incorporate style information, we propose a Reinforcement Learning strategy with CLIP-based language supervision that ensures stylization and content preservation simultaneously. Experimental results demonstrate the superiority of our method, which can effectively transfer art styles using language instructions at different granularities. Code is available at https://github.com/zipengxuc/StylerDALLE.


2022 - 3D-Aware Semantic-Guided Generative Model for Human Synthesis [Relazione in Atti di Convegno]
Zhang, J.; Sangineto, E.; Tang, H.; Siarohin, A.; Zhong, Z.; Sebe, N.; Wang, W.
abstract

Generative Neural Radiance Field (GNeRF) models, which extract implicit 3D representations from 2D images, have recently been shown to produce realistic images representing rigid/semi-rigid objects, such as human faces or cars. However, they usually struggle to generate high-quality images representing non-rigid objects, such as the human body, which is of a great interest for many computer graphics applications. This paper proposes a 3D-aware Semantic-Guided Generative Model (3D-SGAN) for human image synthesis, which combines a GNeRF with a texture generator. The former learns an implicit 3D representation of the human body and outputs a set of 2D semantic segmentation masks. The latter transforms these semantic masks into a real image, adding a realistic texture to the human appearance. Without requiring additional 3D information, our model can learn 3D human representations with a photo-realistic, controllable generation. Our experiments on the DeepFashion dataset show that 3D-SGAN significantly outperforms the most recent baselines. The code is available at https://github.com/zhangqianhui/3DSGAN.


2022 - Temporal Alignment for History Representation in Reinforcement Learning [Relazione in Atti di Convegno]
Ermolov, A.; Sangineto, E.; Sebe, N.
abstract

Environments in Reinforcement Learning are usually only partially observable. To address this problem, a possible solution is to provide the agent with information about the past. However, providing complete observations of numerous steps can be excessive. Inspired by human memory, we propose to represent history with only important changes in the environment and, in our approach, to obtain automatically this representation using self-supervision. Our method (TempAl) aligns temporally-close frames, revealing a general, slowly varying state of the environment. This procedure is based on contrastive loss, which pulls embeddings of nearby observations to each other while pushing away other samples from the batch. It can be interpreted as a metric that captures the temporal relations of observations. We propose to combine both common instantaneous and our history representation and we evaluate TempAl on all available Atari games from the Arcade Learning Environment. TempAl surpasses the instantaneous-only baseline in 35 environments out of 49. The source code of the method and of all the experiments is available at https://github.com/htdt/tempal.


2022 - Unsupervised High-Resolution Portrait Gaze Correction and Animation [Articolo su rivista]
Zhang, J.; Chen, J.; Tang, H.; Sangineto, E.; Wu, P.; Yan, Y.; Sebe, N.; Wang, W.
abstract

This paper proposes a gaze correction and animation method for high-resolution, unconstrained portrait images, which can be trained without the gaze angle and the head pose annotations. Common gaze-correction methods usually require annotating training data with precise gaze, and head pose information. Solving this problem using an unsupervised method remains an open problem, especially for high-resolution face images in the wild, which are not easy to annotate with gaze and head pose labels. To address this issue, we first create two new portrait datasets: CelebGaze (256 × 256) and high-resolution CelebHQGaze (512 × 512). Second, we formulate the gaze correction task as an image inpainting problem, addressed using a Gaze Correction Module (GCM) and a Gaze Animation Module (GAM). Moreover, we propose an unsupervised training strategy, i.e., Synthesis-As-Training, to learn the correlation between the eye region features and the gaze angle. As a result, we can use the learned latent space for gaze animation with semantic interpolation in this space. Moreover, to alleviate both the memory and the computational costs in the training and the inference stage, we propose a Coarse-to-Fine Module (CFM) integrated with GCM and GAM. Extensive experiments validate the effectiveness of our method for both the gaze correction and the gaze animation tasks in both low and high-resolution face datasets in the wild and demonstrate the superiority of our method with respect to the state of the art.


2021 - A Unified Objective for Novel Class Discovery [Relazione in Atti di Convegno]
Fini, Enrico; Sangineto, Enver; Lathuilière, Stéphane; Zhong, Zhun; Nabi, Moin; Ricci, Elisa
abstract


2021 - Appearance and Pose-Conditioned Human Image Generation using Deformable GANs [Articolo su rivista]
Siarohin, Aliaksandr; Lathuilière, Stéphane; Sangineto, Enver; Sebe, Nicu
abstract

In this paper, we address the problem of generating person images conditioned on both pose and appearance information. Specifically, given an image xa of a person and a target pose P(xb), extracted from an image xb, we synthesize a new image of that person in pose P(xb), while preserving the visual details in xa. In order to deal with pixel-to-pixel misalignments caused by the pose differences between P(xa) and P(xb), we introduce deformable skip connections in the generator of our Generative Adversarial Network. Moreover, a nearest-neighbour loss is proposed instead of the common L1 and L2 losses in order to match the details of the generated image with the target image. Quantitative and qualitative results, using common datasets and protocols recently proposed for this task, show that our approach is competitive with respect to the state of the art. Moreover, we conduct an extensive evaluation using off-the-shell person re-identification (Re-ID) systems trained with person-generation based augmented data, which is one of themain important applications for this task. Our experiments show that our Deformable GANs can significantly boost the Re-ID accuracy and are even better than data-augmentation methods specifically trained using Re-ID losses.


2021 - Coarse-to-fine gaze redirection with numerical and pictorial guidance [Relazione in Atti di Convegno]
Chen, J.; Zhang, J.; Sangineto, E.; Chen, T.; Fan, J.; Sebe, N.
abstract

Gaze redirection aims at manipulating the gaze of a given face image with respect to a desired direction (i.e., a reference angle) and it can be applied to many real life scenarios, such as video-conferencing or taking group photos. However, previous work on this topic mainly suffers of two limitations: (1) Low-quality image generation and (2) Low redirection precision. In this paper, we propose to alleviate these problems by means of a novel gaze redirection framework which exploits both a numerical and a pictorial direction guidance, jointly with a coarse-to-fine learning strategy. Specifically, the coarse branch learns the spatial transformation which warps input image according to desired gaze. On the other hand, the fine-grained branch consists of a generator network with conditional residual image learning and a multi-task discriminator. This second branch reduces the gap between the previously warped image and the ground-truth image and recovers finer texture details. Moreover, we propose a numerical and pictorial guidance module (NPG) which uses a pictorial gazemap description and numerical angles as an extra guide to further improve the precision of gaze redirection. Extensive experiments on a benchmark dataset show that the proposed method outperforms the state-of-the-art approaches in terms of both image quality and redirection precision. The code is available at https://github.com/jingjingchen777/CFGR


2021 - Efficient Training of Visual Transformers with Small-Size Datasets [Relazione in Atti di Convegno]
Liu, Yahui; Sangineto, Enver; Bi, Wei; Sebe, Nicu; Lepri, Bruno; De Nadai, Marco
abstract


2021 - Metric-Learning-Based Deep Hashing Network for Content-Based Retrieval of Remote Sensing Images [Articolo su rivista]
Roy, Subhankar; Sangineto, Enver; Demir, Begum; Sebe, Nicu
abstract

Hashing methods have recently been shown to be very effective in the retrieval of remote sensing (RS) images due to their computational efficiency and fast search speed. Common hashing methods in RS are based on hand-crafted features on top of which they learn a hash function, which provides the final binary codes. However, these features are not optimized for the final task (i.e., retrieval using binary codes). On the other hand, modern deep neural networks (DNNs) have shown an impressive success in learning optimized features for a specific task in an end-to-end fashion. Unfortunately, typical RS data sets are composed of only a small number of labeled samples, which make the training (or fine-tuning) of big DNNs problematic and prone to overfitting. To address this problem, in this letter, we introduce a metric-learning-based hashing network, which: 1) implicitly uses a big, pretrained DNN as an intermediate representation step without the need of retraining or fine-tuning; 2) learns a semantic-based metric space where the features are optimized for the target retrieval task; and 3) computes compact binary hash codes for fast search. Experiments carried out on two RS benchmarks highlight that the proposed network significantly improves the retrieval performance under the same retrieval time when compared to the state-of-the-art hashing methods in RS.


2021 - Smoothing the Disentangled Latent Style Space for Unsupervised Image-to-Image Translation [Relazione in Atti di Convegno]
Liu, Yahui; Sangineto, Enver; Chen, Yajing; Bao, Linchao; Zhang, Haoxian; Sebe, Nicu; Lepri, Bruno; Wang, Wei; Nadai, Marco De
abstract


2021 - TriGAN: image-to-image translation for multi-source domain adaptation [Articolo su rivista]
Roy, S.; Siarohin, A.; Sangineto, E.; Sebe, N.; Ricci, E.
abstract

Most domain adaptation methods consider the problem of transferring knowledge to the target domain from a single-source dataset. However, in practical applications, we typically have access to multiple sources. In this paper we propose the first approach for multi-source domain adaptation (MSDA) based on generative adversarial networks. Our method is inspired by the observation that the appearance of a given image depends on three factors: the domain, the style (characterized in terms of low-level features variations) and the content. For this reason, we propose to project the source image features onto a space where only the dependence from the content is kept, and then re-project this invariant representation onto the pixel space using the target domain and style. In this way, new labeled images can be generated which are used to train a final target classifier. We test our approach using common MSDA benchmarks, showing that it outperforms state-of-the-art methods.


2021 - Whitening for Self-Supervised Representation Learning [Relazione in Atti di Convegno]
Ermolov, A.; Siarohin, A.; Sangineto, E.; Sebe, N.
abstract


2020 - Attention-based Fusion for Multi-source Human Image Generation [Relazione in Atti di Convegno]
Lathuiliere, Stephane; Sangineto, Enver; Siarohin, Aliaksandr; Sebe, Nicu
abstract

We present a generalization of the person-image generation task, in which a human image is generated conditioned on a target pose and a set X of source appearance images. In this way, we can exploit multiple, possibly complementary images of the same person which are usually available at training and at testing time. The solution we propose is mainly based on a local attention mechanism which selects relevant information from different source image regions, avoiding the necessity to build specific generators for each specific cardinality of X. The empirical evaluation of our method shows the practical interest of addressing the person-image generation problem in a multi-source setting.


2020 - Dual In-painting Model for Unsupervised Gaze Correction and Animation in the Wild [Relazione in Atti di Convegno]
Zhang, Jichao; Chen, Jingjing; Tang, Hao; Wang, Wei; Yan, Yan; Sangineto, Enver; Sebe, Nicu
abstract

We address the problem of unsupervised gaze correction in the wild, presenting a solution that works without the need of precise annotations of the gaze angle and the head pose. We created a new dataset called CelebAGaze consisting of two domains X, Y, where the eyes are either staring at the camera or somewhere else. Our method consists of three novel modules: the Gaze Correction module(GCM), the Gaze Animation module(GAM), and the Pretrained Autoencoder module (PAM). Specifically, GCM and GAM separately train a dual in-painting network using data from the domain X for gaze correction and data from the domain Y for gaze animation. Additionally, a Synthesis-As-Training method is proposed when training GAM to encourage the features encoded from the eye region to be correlated with the angle information, resulting in gaze animation achieved by interpolation in the latent space. To further preserve the identity information e.g., eye shape, iris color, we propose the PAM with an Autoencoder, which is based on Self-Supervised mirror learning where the bottleneck features are angle-invariant and which works as an extra input to the dual in-painting models. Extensive experiments validate the effectiveness of the proposed method for gaze correction and gaze animation in the wild and demonstrate the superiority of our approach in producing more compelling results than state-of-the-art baselines. Our code, the pretrained models and supplementary results are available at:https://github.com/zhangqianhui/GazeAnimation.


2020 - Online Continual Learning under Extreme Memory Constraints [Relazione in Atti di Convegno]
Fini, Enrico; Lathuilière, Stéphane; Sangineto, Enver; Nabi, Moin; Ricci, Elisa
abstract


2019 - Self Paced Deep Learning for Weakly Supervised Object Detection [Articolo su rivista]
Sangineto, E.; Nabi, M.; Culibrk, D.; Sebe, N.
abstract

In a weakly-supervised scenario object detectors need to be trained using image-level annotation alone. Since bounding-box-level ground truth is not available, most of the solutions proposed so far are based on an iterative, Multiple Instance Learning framework in which the current classifier is used to select the highest-confidence boxes in each image, which are treated as pseudo-ground truth in the next training iteration. However, the errors of an immature classifier can make the process drift, usually introducing many of false positives in the training dataset. To alleviate this problem, we propose in this paper a training protocol based on the self-paced learning paradigm. The main idea is to iteratively select a subset of images and boxes that are the most reliable, and use them for training. While in the past few years similar strategies have been adopted for SVMs and other classifiers, we are the first showing that a self-paced approach can be used with deep-network-based classifiers in an end-to-end training pipeline. The method we propose is built on the fully-supervised Fast-RCNN architecture and can be applied to similar architectures which represent the input image as a bag of boxes. We show state-of-the-art results on Pascal VOC 2007, Pascal VOC 2010 and ILSVRC 2013. OnILSVRC 2013 our results based on a low-capacity AlexNet network outperform even those weakly-supervised approaches which are based on much higher-capacity networks.


2019 - Training adversarial discriminators for cross-channel abnormal event detection in crowds [Relazione in Atti di Convegno]
Ravanbakhsh, M.; Sangineto, E.; Nabi, M.; Sebe, N.
abstract

Abnormal crowd behaviour detection attracts a large interest due to its importance in video surveillance scenarios.However, the ambiguity and the lack of sufficient abnormal ground truth data makes end-to-end training of large deep networks hard in this domain. In this paper we propose to use Generative Adversarial Nets (GANs), which are trained to generate only the normal distribution of the data. During the adversarial GAN training, a discriminator (D) is used as a supervisor for the generator network(G) and vice versa. At testing time we use D to solve our discriminative task (abnormality detection), where D has been trained without the need of manually-annotated abnormal data. Moreover, in order to prevent G learn a trivial identity function, we use a cross-channel approach, forcing G to transform raw-pixel data in motion information and vice versa. The quantitative results on standard benchmarks show that our method outperforms previous state-of-the-art methods in both the frame-level and the pixel-level evaluation.


2019 - Unsupervised Domain Adaptation Using Feature-Whitening and Consensus Loss [Relazione in Atti di Convegno]
Roy, Subhankar; Siarohin, Aliaksandr; Sangineto, Enver; Bulo, Samuel Rota; Sebe, Nicu; Ricci, Elisa
abstract

A classifier trained on a dataset seldom works on other datasets obtained under different conditions due to domain shift. This problem is commonly addressed by domain adaptation methods. In this work we introduce a novel deep learning framework which unifies different paradigms in unsupervised domain adaptation. Specifically, we propose domain alignment layers which implement feature whitening for the purpose of matching source and target feature distributions. Additionally, we leverage the unlabeled target data by proposing the Min-Entropy Consensus loss, which regularizes training while avoiding the adoption of many user-defined hyper-parameters. We report results on publicly available datasets, considering both digit classification and object recognition tasks. We show that, in most of our experiments, our approach improves upon previous methods, setting new state-of-the-art performances.


2019 - Whitening and coloring batch transform for GANS [Relazione in Atti di Convegno]
Siarohin, A.; Sangineto, E.; Sebe, N.
abstract

Batch Normalization (BN) is a common technique used to speed-up and stabilize training. On the other hand, the learnable parameters of BN are commonly used in conditional Generative Adversarial Networks (cGANs) for representing class-specific information using conditional Batch Normalization (cBN). In this paper we propose to generalize both BN and cBN using a Whitening and Coloring based batch normalization. We show that our conditional Coloring can represent categorical conditioning information which largely helps the cGAN qualitative results. Moreover, we show that full-feature whitening is important in a general GAN scenario in which the training process is known to be highly unstable. We test our approach on different datasets and using different GAN networks and training protocols, showing a consistent improvement in all the tested frameworks. Our CIFAR-10 conditioned results are higher than all previous works on this dataset.


2018 - DEEP METRIC AND HASH-CODE LEARNING FOR CONTENT-BASED RETRIEVAL OF REMOTE SENSING IMAGES [Relazione in Atti di Convegno]
Roy, S; Sangineto, E; Demir, B; Sebe, N
abstract

The growing volume of Remote Sensing (RS) image archives demands for feature learning techniques and hashing functions which can: (1) accurately represent the semantics in the RS images; and (2) have quasi real-time performance during retrieval. This paper aims to address both challenges at the same time, by learning a semantic-based metric space for content based RS image retrieval while simultaneously producing binary hash codes for an efficient archive search. This double goal is achieved by training a deep network using a combination of different loss functions which, on the one hand, aim at clustering semantically similar samples (i.e., images), and, on the other hand, encourage the network to produce final activation values (i.e., descriptors) that can be easily binarized. Moreover, since RS annotated training images are too few to train a deep network from scratch, we propose to split the image representation problem in two different phases. In the first we use a general-purpose, pre-trained network to produce an intermediate representation, and in the second we train our hashing network using a relatively small set of training images. Experiments on two aerial benchmark archives show that the proposed method outperforms previous state-of-the-art hashing approaches by up to 5.4% using the same number of hash bits per image.


2018 - Deformable GANs for Pose-Based Human Image Generation [Relazione in Atti di Convegno]
Siarohin, Aliaksandr; Sangineto, Enver; Lathuiliere, Stephane; Sebe, Nicu
abstract

In this paper we address the problem of generating person images conditioned on a given pose. Specifically, given an image of a person and a target pose, we synthesize a new image of that person in the novel pose. In order to deal with pixel-to-pixel misalignments caused by the pose differences, we introduce deformable skip connections in the generator of our Generative Adversarial Network. Moreover, a nearest-neighbour loss is proposed instead of the common L 1 and L 2 losses in order to match the details of the generated image with the target image. We test our approach using photos of persons in different poses and we compare our method with previous work in this area showing state-of-the-art results in two benchmarks. Our method can be applied to the wider field of deformable object generation, provided that the pose of the articulated object can be extracted using a keypoint detector.


2018 - Plug-and-play CNN for crowd motion analysis: An application in abnormal event detection [Relazione in Atti di Convegno]
Ravanbakhsh, M.; Nabi, M.; Mousavi, H.; Sangineto, E.; Sebe, N.
abstract

Most of the crowd abnormal event detection methods rely on complex hand-crafted features to represent the crowd motion and appearance. Convolutional Neural Networks (CNN) have shown to be a powerful instrument with excellent representational capacities, which can leverage the need for hand-crafted features. In this paper, we show that keeping track of the changes in the CNN feature across time can be used to effectively detect local anomalies. Specifically, we propose to measure local abnormality by combining semantic information (inherited from existing CNN models) with low-level optical-flow. One of the advantages of this method is that it can be used without the fine-tuning phase. The proposed method is validated on challenging abnormality detection datasets and the results show the superiority of our approach compared with the state-of-the art methods.


2018 - Semantic-Fusion Gans for Semi-Supervised Satellite Image Classification [Relazione in Atti di Convegno]
Subhankar, Roy; Sangineto, E.; Demir, B.; Sebe, N.
abstract

Most of the public satellite image datasets contain only a small number of annotated images. The lack of a sufficient quantity of labeled data for training is a bottleneck for the use of modern deep-learning based classification approaches in this domain. In this paper we propose a semi -supervised approach to deal with this problem. We use the discriminator $(D)$ of a Generative Adversarial Network (GAN) as the final classifier, and we train $D$ using both labeled and unlabeled data. The main novelty we introduce is the representation of the visual information fed to $D$ by means of two different channels: the original image and its “semantic” representation, the latter being obtained by means of an external network trained on ImageNet. The two channels are fused in $D$ and jointly used to classify fake images, real labeled and real unlabeled images. We show that using only 100 labeled images, the proposed approach achieves an accuracy close to 69% and a significant improvement with respect to other GAN-based semi-supervised methods. Although we have tested our approach only on satellite images, we do not use any domain-specific knowledge. Thus, our method can be applied to other semi-supervised domains.


2017 - Abnormal event detection in videos using generative adversarial nets [Relazione in Atti di Convegno]
Ravanbakhsh, M.; Nabi, M.; Sangineto, E.; Marcenaro, L.; Regazzoni, C.; Sebe, N.
abstract

In this paper we address the abnormality detection problem in crowded scenes. We propose to use Generative Adversarial Nets (GANs), which are trained using normal frames and corresponding optical-flow images in order to learn an internal representation of the scene normality. Since our GANs are trained with only normal data, they are not able to generate abnormal events. At testing time the real data are compared with both the appearance and the motion representations reconstructed by our GANs and abnormal areas are detected by computing local differences. Experimental results on challenging abnormality detection datasets show the superiority of the proposed method compared to the state of the art in both frame-level and pixel-level abnormality detection tasks.


2017 - FOIL it! Find One mismatch between Image and Language caption [Relazione in Atti di Convegno]
Shekhar, Ravi; Pezzelle, Sandro; Klimovich, Yauhen; Herbelot, Aurelie; Nabi, Moin; Sangineto, Enver; Bernardi, Raffaella
abstract

In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the two modalities. To this end, we propose an extension of the MS-COCO dataset, FOIL-COCO, which associates images with both correct and ‘foil’ captions, that is, descriptions of the image that are highly similar to the original ones, but contain one single mistake (‘foil word’). We show that current LaVi models fall into the traps of this data and perform badly on three tasks: a) caption classification (correct vs. foil); b) foil word detection; c) foil word correction. Humans, in contrast, have near-perfect performance on those tasks. We demonstrate that merely utilising language cues is not enough to model FOIL-COCO and that it challenges the state-of-the-art by requiring a fine-grained understanding of the relation between text and image.


2017 - Vision and language integration: Moving beyond objects [Relazione in Atti di Convegno]
Shekhar, R.; Pezzelle, S.; Herbelot, A.; Nabi, M.; Sangineto, E.; Bernardi, R.
abstract

The last years have seen an explosion of work on the integration of vision and language data. New tasks like Image Captioning and Visual Questions Answering have been proposed and impressive results have been achieved. There is now a shared desire to gain an in-depth understanding of the strengths and weaknesses of those models. To this end, several datasets have been proposed to try and challenge the state-of-the-art. Those datasets, however, mostly focus on the interpretation of objects (as denoted by nouns in the corresponding captions). In this paper, we reuse a previously proposed methodology to evaluate the ability of current systems to move beyond objects and deal with attributes (as denoted by adjectives), actions (verbs), manner (adverbs) and spatial relations (prepositions). We show that the coarse representations given by current approaches are not informative enough to interpret attributes or actions, whilst spatial relations somewhat fare better, but only in attention models.


2016 - Bad teacher or unruly student: Can deep learning say something in Image Forensics analysis? [Relazione in Atti di Convegno]
Rota, P.; Sangineto, E.; Conotter, V.; Pramerdorfer, C.
abstract

The pervasive availability of the Internet, coupled with the development of increasingly powerful technologies, has led digital images to be the primary source of visual information in nowadays society. However, their reliability as a true representation of reality cannot be taken for granted, due to the affordable powerful graphics editing softwares that can easily alter the original content, leaving no visual trace of any modification on the image making them potentially dangerous. This motivates developing technological solutions able to detect media manipulations without a prior knowledge or extra information regarding the given image. At the same time, the huge amount of available data has also led to tremendous advances of data-hungry learning models, which have already demonstrated in last few years to be successful in image classification. In this work we propose a deep learning approach for tampered image classification. To our best knowledge, this the first attempt to use the deep learning paradigm in an image forensic scenario. In particular, we propose a new blind deep learning approach based on Convolutional Neural Networks (CNN) able to learn invisible discriminative artifacts from manipulated images that can be exploited to automatically discriminate between forged and authentic images. The proposed approach not only detects forged images but it can be extended to localize the tampered regions within the image. This method outperforms the state-of-the-art in terms of accuracy on CASIA TIDE v2.0 dataset. The capability of automatically crafting discriminant features can lead to surprising results. For instance, detecting image compression filters used to create the dataset. This argument is also discussed within this paper.


2016 - Learning Personalized Models for Facial Expression Analysis and Gesture Recognition [Articolo su rivista]
Zen, Gloria; Porzi, Lorenzo; Sangineto, Enver; Ricci, Elisa; Sebe, Niculae
abstract

Facial expression and gesture recognition algorithms are key enabling technologies for human-computer interaction (HCI) systems. State of the art approaches for automatic detection of body movements and analyzing emotions from facial features heavily rely on advanced machine learning algorithms. Most of these methods are designed for the average user, but the assumption “one-size-fits-all” ignores diversity in cultural background, gender, ethnicity, and personal behavior, and limits their applicability in real-world scenarios. A possible solution is to build personalized interfaces, which practically implies learning person-specific classifiers and usually collecting a significant amount of labeled samples for each novel user. As data annotation is a tedious and time-consuming process, in this paper we present a framework for personalizing classification models which does not require labeled target data. Personalization is achieved by devising a novel transfer learning approach. Specifically, we propose a regression framework which exploits auxiliary (source) annotated data to learn the relation between person-specific sample distributions and parameters of the corresponding classifiers. Then, when considering a new target user, the classification model is computed by simply feeding the associated (unlabeled) sample distribution into the learned regression function. We evaluate the proposed approach in different applications: pain recognition and action unit detection using visual data and gestures classification using inertial measurements, demonstrating the generality of our method with respect to different input data types and basic classifiers. We also show the advantages of our approach in terms of accuracy and computational time both with respect to user-independent approaches and to previous personalization techniques.


2015 - FaceCept3D: Real Time 3D Face Tracking and Analysis [Relazione in Atti di Convegno]
Tulyakov, S; Vieriu, R; Sangineto, E; Sebe, N
abstract


2015 - Facial expression recognition under a wide range of head poses [Relazione in Atti di Convegno]
Vieriu, Radu Laurentiu; Tulyakov, Sergey; Semeniuta, Stanislau; Sangineto, Enver; Sebe, Niculae
abstract


2015 - Unsupervised Tube Extraction Using Transductive Learning and Dense Trajectories [Relazione in Atti di Convegno]
Puscas, Mihai - Marian; Sangineto, Enver; Culibrk, Dubravko; Sebe, Niculae
abstract

We address the problem of automatic extraction of foreground objects from videos. The goal is to provide a method for unsupervised collection of samples which can be further used for object detection training without any human intervention. We use the well known Selective Search approach to produce an initial still-image based segmentation of the video frames. This initial set of proposals is pruned and temporally extended using optical flow and transductive learning. Specifically, we propose to use Dense Trajectories in order to robustly match and track candidate boxes over different frames. The obtained box tracks are used to collect samples for unsupervised training of track-specific detectors. Finally, the detectors are run on the videos to extract the final tubes. The combination of appearance-based static ”objectness” (Selective Search), motion information (Dense Trajectories) and transductive learning (detectors are forced to ”overfit” on the unsupervised data used for training) makes the proposed approach extremely robust. We outperform state-of-the-art systems by a large margin on common benchmarks used for tube proposal evaluation.


2015 - Video Classification with Densely Extracted HOG/HOF/MBH Features: An Evaluation of the Accuracy/Computational Efficiency Trade-off [Articolo su rivista]
J., Uijlings; Duta, Ionut Cosmin; Sangineto, Enver; Sebe, Niculae
abstract

The current state-of-the-art in video classification is based on Bag-of-Words using local visual descriptors. Most commonly these are histogram of oriented gradients (HOG), histogram of optical flow (HOF) and motion boundary histograms (MBH) descriptors. While such approach is very powerful for classification, it is also computationally expensive. This paper addresses the problem of computational efficiency. Specifically: (1) We propose several speed-ups for densely sampled HOG, HOF and MBH descriptors and release Matlab code; (2) We investigate the trade-off between accuracy and computational efficiency of descriptors in terms of frame sampling rate and type of Optical Flow method; (3) We investigate the trade-off between accuracy and computational efficiency for computing the feature vocabulary, using and comparing most of the commonly adopted vector quantization techniques: k-means, hierarchical k-means, Random Forests, Fisher Vectors and VLAD.


2014 - Statistical and Spatial Consensus Collection for Detector Adaptation [Relazione in Atti di Convegno]
Sangineto, E
abstract


2014 - Unsupervised Domain Adaptation for Personalized Facial Emotion Recognition [Relazione in Atti di Convegno]
Zen, Gloria; Sangineto, Enver; E., Ricci; Sebe, Niculae
abstract


2014 - We are not All Equal: Personalizing Models for Facial Expression Analysis with Transductive Parameter Transfer [Relazione in Atti di Convegno]
Sangineto, Enver; Zen, Gloria; Ricci, Elisa; Sebe, Niculae
abstract


2013 - Anomaly detection in crowded scenes: a novel framework based on Swarm Optimization and Social Force Modeling [Capitolo/Saggio]
Raghavendra, R; Cristani, M; Del Bue, A; Sangineto, E; Murino, V
abstract


2013 - Pose and Expression Independent Facial Landmark Localization Using Dense-SURF and the Hausdorff Distance [Articolo su rivista]
Sangineto, E
abstract


2012 - Learning Discriminative Spatial Relations for Detector Dictionaries: An Application to Pedestrian Detection [Relazione in Atti di Convegno]
Sangineto, E; Cristani, M; Del Bue, A; Murino, V
abstract


2012 - Real-time viewpoint-invariant hand localization with cluttered backgrounds [Articolo su rivista]
Sangineto, E; Cupelli, M
abstract


2010 - A framework for hand-based interaction [Relazione in Atti di Convegno]
Bottoni, P; Cinque, L; Cupelli, M; Di Filippo, L; Labella, A; Pierro, M; Sangineto, E
abstract


2010 - Face recognition using SIFT features and a region-based ranking [Articolo su rivista]
Cinque, L.; Iovane, G.; Manzo, M.; Sangineto, E.
abstract

Two of the most important state-of-the-art challenges in face recognition are: dealing with image acquisition conditions very different between the gallery and the probe set and dealing with large datasets of individuals. In this paper we face both aspects presenting a method which is able to work in “real life” scenarios, in which face images are differently illuminated, can be partially occluded or can show different facial expressions or noise levels. Our proposed system has been tested with datasets of 1000 different individuals, showing performances usually obtained with much smaller gallery sets and much better images. The approach we propose is based on SIFT descriptors, which are known to be robust to different illumination conditions and noise levels. SIFTs are used to automatically detect face regions (mouth area, eye area, etc.). Such regions are then independently compared with the corresponding regions of the gallery images for computing a similarity-based renking of the system’s database. © 2010 Taylor & Francis Group, LLC.


2009 - Improved statistical techniques for multi-part face detection and recognition [Relazione in Atti di Convegno]
Christian, Micheloni; Sangineto, Enver; Cinque, Luigi; Gian Luca, Foresti
abstract

In this paper we propose an integrated system for face detection and face recognition based on improved versions of state-of-the-art statistical learning techniques such as Boosting and LDA. Both the detection and the recognition processes are performed on facial features (e.g., the eyes, the nose, the mouth, etc) in order to improve the recognition accuracy and to exploit their statistical independence in the training phase. Experimental results on real images show the superiority of our proposed techniques with respect to the existing ones in both the detection and the recognition phase. © 2009 Springer Berlin Heidelberg.


2008 - Adaptive Course Generation through Learning Styles Representation [Articolo su rivista]
Sangineto, E; N., Capuano; M., Gaeta; A., Micarelli
abstract


2008 - Comparing SIFT and LDA based face recognition approaches [Articolo su rivista]
Cinque, L.; Iovane, G.; Sangineto, E.
abstract

In this paper we present a face recognition system based on the Scale Invariant Feature Transform (SIFT) image descriptors recently proposed by Lowe [6] and largely used in generic object recognition tasks. We show how SIFT descriptors can be used in a robust face recognition system coupled with some simple image normalization processes and geometric constraints on SIFT matchings. In the paper we present an extensive experimental evaluation of the prosed SIFT-based face recognition approach comparing it with a “standard” Linear Discriminant Analysis (LDA)-based method, commonly considered one of the best performing face recognition technique. The two systems have been tested using images collected from different face recognition benchmarks in order to simulate real-life applications in which image acquisition parameters largely vary from one query image to the other. In all our tests the SIFT-based method clearly outperformed the LDA-based one, showing that the a priori knowledge embedded in the SIFT local description of image appearances is more robust than trainingbased systems in which appearance variability factors need to be off-line learned. © 2008 Taru Publication.


2008 - Detecting Attention through Telepresence [Relazione in Atti di Convegno]
Levialdi, S.; Malizia, A.; Onorati, T.; Sangineto, E.; Sebe, Niculae
abstract


2008 - Fast viewpoint-invariant articulated hand detection combining curve and graph matching [Relazione in Atti di Convegno]
Cinque, Luigi; M., Cupelli; Sangineto, Enver
abstract


2008 - Identifying elephant photos by multi-curve matching [Articolo su rivista]
A., Ardovini; Cinque, Luigi; Sangineto, Enver
abstract

We present in this paper an elephant photo identification system based on the shape comparison of the nicks characterizing the elephants' ears. The method we propose can deal with very cluttered and noisy images as the ones commonly used by zoologists for wild elephant photo identification. Difficult segmentation problems are solved using rough position information input by the user. Such information is used by the system as a basis for a set of segmentation and normalization hypotheses aiming at comparing a query photo Q with different photos of the system's database possibly representing the same individual as Q. The proposed shape comparison method, based on matching multiple, non-connected curves, can be applied to different retrieval by shape problems. Examples with real wild elephant photos are shown. (C) 2007 Elsevier Ltd. All rights reserved.


2008 - LICoSL: Landmark-Based Identikit Composition and Suspect Retrieval [Articolo su rivista]
G., Iovane; Sangineto, E
abstract


2007 - A Statistical Method for People Counting in Crowded Environments [Relazione in Atti di Convegno]
M., Bozzoli; Cinque, Luigi; Sangineto, Enver
abstract


2007 - A semi-automatic approach to photo identification of wild elephants [Relazione in Atti di Convegno]
A., Ardovini; Cinque, Luigi; F., Della Rocca; Sangineto, Enver
abstract


2007 - An adaptive e-learning platform for personalized course generation [Capitolo/Saggio]
Sangineto, E
abstract


2007 - Deformation tolerant generalized Hough transform for sketch-based image retrieval in complex scenes [Articolo su rivista]
M., Anelli; Cinque, Luigi; Sangineto, Enver
abstract

Sketch-based image retrieval systems need to handle two main problems. First of all, they have to recognize shapes similar but not necessarily identical to the user’s query. Hence, exact object identification techniques do not fit in this case. The second problem is the selection of the image features to compare with the user’s sketch. In domain-independent visual repositories, real-life images with non-uniform background and possible occluding objects make this second task particularly hard. We address the second problem proposing a variant of the well-known Generalized Hough Transform (GHT), which is a robust object identification technique for unsegmented images. Moreover, we solve the first problem modifying the GHT to deal with an inexact matching problem. In this paper, we show how this idea can be efficiently and accurately realized. Experimental results are shown with two different databases of real, unsegmented images.


2007 - Detecting attention through Telepresence [Relazione in Atti di Convegno]
LEVIALDI GHIRON, S; Malizia, A.; Onorati, T.; Sangineto, E.; Sebe, N.
abstract


2007 - Shape Based Image Retrieval by Alignment [Capitolo/Saggio]
Sangineto, E
abstract


2006 - A 3D geometric approach to face detection and facial expression recognition [Articolo su rivista]
Gaeta, Matteo; Iovane, Gerardo; Sangineto, Enver
abstract

Face detection and facial expression recognition are research areas with important application possibilities. Although the two problems are usually dealt with different approaches, we show in this paper how the same recognition process can be used to recognize both a generic “class-face” in a given, possibly complex image, and a specific facial expression. The approach we propose is based on two steps. In the former we use alignment techniques in order to overlap the 3D representations of the main face components with the 2D image elements. In the latter we compare the candidate groups of localized components with a set of structural models, each of which representing a facial expression. Expressionindependent face detection is achieved using the same approach with a model built generalizing over a set of face examples with different expressions.


2006 - Articulated Object Recognition: A General Framework and a Case Study [Relazione in Atti di Convegno]
Cinque, Luigi; Sangineto, Enver; S. L., Tanimoto
abstract


2006 - Computer-aided wild elephant identification [Relazione in Atti di Convegno]
Petriccione, D; Ardovini, A; Cinque, L; Della Rocca, F; Massaro, L; Ortolani, A; Sangineto, E
abstract


2006 - Recognition of articulated robots in the RoboCup domain [Articolo su rivista]
L., Cinque; Sangineto, E; S., Tanimoto
abstract


2005 - A survey on nonrigid object recognition approaches and their applications to face detection and human body detection [Articolo su rivista]
M., Gaeta; G., Iovane; Sangineto, E
abstract


2005 - Automatic Student Personalization in Preferred Learning Categories [Relazione in Atti di Convegno]
Capuano, N; Gaeta, M; Micarelli, A; Sangineto, E
abstract


2005 - Environment Topological Structure Recognition for Robot Navigation [Relazione in Atti di Convegno]
Sangineto, E; M. R., Iarusso
abstract


2004 - Diogene: a Semantic Web-Based Automatic Brokering System [Articolo su rivista]
N., Capuano; M., Gaeta; A., Micarelli; Sangineto, E
abstract


2004 - Shape Similarity Image Retrieval by Hypothesis and Test [Relazione in Atti di Convegno]
Sangineto, E
abstract


2004 - Visual Retrieval through Geometric Voting [Capitolo/Saggio]
Sangineto, E; M., Anelli; A., Micarelli
abstract


2003 - A New Content Based Image Retrieval Method Based on a Sketch-Driven Interpretation of Line Segments [Relazione in Atti di Convegno]
M., Anelli; A., Micarelli; Sangineto, E
abstract


2003 - An Abstract Representation of Geometric Knowledge For Object Classification [Articolo su rivista]
Sangineto, E
abstract


2003 - An Intelligent Web Teacher System for Learning Personalization and Semantic Web Compatibility [Relazione in Atti di Convegno]
Capuano, N.; M., Gaeta; Micarelli, A.; Sangineto, E.
abstract

We propose a Web tutoring system in which Artificial Intelligence techniques and Semantic Web approaches are integrated in order to provide an automatic tool able both to completely customize learning on the student’s needs and to exchange learning material with other Web systems. IWT (Intelligent Web Teacher) is based on an ad hoc knowledge representation which describes the didactic domain by means of an Ontology. The student can select the concepts belonging to the Ontology she/he is interested in which. The system planning mechanism builds the most suitable Learning Path for that student.


2003 - Content Based Image Retrieval for Unsegmented Images [Relazione in Atti di Convegno]
M., Anelli; A., Micarelli; Sangineto, E
abstract


2003 - Diogene: a Training Web Broker for ICT Professionals [Relazione in Atti di Convegno]
Vergara, M.; Capuano, N.; Sangineto, E.
abstract

The purpose of this paper is to describe the work in progress related to the design, the implementation and the evaluation of an innovative e-learning platform for ICT individual training in the framework of an EC funded project named Diogene. The present e-learning solution includes several state-of-the-art technologies and methodologies such as: metadata and ontologies for knowledge manipulation, fuzzy learner modelling, intelligent course tailoring, co-operative and online training support. The proposed solution is based on the distribution of working tasks among content provider services, content discovery services, content brokering services, training services, curriculum vitae searching services and collaboration services.


2003 - Recognition in Office-Like Environments Through the extraction of the Perspective Structure [Relazione in Atti di Convegno]
M. R., Iarusso; A., Micarelli; Sangineto, E
abstract


2003 - Recognition of office-like environments through the extraction of the perspectwe structure [Relazione in Atti di Convegno]
Iarusso, M. R.; Micarelli, A.; Sangineto, E.
abstract

In this paper we propose a vision-based system that lets the robot recognize an environment observed through the construction of a perspective structure which characterizes it. The individualization of the most significant characteristics of the perspective structure is performed by a geometric method that, using the information given by the image, represents the scene through elementary geometrical forms (such as straight lines) and, on the basis of this geometrical representation, it detects the perspective structure elements (e.g. the vanishing point). The method returns results that can help the robot to establish if he is really inside a corridor or in another place.


2002 - A Deformation Tolerant Version of the Generalized Hough Transform for Image Retrieval [Relazione in Atti di Convegno]
M., Anelli; A., Micarelli; Sangineto, E
abstract


2002 - A Geometric Approach to Natural Indoor Landmark Recognition for Mobile Robots [Relazione in Atti di Convegno]
Micarelli, A; Panzieri, S; Sangineto, E; Sansonetti, G
abstract


2002 - An Architecture for Video Content-Based Retrieval [Articolo su rivista]
A., DEGLI ESPOSTI; A., Micarelli; A., Neri; Sangineto, E; G., Sansonetti
abstract


2002 - An Integrated Architecture for Automatic Course Generation [Relazione in Atti di Convegno]
N., Capuano; M., Gaeta; A., Micarelli; Sangineto, E
abstract


2002 - Annotazione Automatica di Tennis Video [Relazione in Atti di Convegno]
C., Calvo; A., Micarelli; Sangineto, E
abstract


2002 - Automatic Annotation of Tennis Video Sequences [Relazione in Atti di Convegno]
C., Calvo; A., Micarelli; Sangineto, E
abstract


2002 - Recupero di Immagini tramite Trasformata di Hough [Relazione in Atti di Convegno]
M., Anelli; A., Micarelli; Sangineto, E
abstract


2002 - Two Different Approaches to Natural Indoor Landmark Recognition for Robot Navigation [Articolo su rivista]
A., Micarelli; S., Panzieri; Sangineto, E; G., Sansonetti
abstract


2001 - Natural Indoor Landmark Recognition for Robot Navigation [Relazione in Atti di Convegno]
A., Micarelli; Sangineto, E; G., Sansonetti
abstract


1998 - Local Reasoning to Play Diplomacy [Relazione in Atti di Convegno]
Sangineto, E
abstract


1997 - Multi-Agent Negotiation and Planning Through Knowledge Contextualization [Relazione in Atti di Convegno]
Sangineto, E
abstract