Nuova ricerca

MATTEO STEFANINI

Assegnista di ricerca
Dipartimento di Ingegneria "Enzo Ferrari"


Home |


Pubblicazioni

2023 - From Show to Tell: A Survey on Deep Learning-based Image Captioning [Articolo su rivista]
Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cascianelli, Silvia; Fiameni, Giuseppe; Cucchiara, Rita
abstract


2023 - Predicting gene and protein expression levels from DNA and protein sequences with Perceiver [Articolo su rivista]
Stefanini, Matteo; Lovino, Marta; Cucchiara, Rita; Ficarra, Elisa
abstract

Background and objective: The functions of an organism and its biological processes result from the expression of genes and proteins. Therefore quantifying and predicting mRNA and protein levels is a crucial aspect of scientific research. Concerning the prediction of mRNA levels, the available approaches use the sequence upstream and downstream of the Transcription Start Site (TSS) as input to neural networks. The State-of-the-art models (e.g., Xpresso and Basenjii) predict mRNA levels exploiting Convolutional (CNN) or Long Short Term Memory (LSTM) Networks. However, CNN prediction depends on convolutional kernel size, and LSTM suffers from capturing long-range dependencies in the sequence. Concerning the prediction of protein levels, as far as we know, there is no model for predicting protein levels by exploiting the gene or protein sequences. Methods: Here, we exploit a new model type (called Perceiver) for mRNA and protein level prediction, exploiting a Transformer-based architecture with an attention module to attend to long-range interactions in the sequences. In addition, the Perceiver model overcomes the quadratic complexity of the standard Transformer architectures. This work's contributions are 1. DNAPerceiver model to predict mRNA levels from the sequence upstream and downstream of the TSS; 2. ProteinPerceiver model to predict protein levels from the protein sequence; 3. Protein&DNAPerceiver model to predict protein levels from TSS and protein sequences. Results: The models are evaluated on cell lines, mice, glioblastoma, and lung cancer tissues. The results show the effectiveness of the Perceiver-type models in predicting mRNA and protein levels. Conclusions: This paper presents a Perceiver architecture for mRNA and protein level prediction. In the future, inserting regulatory and epigenetic information into the model could improve mRNA and protein level predictions. The source code is freely available at https://github.com/MatteoStefanini/DNAPerceiver.


2022 - ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval [Relazione in Atti di Convegno]
Messina, Nicola; Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Falchi, Fabrizio; Amato, Giuseppe; Cucchiara, Rita
abstract


2022 - CaMEL: Mean Teacher Learning for Image Captioning [Relazione in Atti di Convegno]
Barraco, Manuele; Stefanini, Matteo; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita
abstract


2021 - A Novel Attention-based Aggregation Function to Combine Vision and Language [Relazione in Atti di Convegno]
Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements - like regions and words - proper reduction functions are needed to transform a set of encoded elements into a single response, like a classification or similarity score. In this paper, we propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction, which can be used for both classification and ranking. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices, on both COCO and VQA 2.0 datasets. Experimentally, we demonstrate that our approach leads to a performance increase on both tasks. Further, we conduct ablation studies to validate the role of each component of the approach.


2021 - Learning to Select: A Fully Attentive Approach for Novel Object Captioning [Relazione in Atti di Convegno]
Cagrandi, Marco; Cornia, Marcella; Stefanini, Matteo; Baraldi, Lorenzo; Cucchiara, Rita
abstract


2020 - Explaining Digital Humanities by Aligning Images and Textual Descriptions [Articolo su rivista]
Cornia, Marcella; Stefanini, Matteo; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita
abstract

Replicating the human ability to connect Vision and Language has recently been gaining a lot of attention in the Computer Vision and the Natural Language Processing communities. This research effort has resulted in algorithms that can retrieve images from textual descriptions and vice versa, when realistic images and sentences with simple semantics are employed and when paired training data is provided. In this paper, we go beyond these limitations and tackle the design of visual-semantic algorithms in the domain of the Digital Humanities. This setting not only advertises more complex visual and semantic structures but also features a significant lack of training data which makes the use of fully-supervised approaches infeasible. With this aim, we propose a joint visual-semantic embedding that can automatically align illustrations and textual elements without paired supervision. This is achieved by transferring the knowledge learned on ordinary visual-semantic datasets to the artistic domain. Experiments, performed on two datasets specifically designed for this domain, validate the proposed strategies and quantify the domain shift between natural images and artworks.


2020 - Meshed-Memory Transformer for Image Captioning [Relazione in Atti di Convegno]
Cornia, Marcella; Stefanini, Matteo; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M² - a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the M² Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the "Karpathy" test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly available at :https://github.com/aimagelab/meshed-memory-transformer.


2019 - A Deep-learning-based approach to VM behavior Identification in Cloud Systems [Relazione in Atti di Convegno]
Stefanini, M.; Lancellotti, R.; Baraldi, L.; Calderara, S.
abstract


2019 - Artpedia: A New Visual-Semantic Dataset with Visual and Contextual Sentences in the Artistic Domain [Relazione in Atti di Convegno]
Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita
abstract

As vision and language techniques are widely applied to realistic images, there is a growing interest in designing visual-semantic models suitable for more complex and challenging scenarios. In this paper, we address the problem of cross-modal retrieval of images and sentences coming from the artistic domain. To this aim, we collect and manually annotate the Artpedia dataset that contains paintings and textual sentences describing both the visual content of the paintings and other contextual information. Thus, the problem is not only to match images and sentences, but also to identify which sentences actually describe the visual content of a given image. To this end, we devise a visual-semantic model that jointly addresses these two challenges by exploiting the latent alignment between visual and textual chunks. Experimental evaluations, obtained by comparing our model to different baselines, demonstrate the effectiveness of our solution and highlight the challenges of the proposed dataset. The Artpedia dataset is publicly available at: http://aimagelab.ing.unimore.it/artpedia.