Nuova ricerca

NICHOLAS MORATELLI

Dottorando
Dipartimento di Ingegneria "Enzo Ferrari"


Home |


Pubblicazioni

2024 - Are Learnable Prompts the Right Way of Prompting? Adapting Vision-and-Language Models with Memory Optimization [Articolo su rivista]
Moratelli, Nicholas; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract


2024 - Fluent and Accurate Image Captioning with a Self-Trained Reward Model [Relazione in Atti di Convegno]
Moratelli, Nicholas; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract


2024 - Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization [Relazione in Atti di Convegno]
Moratelli, Nicholas; Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract


2024 - The Revolution of Multimodal Large Language Models: A Survey [Relazione in Atti di Convegno]
Caffagni, Davide; Cocchi, Federico; Barsellotti, Luca; Moratelli, Nicholas; Sarto, Sara; Baraldi, Lorenzo; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita
abstract


2024 - Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs [Relazione in Atti di Convegno]
Caffagni, Davide; Cocchi, Federico; Moratelli, Nicholas; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract


2023 - Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates [Articolo su rivista]
Moratelli, Nicholas; Barraco, Manuele; Morelli, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
abstract

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through k-nearest neighbor (kNN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.