Nuova ricerca


Home | Curriculum(pdf) |


2024 - Determining the Largest Overlap between Tables [Articolo su rivista]
Zecchini, Luca; Bleifuß, Tobias; Simonini, Giovanni; Bergamaschi, Sonia; Naumann, Felix

Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overlap between two tables is not trivial. In particular, detecting their largest overlap, i.e., their largest common subtable, is a computationally challenging problem. As the information overlap may not occur in contiguous portions of the tables, only the ability to permute columns and rows can reveal it. The detection of the largest overlap can help us in relevant tasks such as the discovery of multiple coexisting versions of the same table, which can present differences in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, matching tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. We present the first formal definition of this problem, and with it Sloth, our solution to efficiently detect the largest overlap between two tables. We experimentally demonstrate on real-world datasets its efficacy in solving this task, analyzing its performance and showing its impact on multiple use cases.

2023 - A big data platform exploiting auditable tokenization to promote good practices inside local energy communities [Articolo su rivista]
Gagliardelli, Luca; Zecchini, Luca; Ferretti, Luca; Beneventano, Domenico; Simonini, Giovanni; Bergamaschi, Sonia; Orsini, Mirko; Magnotta, Luca; Mescoli, Emma; Livaldi, Andrea; Gessa, Nicola; De Sabbata, Piero; D’Agosta, Gianluca; Paolucci, Fabrizio; Moretti, Fabio

The Energy Community Platform (ECP) is a modular system conceived to promote a conscious use of energy by the users inside local energy communities. It is composed of two integrated subsystems: the Energy Community Data Platform (ECDP), a middleware platform designed to support the collection and the analysis of big data about the energy consumption inside local energy communities, and the Energy Community Tokenization Platform (ECTP), which focuses on tokenizing processed source data to enable incentives through smart contracts hosted on a decentralized infrastructure possibly governed by multiple authorities. We illustrate the overall design of our system, conceived considering some real-world projects (dealing with different types of local energy community, different amounts and nature of incoming data, and different types of users), analyzing in detail the key aspects of the two subsystems. In particular, the ECDP acquires data of a different nature in a heterogeneous format from multiple sources and supports a data integration workflow and a data lake workflow, designed for different uses of the data. We motivate our technological choices and present the alternatives taken into account, both in terms of software and of architectural design. On the other hand, the ECTP operates a tokenization process via smart contracts to promote good behaviors of users within the local energy community. The peculiarity of this platform is to allow external parties to audit the correct behavior of the whole tokenization process while protecting the confidentiality of the data and the performance of the platform. The main strengths of the presented system are flexibility and scalability (guaranteed by its modular architecture), which allow its applicability to any type of local energy community.

2023 - BrewER: Entity Resolution On-Demand [Articolo su rivista]
Zecchini, L.; Simonini, G.; Bergamaschi, S.; Naumann, F.

The task of entity resolution (ER) aims to detect multiple records describing the same real-world entity in datasets and to consolidate them into a single consistent record. ER plays a fundamental role in guaranteeing good data quality, e.g., as input for data science pipelines. Yet, the traditional approach to ER requires cleaning the entire data before being able to run consistent queries on it; hence, users struggle to tackle common scenarios with limited time or resources (e.g., when the data changes frequently or the user is only interested in a portion of the dataset for the task).We previously introduced BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data, according to a priority defined by the user. In this demonstration, we show how BrewER can be exploited to ease the burden of ER, allowing data scientists to save a significant amount of resources for their tasks.

2023 - Bridging the Gap between Buyers and Sellers in Data Marketplaces with Personalized Datasets [Relazione in Atti di Convegno]
Firmani, Donatella; Mathew, Jerin George; Santoro, Donatello; Simonini, Giovanni; Zecchini, Luca

Sharing, discovering, and integrating data is a crucial task and poses many challenging spots and open research direction. Data owners need to know what data consumers want and data consumers need to find datasets that are satisfactory for their tasks. Several data market platforms, or data marketplaces (DMs), have been used so far to facilitate data transactions between data owners and customers. However, current DMs are mostly shop windows, where customers have to rely on metadata that owners manually curate to discover useful datasets and there is no automated mechanism for owners to determine if their data could be merged with other datasets to satisfy customers’ desiderata. The availability of novel artificial intelligence techniques for data management has sparked a renewed interest in proposing new DMs that stray from this conventional paradigm and overcome its limitations. This paper envisions a conceptual framework called DataStreet where DMs can create personalized datasets by combining available datasets and presenting summarized statistics to help users make informed decisions. In our framework, owners share some of their data with a trusted DM, and customers provide a dataset template to fuel content-based (rather than metadata-based) search queries. Upon each query, the DM creates a preview of the personalized dataset through a flexible use of dataset discovery, integration, and value measurement, while ensuring owners’ fair treatment and preserving privacy. The previewed datasets might not be pre-defined in the DM and are finally materialized upon successful transaction.

2023 - Entity Resolution On-Demand for Querying Dirty Datasets [Relazione in Atti di Convegno]
Simonini, Giovanni; Zecchini, Luca; Naumann, Felix; Bergamaschi, Sonia

Entity Resolution (ER) is the process of identifying and merging records that refer to the same real-world entity. ER is usually applied as an expensive cleaning step on the entire data before consuming it, yet the relevance of cleaned entities ultimately depends on the user’s specific application, which may only require a small portion of the entities. We introduce BrewER, a framework designed to evaluate SQL SP queries on unclean data while progressively providing results as if they were obtained from cleaned data. BrewER aims at cleaning a single entity at a time, adhering to an ORDER BY predicate, thus it inherently supports top-k queries and stop-and-resume execution. This approach can save a significant amount of resources for various applications. BrewER has been implemented as an open-source Python library and can be seamlessly employed with existing ER tools and algorithms. We thoroughly demonstrated its efficiency through its evaluation on four real-world datasets.

2023 - Experiences and Lessons Learned from the SIGMOD Entity Resolution Programming Contests [Articolo su rivista]
De Angelis, Andrea; Mazzei, Maurizio; Piai, Federico; Merialdo, Paolo; Firmani, Donatella; Simonini, Giovanni; Zecchini, Luca; Bergamaschi, Sonia; Chu, Xu; Li, Peng; Wu, Renzhi

We report our experience in running three editions (2020, 2021, 2022) of the SIGMOD programming contest, a well-known event for students to engage in solving exciting data management problems. During this period we had the opportunity of introducing participants to the entity resolution task, which is of paramount importance in the data integration community. We aim at sharing the executive decisions, made by the people co-authoring this report, and the lessons learned.

2022 - Big Data Integration & Data-Centric AI for eHealth [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Gagliardelli, Luca; Simonini, Giovanni; Zecchini, Luca

La big data integration, ovvero l’integrazione di grandi quantità di dati provenienti da molteplici sorgenti, rappresenta una delle principali sfide per l’impiego di tecniche e strumenti basati sull’intelligenza artificiale in ambito medico (eHealth). In questo contesto risulta inoltre di primaria importanza garantire la qualità dei dati su cui operano tali strumenti e tecniche (Data-Centric AI), che rivestono un ruolo ormai centrale nel settore. Le attività di ricerca del Database Group (DBGroup) del Dipartimento di Ingegneria "Enzo Ferrari" dell’Università degli Studi di Modena e Reggio Emilia si muovono in questa direzione. Presentiamo quindi i principali progetti di ricerca del DBGroup nel campo dell’eHealth, che si inseriscono nell’ambito di collaborazioni in diversi settori applicativi.

2022 - Big Data Integration for Data-Centric AI [Abstract in Atti di Convegno]
Bergamaschi, Sonia; Beneventano, Domenico; Simonini, Giovanni; Gagliardelli, Luca; Aslam, Adeel; De Sabbata, Giulio; Zecchini, Luca

Big data integration represents one of the main challenges for the use of techniques and tools based on Artificial Intelligence (AI) in several crucial areas: eHealth, energy management, enterprise data, etc. In this context, Data-Centric AI plays a primary role in guaranteeing the quality of the data on which these tools and techniques operate. Thus, the activities of the Database Research Group (DBGroup) of the “Enzo Ferrari” Engineering Department of the University of Modena and Reggio Emilia are moving in this direction. Therefore, we present the main research projects of the DBGroup, which are part of collaborations in various application sectors.

2022 - ECDP: A Big Data Platform for the Smart Monitoring of Local Energy Communities [Relazione in Atti di Convegno]
Gagliardelli, Luca; Zecchini, Luca; Beneventano, Domenico; Simonini, Giovanni; Bergamaschi, Sonia; Orsini, Mirko; Magnotta, Luca; Mescoli, Emma; Livaldi, Andrea; Gessa, Nicola; De Sabbata, Piero; D’Agosta, Gianluca; Paolucci, Fabrizio; Moretti3, Fabio

2022 - Entity Resolution On-Demand [Articolo su rivista]
Simonini, Giovanni; Zecchini, Luca; Bergamaschi, Sonia; Naumann, Felix

Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire, continuously-growing data. Similarly, when querying data lakes, we want to transform data on-demand and return the results in a timely manner---a fundamental requirement of ELT (Extract-Load-Transform) pipelines. We propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, following an ORDER BY predicate. Thus, it inherently supports top-k and stop-and-resume execution. For a wide range of applications, a significant amount of resources can be saved. We exhaustively evaluate and show the efficacy of BrewER on four real-world datasets.

2022 - Progressive Entity Resolution with Node Embeddings [Relazione in Atti di Convegno]
Simonini, Giovanni; Gagliardelli, Luca; Rinaldi, Michele; Zecchini, Luca; De Sabbata, Giulio; Aslam, Adeel; Beneventano, Domenico; Bergamaschi, Sonia

Entity Resolution (ER) is the task of finding records that refer to the same real-world entity, which are called matches. ER is a fundamental pre-processing step when dealing with dirty and/or heterogeneous datasets; however, it can be very time-consuming when employing complex machine learning models to detect matches, as state-of-the-art ER methods do. Thus, when time is a critical component and having a partial ER result is better than having no result at all, progressive ER methods are employed to try to maximize the number of detected matches as a function of time. In this paper, we study how to perform progressive ER by exploiting graph embeddings. The basic idea is to represent candidate matches in a graph: each node is a record and each edge is a possible comparison to check—we build that on top of a well-known, established graph-based ER framework. We experimentally show that our method performs better than existing state-of-the-art progressive ER methods on real-world benchmark datasets.

2022 - Task-Driven Big Data Integration [Relazione in Atti di Convegno]
Zecchini, Luca

Data integration aims at combining data acquired from different autonomous sources to provide the user with a unified view of this data. One of the main challenges in data integration processes is entity resolution, whose goal is to detect the different representations of the same real-world entity across the sources, in order to produce a unique and consistent representation for it. The advent of big data has challenged traditional data integration paradigms, making the offline batch approach to entity resolution no longer suitable for several scenarios (e.g., when performing data exploration or dealing with datasets that change with a high frequency). Therefore, it becomes of primary importance to produce new solutions capable of operating effectively in such situations. In this paper, I present some contributions made during the first half of my PhD program, mainly focusing on the design of a framework to perform entity resolution in an on-demand fashion, building on the results achieved by the progressive and query-driven approaches to this task. Moreover, I also briefly describe two projects in which I took part as a member of my research group, touching on some real-world applications of big data integration techniques, to conclude with some ideas on the future directions of my research.

2021 - Progressive Query-Driven Entity Resolution [Relazione in Atti di Convegno]
Zecchini, Luca

Entity Resolution (ER) aims to detect in a dirty dataset the records that refer to the same real-world entity, playing a fundamental role in data cleaning and integration tasks. Often, a data scientist is only interested in a portion of the dataset (e.g., data exploration); this interest can be expressed through a query. The traditional batch approach is far from optimal, since it requires to perform ER on the whole dataset before executing a query on its cleaned version, performing a huge number of useless comparisons. This causes a waste of time, resources and money. Proposed solutions to this problem follow a query-driven approach (perform ER only on the useful data) or a progressive one (the entities in the result are emitted as soon as they are solved), but these two aspects have never been reconciled. This paper introduces BrewER framework, which allows to execute clean queries on dirty datasets in a query-driven and progressive way, thanks to a preliminary filtering and an iteratively managed sorted list that defines emission priority. Early results obtained by first BrewER prototype on real-world datasets from different domains confirm the benefits of this combined solution, paving the way for a new and more comprehensive approach to ER.

2021 - The Case for Multi-task Active Learning Entity Resolution [Relazione in Atti di Convegno]
Simonini, Giovanni; Saccani, Henrique; Gagliardelli, Luca; Zecchini, Luca; Beneventano, Domenico; Bergamaschi, Sonia

2020 - Entity resolution on camera records without machine learning [Relazione in Atti di Convegno]
Zecchini, L.; Simonini, G.; Bergamaschi, S.

This paper reports the runner-up solution to the ACM SIGMOD 2020 programming contest, whose target was to identify the specifications (i.e., records) collected across 24 e-commerce data sources that refer to the same real-world entities. First, we investigate the machine learning (ML) approach, but surprisingly find that existing state-of-the-art ML-based methods fall short in such a context-not reaching 0.49 F-score. Then, we propose an efficient solution that exploits annotated lists and regular expressions generated by humans that reaches a 0.99 F-score. In our experience, our approach was not more expensive than the dataset labeling of match/non-match pairs required by ML-based methods, in terms of human efforts.