Nuova ricerca


Assegnista di ricerca presso: Centro Interdipartimentale di Ricerca Artificial Intelligence Research and Innovation center (AIRI)

Home |


2021 - Automated Machine Learning for Entity Matching Tasks [Relazione in Atti di Convegno]
Paganelli, Matteo; Del Buono, Francesco; Pevarello, Marco; Guerra, Francesco; Vincini, Maurizio

The paper studies the application of automated machine learning approaches (AutoML) for addressing the problem of Entity Matching (EM). This would make the existing, highly effective, Machine Learning (ML) and Deep Learning based approaches for EM usable also by non-expert users, who do not have the expertise to train and tune such complex systems. Our experiments show that the direct application of AutoML systems to this scenario does not provide high quality results. To address this issue, we introduce a new component, the EM adapter, to be pipelined with standard AutoML systems, that preprocesses the EM datasets to make them usable by automated approaches. The experimental evaluation shows that our proposal obtains the same effectiveness as the state-of-the-art EM systems, but it does not require any skill on ML to tune it.

2021 - Transforming ML Predictive Pipelines into SQL with MASQ [Relazione in Atti di Convegno]
Del Buono, F.; Paganelli, M.; Sottovia, P.; Interlandi, M.; Guerra, F.

Inference of Machine Learning (ML) models, i.e. the process of obtaining predictions from trained models, is often an overlooked problem. Model inference is however one of the main contributors of both technical debt in ML applications and infrastructure complexity. MASQ is a framework able to run inference of ML models directly on DBMSs. MASQ not only averts expensive data movements for those predictive scenarios where data resides on a database, but it also naturally exploits all the "Enterprise-grade"features such as governance, security and auditability which make DBMSs the cornerstone of many businesses. MASQ compiles trained models and ML pipelines implemented in scikit-learn directly into standard SQL: no UDFs nor vendor-specific syntax are used, and therefore queries can be readily executed on any DBMS. In this demo, we will showcase MASQ's capabilities through a GUI allowing attendees to: (1) train ML pipelines composed of data featurizers and ML models; (2) compile the trained pipelines into SQL, and deploy them on different DBMSs (MySQL and SQLServer in the demo); and (3) compare the related performance under different configurations (e.g., the original pipeline on the ML framework against the SQL implementations).

2021 - Using Landmarks for Explaining Entity Matching Models [Relazione in Atti di Convegno]
Baraldi, Andrea; Del Buono, Francesco; Paganelli, Matteo; Guerra, Francesco

The state of the art approaches for performing Entity Matching (EM) rely on machine & deep learning models for inferring pairs of matching / non-matching entities. Although the experimental evaluations demonstrate that these approaches are effective, their adoption in real scenarios is limited by the fact that they are difficult to interpret. Explainable AI systems have been recently proposed for complementing deep learning approaches. Their application to the scenario offered by EM is still new and requires to address the specificity of this task, characterized by particular dataset schemas, describing a pair of entities, and imbalanced classes. This paper introduces Landmark Explanation, a generic and extensible framework that extends the capabilities of a post-hoc perturbation-based explainer over the EM scenario. Landmark Explanation generates perturbations that take advantage of the particular schemas of the EM datasets, thus generating explanations more accurate and more interesting for the users than the ones generated by competing approaches.

2020 - Explaining data with descriptions [Articolo su rivista]
Paganelli, Matteo; Sottovia, Paolo; Maccioni, Antonio; Interlandi, Matteo; Guerra, Francesco

2020 - Unsupervised Evaluation of Data Integration Processes [Relazione in Atti di Convegno]
Paganelli, M.; Buono, F. D.; Guerra, F.; Ferro, N.

Evaluation of the quality of data integration processes is usually performed via manual onerous data inspections. This task is particularly heavy in real business scenarios, where the large amount of data makes checking all the tuples infeasible and the frequent updates, i.e. changes in the sources and/or new sources, impose to repeat the evaluation over and over. Our idea is to address this issue by providing the experts with an unsupervised measure, based on word frequencies, which quantifies how much a dataset is representative of another dataset, giving an indication of how good is the integration process and whether deviations are happening and a manual inspection is needed. We also conducted some preliminary experiments, using shared datasets, that show the effectiveness of the proposed measures in typical data integration scenarios.

2019 - Big Data Integration of Heterogeneous Data Sources: The Re-Search Alps Case Study [Relazione in Atti di Convegno]
Guerra, Francesco; Sottovia, Paolo; Paganelli, Matteo; Vincini, Maurizio

The application of big data integration techniques in real scenarios needs to address practical issues related to the scalability of the process and the heterogeneity of data sources. In this paper, we describe the pipeline that has been developed in the context of the Re-search Alps project, a project funded by the EU Commission through the INEA Agency in the CEF Telecom framework, that aims at creating an open dataset describing research centers located in the Alpine area.

2019 - Finding Synonymous Attributes in Evolving Wikipedia Infoboxes [Relazione in Atti di Convegno]
Sottovia, Paolo; Paganelli, Matteo; Guerra, Francesco; Velegrakis, Yannis

Wikipedia Infoboxes are semi-structured data structures organized in an attribute-value fashion. Policies establish for each type of entity represented in Wikipedia the attribute names that the Infobox should contain in the form of a template. However, these requirements change over time and often users choose not to strictly obey them. As a result, it is hard to treat in an integrated way the history of the Wikipedia pages, making it difficult to analyze the temporal evolution of Wikipedia entities through their Infobox and impossible to perform direct comparison of entities of the same type. To address this challenge, we propose an approach to deal with the misalignment of the attribute names and identify clusters of synonymous Infobox attributes. Elements in the same cluster are considered as a temporal evolution of the same attribute. To identify the clusters we use two different distance metrics. The first is the co-occurrence degree that is treated as a negative distance, and the second is the co-occurrence of similar values in the attributes that are treated as a positive evidence of synonymy. We formalize the problem as a correlation clustering problem over a weighted graph constructed with attributes as nodes and positive and negative evidence as edges. We solve it with a linear programming model that shows a good approximation. Our experiments over a collection of Infoboxes of the last 13 years shows the potential of our approach.

2019 - Parallelizing computations of full disjunctions [Articolo su rivista]
Paganelli, Matteo; Beneventano, Domenico; Guerra, Francesco; Sottovia, Paolo

In relational databases, the full disjunction operator is an associative extension of the full outerjoin to an arbitrary number of relations. Its goal is to maximize the information we can extract from a database by connecting all tables through all join paths. The use of full disjunctions has been envisaged in several scenarios, such as data integration, and knowledge extraction. One of the main limitations in its adoption in real business scenarios is the large time its computation requires. This paper overcomes this limitation by introducing a novel approach parafd, based on parallel computing techniques, for implementing the full disjunction operator in an exact and approximate version. Our proposal has been compared with state of the art algorithms, which have also been re-implemented for performing in parallel. The experiments show that the time performance outperforms existing approaches. Finally, we have experimented the full disjunction as a collection of documents indexed by a textual search engine. In this way, we provide a simple technique for performing keyword search over relational databases. The results obtained against a benchmark show high precision and recall levels even compared with the existing proposals.

2019 - TuneR: Fine Tuning of Rule-based Entity Matchers [Relazione in Atti di Convegno]
Paganelli, Matteo; Sottovia, Paolo; Guerra, Francesco; Velegrakis, Yannis

A rule-based entity matching task requires the definition of an effective set of rules, which is a time-consuming and error-prone process. The typical approach adopted for its resolution is a trial and error method, where the rules are incrementally added and modified until satisfactory results are obtained. This approach requires significant human intervention, since a typical dataset needs the definition of a large number of rules and possible interconnections that cannot be manually managed. In this paper, we propose TuneR, a software library supporting developers (i.e., coders, scientists, and domain experts) in tuning sets of matching rules. It aims to reduce human intervention by offering a tool for the optimization of rule sets based on user-defined criteria (such as effectiveness, interpretability, etc.). Our goal is to integrate the framework in the Magellan ecosystem, thus completing the functionalities required by the developers for performing Entity Matching tasks.

2019 - Understanding Data in the Blink of an Eye [Relazione in Atti di Convegno]
Paganelli, Matteo; Sottovia, Paolo; Maccioni, Antonio; Interlandi, Matteo; Guerra, Francesco

Many data analysis and knowledge mining tasks require a basic understanding of the content of a dataset prior to any data access. In this demo, we showcase how data descriptions---a set of compact, readable and insightful formulas of boolean predicates---can be used to guide users in understanding datasets. Finding the best description for a dataset is, unfortunately, both computationally hard and task-specific. This demo shows that not only we can generate descriptions at interactive speed, but also that diverse user needs---from anomaly detection to data exploration---can be accommodated through a user-driven process exploiting dynamic programming in concert with a set of heuristics.

2017 - The RE-SEARCH ALPS (research laboratories in the alpine area) project [Relazione in Atti di Convegno]
Guerra, Francesco; Russo, Margherita; Fontana, Marco; Paganelli, Matteo; Bancilhon, Francois; Frisch, Christian; Petit, Loic; Giorgi, Anna; Zilio, Emanuela

The paper describes the RE-SEARCH ALPS project, which aims to gather, consolidate, harmonize and make available to different targets (public and private bodies working at local, regional and national level) data about laboratories, research and innovation centers which are active in particular in the regions of seven countries which constitute the Alpine Area (France, Italy, Switzerland, Austria, Germany, Liechtenstein and Slovenia). The project is complemented with a search engine which allows the users to directly query the dataset and to obtain geo referenced data as result. The data will be properly visualized thanks a visualizer developed in the project. From a research perspective, the project has to address hot and challenging Big Data issues, such as big data integration (to join data sources), entity recognition and linkage in large amount of data (to discover the same Institution represented in different sources), data cleaning and reconciliation (to address issues related to different representation of the same real object). The project has been applied in a call for the cration of Open Datasets promoted by the European Innovation and Networks Executive Agency through the Connecting Europe Facility (CEF) funding instrument. The project has been recently approved (AGREEMENT No INEA/CEF/ICT/A2016/1296967): it lasts two years and will start on July 2017.