SONIA BERGAMASCHI - personale UniMoRe

Nuova ricerca

SONIA BERGAMASCHI

SENIOR PROFESSOR
Dipartimento di Ingegneria "Enzo Ferrari"

Pubblicazioni

2024 - A Novel Methodology for Topic Identification in Hadith [Abstract in Atti di Convegno]
Aftar, Sania; Gagliardelli, Luca; EL GANADI, Amina; Ruozzi, Federico; Bergamaschi, Sonia
abstract

2024 - Determining the Largest Overlap between Tables [Articolo su rivista]
Zecchini, Luca; Bleifuß, Tobias; Simonini, Giovanni; Bergamaschi, Sonia; Naumann, Felix
abstract

Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overlap between two tables is not trivial. In particular, detecting their largest overlap, i.e., their largest common subtable, is a computationally challenging problem. As the information overlap may not occur in contiguous portions of the tables, only the ability to permute columns and rows can reveal it. The detection of the largest overlap can help us in relevant tasks such as the discovery of multiple coexisting versions of the same table, which can present differences in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, matching tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. We present the first formal definition of this problem, and with it Sloth, our solution to efficiently detect the largest overlap between two tables. We experimentally demonstrate on real-world datasets its efficacy in solving this task, analyzing its performance and showing its impact on multiple use cases.

2024 - GSM: A Generalized Approach to Supervised Meta-blocking for Scalable Entity Resolution [Articolo su rivista]
Gagliardelli, Luca; Papadakis, George; Simonini, Giovanni; Bergamaschi, Sonia; Palpanas, Themis
abstract

2023 - A Big Data Platform for the Management of Local Energy Communities Data [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Gagliardelli, Luca
abstract

2023 - A big data platform exploiting auditable tokenization to promote good practices inside local energy communities [Articolo su rivista]
Gagliardelli, Luca; Zecchini, Luca; Ferretti, Luca; Beneventano, Domenico; Simonini, Giovanni; Bergamaschi, Sonia; Orsini, Mirko; Magnotta, Luca; Mescoli, Emma; Livaldi, Andrea; Gessa, Nicola; De Sabbata, Piero; D’Agosta, Gianluca; Paolucci, Fabrizio; Moretti, Fabio
abstract

The Energy Community Platform (ECP) is a modular system conceived to promote a conscious use of energy by the users inside local energy communities. It is composed of two integrated subsystems: the Energy Community Data Platform (ECDP), a middleware platform designed to support the collection and the analysis of big data about the energy consumption inside local energy communities, and the Energy Community Tokenization Platform (ECTP), which focuses on tokenizing processed source data to enable incentives through smart contracts hosted on a decentralized infrastructure possibly governed by multiple authorities. We illustrate the overall design of our system, conceived considering some real-world projects (dealing with different types of local energy community, different amounts and nature of incoming data, and different types of users), analyzing in detail the key aspects of the two subsystems. In particular, the ECDP acquires data of a different nature in a heterogeneous format from multiple sources and supports a data integration workflow and a data lake workflow, designed for different uses of the data. We motivate our technological choices and present the alternatives taken into account, both in terms of software and of architectural design. On the other hand, the ECTP operates a tokenization process via smart contracts to promote good behaviors of users within the local energy community. The peculiarity of this platform is to allow external parties to audit the correct behavior of the whole tokenization process while protecting the confidentiality of the data and the performance of the platform. The main strengths of the presented system are flexibility and scalability (guaranteed by its modular architecture), which allow its applicability to any type of local energy community.

2023 - BrewER: Entity Resolution On-Demand [Articolo su rivista]
Zecchini, L.; Simonini, G.; Bergamaschi, S.; Naumann, F.
abstract

The task of entity resolution (ER) aims to detect multiple records describing the same real-world entity in datasets and to consolidate them into a single consistent record. ER plays a fundamental role in guaranteeing good data quality, e.g., as input for data science pipelines. Yet, the traditional approach to ER requires cleaning the entire data before being able to run consistent queries on it; hence, users struggle to tackle common scenarios with limited time or resources (e.g., when the data changes frequently or the user is only interested in a portion of the dataset for the task).We previously introduced BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data, according to a priority defined by the user. In this demonstration, we show how BrewER can be exploited to ease the burden of ER, allowing data scientists to save a significant amount of resources for their tasks.

2023 - Bridging Islamic Knowledge and AI: Inquiring ChatGPT on Possible Categorizations for an Islamic Digital Library (full paper) [Relazione in Atti di Convegno]
EL GANADI, Amina; Vigliermo, RICCARDO AMERIGO; Sala, Luca; Vanzini, Matteo; Ruozzi, Federico; Bergamaschi, Sonia
abstract

2023 - Entity Resolution On-Demand for Querying Dirty Datasets [Relazione in Atti di Convegno]
Simonini, Giovanni; Zecchini, Luca; Naumann, Felix; Bergamaschi, Sonia
abstract

Entity Resolution (ER) is the process of identifying and merging records that refer to the same real-world entity. ER is usually applied as an expensive cleaning step on the entire data before consuming it, yet the relevance of cleaned entities ultimately depends on the user’s specific application, which may only require a small portion of the entities. We introduce BrewER, a framework designed to evaluate SQL SP queries on unclean data while progressively providing results as if they were obtained from cleaned data. BrewER aims at cleaning a single entity at a time, adhering to an ORDER BY predicate, thus it inherently supports top-k queries and stop-and-resume execution. This approach can save a significant amount of resources for various applications. BrewER has been implemented as an open-source Python library and can be seamlessly employed with existing ER tools and algorithms. We thoroughly demonstrated its efficiency through its evaluation on four real-world datasets.

2023 - Experiences and Lessons Learned from the SIGMOD Entity Resolution Programming Contests [Articolo su rivista]
De Angelis, Andrea; Mazzei, Maurizio; Piai, Federico; Merialdo, Paolo; Firmani, Donatella; Simonini, Giovanni; Zecchini, Luca; Bergamaschi, Sonia; Chu, Xu; Li, Peng; Wu, Renzhi
abstract

We report our experience in running three editions (2020, 2021, 2022) of the SIGMOD programming contest, a well-known event for students to engage in solving exciting data management problems. During this period we had the opportunity of introducing participants to the entity resolution task, which is of paramount importance in the data integration community. We aim at sharing the executive decisions, made by the people co-authoring this report, and the lessons learned.

2023 - HKS: Efficient Data Partitioning for Stateful Streaming [Relazione in Atti di Convegno]
Aslam, Adeel; Simonini, Giovanni; Gagliardelli, Luca; Mozzillo, Angelo; Bergamaschi, Sonia
abstract

2023 - Knowledge extraction, management and long-term preservation of non-Latin cultural heritages - Digital Maktaba project presentation [Relazione in Atti di Convegno]
Martoglia, Riccardo; Bergamaschi, Sonia; Ruozzi, Federico; Vanzini, Matteo; Sala, Luca; Vigliermo, RICCARDO AMERIGO
abstract

2023 - Privacy-Preserving Data Integration for Digital Justice [Relazione in Atti di Convegno]
Trigiante, L.; Beneventano, D.; Bergamaschi, S.
abstract

The digital transformation of the Justice domain and the resulting availability of vast amounts of data describing people and their criminal behaviors offer significant promise to feed multiple research areas and enhance the criminal justice system. Achieving this vision requires the integration of different sources to create an accurate and unified representation that enables detailed and extensive data analysis. However, the collection and processing of sensitive legal-related data about individuals imposes consideration of privacy legislation and confidentiality implications. This paper presents the lesson learned from the design and develop of a Privacy-Preserving Data Integration (PPDI) architecture and process to address the challenges and opportunities of integrating personal data belonging to criminal and court sources within the Italian Justice Domain in compliance with GDPR.

2023 - Progetto di Basi di Dati Relazionali [Monografia/Trattato scientifico]
Beneventano, Domenico; Bergamaschi, Sonia; Gagliardelli, Luca; Guerra, Francesco; Vincini, Maurizio
abstract

L’obiettivo del volume è fornire al lettore le nozioni fondamentali di progettazione e di realizzazione di applicazioni di basi di dati relazionali. Relativamente alla progettazione, vengono trattate le fasi di progettazione concettuale e logica e vengono presentati i modelli dei dati Entity-Relationship e Relazionale che costituiscono gli strumenti di base, rispettivamente, per la progettazione concettuale e la progettazione logica. Viene inoltre introdotto lo studente alla teoria della normalizzazione di basi di dati relazionali. Relativamente alla realizzazione, vengono presentati elementi ed esempi del linguaggio standard per RDBMS (Relational Database Management Systems) SQL. Ampio spazio è dedicato ad esercizi svolti sui temi trattati.

2023 - [Vision Paper] Privacy-Preserving Data Integration [Relazione in Atti di Convegno]
Trigiante, Lisa; Beneventano, Domenico; Bergamaschi, Sonia
abstract

The digital transformation of different processes and the resulting availability of vast amounts of data describing people and their behaviors offer significant promise to advance multiple research areas and enhance both the public and private sectors. Exploiting the full potential of this vision requires a unified representation of different autonomous data sources to fa- cilitate detailed data analysis capacity. Collecting and processing sensitive data about individuals leads to consideration of privacy requirements and confidentiality concerns. This vision paper pro- vides a concise overview of the research field concerning Privacy- Preserving Data Integration (PPDI), the associated challenges, opportunities, and unexplored aspects, with the primary aim of designing a novel and comprehensive PPDI framework based on a Trusted Third-Party microservices architecture.

2022 - Big Data Integration & Data-Centric AI for eHealth [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Gagliardelli, Luca; Simonini, Giovanni; Zecchini, Luca
abstract

La big data integration, ovvero l’integrazione di grandi quantità di dati provenienti da molteplici sorgenti, rappresenta una delle principali sfide per l’impiego di tecniche e strumenti basati sull’intelligenza artificiale in ambito medico (eHealth). In questo contesto risulta inoltre di primaria importanza garantire la qualità dei dati su cui operano tali strumenti e tecniche (Data-Centric AI), che rivestono un ruolo ormai centrale nel settore. Le attività di ricerca del Database Group (DBGroup) del Dipartimento di Ingegneria "Enzo Ferrari" dell’Università degli Studi di Modena e Reggio Emilia si muovono in questa direzione. Presentiamo quindi i principali progetti di ricerca del DBGroup nel campo dell’eHealth, che si inseriscono nell’ambito di collaborazioni in diversi settori applicativi.

2022 - Big Data Integration for Data-Centric AI [Abstract in Atti di Convegno]
Bergamaschi, Sonia; Beneventano, Domenico; Simonini, Giovanni; Gagliardelli, Luca; Aslam, Adeel; De Sabbata, Giulio; Zecchini, Luca
abstract

Big data integration represents one of the main challenges for the use of techniques and tools based on Artificial Intelligence (AI) in several crucial areas: eHealth, energy management, enterprise data, etc. In this context, Data-Centric AI plays a primary role in guaranteeing the quality of the data on which these tools and techniques operate. Thus, the activities of the Database Research Group (DBGroup) of the “Enzo Ferrari” Engineering Department of the University of Modena and Reggio Emilia are moving in this direction. Therefore, we present the main research projects of the DBGroup, which are part of collaborations in various application sectors.

2022 - ECDP: A Big Data Platform for the Smart Monitoring of Local Energy Communities [Relazione in Atti di Convegno]
Gagliardelli, Luca; Zecchini, Luca; Beneventano, Domenico; Simonini, Giovanni; Bergamaschi, Sonia; Orsini, Mirko; Magnotta, Luca; Mescoli, Emma; Livaldi, Andrea; Gessa, Nicola; De Sabbata, Piero; D’Agosta, Gianluca; Paolucci, Fabrizio; Moretti3, Fabio
abstract

2022 - Entity Resolution On-Demand [Articolo su rivista]
Simonini, Giovanni; Zecchini, Luca; Bergamaschi, Sonia; Naumann, Felix
abstract

Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire, continuously-growing data. Similarly, when querying data lakes, we want to transform data on-demand and return the results in a timely manner---a fundamental requirement of ELT (Extract-Load-Transform) pipelines. We propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, following an ORDER BY predicate. Thus, it inherently supports top-k and stop-and-resume execution. For a wide range of applications, a significant amount of resources can be saved. We exhaustively evaluate and show the efficacy of BrewER on four real-world datasets.

2022 - Generalized Supervised Meta-blocking [Articolo su rivista]
Gagliardelli, Luca; Papadakis, George; Simonini, Giovanni; Bergamaschi, Sonia; Palpanas, Themis
abstract

Entity Resolution is a core data integration task that relies on Blocking to scale to large datasets. Schema-agnostic blocking achieves very high recall, requires no domain knowledge and applies to data of any structuredness and schema heterogeneity. This comes at the cost of many irrelevant candidate pairs (i.e., comparisons), which can be significantly reduced by Meta-blocking techniques that leverage the entity co-occurrence patterns inside blocks: first, pairs of candidate entities are weighted in proportion to their matching likelihood, and then, pruning discards the pairs with the lowest scores. Supervised Meta-blocking goes beyond this approach by combining multiple scores per comparison into a feature vector that is fed to a binary classifier. By using probabilistic classifiers, Generalized Supervised Meta-blocking associates every pair of candidates with a score that can be used by any pruning algorithm. For higher effectiveness, new weighting schemes are examined as features. Through extensive experiments, we identify the best pruning algorithms, their optimal sets of features, as well as the minimum possible size of the training set.

2022 - Novel Perspectives for the Management of Multilingual and Multialphabetic Heritages through Automatic Knowledge Extraction: The DigitalMaktaba Approach [Articolo su rivista]
Bergamaschi, Sonia; De Nardis, Stefania; Martoglia, Riccardo; Ruozzi, Federico; Sala, Luca; Vanzini, Matteo; Vigliermo, RICCARDO AMERIGO
abstract

The linguistic and social impact of multiculturalism can no longer be neglected in any sector, creating the urgent need of creating systems and procedures for managing and sharing cultural heritages in both supranational and multi-literate contexts. In order to achieve this goal, text sensing appears to be one of the most crucial research areas. The long-term objective of the DigitalMaktaba project, born from interdisciplinary collaboration between computer scientists, historians, librarians, engineers and linguists, is to establish procedures for the creation, management and cataloguing of archival heritage in non-Latin alphabets. In this paper, we discuss the currently ongoing design of an innovative workflow and tool in the area of text sensing, for the automatic extraction of knowledge and cataloguing of documents written in non-Latin languages (Arabic, Persian and Azerbaijani). The current prototype leverages different OCR, text processing and information extraction techniques in order to provide both a highly accurate extracted text and rich metadata content (including automatically identified cataloguing metadata), overcoming typical limitations of current state of the art approaches. The initial tests provide promising results. The paper includes a discussion of future steps (e.g., AI-based techniques further leveraging the extracted data/metadata and making the system learn from user feedback) and of the many foreseen advantages of this research, both from a technical and a broader cultural-preservation and sharing point of view.

2022 - Progressive Entity Resolution with Node Embeddings [Relazione in Atti di Convegno]
Simonini, Giovanni; Gagliardelli, Luca; Rinaldi, Michele; Zecchini, Luca; De Sabbata, Giulio; Aslam, Adeel; Beneventano, Domenico; Bergamaschi, Sonia
abstract

Entity Resolution (ER) is the task of finding records that refer to the same real-world entity, which are called matches. ER is a fundamental pre-processing step when dealing with dirty and/or heterogeneous datasets; however, it can be very time-consuming when employing complex machine learning models to detect matches, as state-of-the-art ER methods do. Thus, when time is a critical component and having a partial ER result is better than having no result at all, progressive ER methods are employed to try to maximize the number of detected matches as a function of time. In this paper, we study how to perform progressive ER by exploiting graph embeddings. The basic idea is to represent candidate matches in a graph: each node is a record and each edge is a possible comparison to check—we build that on top of a well-known, established graph-based ER framework. We experimentally show that our method performs better than existing state-of-the-art progressive ER methods on real-world benchmark datasets.

2021 - Preface [Relazione in Atti di Convegno]
Mottin, D.; Lissandrini, M.; Roy, S. B.; Velegrakis, Y.; Athanassoulis, M.; Augsten, N.; Hamadou, H. B.; Bergamaschi, S.; Bikakis, N.; Bonifati, A.; Dimou, A.; Di Rocco, L.; Fletcher, G.; Foroni, D.; Freytag, J. -C.; Groth, P.; Guerra, F.; Hartig, O.; Karras, P.; Ke, X.; Kondylakis, H.; Koutrika, G.; Manolescu, I.
abstract

2021 - Preserving and conserving culture: First steps towards a knowledge extractor and cataloguer for multilingual and multi-alphabetic heritages [Relazione in Atti di Convegno]
Bergamaschi, S.; Martoglia, R.; Ruozzi, F.; Vigliermo, R. A.; De Nardis, S.; Sala, L.; Vanzini, M.
abstract

Managing and sharing cultural heritages also in supranational and multi-literate contexts is a very hot research topic. In this paper we discuss the research we are conducting in the DigitalMaktaba project, presenting the first steps for designing an innovative workflow and tool for the automatic extraction of knowledge from documents written in multiple non-Latin languages (Arabic, Persian and Azerbaijani languages). The tool leverages different OCR, text processing techniques and linguistic corpora in order to provide both a highly accurate extracted text and a rich metadata content, overcoming typical limitations of current state-of-the-art systems; this will enable in the near future the development of an automatic cataloguer which we hope will ultimately help in better preserving and conserving culture in such a demanding scenario.

2021 - Reproducible experiments on Three-Dimensional Entity Resolution with JedAI [Articolo su rivista]
Mandilaras, George; Papadakis, George; Gagliardelli, Luca; Simonini, Giovanni; Thanos, Emmanouil; Giannakopoulos, George; Bergamaschi, Sonia; Palpanas, Themis; Koubarakis, Manolis; Lara-Clares, Alicia; Farina, Antonio
abstract

In Papadakis et al. [1], we presented the latest release of JedAI, an open-source Entity Resolution (ER) system that allows for building a large variety of end-to-end ER pipelines. Through a thorough experimental evaluation, we compared a schema-agnostic ER pipeline based on blocks with another schema-based ER pipeline based on similarity joins. We applied them to 10 established, real-world datasets and assessed them with respect to effectiveness and time efficiency. Special care was taken to juxtapose their scalability, too, using seven established, synthetic datasets. Moreover, we experimentally compared the effectiveness of the batch schema-agnostic ER pipeline with its progressive counterpart. In this companion paper, we describe how to reproduce the entire experimental study that pertains to JedAI’s serial execution through its intuitive user interface. We also explain how to examine the robustness of the parameter configurations we have selected.

2021 - The Case for Multi-task Active Learning Entity Resolution [Relazione in Atti di Convegno]
Simonini, Giovanni; Saccani, Henrique; Gagliardelli, Luca; Zecchini, Luca; Beneventano, Domenico; Bergamaschi, Sonia
abstract

2021 - The Italian National Registry for FSHD: an enhanced data integration and an analytics framework towards Smart Health Care and Precision Medicine for a rare disease [Articolo su rivista]
Bettio, C.; Salsi, V.; Orsini, M.; Calanchi, E.; Magnotta, L.; Gagliardelli, L.; Kinoshita, J.; Bergamaschi, S.; Tupler, R.
abstract

Background: The Italian Clinical network for FSHD (ICNF) has established the Italian National Registry for FSHD (INRF), collecting data from patients affected by Facioscapulohumeral dystrophy (FSHD) and their relatives. The INRF has gathered data from molecular analysis, clinical evaluation, anamnestic information, and family history from more than 3500 participants. Methods: A data management framework, called Mediator Environment for Multiple Information Sources (MOMIS) FSHD Web Platform, has been developed to provide charts, maps and search tools customized for specific needs. Patients’ samples and their clinical information derives from the Italian Clinical network for FSHD (ICNF), a consortium consisting of fourteen neuromuscular clinics distributed across Italy. The tools used to collect, integrate, and visualize clinical, molecular and natural history information about patients affected by FSHD and their relatives are described. Results: The INRF collected the molecular data regarding FSHD diagnosis conducted on 7197 subjects and identified 3362 individuals carrying a D4Z4 Reduced Allele (DRA): 1634 were unrelated index cases. In 1032 cases the molecular testing has been extended to 3747 relatives, 1728 carrying a DRA. Since 2009 molecular analysis has been accompanied by clinical evaluation based standardized evaluation protocols. In the period 2009–2020, 3577 clinical forms have been collected, 2059 follow the Comprehensive Clinical Evaluation form (CCEF). The integration of standardized clinical information and molecular data has made possible to demonstrate the wide phenotypic variability of FSHD. The MOMIS (Mediator Environment for Multiple Information Sources) data integration framework allowed performing genotype–phenotype correlation studies, and generated information of medical importance either for clinical practice or genetic counseling. Conclusion: The platform implemented for the FSHD Registry data collection based on OpenClinica meets the requirement to integrate patient/disease information, as well as the need to adapt dynamically to security and privacy concerns. Our results indicate that the quality of data collection in a multi-integrated approach is fundamental for clinical and epidemiological research in a rare disease and may have great value in allowing us to redefine diagnostic criteria and disease markers for FSHD. By extending the use of the MOMIS data integration framework to other countries and the longitudinal systematic collection of standardized clinical data will facilitate the understanding of disease natural history and offer valuable inputs towards trial readiness. This approach is of high significance to FSHD medical community and also to rare disease research in general.

2020 - BLAST2: An Efficient Technique for Loose Schema Information Extraction from Heterogeneous Big Data Sources [Articolo su rivista]
BENEVENTANO, Domenico; BERGAMASCHI, Sonia; GAGLIARDELLI, LUCA; SIMONINI, GIOVANNI
abstract

We present BLAST2 a novel technique to efficiently extract loose schema information, i.e., metadata that can serve as a surrogate of the schema alignment task within the Entity Resolution (ER) process — to identify records that refer to the same real-world entity — when integrating multiple, heterogeneous and voluminous data sources. The loose schema information is exploited for reducing the overall complexity of ER, whose naïve solution would imply O(n^2) comparisons, where is the number of entity representations involved in the process and can be extracted by both structured and unstructured data sources. BLAST2 is completely unsupervised yet able to achieve almost the same precision and recall of supervised state-of-the-art schema alignment techniques when employed for Entity Resolution tasks, as shown in our experimental evaluation performed on two real-world data sets (composed of 7 and 10 data sources, respectively).

2020 - Entity resolution on camera records without machine learning [Relazione in Atti di Convegno]
Zecchini, L.; Simonini, G.; Bergamaschi, S.
abstract

This paper reports the runner-up solution to the ACM SIGMOD 2020 programming contest, whose target was to identify the specifications (i.e., records) collected across 24 e-commerce data sources that refer to the same real-world entities. First, we investigate the machine learning (ML) approach, but surprisingly find that existing state-of-the-art ML-based methods fall short in such a context-not reaching 0.49 F-score. Then, we propose an efficient solution that exploits annotated lists and regular expressions generated by humans that reaches a 0.99 F-score. In our experience, our approach was not more expensive than the dataset labeling of match/non-match pairs required by ML-based methods, in terms of human efforts.

2020 - RulER: Scaling Up Record-level Matching Rules [Relazione in Atti di Convegno]
Gagliardelli, Luca; Simonini, Giovanni; Bergamaschi, Sonia
abstract

2020 - Scaling up Record-level Matching Rules [Relazione in Atti di Convegno]
Gagliardelli, L.; Simonini, G.; Bergamaschi, S.
abstract

Record-level matching rules are chains of similarity join pred-icates on multiple attributes employed to join records that refer to the same real-world object when an explicit foreign key is not available on the data sets at hand. They are widely employed by data scientists and practitioners that work with data lakes, open data, and data in the wild. In this work we present a novel technique that allows to efficiently exe-cute record-level matching rules on parallel and distributed systems and demonstrate its efficiency on a real-wold data set.

2020 - Three-dimensional Entity Resolution with JedAI [Articolo su rivista]
Papadakis, G.; Mandilaras, G.; Gagliardelli, L.; Simonini, G.; Thanos, E.; Giannakopoulos, G.; Bergamaschi, S.; Palpanas, T.; Koubarakis, M.
abstract

Entity Resolution (ER) is the task of detecting different entity profiles that describe the same real-world objects. To facilitate its execution, we have developed JedAI, an open-source system that puts together a series of state-of-the-art ER techniques that have been proposed and examined independently, targeting parts of the ER end-to-end pipeline. This is a unique approach, as no other ER tool brings together so many established techniques. Instead, most ER tools merely convey a few techniques, those primarily developed by their creators. In addition to democratizing ER techniques, JedAI goes beyond the other ER tools by offering a series of unique characteristics: (i) It allows for building and benchmarking millions of ER pipelines. (ii) It is the only ER system that applies seamlessly to any combination of structured and/or semi-structured data. (iii) It constitutes the only ER system that runs seamlessly both on stand-alone computers and clusters of computers — through the parallel implementation of all algorithms in Apache Spark. (iv) It supports two different end-to-end workflows for carrying out batch ER (i.e., budget-agnostic), a schema-agnostic one based on blocks, and a schema-based one relying on similarity joins. (v) It adapts both end-to-end workflows to budget-aware (i.e., progressive) ER. We present in detail all features of JedAI, stressing the core characteristics that enhance its usability, and boost its versatility and effectiveness. We also compare it to the state-of-the-art in the field, qualitatively and quantitatively, demonstrating its state-of-the-art performance over a variety of large-scale datasets from different domains. The central repository of the JedAI's code base is here: https://github.com/scify/JedAIToolkit. A video demonstrating the JedAI's Web application is available here: https://www.youtube.com/watch?v=OJY1DUrUAe8.

2019 - Action recognition to estimate Activities of Daily Living (ADL) of elderly people [Relazione in Atti di Convegno]
Gabrielli, M.; Leo, P.; Renzi, F.; Bergamaschi, S.
abstract

This work proposes a method, and preliminary experimental results to detect and recognize a set of Activities of Daily Living, carried out by elderly people in a residential context, by analyzing video of actions, recorded using an RGB Camera. The proposed solution is based on the creation of neural network models, in particular Convolutional Neural Networks (CNN), which are trained on data extracted and preprocessed from the 'Moments in Time' dataset, a resource released by MIT-IBM Watson AI Lab that includes a collection of a million labeled 3-second videos from hundreds of categories. The performances of the models obtained following two different approaches are also described, the one using Auto Machine Learning, which was also necessary in order to have an idea of the achievable performances and the transfer learning approach. One of the main drivers of our research activity was also to explore and challenge the Auto Machine Learning approach in the context of ADL and evaluate its initial accuracy baseline concerning Transfer Learning approaches.

2019 - Computing inter-document similarity with Context Semantic Analysis [Articolo su rivista]
Beneventano, Domenico; Benedetti, Fabio; Bergamaschi, Sonia; Simonini, Giovanni
abstract

We propose a novel knowledge-based technique for inter-document similarity computation, called Context Semantic Analysis (CSA). Several specialized approaches built on top of specific knowledge base (e.g. Wikipedia) exist in literature, but CSA differs from them because it is designed to be portable to any RDF knowledge base. Our technique relies on a generic RDF knowledge base (e.g. DBpedia and Wikidata) to extract from it a contextual graph and a semantic contextual vector able to represent the context of a document. We show how CSA exploits such Semantic Context Vector to compute inter-document similarity effectively. Moreover, we show how CSA can be effectively applied in the Information Retrieval domain. Experimental results show that our general technique outperforms baselines built on top of traditional methods, and achieves a performance similar to the ones built on top of specific knowledge bases.

2019 - Entity Resolution and Data Fusion: An Integrated Approach [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Gagliardelli, Luca; Simonini, Giovanni
abstract

2019 - Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments [Relazione in Atti di Convegno]
Morrone, Giovanni; Pasa, Luca; Tikhanoff, Vadim; Bergamaschi, Sonia; Fadiga, Luciano; Badino, Leonardo
abstract

In this paper, we address the problem of enhancing the speech of a speaker of interest in a cocktail party scenario when visual information of the speaker of interest is available. Contrary to most previous studies, we do not learn visual features on the typically small audio-visual datasets, but use an already available face landmark detector (trained on a separate image dataset). The landmarks are used by LSTM-based models to generate time-frequency masks which are applied to the acoustic mixed-speech spectrogram. Results show that: (i) landmark motion features are very effective features for this task, (ii) similarly to previous work, reconstruction of the target speaker's spectrogram mediated by masking is significantly more accurate than direct spectrogram reconstruction, and (iii) the best masks depend on both motion landmark features and the input mixed-speech spectrogram. To the best of our knowledge, our proposed models are the first models trained and evaluated on the limited size GRID and TCD-TIMIT datasets, that achieve speaker-independent speech enhancement in a multi-talker setting.

2019 - Scaling entity resolution: A loosely schema-aware approach [Articolo su rivista]
Simonini, Giovanni; Gagliardelli, Luca; Bergamaschi, Sonia; Jagadish, H. V.
abstract

In big data sources, real-world entities are typically represented with a variety of schemata and formats (e.g., relational records, JSON objects, etc.). Different profiles (i.e., representations) of an entity often contain redundant and/or inconsistent information. Thus identifying which profiles refer to the same entity is a fundamental task (called Entity Resolution) to unleash the value of big data. The naïve all-pairs comparison solution is impractical on large data, hence blocking methods are employed to partition a profile collection into (possibly overlapping) blocks and limit the comparisons to profiles that appear in the same block together. Meta-blocking is the task of restructuring a block collection, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on schema-agnostic features, under the assumption that handling the schema variety of big data does not pay-off for such a task. In this paper, we demonstrate how “loose” schema information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract the loose schema information by adopting an LSH-based step for efficiently handling volume and schema heterogeneity of the data. Furthermore, we introduce a novel meta-blocking algorithm that can be employed to efficiently execute Blast on MapReduce-like systems (such as Apache Spark). Finally, we experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art (meta-)blocking approaches.

2019 - Schema-agnostic progressive entity resolution [Articolo su rivista]
Simonini, G.; Papadakis, G.; Palpanas, T.; Bergamaschi, S.
abstract

Entity Resolution (ER) is the task of finding entity profiles that correspond to the same real-world entity. Progressive ER aims to efficiently resolve large datasets when limited time and/or computational resources are available. In practice, its goal is to provide the best possible partial solution by approximating the optimal comparison order of the entity profiles. So far, Progressive ER has only been examined in the context of structured (relational) data sources, as the existing methods rely on schema knowledge to save unnecessary comparisons: they restrict their search space to similar entities with the help of schema-based blocking keys (i.e., signatures that represent the entity profiles). As a result, these solutions are not applicable in Big Data integration applications, which involve large and heterogeneous datasets, such as relational and RDF databases, JSON files, Web corpus etc. To cover this gap, we propose a family of schema-agnostic Progressive ER methods, which do not require schema information, thus applying to heterogeneous data sources of any schema variety. First, we introduce two naïve schema-agnostic methods, showing that straightforward solutions exhibit a poor performance that does not scale well to large volumes of data. Then, we propose four different advanced methods. Through an extensive experimental evaluation over 7 real-world, established datasets, we show that all the advanced methods outperform to a significant extent both the naïve and the state-of-the-art schema-based ones. We also investigate the relative performance of the advanced methods, providing guidelines on the method selection.

2019 - SparkER: Scaling Entity Resolution in Spark [Relazione in Atti di Convegno]
Gagliardelli, Luca; Simonini, Giovanni; Beneventano, Domenico; Bergamaschi, Sonia
abstract

We present SparkER, an ER tool that can scale practitioners’ favorite ER algorithms. SparkER has been devised to take full ad- vantage of parallel and distributed computation as well (running on top of Apache Spark). The first SparkER version was focused on the blocking step and implements both schema-agnostic and Blast meta-blocking approaches (i.e. the state-of-the-art ones); a GUI for SparkER, to let non-expert users to use it in an unsupervised mode, was developed. The new version of SparkER to be shown in this demo, extends significantly the tool. Entity matching and Entity Clustering modules have been added. Moreover, in addition to the completely unsupervised mode of the first version, a supervised mode has been added. The user can be assisted in supervising the entire process and in injecting his knowledge in order to achieve the best result. During the demonstration, attendees will be shown how SparkER can significantly help in devising and debugging ER algorithms.

2018 - BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios [Relazione in Atti di Convegno]
Gagliardelli, Luca; Zhu, Song; Simonini, Giovanni; Bergamaschi, Sonia
abstract

Duplicate detection aims to identify different records in data sources that refers to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples.

2018 - Enhancing Loosely Schema-aware Entity Resolution with User Interaction [Relazione in Atti di Convegno]
Simonini, Giovanni; Gagliardelli, Luca; Zhu, Song; Bergamaschi, Sonia
abstract

Entity Resolution (ER) is a fundamental task of data integration: it identifies different representations (i.e., profiles) of the same real-world entity in databases. To compare all possible profile pairs through an ER algorithm has a quadratic complexity. Blocking is commonly employed to avoid that: profiles are grouped into blocks according to some features, and ER is performed only for entities of the same block. Yet, devising blocking criteria and ER algorithms for data with highly schema heterogeneity is a difficult and error-prone task calling for automatic methods and debugging tools. In our previous work, we presented Blast, an ER system that can scale practitioners’ favorite Entity Resolution algorithms. In current version, Blast has been devised to take full advantage of parallel and distributed computation as well (running on top of Apache Spark). It implements the state-of-the-art unsuper- vised blocking method based on automatically extracted loose schema information. It comes with a GUI, which allows: (i) to visualize, understand, and (optionally) manually modify the loose schema information automatically extracted (i.e., injecting user’s knowledge in the system); (ii) to retrieve resolved entities through a free-text search box, and to visualize the process that lead to that result (i.e., the provenance). Experimental results on real-world datasets show that these two functionalities can significantly enhance Entity Resolution results.

2018 - Enhancing big data exploration with faceted browsing [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Zhu, Song; Simonini, Giovanni
abstract

With the modern information technologies, data availability is increasing at formidable speed giving raise to the Big Data challenge (Bergamaschi, 2014). As a matter of fact, Big Data analysis now drives every aspect of modern so- ciety, such as: manufacturing, retail, financial services, etc., (Labrinidis & Jagadish, 2012). In this scenario, we need to rethink advanced and efficient human-computer-interaction to be able to handling huge amount of data. In fact, one of the most valuable means to make sense of Big Data, to most peo- ple, is data visualization. As a matter of fact, data visualization may guide decision-making and become a powerful tool to convey information in all data analysis tasks. However, to be actually actionable, data visualization tools should allow the right amount of interactivity and to be easy to use, under- standable, meaningful, and approachable. In this article, we present a new approach to visualize and explore a huge amount of data. In particular, the novelty of our approach is to enhance the faceted browsing search in Apache Solr∗ (a widely used enterprise search platform) by exploiting Bayesian networks, supporting the user in the explo- ration of the data. We show how the proposed Bayesian suggestion algorithm (Cooper & Herskovits, 1991) be a key ingredient in a Big Data scenario, where a query can generate too many results that the user cannot handle. Our pro- posed solution aim to select best results, which together with the result-path, chosen by the user by means of multi-faceted querying and faceted navigation, can be a valuable support for both Big Data exploration and visualization. In the fellowing, we introduce the faceted browsing technique, then we de- scribe how it can be enhanced exploiting Bayesian networks.

2018 - MOMIS Dashboard: a powerful data analytics tool for Industry 4.0 [Relazione in Atti di Convegno]
Magnotta, Luca; Gagliardelli, Luca; Simonini, Giovanni; Orsini, Mirko; Bergamaschi, Sonia
abstract

In this work we present the MOMIS Dashboard, an interactive data analytics tool to explore and visualize data sources content through several kind of dynamic views (e.g. maps, bar, line, pie, etc.). The software tool is very versatile, and supports the connection to the main relational DBMS and Big Data sources. Moreover, it can be connected to MOMIS, a powerful Open Source Data Integration system, able to integrate heterogeneous data sources as enterprise information systems as well as sensors data. MOMIS Dashboard provides a secure permission management to limit data access on the basis of a user role, and a Designer to create and share personalized insights on the company KPIs, facilitating the enterprise collaboration. We illustrate the MOMIS Dashboard efficacy in a real enterprise scenario: a production monitoring platform to analyze real-time and historical data collected through sensors located on production machines that optimize production, energy consumption, and enable preventive maintenance.

2018 - Preface [Relazione in Atti di Convegno]
Bergamaschi, S.; Noia, T. D.; Maurino, A.; Tanca, L.
abstract

2018 - Schema-agnostic Progressive Entity Resolution [Relazione in Atti di Convegno]
Simonini, Giovanni; Papadakis, George; Palpanas, Themis; Bergamaschi, Sonia
abstract

Entity Resolution (ER) is the task of finding entity profiles that correspond to the same real-world entity. Progressive ER aims to efficiently resolve large datasets when limited time and/or computational resources are available. In practice, its goal is to provide the best possible partial solution by approximating the optimal comparison order of the entity profiles. So far, Progressive ER has only been examined in the context of structured (relational) data sources, as the existing methods rely on schema knowledge to save unnecessary comparisons: they restrict their search space to similar entities with the help of schema-based blocking keys (i.e., signatures that represent the entity profiles). As a result, these solutions are not applicable in Big Data integration applications, which involve large and heterogeneous datasets, such as relational and RDF databases, JSON files, Web corpus etc. To cover this gap, we propose a family of schema-agnostic Progressive ER methods, which do not require schema infor- mation, thus applying to heterogeneous data sources of any schema variety. First, we introduce a na ̈ıve schema-agnostic method, showing that the straightforward solution exhibits a poor performance that does not scale well to large volumes of data. Then, we propose three different advanced methods. Through an extensive experimental evaluation over 7 real-world, established datasets, we show that all the advanced methods outperform to a significant extent both the na ̈ıve and the state-of-the-art schema- based ones. We also investigate the relative performance of the advanced methods, providing guidelines on the method selection.

2018 - Towards Progressive Search-driven Entity Resolution [Relazione in Atti di Convegno]
Pietrangelo, A.; Simonini, G.; Bergamaschi, S.; Koumarelas, I.; Naumann, F.
abstract

Keyword-search systems for databases aim to answer a user query composed of a few terms with a ranked list of records. They are powerful and easy-to-use data exploration tools for a wide range of contexts. For instance, given a product database gathered scraping e-commerce websites, these systems enable even non-technical users to explore the item set (e.g., to check whether it contains certain products or not, or to discover the price of an item). However, if the database contains dirty records (i.e., incomplete and duplicated records), a pre-processing step to clean the data is required. One fundamental data cleaning step is Entity Resolution, i.e., the task of identifying and fusing together all the records that refer to the same real-word entity. This task is typically executed on the whole data, independently of: (i) the portion of the entities that a user may indicate through keywords, and (ii) the order priority that a user might express through an order by clause. This paper describes a first step to solve the problem of progressive search-driven Entity Resolution: resolving all the entities described by a user through a handful of keywords, progressively (according to an order by clause). We discuss the features of our method, named SearchER and showcase some examples of keyword queries on two real-world datasets obtained with a demonstrative prototype that we have built.

2017 - BigBench workload executed by using Apache Flink [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Gagliardelli, Luca; Simonini, Giovanni; Zhu, Song
abstract

Many of the challenges that have to be faced in Industry 4.0 involve the management and analysis of huge amount of data (e.g. sensor data management and machine-fault prediction in industrial manufacturing, web-logs analysis in e-commerce). To handle the so-called Big Data management and analysis, a plethora of frameworks has been proposed in the last decade. Many of them are focusing on the parallel processing paradigm, such as MapReduce, Apache Hive, Apache Flink. However, in this jungle of frameworks, the performance evaluation of these technologies is not a trivial task, and strictly depends on the application requirements. The scope of this paper is to compare two of the most employed and promising frameworks to manage big data: Apache Flink and Apache Hive, which are general purpose distributed platforms under the umbrella of the Apache Software Foundation. To evaluate these two frameworks we use the benchmark BigBench, developed for Apache Hive. We re-implemented the most significant queries of Apache Hive BigBench to make them work on Apache Flink, in order to be able to compare the results of the same queries executed on both frameworks. Our results show that Apache Flink, if it is configured well, is able to outperform Apache Hive.

2017 - Conditional random fields with semantic enhancement for named-entity recognition [Relazione in Atti di Convegno]
Bergamaschi, S.; Cappelli, A.; Circiello, A.; Varone, M.
abstract

We propose a novel Named Entity Recognition (NER) system based on a machine learning technique and a semantic network. .e NER system is able to exploit the advantages of semantic information, coming from Expert System proprietary technology, Cogito. NER is a task of Natural Language Processing (NLP) which consists in detecting, from an unforma.ed text source and classify Named Entities (NE), i.e. real-world entities that can be denoted with a rigid designator. To address this problem, the chosen approach is a combination of machine learning and deep semantic processing. .e machine learning method used is Conditional Random Fields (CRF). CRF is particularly suitable for the task because it analyzes an input sequence of tokens considering it as a whole, instead of one item at a time. CRF has been trained not only with classical information, available a.er a simple computation or anyway with li.le e.ort, but with the addition of semantic information. Semantic information is obtained with Sensigrafo and Semantic Disambiguator, which are the proprietary semantic network and semantic engine of Expert System, respectively. .e results are encouraging, as we can experimentally prove the improvements in the NER task obtained by exploiting semantics, in particular when the training data size decreases.

2017 - Effects of Semantic Analysis on Named-Entity Recognition with Conditional Random Fields [Relazione in Atti di Convegno]
Bergamaschi, S.; Cappelli, A.; Circiello, A.; Varone, M.
abstract

We propose a novel Named Entity Recognition (NER) system based on a machine learning technique and a semantic network. The NER system is able to exploit the advantages of semantic information, coming from Expert System proprietary technology, Cogito. NER is a task of Natural Language Processing (NLP) which consists in detecting, from an unformatted text source and classify, Named Entities (NE), i.e. real-world entities that can be denoted with a rigid designator. To address this problem, the chosen approach is a combination of machine learning and deep semantic processing. The machine learning method used is Conditional Random Fields (CRF). CRF is particularly suitable for the task because it analyzes an input sequence considering the whole sequence, instead of one item at a time. CRF has been trained not only with classical information, available after a simple computation or anyway with little effort, but with semantic information too. Semantic information is obtained with Sensigrafo and Semantic Disambiguator, which are the proprietary semantic network and semantic engine of Expert System, respectively. The results are encouraging, as we can experimentally prove the improvements in the NER task obtained by exploiting semantics.

2017 - Effects of semantic analysis on named-entity recognition with conditional random fields [Relazione in Atti di Convegno]
Bergamaschi, S.; Cappelli, A.; Circiello, A.; Varone, M.
abstract

2017 - From Data Integration to Big Data Integration [Capitolo/Saggio]
Bergamaschi, Sonia; Beneventano, Domenico; Mandreoli, Federica; Martoglia, Riccardo; Guerra, Francesco; Orsini, Mirko; Po, Laura; Vincini, Maurizio; Simonini, Giovanni; Zhu, Song; Gagliardelli, Luca; Magnotta, Luca
abstract

Abstract. The Database Group (DBGroup, www.dbgroup.unimore.it) and Information System Group (ISGroup, www.isgroup.unimore.it) re- search activities have been mainly devoted to the Data Integration Research Area. The DBGroup designed and developed the MOMIS data integration system, giving raise to a successful innovative enterprise DataRiver (www.datariver.it), distributing MOMIS as open source. MOMIS provides an integrated access to structured and semistructured data sources and allows a user to pose a single query and to receive a single unified answer. Description Logics, Automatic Annotation of schemata plus clustering techniques constitute the theoretical framework. In the context of data integration, the ISGroup addressed problems related to the management and querying of heterogeneous data sources in large-scale and dynamic scenarios. The reference architectures are the Peer Data Management Systems and its evolutions toward dataspaces. In these contexts, the ISGroup proposed and evaluated effective and efficient mechanisms for network creation with limited information loss and solutions for mapping management query reformulation and processing and query routing. The main issues of data integration have been faced: automatic annotation, mapping discovery, global query processing, provenance, multi- dimensional Information integration, keyword search, within European and national projects. With the incoming new requirements of integrating open linked data, textual and multimedia data in a big data scenario, the research has been devoted to the Big Data Integration Research Area. In particular, the most relevant achieved research results are: a scalable entity resolution method, a scalable join operator and a tool, LODEX, for automatically extracting metadata from Linked Open Data (LOD) resources and for visual querying formulation on LOD resources. Moreover, in collaboration with DATARIVER, Data Integration was successfully applied to smart e-health.

2017 - PV-OWL-Pharmacovigilance surveillance through semantic web-based platform for continuous and integrated monitoring of drug-related adverse effects in open data sources and social media [Relazione in Atti di Convegno]
Piccinni, C.; Poluzzi, E.; Orsini, M.; Bergamaschi, S.
abstract

The recent EU regulation on Pharmacovigilance [Regulation (EU) 1235/2010, Directive 2010/84/EU] imposes both to Pharmaceutical companies and Public health agencies to maintain updated safety information of drugs, monitoring all available data sources. Here, we present our project aiming to develop a web platform for continuous monitoring of adverse effects of medicines (pharmacovigilance), by integrating information from public databases, scientific literature and social media. The project will start by scanning all available data sources concerning drug adverse events, both open (e.g., FAERS-FDA Adverse Event Reporting Systems, medical literature, social media, etc.) and proprietary data (e.g., discharge hospital records, drug prescription archives, electronic health records), that require agreement with respective data owners. Subsequent, pharmacovigilance experts will perform a semi-Automatic mapping of codes identifying drugs and adverse events, to build the thesaurus of the web based platform. After these preliminary activities, signal generation and prioritization will be the core of the project. This task will result in risk confidence scores for each included data source and a comprehensive global score, indicating the possible association between a specific drug and an adverse event. The software framework MOMIS, an open source data integration system, will allow semi-Automatic virtual integration of heterogeneous and distributed data sources. A web platform, based on MOMIS, able to merge many heterogeneous data sets concerning adverse events will be developed. The platform will be tested by external specialized subjects (clinical researchers, public or private employees in pharmacovigilance field). The project will provide a) an innovative way to link, for the first time in Italy, different databases to obtain novel safety indicators; b) a web platform for a fast and easy integration of all available data, useful to verify and validate hypothesis generated in signal detection. Finally, the development of the unified safety indicator (global risk score) will result in a compelling, easy-To-understand, visual format for a broad range of professional and not professional users like patients, regulatory authorities, clinicians, lawyers, human scientists.

2017 - Sopj: A scalable online provenance join for data integration [Relazione in Atti di Convegno]
Zhu, Song; S., Email Author; Fiameni, Giuseppe; G., Email Author; Simonini, Giovanni; G., Email Author; Bergamaschi, S.
abstract

Data integration is a technique used to combine different sources of data together to provide an unified view among them. MOMIS[1] is an open-source data integration framework developed by the DBGroup1. The goal of our work is to make MOMIS be able to scale-out as the input data sources increase without introducing noticeable performance penalty. In particular, we present a full outer join method capable to efficiently integrate multiple sources at the same time by using data streams and provenance information. To evaluate the scalability of this innovative approach, we developed a join engine employing a distributed data processing framework. Our solution is able to process input data sources in the form of continuous stream, execute the join operation on-the-fly and produce outputs as soon as they are generated. In this way, the join can return partial results before the input streams have been completely received or processed optimizing the entire execution.

2017 - SparkER: an Entity Resolution framework for Apache Spark [Software]
Gagliardelli, Luca; Simonini, Giovanni; Zhu, Song; Bergamaschi, Sonia
abstract

Entity Resolution is a crucial task for many applications, but its nave solution has a low efficiency due to its quadratic complexity. Usually, to reduce this complexity, blocking is employed to cluster similar entities in order to reduce the global number of comparisons. Meta-Blocking (MB) approach aims to restructure the block collection in order to reduce the number of comparisons, obtaining better results in term of execution time. However, these techniques alone are not sufficient to work in the context of Big Data, where typically the records to be compared are in the order of hundreds of million. Parallel implementations of MB have been proposed in the literature, but all of them are built on Hadoop MapReduce, which is known to have a low efficiency on modern cluster architecture. We implement a Meta-Blocking technique for Apache Spark. Unlike Hadoop, Apache Spark uses a different paradigm to manage the tasks: it does not need to save the partial results on disk, keeping them in memory, which guarantees a shorter execution time. We reimplemented the state-of-the-art MB techniques, creating a new algorithm in order to exploit the Spark architecture. We tested our algorithm over several established datasets, showing that ours Spark implementation outperforms other existing ones based on Hadoop.

2016 - A model for visual building SPARQL queries [Relazione in Atti di Convegno]
Benedetti, F.; Bergamaschi, S.
abstract

LODeX is a Semantic Web tool that, leveraging a summarized representation of a LOD source structure (i.e. Schema Summary), helps users explore and query SPARQL endpoints by hiding the complexity of Semantic Web technologies. By leveraging Schema Summary of a LOD source, LODeX guides the user in composing visual queries that are automatically translated in correct SPARQL queries through a SPARQL compiler. In this work we inspected how LODeX can deal with the high expressivity of SPARQL. In particular, we propose a formal model that allow to define queries over the Schema Summary (i.e. Basic Query) and we analyze how this model can handle different join patterns used in SPARQL queries. Finally, we inspect how LODeX can satisfy real world users necessities by analyzing the query logs contained in the LSQ dataset. We show that LODeX could be able to generate the 77.6% of the 5 million queries contained in LSQ dataset.

2016 - BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution [Articolo su rivista]
Simonini, Giovanni; Bergamaschi, Sonia; Jagadish, H. V.
abstract

Identifying records that refer to the same entity is a fundamental step for data integration. Since it is prohibitively expensive to compare every pair of records, blocking techniques are typically employed to reduce the complexity of this task. These techniques partition records into blocks and limit the comparison to records co-occurring in a block. Generally, to deal with highly heterogeneous and noisy data (e.g. semi-structured data of the Web), these techniques rely on redundancy to reduce the chance of missing matches. Meta-blocking is the task of restructuring blocks generated by redundancy-based blocking techniques, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on schema-agnostic features. In this paper, we demonstrate how “loose” schema information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract this loose information by adopting a LSH-based step for e ciently scaling to large datasets. We experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art unsupervised meta-blocking approaches, and, in many cases, also the supervised one.

2016 - Big Data Research in Italy: A Perspective [Articolo su rivista]
Bergamaschi, Sonia; Carlini, Emanuele; Ceci, Michelangelo; Furletti, Barbara; Giannotti, Fosca; Malerba, Donato; Mazzanzanica, Mario; Monreale, Anna; Pasi, Gabriella; Pedereschi, Dino; Perego, Raffaele; Ruggieri, Salvatore
abstract

The aim of this article is to sinthetically describe the research projects that a selecion of italian universities is undertaking in the context of big data. Far for being exahaustive, this atricle has the objective of offering a sample of distinct applications that address the issue of managing huge amounts od data in Italy, collected in relation to diverse domains.

2016 - Combining User and Database Perspective for Solving Keyword Queries over Relational Databases [Articolo su rivista]
Bergamaschi, Sonia; Interlandi, Matteo; Guerra, Francesco; TRILLO LADO, Raquel; Velegrakis, Yannis
abstract

Over the last decade, keyword search over relational data has attracted considerable attention. A possible approach to face this issue is to transform keyword queries into one or more SQL queries to be executed by the relational DBMS. Finding these queries is a challenging task since the information they represent may be modeled across different tables and attributes. This means that it is needed to identify not only the schema elements where the data of interest is stored, but also to find out how these elements are interconnected. All the approaches that have been proposed so far provide a monolithic solution. In this work, we, instead, divide the problem into three steps: the first one, driven by the user׳s point of view, takes into account what the user has in mind when formulating keyword queries, the second one, driven by the database perspective, considers how the data is represented in the database schema. Finally, the third step combines these two processes. We present the theory behind our approach, and its implementation into a system called QUEST (QUEry generator for STructured sources), which has been deeply tested to show the efficiency and effectiveness of our approach. Furthermore, we report on the outcomes of a number of experimental results that we have conducted.

2016 - Context Semantic Analysis: A Knowledge-Based Technique for Computing Inter-document Similarity [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Beneventano, Domenico; Benedetti, Fabio
abstract

We propose a novel knowledge-based technique for inter-document similarity, called Context Semantic Analysis (CSA). Several specialized approaches built on top of specific knowledge base (e.g. Wikipedia) exist in literature but CSA differs from them because it is designed to be portable to any RDF knowledge base. Our technique relies on a generic RDF knowledge base (e.g. DBpedia and Wikidata) to extract from it a vector able to represent the context of a document. We show how such a Semantic Context Vector can be effectively exploited to compute inter-document similarity. Experimental results show that our general technique outperforms baselines built on top of traditional methods, and achieves a performance similar to the ones of specialized methods.

2016 - Driving Innovation in Youth Policies With Open Data [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Gagliardelli, Luca; Po, Laura
abstract

In December 2007, thirty activists held a meeting in California to define the concept of open public data. For the first time eight Open Government Data (OPG) principles were settled; OPG should be Complete, Primary (reporting data at an high level of granularity), Timely, Accessible, Machine processable, Non-discriminatory, Non-proprietary, License-free. Since the inception of the Open Data philosophy there has been a constant increase in information released improving the communication channel between public administrations and their citizens. Open data offers government, companies and citizens information to make better decisions. We claim Public Administrations, that are the main producers and one of the consumers of Open Data, might effectively extract important information by integrating its own data with open data sources. This paper reports the activities carried on during a research project on Open Data for Youth Policies. The project was devoted to explore the youth situation in the municipalities and provinces of the Emilia Romagna region (Italy), in particular, to examine data on population, education and work. We identified interesting data sources both from the open data community and from the private repositories of local governments related to the Youth Policies. The selected sources have been integrated and, the result of the integration by means of a useful navigator tool have been shown up. In the end, we published new information on the web as Linked Open Data. Since the process applied and the tools used are generic, we trust this paper to be an example and a guide for new projects that aims to create new knowledge through Open Data.

2016 - Enhancing entity resolution efficiency with loosely schema-aware techniques - Discussion paper [Relazione in Atti di Convegno]
Simonini, G.; Bergamaschi, S.
abstract

Entity Resolution, the task of identifying records that refer to the same real-world entity, is a fundamental step in data integration. Blocking is a widely employed technique to avoid the comparison of all possible record pairs in a dataset (an inefficient approach). Renouncing to exploit schema information for blocking has been proved to limit the chance of missing matches (i.e., it guarantees high recall), at the cost of a low precision. Meta-blocking alleviates this issue by restructuring a block collection, removing redundant and superfluous comparisons. Yet, existing meta-blocking techniques exclusively rely on schema-agnostic features. In this paper, we investigate how loose schema information, induced directly from the data, can be exploited in an holistic loosely schema-aware (meta-)blocking approach that outperforms the state-of-the-art meta-blocking in terms of precision, without renouncing high level of recall. We implemented our idea in a system called Blast, and experimentally evaluated it on real-world datasets.

2016 - Exploiting Semantics for Searching Agricultural Bibliographic Data [Articolo su rivista]
Beneventano, Domenico; Bergamaschi, Sonia; Martoglia, Riccardo
abstract

Filtering and search mechanisms which permit to identify key bibliographic references are fundamental for researchers. In this paper we propose a fully automatic and semantic method for filtering/searching bibliographic data, which allows users to look for information by specifying simple keyword queries or document queries, i.e. by simply submitting existing documents to the system. The limitations of standard techniques, based on either syntactical text search and on manually assigned descriptors, are overcome by considering the semantics intrinsically associated to the document/query terms; to this aim, we exploit different kinds of external knowledge sources (both general and specific domain dictionaries or thesauri). The proposed techniques have been developed and successfully tested for agricultural bibliographic data, which plays a central role to enable researchers and policy makers to retrieve related agricultural and scientific information by using the AGROVOC thesaurus.

2016 - Exposing the Underlying Schema of LOD Sources [Relazione in Atti di Convegno]
Benedetti, Fabio; Bergamaschi, Sonia; Po, Laura
abstract

The Linked Data Principles defined by Tim-Berners Lee promise that a large portion of Web Data will be usable as one big interlinked RDF database. Today, with more than one thousand of Linked Open Data (LOD) sources available on the Web, we are assisting to an emerging trend in publication and consumption of LOD datasets. However, the pervasive use of external resources together with a deficiency in the definition of the internal structure of a dataset causes many LOD sources are extremely complex to understand. In this paper, we describe a formal method to unveil the implicit structure of a LOD dataset by building a (Clustered) Schema Summary. The Schema Summary contains all the main classes and properties used within the datasets, whether they are taken from external vocabularies or not, and is conceivable as an RDFS ontology. The Clustered Schema Summary, suitable for large LOD datasets, provides a more high level view of the classes and the properties used by gathering together classes that are object of multiple instantiations.

2016 - Keyword-Based Search Over Databases: A Roadmap for a Reference Architecture Paired with an Evaluation Framework [Articolo su rivista]
Bergamaschi, Sonia; Ferro, Nicola; Guerra, Francesco; Silvello, Gianmaria
abstract

Structured data sources promise to be the next driver of a significant socio-economic impact for both people and companies. Nevertheless, accessing them through formal languages, such as SQL or SPARQL, can become cumbersome and frustrating for end-users. To overcome this issue, keyword search in databases is becoming the technology of choice, even if it suffers from efficiency and effectiveness problems that prevent it from being adopted at Web scale. In this paper, we motivate the need for a reference architecture for keyword search in databases to favor the development of scalable and effective components, also borrowing methods from neighbor fields, such as information retrieval and natural language processing. Moreover, we point out the need for a companion evaluation framework, able to assess the efficiency and the effectiveness of such new systems and in the light of real and compelling use cases.

2016 - Providing Insight into Data Source Topics [Articolo su rivista]
Bergamaschi, Sonia; Ferrari, Davide; Guerra, Francesco; Simonini, Giovanni; Velegrakis, Yannis
abstract

A fundamental service for the exploitation of the modern large data sources that are available online is the ability to identify the topics of the data that they contain. Unfortunately, the heterogeneity and lack of centralized control makes it difficult to identify the topics directly from the actual values used in the sources. We present an approach that generates signatures of sources that are matched against a reference vocabulary of concepts through the respective signature to generate a description of the topics of the source in terms of this reference vocabulary. The reference vocabulary may be provided ready, may be created manually, or may be created by applying our signature-generated algorithm over a well-curated data source with a clear identification of topics. In our particular case, we have used DBpedia for the creation of the vocabulary, since it is one of the largest known collections of entities and concepts. The signatures are generated by exploiting the entropy and the mutual information of the attributes of the sources to generate semantic identifiers of the various attributes, which combined together form a unique signature of the concepts (i.e. the topics) of the source. The generation of the identifiers is based on the entropy of the values of the attributes; thus, they are independent of naming heterogeneity of attributes or tables. Although the use of traditional information-theoretical quantities such as entropy and mutual information is not new, they may become untrustworthy due to their sensitivity to overfitting, and require an equal number of samples used to construct the reference vocabulary. To overcome these limitations, we normalize and use pseudo-additive entropy measures, which automatically downweight the role of vocabulary items and property values with very low frequencies, resulting in a more stable solution than the traditional counterparts. We have materialized our theory in a system called WHATSIT and we experimentally demonstrate its effectiveness.

2015 - Comparing LDA and LSA Topic Models for Content-Based Movie Recommendation Systems [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Po, Laura
abstract

We propose a plot-based recommendation system, which is based upon an evaluation of similarity between the plot of a video that was watched by a user and a large amount of plots stored in a movie database. Our system is independent from the number of user ratings, thus it is able to propose famous and beloved movies as well as old or unheard movies/programs that are still strongly related to the content of the video the user has watched. The system implements and compares the two Topic Models, Latent Semantic Allocation (LSA) and Latent Dirichlet Allocation (LDA), on a movie database of two hundred thousand plots that has been constructed by integrating different movie databases in a local NoSQL (MongoDB) DBMS. The topic models behaviour has been examined on the basis of standard metrics and user evaluations, performance ssessments with 30 users to compare our tool with a commercial system have been conducted.

2015 - Exploiting Semantics for Filtering and Searching Knowledge in a Software Development Context [Articolo su rivista]
Bergamaschi, Sonia; Martoglia, Riccardo; Sorrentino, Serena
abstract

Software development is still considered a bottleneck for SMEs (Small and Medium Enterprises) in the advance of the Information Society. Usually, SMEs store and collect a large number of software textual documentation; these documents might be profitably used to facilitate them in using (and re-using) Software Engineering methods for systematically designing their applications, thus reducing software development cost. Specific and semantics textual filtering/search mechanisms, supporting the identification of adequate processes and practices for the enterprise needs, are fundamental in this context. To this aim, we present an automatic document retrieval method based on semantic similarity and Word Sense Disambiguation (WSD) techniques. The proposal leverages on the strengths of both classic information retrieval and knowledge-based techniques, exploiting syntactical and semantic information provided by general and specific domain knowledge sources. For any SME, it is as easily and generally applicable as are the search techniques offered by common enterprise Content Management Systems (CMSs). Our method was developed within the FACIT-SME European FP-7 project, whose aim is to facilitate the diffusion of Software Engineering methods and best practices among SMEs. As shown by a detailed experimental evaluation, the achieved effectiveness goes well beyond typical retrieval solutions.

2015 - Integrazione di dati clinici con il sistema MOMIS [Capitolo/Saggio]
Benedetti, Fabio; Bergamaschi, Sonia; Orsini, Mirko; Magnotta, Luca
abstract

Nel corso dell’ultimo decennio è diventata sempre più rilevante la necessità di accedere ad informazioni distribuite e contestualmente anche il problema dell’integrazione di informazioni provenienti da sorgenti eterogenee. In campo medico, gli istituti di ricerca e le aziende ospedaliere hanno a disposizione un nu-mero sempre crescente di fonti d’informazione, che possono contenere dati cor-relati tra loro ma spesso ridondanti, eterogenei e non sempre consistenti. L’esigenza, soprattutto da parte delle organizzazioni di ricerca, è quella di poter accedere in modo semplice a tutte le informazioni distribuite sui diversi sistemi informativi, e poter costruire applicazioni che utilizzino in tempo reale tali infor-mazioni, per poter ottenere nel minor tempo possibile i risultati che saranno a be-neficio dei pazienti. In questo articolo viene presentato il progetto di integrazione dati degli studi clini-ci sperimentali condotti dalla FIL (Fondazione Italiana Linfomi) effettuato dal gruppo di ricerca DBGroup e dalla spin off universitaria DataRiver. Il progetto ha riguardato l’integrazione dei dati provenienti da 3 diversi sistemi informativi al fine di ottenere una visione unificata dell’andamento di tutti gli studi ed effettua-re analisi statistiche dinamiche in tempo reale. Lo strumento per il monitoraggio dei trial clinici “Trial Monitoring tool”, sviluppato sfruttando il sistema di data integration MOMIS ed il componente MOMIS Dashboard, consente di effettuare ricerche e monitoraggio dei dati aggregati e di visualizzare i risultati dell’andamento degli studi su mappe, grafici e tabelle dinamiche.

2015 - LODeX: A tool for Visual Querying Linked Open Data [Relazione in Atti di Convegno]
Benedetti, Fabio; Bergamaschi, Sonia; Po, Laura
abstract

Formulating a query on a Linked Open Data (LOD) source is not an easy task; a technical knowledge of the query language, and, the awareness of the structure of the dataset are essential to create a query. We present a revised version of LODeX that provides the user an easy way for building queries in a fast and interactive manner. When a user decides to explore a LOD source, he/she can take advantage of the Schema Summary produced by LODeX (i.e. a synthetic view of the dataset’s structure) and he/she can pick graphical elements from it to create a visual query. The tool also supports the user in browsing the results and, eventually, in refining the query. The prototype has been evaluated on hundreds of public SPARQL endpoints (listed in Data Hub) and it is available online at http://dbgroup.unimo.it/lodex2. A survey conducted on 27 users has demonstrated that our tool can effectively support both unskilled and skilled users in exploring and querying LOD datasets.

2015 - MOMIS Goes Multimedia: WINDSURF and the Case of Top-K Queries [Relazione in Atti di Convegno]
Bartolini, Iaria; Beneventano, Domenico; Bergamaschi, Sonia; Ciaccia, Paolo; Corni, Alberto; Orsini, Mirko; Patella, Marco; Santese, MARCO MARIA
abstract

In a scenario with “traditional” and “multimedia” data sources, this position paper discusses the following question: “How can a multimedia local source (e.g., Windsurf) supporting ranking queries be integrated into a mediator system without such capabilities (e.g., MOMIS)?” More precisely, “How to support ranking queries coming from a multimedia local source within a mediator system with a “traditional” query processor based on an SQL-engine?” We first describe a na¨ıve approach for the execution of range and Top-K global queries where the MOMIS query processing method remains substantially unchanged, but, in the case of Top-K queries, it does not guarantee to obtain K results. We then discuss two alternative modalities for allowing MOMIS to return the Top-K best results of a global query.

2015 - Multilingual Word Sense Induction to Improve Web Search Result Clustering [Relazione in Atti di Convegno]
Albano, Lorenzo; Beneventano, Domenico; Bergamaschi, Sonia
abstract

In [13] a novel approach to Web search result clustering based on Word Sense Induction, i.e. the automatic discovery of word senses from raw text was presented; key to the proposed approach is the idea of, first, automatically in- ducing senses for the target query and, second, clustering the search results based on their semantic similarity to the word senses induced. In [1] we proposed an innovative Word Sense Induction method based on multilingual data; key to our approach was the idea that a multilingual context representation, where the context of the words is expanded by considering its translations in different languages, may im- prove the WSI results; the experiments showed a clear per- formance gain. In this paper we give some preliminary ideas to exploit our multilingual Word Sense Induction method to Web search result clustering.

2015 - Multilingual Word Sense Induction to Improve Web Search Result Clustering [Relazione in Atti di Convegno]
Albano, Lorenzo; Beneventano, Domenico; Bergamaschi, Sonia
abstract

In [12] a novel approach to Web search result clustering based on Word Sense Induction, i.e. the automatic discovery of word senses from raw text was presented; key to the proposed approach is the idea of, first, automatically in- ducing senses for the target query and, second, clustering the search results based on their semantic similarity to the word senses induced. In [1] we proposed an innovative Word Sense Induction method based on multilingual data; key to our approach was the idea that a multilingual context representation, where the context of the words is expanded by considering its translations in different languages, may im- prove the WSI results; the experiments showed a clear per- formance gain. In this paper we give some preliminary ideas to exploit our multilingual Word Sense Induction method to Web search result clustering.

2015 - Open Data for Improving Youth Policies [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Gagliardelli, Luca; Po, Laura
abstract

The Open Data \textit{philosophy} is based on the idea that certain data should be made available to all citizens, in an open form, without any copyright restrictions, patents or other mechanisms of control. Various government have started to publish open data, first of all USA and UK in 2009, and in 2015, the Open Data Barometer project (www.opendatabarometer.org) states that on 77 diverse states across the world, over 55 percent have developed some form of Open Government Data initiative. We claim Public Administrations, that are the main producers and one of the consumers of Open Data, might effectively extract important information by integrating its own data with open data sources.This paper reports the activities carried on during a one-year research project on Open Data for Youth Policies. The project was mainly devoted to explore the youth situation in the municipalities and provinces of the Emilia Romagna region (Italy), in particular, to examine data on population, education and work.The project goals were: to identify interesting data sources both from the open data community and from the private repositories of local governments of Emilia Romagna region related to the Youth Policies; to integrate them and, to show up the result of the integration by means of a useful navigator tool; in the end, to publish new information on the web as Linked Open Data. This paper also reports the main issues encountered that may seriously affect the entire process of consumption, integration till the publication of open data.

2015 - Perspective Look at Keyword-based Search Over Relation Data and its Evaluation (Extended Abstract) [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Ferro, Nicola; Guerra, Francesco; Silvello, Gianmaria
abstract

This position paper discusses the need for considering keyword search over relational databases in the light of broader systems, where keyword search is just one of the components and which are aimed at better supporting users in their search tasks. These more complex systems call for appropriate evaluation methodologies which go beyond what is typically done today, i.e. measuring performances of components mostly in isolation or not related to the actual user needs, and, instead, able to consider the system as a whole, its constituent components, and their inter-relations with the ultimate goal of supporting actual user search tasks.

2015 - Semantic Annotation of the CEREALAB database by the AGROVOC Linked Dataset [Articolo su rivista]
Beneventano, Domenico; Bergamaschi, Sonia; Serena, Sorrentino; Vincini, Maurizio; Benedetti, Fabio
abstract

Nowadays, there has been an increment of open data government initiatives promoting the idea that particular data should be freely published. However, the great majority of these resources is published in an unstructured format and is typically accessed only by closed communities. Starting from these considerations, in a previous work related to a youth precariousness dataset, we proposed an experimental and preliminary methodology or facilitating resource providers in publishing public data into the Linked Open Data (LOD) cloud, and for helping consumers (companies and citizens) in efficiently accessing and querying them. Linked Open Data play a central role for accessing and analyzing the rapidly growing pool of life science data and, as discussed in recent meetings, it is important for data source providers themselves making their resources available as Linked Open Data. In this paper we extend and apply our methodology to the agricultural domain, i.e. to the CEREALAB database, created to store both genotypic and phenotypic data and specifically designed for plant breeding, in order to provide its publication into the LOD cloud.

2015 - The on-site analysis of the Cherenkov Telescope Array [Relazione in Atti di Convegno]
Bulgarelli, Andrea; Fioretti, Valentina; Zoli, Andrea; Aboudan, Alessio; Rodríguez Vázquez, Juan José; De Cesare, Giovanni; De Rosa, Adriano; Maier, Gernot; Lyard, Etienne; Bastieri, Denis; Lombardi, Saverio; Tosti, Gino; Bergamaschi, Sonia; Beneventano, Domenico; Lamanna, Giovanni; Jacquemier, Jean; Kosack, Karl; Angelo Antonelli, Lucio; Boisson, Catherine; Borkowski, Jerzy; Buson, Sara; Carosi, Alessandro; Conforti, Vito; Colomé, Pep; De Los Reyes, Raquel; Dumm, Jon; Evans, Phil; Fortson, Lucy; Fuessling, Matthias; Gotz, Diego; Graciani, Ricardo; Gianotti, Fulvio; Grandi, Paola; Hinton, Jim; Humensky, Brian; Inoue, Susumu; Knödlseder, Jürgen; Le Flour, Thierry; Lindemann, Rico; Malaguti, Giuseppe; Markoff, Sera; Marisaldi, Martino; Neyroud, Nadine; Nicastro, Luciano; Ohm, Stefan; Osborne, Julian; Oya, Igor; Rodriguez, Jerome; Rosen, Simon; Ribo, Marc; Tacchini, Alessandro; Schüssle, Fabian; Stolarczyk, Thierry; Torresi, Eleonora; Testa, Vincenzo; Wegner, Peter; Weinstein, Amanda
abstract

The Cherenkov Telescope Array (CTA) observatory will be one of the largest ground-based veryhigh- energy gamma-ray observatories. The On-Site Analysis will be the first CTA scientific analysis of data acquired from the array of telescopes, in both northern and southern sites. The On-Site Analysis will have two pipelines: the Level-A pipeline (also known as Real-Time Analysis, RTA) and the level-B one. The RTA performs data quality monitoring and must be able to issue automated alerts on variable and transient astrophysical sources within 30 seconds from the last acquired Cherenkov event that contributes to the alert, with a sensitivity not worse than the one achieved by the final pipeline by more than a factor of 3. The Level-B Analysis has a better sensitivity (not be worse than the final one by a factor of 2) and the results should be available within 10 hours from the acquisition of the data: for this reason this analysis could be performed at the end of an observation or next morning. The latency (in particular for the RTA) and the sensitivity requirements are challenging because of the large data rate, a few GByte/s. The remote connection to the CTA candidate site with a rather limited network bandwidth makes the issue of the exported data size extremely critical and prevents any kind of processing in real-time of the data outside the site of the telescopes. For these reasons the analysis will be performed on-site with infrastructures co-located with the telescopes, with limited electrical power availability and with a reduced possibility of human intervention. This means, for example, that the on-site hardware infrastructure should have low-power consumption. A substantial effort towards the optimization of high-throughput computing service is envisioned to provide hardware and software solutions with high-throughput, low-power consumption at a low-cost. This contribution provides a summary of the design of the on-site analysis and reports some prototyping activities.

2015 - Visual Querying LOD sources with LODeX [Relazione in Atti di Convegno]
Benedetti, Fabio; Bergamaschi, Sonia; Po, Laura
abstract

The Linked Open Data (LOD) Cloud has more than tripled its sources in just three years (from 295 sources in 2011 to 1014 in 2014). While the LOD data are being produced at a increasing rate, LOD tools lack in producing an high level representation of datasets and in supporting users in the exploration and querying of a source. To overcome the above problems and significantly increase the number of consumers of LOD data, we devised a new method and a tool, called LODeX, that promotes the understanding, navigation and querying of LOD sources both for experts and for beginners. It also provides a standardized and homogeneous summary of LOD sources and supports user in the creation of visual queries on previously unknown datasets. We have extensively evaluated the portability and usability of the tool. LODeX have been tested on the entire set of datasets available at Data Hub, i.e. 302 sources. In this paper, we showcase the usability evaluation of the different features of the tool (the Schema Summary representation and the visual query building) obtained on 27 users (comprising both Semantic Web experts and beginners).

2014 - A Visual Summary for Linked Open Data sources [Relazione in Atti di Convegno]
Benedetti, Fabio; Bergamaschi, Sonia; Po, Laura
abstract

In this paper we propose LODeX, a tool that produces a representative summary of a Linked open Data (LOD) source starting from scratch, thus supporting users in exploring and understanding the contents of a dataset. The tool takes in input the URL of a SPARQL endpoint and launches a set of predefined SPARQL queries, from the results of the queries it generates a visual summary of the source. The summary reports statistical and structural information of the LOD dataset and it can be browsed to focus on particular classes or to explore their properties and their use. LODeX was tested on the 137 public SPARQL endpoints contained in Data Hub (formerly CKAN), one of the main Open Data catalogues. The statistical and structural information of the 107 well performed extractions are collected and available in the online version of LODeX (http://dbgroup.unimo.it/lodex).

2014 - A prototype for the real-time analysis of the Cherenkov Telescope Array [Relazione in Atti di Convegno]
Andrea, Bulgarelli; Valentina, Fioretti; Andrea, Zoli; Alessio, Aboudan; Juan José Rodríguez, Vázquez; Gernot, Maier; Etienne, Lyard; Denis, Bastieri; Saverio, Lombardi; Gino, Tosti; Adriano De, Rosa; Bergamaschi, Sonia; Matteo, Interlandi; Beneventano, Domenico; Giovanni, Lamanna; Jean, Jacquemier; Karl, Kosack; Lucio Angelo, Antonelli; Catherine, Boisson; Jerzy, Burkowski; Sara, Buson; Alessandro, Carosi; Vito, Conforti; Jose Luis, Contreras; Giovanni De, Cesare; Raquel de los, Reyes; Jon, Dumm; Phil, Evans; Lucy, Fortson; Matthias, Fuessling; Ricardo, Graciani; Fulvio, Gianotti; Paola, Grandi; Jim, Hinton; Brian, Humensky; Jürgen, Knödlseder; Giuseppe, Malaguti; Martino, Marisaldi; Nadine, Neyroud; Luciano, Nicastro; Stefan, Ohm; Julian, Osborne; Simon, Rosen; Alessandro, Tacchini; Eleonora, Torresi; Vincenzo, Testa; Massimo, Trifoglio; Amanda, Weinstein
abstract

The Cherenkov Telescope Array (CTA) observatory will be one of the biggest ground-based very-high-energy (VHE) γ- ray observatory. CTA will achieve a factor of 10 improvement in sensitivity from some tens of GeV to beyond 100 TeV with respect to existing telescopes. The CTA observatory will be capable of issuing alerts on variable and transient sources to maximize the scientific return. To capture these phenomena during their evolution and for effective communication to the astrophysical community, speed is crucial. This requires a system with a reliable automated trigger that can issue alerts immediately upon detection of γ-ray flares. This will be accomplished by means of a Real-Time Analysis (RTA) pipeline, a key system of the CTA observatory. The latency and sensitivity requirements of the alarm system impose a challenge because of the anticipated large data rate, between 0.5 and 8 GB/s. As a consequence, substantial efforts toward the optimization of highthroughput computing service are envisioned. For these reasons our working group has started the development of a prototype of the Real-Time Analysis pipeline. The main goals of this prototype are to test: (i) a set of frameworks and design patterns useful for the inter-process communication between software processes running on memory; (ii) the sustainability of the foreseen CTA data rate in terms of data throughput with different hardware (e.g. accelerators) and software configurations, (iii) the reuse of nonreal- time algorithms or how much we need to simplify algorithms to be compliant with CTA requirements, (iv) interface issues between the different CTA systems. In this work we focus on goals (i) and (ii). © (2014) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.

2014 - Comparing Topic Models for a Movie Recommendation System [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Po, Laura; Sorrentino, Serena
abstract

Recommendation systems have become successful at suggesting content that are likely to be of interest to the user, however their performance greatly suffers when little information about the users preferences are given. In this paper we propose an automated movie recommendation system based on the similarity of movie: given a target movie selected by the user, the goal of the system is to provide a list of those movies that are most similar to the target one, without knowing any user preferences. The Topic Models of Latent Semantic Allocation (LSA) and Latent Dirichlet Allocation (LDA) have been applied and extensively compared on a movie database of two hundred thousand plots. Experiments are an important part of the paper; we examined the topic models behaviour based on standard metrics and on user evaluations, we have conducted performance assessments with 30 users to compare our approach with a commercial system. The outcome was that the performance of LSA was superior to that of LDA in supporting the selection of similar plots. Even if our system does not outperform commercial systems, it does not rely on human effort, thus it can be ported to any domain where natural language descriptions exist. Since it is independent from the number of user ratings, it is able to suggest famous movies as well as old or unheard movies that are still strongly related to the content of the video the user has watched.

2014 - Discovering the topics of a data source: A statistical approach? [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Ferrari, Davide; Guerra, Francesco; Simonini, Giovanni
abstract

In this paper, we present a preliminary approach for automatically discovering the topics of a structured data source with respect to a reference ontology. Our technique relies on a signature, i.e., a weighted graph that summarizes the content of a source. Graph-based approaches have been already used in the literature for similar purposes. In these proposals, the weights are typically assigned using traditional information-theoretical quantities such as entropy and mutual information. Here, we propose a novel data-driven technique based on composite likelihood to estimate the weights and other main features of the graphs, making the resulting approach less sensitive to overfitting. By means of a comparison of signatures, we can easily discover the topic of a target data source with respect to a reference ontology. This task is provided by a matching algorithm that retrieves the elements common to both the graphs. To illustrate our approach, we discuss a preliminary evaluation in the form of running example.

2014 - Keyword Search over Relational Databases: Issues, Approaches and Open Challenges [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Guerra, Francesco; Simonini, Giovanni
abstract

In this paper, we overview the main research approaches developed in the area of Keyword Search over Relational Databases. In particular, we model the process for solving keyword queries in three phases: the management of the user’s input, the search algorithms, the results returned to the user. For each phase we analyze the main problems, the solutions adopted by the most important system developed by researchers and the open challenges. Finally, we introduce two open issues related to multi-source scenarios and database sources handling instance not fully accessible.

2014 - LODeX: A visualization tool for Linked Open Data navigation and querying. [Software]
Benedetti, Fabio; Bergamaschi, Sonia; Po, Laura
abstract

We present LODeX, a tool for visualizing, browsing and querying a LOD source starting from the URL of its SPARQL endpoint. LODeX creates a visual summary for a LOD dataset and allows users to perfor queries on it. Users can select the classes of interest for discovering which instances are stored in the LOD source without any knowledge of the underlying vocabulary used for describing data. The tool couples the overall view of the LOD source with the preview of the instances so that the user can easily build and refine his/her query. The tool has been evaluated on hundreds of public SPARQL endpoints (listed in Data Hub). The schema summaries of 40 LOD sources are stored and available for online querying at http://dbgroup.unimo.it/lodex2.

2014 - Online Index Extraction from Linked Open Data Sources [Relazione in Atti di Convegno]
Benedetti, Fabio; Bergamaschi, Sonia; Po, Laura
abstract

The production of machine-readable data in the form of RDF datasets belonging to the Linked Open Data (LOD) Cloud is growing very fast. However, selecting relevant knowledge sources from the Cloud, assessing the quality and extracting synthetical information from a LOD source are all tasks that require a strong human effort. This paper proposes an approach for the automatic extraction of the more representative information from a LOD source and the creation of a set of indexes that enhance the description of the dataset. These indexes collect statistical information regarding the size and the complexity of the dataset (e.g. the number of instances), but also depict all the instantiated classes and the properties among them, supplying user with a synthetical view of the LOD source. The technique is fully implemented in LODeX, a tool able to deal with the performance issues of systems that expose SPARQL endpoints and to cope with the heterogeneity on the knowledge representation of RDF data. An evaluation on LODeX on a large number of endpoints (244) belonging to the LOD Cloud has been performed and the effectiveness of the index extraction process has been presented.

2014 - PROVENANCE-AWARE SEMANTIC SEARCH ENGINES BASED ON DATA INTEGRATION SYSTEMS [Articolo su rivista]
Beneventano, Domenico; Bergamaschi, Sonia
abstract

Search engines are common tools for virtually every user of the Internet and companies, such as Google and Yahoo!, have become household names. Semantic Search Engines try to augment and improve traditional Web Search Engines by using not just words, but concepts and logical relationships. Given the openness of the Web and the different sources involved, a Web Search Engine must evaluate quality and trustworthiness of the data; a common approach for such assessments is the analysis of the provenance of information. In this paper a relevant class of Provenance-aware Semantic Search Engines, based on a peer-to-peer, data integration mediator-based architecture is described. The architectural and functional features are an enhancement with provenance of the SEWASIE semantic search engine developed within the IST EU SEWASIE project, coordinated by the authors. The methodology to create a two level ontology and the query processing engine developed within the SEWASIE project, together with provenance extension are fully described.

2014 - Towards declarative imperative data-parallel systems ? [Relazione in Atti di Convegno]
Interlandi, M.; Simonini, G.; Bergamaschi, S.
abstract

Pushed by recent evolvements in the field of declarative networking and data-parallel computation, we propose a first investigation over a declarative imperative parallel programming model which tries to combine the two worlds. We identify a set of requirements that the model should possess and introduce a conceptual sketch of the system implementing the foresaw model.

2014 - Word Sense Induction with Multilingual Features Representation [Relazione in Atti di Convegno]
Lorenzo, Albano; Beneventano, Domenico; Bergamaschi, Sonia
abstract

The use of word senses in place of surface word forms has been shown to improve performance on many computational tasks, including intelligent web search. In this paper we propose a novel approach to automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). Almost all the WSI approaches proposed in the literature dealt with monolingual data and only very few proposals incorporate bilingual data. The WSI method we propose is innovative as use multi-lingual data to perform WSI of words in a given language. The experiments show a clear overall improvement of the performance: the single-language setting is outperformed by the multi-language settings on almost all the considered target words. The performance gain, in terms of F-Measure, has an average value of 5% and in some cases it reaches 40%.

2013 - A mediator-based approach for integrating heterogeneous multimedia sources [Articolo su rivista]
Beneventano, Domenico; Bergamaschi, Sonia; C., Gennaro; F., Rabitti
abstract

In many applications, the information required by the user cannot be found in just one source, but has to be retrieved from many varying sources. This is true not only of formatted data in database management systems, but also of textual documents and multimedia data, such as images and videos. We propose a mediator system that provides the end-user with a single query interface to an integrated view of multiple heterogeneous data sources. We exploit the capabilities of the MOMIS integration system and the MILOS multimedia data management system. Each multimedia source is managed by an instance of MILOS, in which a collection of multimedia records is made accessible by means of similarity searches employing the query-by-example paradigm. MOMIS provides an integrated virtual view of the underlying multimedia sources, thus offering unified multimedia access services. Two features are that MILOS is flexible—it is not tied to any particular similarity function—and the MOMIS’s mediator query processor only exploits the ranks of the local answers.

2013 - A semantic multi-lingual method for publishing linked open data [Relazione in Atti di Convegno]
Sorrentino, S.; Bergamaschi, S.; Fusari, E.
abstract

Nowadays, there has been an increment of open data initiatives promoting the freely publication of data produced by public administrations (such as public spending, health care, education etc.). However, the great majority of these data are published in an unstructured format (such as spreadsheets or CSV) and is typically accessed only by closed communities. To address this problem, we propose a semiautomatic multi-lingual and semantic method for facilitating resource providers in publishing public data into the Linked Open Data (LOD) cloud, and for helping consumers (companies and citizens) in efficiently accessing and querying them. The method has been applied on a real case on a set of data provided in Italian.

2013 - An iPad Order Management System for Fashion Trade [Relazione in Atti di Convegno]
I., Baroni; Bergamaschi, Sonia; Po, Laura
abstract

The fashion industry loves the new tablets. In 2011 we noted a 38% growth of e-commerce in the italian fashion industry. A large number of brands have understood the value of mobile devices as the key channel for consumer communication. The interest of brands in applications of mobile marketing and services have made a big step forward, with an increase of 129% in 2011 (osservatori.net, 2012). This paper presents a mobile version of the Fashion OMS (Order Management System) web application. Fashion Touch is a mobile application that allows clients and company’s sales networks to process commercial orders, consult the product catalog and manage customers as the OMS web version does with the added functionality of the off-line order entering mode. To develop an effective mobile App, we started by analyzing the new web technologies for mobile applications (HTML5, CSS3, Ajax) and their relative development frameworks making a comparison with the Apple’s native programming language. We selected Titanium, a multi-platform framework for native mobile and desktop devices application development via web technologies as the best framework for our purpose. We faced issues concerning the network synchronization and studied different database solutions depending on the device hardware characteristics and performances. This paper reports every aspect of the App development until the publication on the Apple Store.

2013 - Datalog in time and space, synchronously [Relazione in Atti di Convegno]
Interlandi, M.; Tanca, L.; Bergamaschi, S.
abstract

Motivated by recent developments of Datalog-based languages for highly distributed systems [9], in this paper we introduce a version of Datalog: specifically tailored for distributed programming [2] in synchronous settings, along with its operational and declarative semantics.

2013 - Keyword Search and Evaluation over Relational Databases: an Outlook to the Future [Relazione in Atti di Convegno]
Bergamaschi, Sonia; N., Ferro; Guerra, Francesco; G., Silvello
abstract

2013 - NORMS: an automatic tool to perform schema label normalization [Software]
Bergamaschi, Sonia; Gawinecki, Maciej; Sorrentino, Serena
abstract

Schema matching is the problem of finding relationships among concepts across heterogeneous data sources(heterogeneous in format and structure). Schema matching systems usually exploit lexical and semantic information provided by lexical databases/thesauri to discover intra/inter semanticrelationships among schema elements. However, most of them obtain poor performance on real scenarios due to the significant presence of “non-dictionary words” in real world schemata.Non-dictionary words include compound nouns, abbreviations and acronyms. In this paper, we present NORMS (NORmalizer of Schemata), a tool performing schema label normalization to increase the number of comparable labels extracted fromschemata.

2013 - QUEST: A Keyword Search System for Relational Data based on Semantic and Machine Learning Techniques [Articolo su rivista]
Bergamaschi, Sonia; Guerra, Francesco; Interlandi, Matteo; Trillo Lado, R.; Velegrakis, Y.
abstract

We showcase QUEST (QUEry generator for STructured sources), a search engine for relational databases that combines semantic and machine learning techniques for transforming keyword queries into meaningful SQL queries. The search engine relies on two approaches: the forward, providing mappings of keywords into database terms (names of tables and attributes, and domains of attributes), and the backward, computing the paths joining the data structures identified in the forward step. The results provided by the two approaches are combined within a probabilistic framework based on the Dempster-Shafer Theory. We demonstrate QUEST capabilities, and we show how, thanks to the flexibility obtained by the probabilistic combination of different techniques, QUEST is able to compute high quality results even with few training data and/or with hidden data sources such as those found in the Deep Web.

2013 - Semantic Annotation and Publication of Linked Open Data [Relazione in Atti di Convegno]
Sorrentino, Serena; Bergamaschi, Sonia; Elisa, Fusari; Beneventano, Domenico
abstract

Nowadays, there has been an increment of open data government initiatives promoting the idea that particular data produced by public administrations (such as public spending, health care, education etc.) should be freely published. However, the great majority of these resources is published in an unstructured format (such as spreadsheets or CSV) and is typically accessed only by closed communities. Starting from these considerations, we propose a semi-automatic experimental methodology for facilitating resource providers in publishing public data into the Linked Open Data (LOD) cloud, and for helping consumers (companies and citizens) in efficiently accessing and querying them. We present a preliminary method for publishing, linking and semantically enriching open data by performing automatic semantic annotation of schema elements. The methodology has been applied on a set of data provided by the Research Project on Youth Precariousness, of the Modena municipality, Italy.

2013 - Semantic Annotation of the CEREALAB Database by the AGROVOC Linked Dataset [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Sorrentino, Serena
abstract

The objective of the CEREALAB database is to help the breeders in choosing molecular markers associated to the most important traits. Phenotypic and genotypic data obtained from the integration of open source databases with the data obtained by the CEREALAB project are made available to the users. The CEREALAB database has been and is currently extensively used within the frame of the CEREALAB project. This paper presents the main achievements and the ongoing research to annotate the CEREALAB database and to publish it in the Linking Open Data network, in order to facilitate breeders and geneticists in searching and exploiting linked agricultural resources. One of the main focus of this paper is to discuss the use of the AGROVOC Linked Dataset both to annotate the CEREALAB schema and to discover schema-level mappings among the CEREALAB Dataset and other resources of the Linking Open Data network, such as NALT, the National Agricultural Library Thesaurus, and DBpedia.

2013 - Semantic Integration of heterogeneous data sources in the MOMIS Data Transformation System [Articolo su rivista]
Vincini, Maurizio; Bergamaschi, Sonia; Beneventano, Domenico
abstract

In the last twenty years, many data integration systems following a classical wrapper/mediator architecture and providing a Global Virtual Schema (a.k.a. Global Virtual View - GVV) have been proposed by the research community. The main issues faced by these approaches range from system-level heterogeneities, to structural syntax level heterogeneities at the semantic level. Despite the research effort, all the approaches proposed require a lot of user intervention for customizing and managing the data integration and reconciliation tasks. In some cases, the effort and the complexity of the task is huge, since it requires the development of specific programming codes. Unfortunately, due to the specificity to be addressed, application codes and solutions are not frequently reusable in other domains. For this reason, the Lowell Report 2005 has provided the guideline for the definition of a public benchmark for information integration problem. The proposal, called THALIA (Test Harness for the Assessment of Legacy information Integration Approaches), focuses on how the data integration systems manage syntactic and semantic heterogeneities, which definitely are the greatest technical challenges in the field. We developed a Data Transformation System (DTS) that supports data transformation functions and produces query translation in order to push down to the sources the execution. Our DTS is based on MOMIS, a mediator-based data integration system that our research group is developing and supporting since 1999. In this paper, we show how the DTS is able to solve all the twelve queries of the THALIA benchmark by using a simple combination of declarative translation functions already available in the standard SQL language. We think that this is a remarkable result, mainly for two reasons: firstly to the best of our knowledge there is no system that has provided a complete answer to the benchmark, secondly, our queries does not require any overhead of new code.

2013 - Semantic annotation and publication of linked open data [Relazione in Atti di Convegno]
Sorrentino, S.; Bergamaschi, S.; Fusari, E.; Beneventano, D.
abstract

2013 - Semantic annotation of the CEREALAB database by the AGROVOC linked dataset [Relazione in Atti di Convegno]
Beneventano, D.; Bergamaschi, S.; Sorrentino, S.
abstract

2013 - Using a HMM based approach for mapping keyword queries into database terms [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Guerra, Francesco; M., Interlandi; S., Rota; R., Trillo; Y., Velegrakis
abstract

Systems translating keyword queries into SQL queries over relational databases are usually referred to in the literature as schema-based approaches. These techniques exploit the information contained in the database schema to build SQL queries that express the intended meaning of the user query. Besides, typically, they perform a preliminary step that associates keywords in the user query with database elements (names of tables, attributes and domain attributes). In this paper, we present a probabilistic approach based on a Hidden Markov Model to provide such mappings. In contrast to most existing techniques, our proposal does not require any a-priori knowledge of the database extension.

2013 - Visual query specification and interaction with industrial engineering data [Relazione in Atti di Convegno]
Malagoli, A.; Leva, M.; Kimani, S.; Russo, A.; Mecella, M.; Bergamaschi, S.; Catarci, T.
abstract

Nowadays, industrial engineering environments are typically characterized by sensors which stream massive amounts of different types of data. It is often difficult for industrial engineers to query, interact with, and interpret the data. In order to process many different kinds of distributed data stream sources originating from different kinds of data sources, a distributed federated data stream management system (FDSMS) is necessary. Although there exist some research efforts aimed at providing visual interfaces for querying temporal and real time data, there is virtually no existing work that provides a visual query specification and interaction interface that directly corresponds to a distributed federated data stream management system. This paper describes a visual environment that supports users in visually specifying queries and interacting with industrial engineering data. The visual environment comprises: a visual query specification and interaction environment, and a corresponding visual query language that runs on top of a distributed FDSMS. © 2013 Springer-Verlag.

2013 - Visually querying and accessing data streams in industrial engineering applications [Relazione in Atti di Convegno]
Leva, M.; Mecella, M.; Russo, A.; Catarci, T.; Bergamaschi, S.; Malagoli, A.; Melander, L.; Risch, T.; Xu, C.
abstract

Managing the complexity of highly specialized products in industrial environments requires today the ability of handling the different data streams produced in all phases of the product lifecycle. Data stream management systems and stream query languages represent a viable solution for processing and accessing large data streams to support analytical tasks as a basis for improving human collaboration and decision-making. However, the complexity of existing textual stream query languages often prevents unskilled users from effectively exploiting the available tools. In this paper we present a visual query system that allows users to graphically build queries over data streams and traditional relational data. It builds on top of the SCSQ federated data stream management system providing configurable tools for the graphical visualization of query results. The system is being validated in real industrial scenarios, by using the presented tools to analyze data streams produced by industrial machines.

2012 - A Semantic Method for Searching Knowledge in a Software Development Context [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Martoglia, Riccardo; Sorrentino, Serena
abstract

The FACIT-SME European FP-7 project targets to facilitate the use and sharing of Software Engineering (SE) methods and best practices among software developing SMEs. In this context, we present an automatic semantic document searching method based on Word Sense Disambiguation which exploits both syntactic and semantic information provided by external dictionaries and is easily applicable for any SME.

2012 - A Supervised Method for Lexical Annotation of Schema Labels based on Wikipedia [Relazione in Atti di Convegno]
Sorrentino, Serena; Bergamaschi, Sonia; Elena, Parmiggiani
abstract

Lexical annotation is the process of explicit assignment of one or more meanings to a term w.r.t. a sense inventory (e.g., a thesaurus or an ontology). We propose an automatic supervised lexical annotation method, called ALATK (Automatic Lexical Annotation -Topic Kernel), based on the Topic Kernel function for the annotation of schema labels extracted from structured and semi-structured data sources. It exploits Wikipedia as sense inventory and as resource of training data.

2012 - A meta-language for MDX queries in eLog Business Solution [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Interlandi, Matteo; Mario, Longo; Po, Laura; Vincini, Maurizio
abstract

The adoption of business intelligence technologyin industries is growing rapidly. Business managers are notsatisfied with ad hoc and static reports and they ask for moreflexible and easy to use data analysis tools. Recently, applicationinterfaces that expand the range of operations available to theuser, hiding the underlying complexity, have been developed. Thepaper presents eLog, a business intelligence solution designedand developed in collaboration between the database group ofthe University of Modena and Reggio Emilia and eBilling, anItalian SME supplier of solutions for the design, production andautomation of documentary processes for top Italian companies.eLog enables business managers to define OLAP reports bymeans of a web interface and to customize analysis indicatorsadopting a simple meta-language. The framework translates theuser’s reports into MDX queries and is able to automaticallyselect the data cube suitable for each query.Over 140 medium and large companies have exploited thetechnological services of eBilling S.p.A. to manage their documentsflows. In particular, eLog services have been used by themajor media and telecommunications Italian companies and theirforeign annex, such as Sky, Mediaset, H3G, Tim Brazil etc. Thelargest customer can provide up to 30 millions mail pieces within6 months (about 200 GB of data in the relational DBMS). In aperiod of 18 months, eLog could reach 150 millions mail pieces(1 TB of data) to handle.

2012 - A non-intrusive movie recommendation system [Relazione in Atti di Convegno]
Farinella, Tania; Bergamaschi, Sonia; Po, Laura
abstract

Several recommendation systems have been developed to support the user in choosing an interesting movie from multimedia repositories. The widely utilized collaborative-filtering systems focus on the analysis of user profiles or user ratings of the items. However, these systems decrease their performance at the start-up phase and due to privacy issues, when a user hides most of his personal data. On the other hand, content-based recommendation systems compare movie features to suggest similar multimedia contents; these systems are based on less invasive observations, however they find some difficulties to supply tailored suggestions. In this paper, we propose a plot-based recommendation system, which is based upon an evaluation of similarity among the plot of a video that was watched by the user and a large amount of plots that is stored in a movie database. Since it is independent from the number of user ratings, it is able to propose famous and beloved movies as well as old or unheard movies/programs that are still strongly related to the content of the video the user has watched. We experimented different methodologies to compare natural language descriptions of movies (plots) and evaluated the Latent Semantic Analysis (LSA) to be the superior one in supporting the selection of similar plots. In order to increase the efficiency of LSA, different models have been experimented and in the end, a recommendation system that is able to compare about two hundred thousands movie plots in less than a minute has been developed.

2012 - Dimension matching in Peer-to-Peer Data Warehousing [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Marius Octavian, Olaru; Sorrentino, Serena; Vincini, Maurizio
abstract

During the last decades, the Data Warehouse has been one of the main components of a Decision Support System (DSS) inside a company. Given the great diffusion of Data Warehouses nowadays, managers have realized that there is a great potential in combining information coming from multiple information sources, like heterogeneous Data Warehouses from companies operating in the same sector. Existing solutions rely mostly on the Extract-Transform-Load (ETL) approach, a costly and complex process. The process of Data Warehous integration can be greatly simplified by developing a method that is able to semi-automatically discover semantic relationships among attributes of two or more different, heterogeneous Data Warehouse schemas. In this paper, we propose a method for the semi-automatic discovery of mappings between dimension hierarchies of heterogeneous Data Warehouses. Our approach exploits techniques from the Data Integration research area by combining topological properties of dimensions and semantic techniques.

2012 - FACIT-SME - Facilitate IT-providing SMEs by Operation-related Models and Methods [Software]
Bergamaschi, Sonia; Beneventano, Domenico; Martoglia, Riccardo
abstract

The FACIT SME project addresses SMEs operating in the ICT domain. The goals are (a) to facilitate the use of Software Engineering (SE) methods and to systematize their application integrated with the business processes, (b) to provide efficient and affordable certification of these processes according to internationally accepted standards, and (c) to securely share best practices, tools and experiences with development partners and customers. The project targets (1) to develop a novel Open Reference Model (ORM) for ICT SME, serving as knowledge backbone in terms of procedures, documents, tools and deployment methods; (2) to develop a customisable Open Source Enactment System (OSES) that provides IT support for the project-specific application of the ORM; and (3) to evaluate these developments with 5 ICT SMEs by establishing the ORM, the OSES and preparing the certifications. The approach combines and amends achievements from Model Generated Workplaces, Certification of SE for SMEs, and model-based document management. The consortium is shaped by 4 significant SME associations as well as a European association exclusively focused on the SME community in the ICT sector. Five R&D partners provide the required competences. Five SMEs operating in the ICT domain will evaluate the results in daily-life application. The major impact is expected for ICT SMEs by (a) optimising their processes based on best practise; (b) achieving internationally accepted certification; and (c) provision of structured reference knowledge. They will improve implementation projects and make their solutions more appealing to SMEs. ICT SME communities (organized by associations) will experience significant benefit through exchange of recent knowledge and best practises. By providing clear assets (ORM and OSES), the associations shape the service offering to their members and strengthen their community. The use of Open Source will further facilitate the spread of the results across European SMEs.

2012 - Integration and Provenance of Cereals Genotypic and Phenotypic Data [Poster]
Beneventano, Domenico; Bergamaschi, Sonia; Abdul Rahman, Dannaoui; Pecchioni, Nicola
abstract

This paper presents the ongoing research on the design and development of a Provenance Management component, PM_MOMIS, for the MOMIS Data Integration System. MOMIS has been developed by the DBGROUP of the University of Modena and Reggio Emilia (www.dbgroup.unimore.it). An open source version of the MOMIS system is delivered and maintained by the academic spin-off DataRiver (www.datariver.it).PM_MOMIS aims to provide the provenance management techniques supported by two of the most relevant data provenance systems, the "Perm" and "Trio" systems, and extends them by including the data fusion and conflict resolution techniques provided by MOMIS. PM_MOMIS functionalities have been studied and partially developed in the domain of genotypic and phenotypic cereal-data management within the CEREALAB project. The CEREALAB Data Integration Application integrates data coming from different databases with MOMIS, with the aim of creating a powerful tool for plant breeders and geneticists. Users of CEREALAB played a major role in the emergence of real needs of provenance management in their domain.We defined the provenance for the "full outerjoin-merge" operator, used in MOMIS to solve conflicts among values; this definition is based on the concept of "PI-CS-provenance" of the "Perm" system; we are using the "Perm" system as the SQL engine of MOMIS, so that to obtain the provenance in our CEREALAB Application. The main drawback of this solution is that often conflicting values represent alternatives; then our proposal is to consider the output of the "full outerjoin-merge" operator as an uncertain relation and manage it with a system that supports uncertain data and data lineage, the "Trio" system.

2012 - Integration and Provenance of Cereals Genotypic and Phenotypic Data [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Abdul Rahman, Dannaoui
abstract

This paper presents the ongoing research on the design and development of a Provenance Managementcomponent, PM_MOMIS, for the MOMIS Data Integration System. PM_MOMIS aims to provide the provenancemanagement techniques supported by two of the most relevant data provenance systems, the Perm andTrio systems, and extends them by including the data fusion and conflict resolution techniquesprovided by MOMIS.PM_MOMIS functionalities have been studied and partially developed in the domain of genotypic andphenotypic cereal-data management within the CEREALAB project. The CEREALAB Data IntegrationApplication integrates data coming from different databases with MOMIS, with the aim of creating apowerful tool for plant breeders and geneticists. Users of CEREALAB played a major role in theemergence of real needs of provenance management in their domain.

2012 - ODBASE 2012 PC co-chairs message [Relazione in Atti di Convegno]
Bergamaschi, S.; Cruz, I.
abstract

We are happy to present the papers of the 11th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE) held in Rome (Italy) on September 11th and 12th, 2012. The ODBASE Conference series provides a forum for research on the use of ontologies and data semantics in novel applications, and continues to draw a highly diverse body of researchers and practitioners. ODBASE is part of On the Move to Meaningful Internet Systems (OnTheMove) that co-locates three conferences: ODBASE, DOA-SVI (International Symposium on Secure Virtual Infrastructures), and CoopIS (International Conference on Cooperative Information Systems). Of particular interest in the 2012 edition of the ODBASE Conference are the research and practical experience papers that bridge across traditional boundaries between disciplines such as databases, networking, mobile systems, artificial intelligence, information retrieval, and computational linguistics. In this edition, we received 52 paper submissions and had a program committee of 82 people, which included researchers and practitioners from diverse research areas. Special arrangements were made during the review process to ensure that each paper was reviewed by members of different research areas. The result of this effort is the selection of high quality papers: fifteen as regular papers (29%), six as short papers (12%), and three as posters (15%). Their themes included studies and solutions to a number of modern challenges such as search and management of linked data and RDF documents, modeling, management, alignment and storing of ontologies, application of mining techniques, semantics discovery, and data uncertainty management. © 2012 Springer-Verlag.

2012 - ODBASE 2012 PC co-chairs message [Relazione in Atti di Convegno]
Bergamaschi, S.; Cruz, I.
abstract

2012 - On the Move to Meaningful Internet Systems: OTM 20121 [Curatela]
Bergamaschi, Sonia
abstract

The two-volume set LNCS 7565 and 7566 constitutes the refereed proceedings of three confederated international conferences: Cooperative Information Systems (CoopIS 2012), Distributed Objects and Applications - Secure Virtual Infrastructures (DOA-SVI 2012), and Ontologies, DataBases and Applications of SEmantics (ODBASE 2012) held as part of OTM 2012 in September 2012 in Rome, Italy. The 53 revised full papers presented were carefully reviewed and selected from a total of 169 submissions. The 22 full papers included in the first volume constitute the proceedings of CoopIS 2012 and are organized in topical sections on business process design; process verification and analysis; service-oriented architectures and cloud; security, risk, and prediction; discovery and detection; collaboration; and 5 short papers.

2012 - The CEREALAB Database: Ongoing Research and Future Challenges [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Dannaoui, Abdul Rahman; Milc, Justyna Anna; Pecchioni, Nicola; Sorrentino, Serena
abstract

The objective of the CEREALAB database is to help the breeders in choosing molecular markers associated to the most important traits. Phenotypic and genotypic data obtained from the integration of open source databases with the data obtained by the CEREALAB project are made available to the users. The first version of the CEREALAB database has been extensively used within the frame of the CEREALAB project. This paper presents the main achievements and the ongoing research related to the CEREALAB database. First, as a result of the extensive use of the CEREALAB database, several extensions and improvements to the web application user interface were introduced. Second, always derived from end-user needs, the notion of provenance was introduced and partially implemented in the context of the CEREALAB database. Third, we describe some preliminary ideas to annotate the CEREALAB database and to publish it in the Linking Open Data network.

2012 - Understanding the Semantics of Keyword Queries on Relational Data Without Accessing the Instance [Capitolo/Saggio]
Bergamaschi, Sonia; Domnori, Elton; Guerra, Francesco; Rota, Silvia; Raquel Trillo, Lado; Yannis, Velegrakis
abstract

This chapter deals with the problem of answering a keyword query over a relational database. To do so, one needs to understand the meaning of the keywords in the query, “guess” its possible semantics, and materialize them as SQL queries that can be executed directly on the relational database. The focus of the chapter is on techniques that do not require any prior access to the instance data, making them suitable for sources behind wrappers or Web interfaces or, in general, for sources that disallow prior access to their data in order to construct an index. The chapter describes two techniques that use semantic information and metadata from the sources, alongside the query itself, in order to achieve that. Apart from understanding the semantics of the keywords themselves, the techniques are also exploiting the order and the proximity of the keywords in the query to make a more educated guess. The first approach is based on an extension of the Hungarian algorithm for identifying the data structures having the maximum likelihood to contain the user keywords. In the second approach, the problem of associating keywords into data structures of the relational source is modeled by means of a hidden Markov model, and the Viterbi algorithm is exploited for computing the mappings. Both techniques have been implemented in two systems called KEYMANTIC and KEYRY, respectively.

2012 - Working in a dynamic environment: the NeP4B approach as a MAS [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Guerra, Francesco; Mandreoli, Federica; Vincini, Maurizio
abstract

Integration of heterogeneous information in the context of Internet is becoming a key activity to enable a more organized and semantically meaningful access to several kinds of information in the form of data sources, multimediadocuments and web services. In NeP4B (Networked Peers for Business), a project funded by the Italian Ministry of University and Research, we developed an approach for providing a uniform representation of data, multimedia and services,thus allowing users to obtain sets of data, multimedia documents and lists of webservices as query results. NeP4B is based on a P2P network of semantic peers, connected one with each other by means of automatically generated mappings.In this paper we present a new architecture for NeP4B, based on a Multi-Agent System.We claim that such a solution may be more efficient and effective, thanks to the agents’ autonomy and intelligence, in a dynamic environment, where sources are frequently added (or deleted) to (from) the network.

2011 - 2nd International Workshop on Data Engineering meets the Semantic [Esposizione]
Guerra, Francesco; Bergamaschi, Sonia
abstract

The goal of DESWeb is to bring together researchers and practitioners from both fields of Data Management and Semantic Web. It aims at investigating the new challenges that Semantic Web technologies have introduced and new ways through which these technologies can improve existing data management solutions. Furthermore, it intends to study what data management systems and technologies can offer in order to improve the scalability and performance of Semantic Web applications.

2011 - A Hidden Markov Model Approach to Keyword-Based Search over Relational Databases [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Guerra, Francesco; Rota, Silvia; Yannis, Velegrakis
abstract

We present a novel method for translating keyword queries over relationaldatabases into SQL queries with the same intended semantic meaning. Incontrast to the majority of the existing keyword-based techniques, our approachdoes not require any a-priori knowledge of the data instance. It follows a probabilisticapproach based on a Hidden Markov Model for computing the top-K bestmappings of the query keywords into the database terms, i.e., tables, attributesand values. The mappings are then used to generate the SQL queries that areexecuted to produce the answer to the keyword query. The method has been implementedinto a system called KEYRY (from KEYword to queRY).

2011 - A Semantic Approach to ETL Technologies [Articolo su rivista]
Bergamaschi, Sonia; Guerra, Francesco; Orsini, Mirko; Claudio, Sartori; Vincini, Maurizio
abstract

Data warehouse architectures rely on extraction, transformation and loading (ETL) processes for the creation of anupdated, consistent and materialized view of a set of data sources. In this paper, we aim to support these processes byproposing a tool for the semi-automatic definition of inter-attribute semantic mappings and transformation functions.The tool is based on semantic analysis of the schemas for the mapping definitions amongst the data sources and thedata warehouse, and on a set of clustering techniques for defining transformation functions homogenizing data comingfrom multiple sources. Our proposal couples and extends the functionalities of two previously developed systems: theMOMIS integration system and the RELEVANT data analysis system.

2011 - A Web Platform for Collaborative Multimedia Content Authoring Exploiting Keyword Search Engine and Data Cloud [Articolo su rivista]
Bergamaschi, Sonia; Interlandi, Matteo; Vincini, Maurizio
abstract

The composition of multimedia presentations is a time- and resource-consuming task if not afforded in a well-defined manner. This is particularly true when people having different roles and following different high-level directives, collaborate in the authoring and assembling of a final product. For this reason we adopt the Select, Assemble, Transform and Present (SATP) approach to coordinate the presentation authoring and a tag cloud-based search engine in order to help users in efficiently retrieving useful assets. In this paper we present MediaPresenter, the framework we developed to support companies in the creation of multimedia communication means, providing an instrument that users can exploit every time new communication channels have to be created.

2011 - A genotypic and phenotypic informationsource for marker-assisted selection of cereals:the CEREALAB database [Articolo su rivista]
Milc, Justyna Anna; Sala, Antonio; Bergamaschi, Sonia; Pecchioni, Nicola
abstract

The CEREALAB database aims to store genotypic and phenotypic data obtained by the CEREALAB project and to integratethem with already existing data sources in order to create a tool for plant breeders and geneticists. The database can helpthem in unravelling the genetics of economically important phenotypic traits; in identifying and choosing molecularmarkers associated to key traits; and in choosing the desired parentals for breeding programs. The database is dividedinto three sub-schemas corresponding to the species of interest: wheat, barley and rice; each sub-schema is then dividedinto two sub-ontologies, regarding genotypic and phenotypic data, respectively.

2011 - A web-based platform for multimedia content authoring exploiting keyword search engine and data cloud [Relazione in Atti di Convegno]
Bergamaschi, Sonia; F., Ferrari; Interlandi, Matteo; Vincini, Maurizio
abstract

The composition of multimedia presentations is atime and resource consuming task if not afforded in a well definedmanner. This is particularly true when people having differentroles and following different high-level directives, collaborate inthe authoring and assembling of a final product. For this reasonwe adopt the Select, Assemble, Transform and Present (SATP)approach to coordinate the presentation authoring and a tagcloud-based search engine in order to help users in efficientlyretrieving useful assets. In this paper we present MediaPresenter,the framework we developed to support companies in the creationof multimedia communication means, providing an instrumentthat users can exploit every time new communication channelshave to be created.

2011 - Automatic Normalization and Annotation for Discovering Semantic Mappings [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Beneventano, Domenico; Po, Laura; Sorrentino, Serena
abstract

Schema matching is the problem of finding relationships among concepts across heterogeneous data sources (heterogeneous in format and in structure). Starting from the “hidden meaning” associated to schema labels (i.e. class/attribute names) it is possible to discover relationships among the elements of different schemata. Lexical annotation (i.e. annotation w.r.t. a thesaurus/lexical resource) helps in associating a “meaning” to schema labels. However, accuracy of semi-automatic lexical annotation methods on real-world schemata suffers from the abundance of non-dictionary words such as compound nouns and word abbreviations. In this work, we address this problem by proposing a method to perform schema labels normalization which increases the number of comparable labels. Unlike other solutions, the method semi-automatically expands abbreviations and annotates compound terms, without a minimal manual effort. We empirically prove that our normalization method helps in the identification of similarities among schema elements of different data sources, thus improving schema matching accuracy.

2011 - Data Integration [Capitolo/Saggio]
Bergamaschi, Sonia; Beneventano, Domenico; Guerra, Francesco; Orsini, Mirko
abstract

Given the many data integration approaches, a complete and exhaustivecomparison of all the research activities is not possible. In this chapter, we willpresent an overview of the most relevant research activities andideas in the field investigated in the last 20 years. We will also introduce the MOMISsystem, a framework to perform information extraction and integration from bothstructured and semistructured data sources, that is one of the most interesting resultsof our research activity. An open source version of the MOMIS system was deliveredby the academic startup DataRiver (www.datariver.it).

2011 - Information Systems: Editorial [Articolo su rivista]
Carlo, Batini; Beneventano, Domenico; Bergamaschi, Sonia; Tiziana, Catarci
abstract

Research efforts on structured data, multimedia, and services have involved non-overlapping communities. However, from a user perspective, the three kinds of information should behave and be accessed similarly. Instead, a user has to deal with different tools in order to gain a complete knowledge about a domain. There is no integrated view comprising data, multimedia and services retrieved by the specific tools that is automatically computed. A unified approach for dealing with different kinds of information may allow searches across different domains and different starting points / results in the searching processes.Multiple and challenging research issues have to be addressed to achieve this goal, including: mediating among different models for representing information, developing new techniques for extracting and mapping relevant information from heterogeneous kinds of data, devising innovative paradigms for formulating and processing queries ranging over both (multimedia) data and services, investigating new models for visualizing the results and allowing the user to easily manipulate them.This special issue "Semantic Integration of Data, Multimedia, and Services" presents advances in data, multimedia, and services interoperability.

2011 - KEYRY: A Keyword-Based Search Engine over Relational Databases Based on a Hidden Markov Model [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Guerra, Francesco; Rota, Silvia; Yannis, Velegrakis
abstract

We propose the demonstration of KEYRY, a tool for translating keywordqueries over structured data sources into queries in the native language ofthe data source. KEYRY does not assume any prior knowledge of the source contents.This allows it to be used in situations where traditional keyword searchtechniques over structured data that require such a knowledge cannot be applied,i.e., sources on the hidden web or those behind wrappers in integration systems.In KEYRY the search process is modeled as a Hidden Markov Model and the ListViterbi algorithm is applied to computing the top-k queries that better representthe intended meaning of a user keyword query. We demonstrate the tool’s capabilities,and we show how the tool is able to improve its behavior over time byexploiting implicit user feedback provided through the selection among the top-ksolutions generated.

2011 - Keyword search over relational databases: a metadata approach [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Domnori, Elton; Guerra, Francesco; Raquel Trillo, Lado; Yannis, Velegrakis
abstract

Keyword queries offer a convenient alternative to traditionalSQL in querying relational databases with large, often unknown,schemas and instances. The challenge in answering such queriesis to discover their intended semantics, construct the SQL queriesthat describe them and used them to retrieve the respective tuples.Existing approaches typically rely on indices built a-priori on thedatabase content. This seriously limits their applicability if a-prioriaccess to the database content is not possible. Examples include theon-line databases accessed through web interface, or the sources ininformation integration systems that operate behind wrappers withspecific query capabilities. Furthermore, existing literature has notstudied to its full extend the inter-dependencies across the ways thedifferent keywords are mapped into the database values and schemaelements. In this work, we describe a novel technique for translatingkeyword queries into SQL based on the Munkres (a.k.a. Hungarian)algorithm. Our approach not only tackles the above twolimitations, but it offers significant improvements in the identificationof the semantically meaningful SQL queries that describe theintended keyword query semantics. We provide details of the techniqueimplementation and an extensive experimental evaluation.

2011 - Keyword-based Search in Data Integration Systems [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Domnori, Elton; Guerra, Francesco; Raquel Trillo, Lado; Yannis, Velegrakis
abstract

In this paper we describe Keymantic, a framework for translating keywordqueries into SQL queries by assuming that the only available information isthe source metadata, i.e., schema and some external auxiliary information. Sucha framework finds application when only intensional knowledge about the datasource is available like in Data Integration Systems.

2011 - MediaBank: Keyword Search and Tag Cloud Functionalities for aMultimedia Content Authoring Web Platform [Articolo su rivista]
Bergamaschi, Sonia; Interlandi, Matteo; Vincini, Maurizio
abstract

The composition of multimedia presentations is atime- and resource-consuming task if not afforded ina well-defined manner. This is particularly true whenpeople having different roles and following differenthigh-level directives, collaborate in the authoringand assembling of a final product. For this reasonwe adopt the Select, Assemble, Transform andPresent (SATP) approach to coordinate thepresentation authoring and a tag cloud-based searchengine in order to help users in efficiently retrievinguseful assets. In the first of this paper we presentMediaPresenter, the framework we developed tosupport companies in the creation of multimediacommunication means, providing an instrument thatusers can exploit every time new communicationchannels have to be created. In the second part wedescribe how we adopt keyword search techniquescoupled with Tag Cloud in order to summarize theresults over the stored data.

2011 - MediaPresenter, a web platform for multimedia content management [Relazione in Atti di Convegno]
Bergamaschi, Sonia; F., Ferrari; Interlandi, Matteo; Vincini, Maurizio
abstract

The composition of multimedia presentations is a time and resource consuming task if not afforded in a well defined manner. This is particularly true for medium/big companies, where people having different roles and following different high-level directives, collaborate in the authoring and assembling of a final product. In this paper we present MediaPresenter, the framework we developed to support companies in the creation of multimedia communication means, providing an instrument that users can exploit every time new communication channels have tobe created.

2011 - NORMS: an automatic tool to perform schema label normalization [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Gawinecki, Maciej; Sorrentino, Serena
abstract

2011 - Semi-automatic Discovery of Mappings Between Heterogeneous Data Warehouse Dimensions [Articolo su rivista]
Bergamaschi, Sonia; Marius Octavian, Olaru; Sorrentino, Serena; Vincini, Maurizio
abstract

Data Warehousing is the main Business Intelligence instrument for the analysis of large amounts of data. It permits the extraction of relevant information for decision making processes inside organizations. Given the great diffusion of Data Warehouses, there is an increasing need to integrate information coming from independent Data Warehouses or from independently developed data marts in the same Data Warehouse. In this paper, we provide a method for the semi-automatic discovery of common topological properties of dimensions that can be used to automatically map elements of different dimensions in heterogeneous Data Warehouses. The method uses techniques from the Data Integration research area and combines topological properties of dimensions in a multidimensional model.

2011 - Semi-automatic Discovery of Mappings Between Heterogeneous Data Warehouse Dimensions [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Marius Octavian, Olaru; Sorrentino, Serena; Vincini, Maurizio
abstract

2011 - The List Viterbi Training Algorithm and Its Application to Keyword Search over Databases [Relazione in Atti di Convegno]
Rota, Silvia; Bergamaschi, Sonia; Guerra, Francesco
abstract

Hidden Markov Models (HMMs) are today employed in a varietyof applications, ranging from speech recognition to bioinformatics.In this paper, we present the List Viterbi training algorithm, aversion of the Expectation-Maximization (EM) algorithm based onthe List Viterbi algorithm instead of the commonly used forwardbackwardalgorithm. We developed the batch and online versionsof the algorithm, and we also describe an interesting application inthe context of keyword search over databases, where we exploit aHMM for matching keywords into database terms. In our experimentswe tested the online version of the training algorithm in asemi-supervised setting that allows us to take into account the feedbacksprovided by the users.

2011 - The Open Source release of the MOMIS Data Integration System [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Beneventano, Domenico; Corni, Alberto; Entela, Kazazi; Orsini, Mirko; Po, Laura; Sorrentino, Serena
abstract

MOMIS (Mediator EnvirOnment for Multiple InformationSources) is an Open Source Data Integration system able to aggregate data coming from heterogeneous data sources (structured and semistructured) in a semi-automatic way. DataRiver3 is a Spin-Off of the University of Modena and Reggio Emilia that has re-engineered the MOMIS system, and released its Open Source version both for commercial and academic use. The MOMIS system has been extended with a set of features to minimize the integration process costs, exploiting the semantics of the data sources and optimizing each integration phase.The Open Source MOMIS system have been successfully applied in several industrial sectors: Medical, Agro-food, Tourism, Textile, Mechanical, Logistics. This paper describes the features of the Open Source MOMIS system and how it is able to address real data integration challenges.

2011 - Understanding linked open data through keyword searching: the KEYRY approach [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Guerra, Francesco; Rota, Silvia; Yannis, Velegrakis
abstract

We introduce KEYRY, a tool for translating keyword queries overstructured data sources into queries formulated in their native querylanguage. Since it is not based on analysis of the data sourcecontents, KEYRY finds application in scenarios where sourceshold complex and huge schemas, apt to frequent changes, such assources belonging to the linked open data cloud. KEYRY is basedon a probabilistic approach that provides the top-k results that betterapproximate the intended meaning of the user query.

2011 - Uomini e Computer: La nuova alleanza [Altro]
Brunetti, Rossella; Bergamaschi, Sonia; Bandieri, Paola
abstract

La maggior parte di noi ha una discreta familiarità con i computer, con Internet e con il Web, con gli elaboratori di immagini e suoni. L’Informatica invece, cioè l’insieme dei processi e delle tecnologie che rendono possibile la creazione, la raccolta, l’elaborazione, l’immagazzinamento e la trasmissione dell’informazione con metodi automatici, la più giovane delle Scienze esatte, rimane ancora abbastanza sconosciuta ai più. Il computer, protagonista per eccellenza della rivoluzione culturale introdotta dallo sviluppo delle tecnologie informatiche, ha cambiato nel giro di circa cinquanta anni dimensioni e potenzialità diventando uno strumento indispensabile per poter ampliare i nostri orizzonti tecnologici e culturali e per partecipare alla “cultura globale” creata dalle reti informatiche.Si discute molto se l’enorme mole di informazione manipolabile ed accessibile attraverso il computer porti effettivamente ad un “progresso” culturale dell’uomo oppure, invece, porti ad una perdita di controllo della propria identità e dei propri veri connotati in una confusione ineliminabile tra l’immagine truccata e distorta del mondo “virtuale” e quella del mondo reale che la genera.Una esplorazione sulle applicazioni tecnologiche e culturali più avanzate dell’informatica offre la possibilità di recuperare una “nuova alleanza” tra l’uomo e le macchine, tra la irriproducibile individualità della mente umana e la moltiplicazione dell’io creata dal Web, tra l’atto creativo dell’artista e le leggi esatte che presiedono il comportamento dei computer.Il progetto ha come scopo quello di avvicinare gli studenti delle Scuole Elementari e Superiori e la cittadinanza modenese agli sviluppi più avanzati della scienza e tecnologia informatica. Si discuteranno principalmente i progressi tecnologici legati alla robotica, i modelli informatici per l’intelligenza artificiale, la rivoluzione nella gestione dell’informazione introdotta dal Web, l’impatto dell’informatica nel mondo dello spettacolo e dell’arte.Sono stati realizzati:3 Laboratori per le Scuole Elementari3 Incontri per le Scuole Superiori3 conferenze seraliProduzione di un DVD a conclusione dell’esperienza

2011 - Using semantic techniques to access web data [Articolo su rivista]
Raquel, Trillo; Po, Laura; Sergio, Ilarri; Bergamaschi, Sonia; Eduardo, Mena
abstract

Nowadays, people frequently use different keyword-based web search engines to find the information they need on the web. However, many words are polysemous and, when these words are used to query a search engine, its output usually includes links to web pages referring to their different meanings. Besides, results with different meanings are mixed up, which makes the task of finding the relevant information difficult for the users, especially if the user-intended meanings behind the input keywords are not among the most popular on the web. In this paper, we propose a set of semantics techniques to group the results provided by a traditional search engine into categories defined by the different meanings of the input keywords. Differently from other proposals, our method considers the knowledge provided by ontologies available on the web in order to dynamically define the possible categories. Thus, it is independent of the sources providing the results that must be grouped. Our experimental results show the interest of the proposal.

2010 - 1st International Workshop on Data Engineering meets the Semantic Web (DESWeb 2010) [Esposizione]
Guerra, Francesco; Yannis, Velegrakis; Bergamaschi, Sonia
abstract

Modern web applications like Wiki’s, social networking sites and mashups, are radically changing thenature of modern Web from a publishing-only environment into a vivant place for information exchange.The successful exploitation of this information largely depends on the ability to successfully communicatethe data semantics, which is exactly the vision of the Semantic Web. In this context, new challengesemerge for semantic-aware data management systems.The contribution of the data management community in the Semantic Web effort is fundamental. RDFhas already been adopted as the representation model and exchange format for the semantics of thedata on the Web. Although, until recently, RDF had not received considerable attention, the recentpublication in RDF format of large ontologies with millions of entities from sites like Yahoo! andWikipedia, the huge amounts of microformats in RDF from life science organizations, and the giganticRDF bibliographic annotations from publishers, have made clear the need for advanced managementtechniques for RDF data.On the other hand, traditional data management techniques have a lot to gain by incorporating semanticinformation into their frameworks. Existing data integration, exchange and query solutions are typicallybased on the actual data values stored in the repositories, and not on the semantics of these values.Incorporation of semantics in the data management process improves query accuracy, and permit moreefficient and effective sharing and distribution services. Integration of new content, on-the-fly generationof mappings, queries on loosely structured data, keyword searching on structured data repositories, andentity identification, are some of the areas that can benefit from the presence of semantic knowledgealongside the data.The goal of DESWeb is to bring together researchers and practitioners from both fields of DataManagement and Semantic Web. It aims at investigating the new challenges that Semantic Webtechnologies have introduced and new ways through which these technologies can improve existing datamanagement solutions. Furthermore, it intends to study what data management systems andtechnologies can offer in order to improve the scalability and performance of Semantic Web applications.

2010 - A COMPLETE LCA DATA INTEGRATION SOLUTION BUILT UPON MOMIS SYSTEM [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Sgaravato, Luca; Vincini, Maurizio
abstract

Life Cycle Thinking is day by day spreading outside scientific circles to assume a key role in the modern production system. Rewarded by consumers or ruled by governments an increasing number of firms is focusing on the assessment of their industrial processes. ENEA supports the adoption of such practice in small companies supplying them with simplified LCA tools; extending their database with value and up-to-date data published by the European Commission is of primary importance in order to provide effective assistance. This paper presents and demonstrates how the MOMIS (and RELEVANT) systems may be coupled and extended to actually provide a time and effort effective support in developing and deploying such an integration solution. The paper describes all the stages involved in the Extract Transform and Load process, with strong emphasis on the benefits the integration designer can achieve by the means of the semi-automatic definition of inter-attribute mappings and transformation functions [1,2] on a large number of records.

2010 - Agents and Peer-to-Peer Computing6th InternationalWorkshop, AP2PC 2007 Honululu, Hawaii, USA, May 14-18, 2007 Revised and Selected Papers [Curatela]
S. R. H., Joseph; Z., Despotovic; G., Moro; Bergamaschi, Sonia
abstract

Peer-to-peer (P2P) computing has attracted significant media attention, initiallyspurred by the popularity of file-sharing systems such as Napster, Gnutella, andMorpheus.More recently systems like BitTorrent and eDonkey have continued tosustain that attention. New techniques such as distributed hash-tables (DHTs),semantic routing, and Plaxton Meshes are being combined with traditional conceptssuch as Hypercubes, Trust Metrics, and caching techniques to pool togetherthe untapped computing power at the “edges” of the Internet. These newtechniques and possibilities have generated a lot of interest in many industrialorganizations, and resulted in the creation of a P2P working group on standardizationin this area (http://www.irtf.org/charter?gtype=rg&group=p2prg).In P2P computing, peers and services forego central coordination and dynamicallyorganize themselves to support knowledge sharing and collaboration,in both cooperative and non-cooperative environments. The success of P2P systemsstrongly depends on a number of factors. First, the ability to ensure equitabledistribution of content and services. Economic and business models whichrely on incentive mechanisms to supply contributions to the system are beingdeveloped, along with methods for controlling the “free riding” issue. Second,the ability to enforce provision of trusted services. Reputation-based P2P trustmanagement models are becoming a focus of the research community as a viablesolution. The trust models must balance both constraints imposed by theenvironment (e.g., scalability) and the unique properties of trust as a social andpsychological phenomenon. Recently, we are also witnessing a move of the P2Pparadigm to embrace mobile computing in an attempt to achieve even higherubiquitousness. The possibility of services related to physical location and therelation with agents in physical proximity could introduce new opportunities andalso new technical challenges.Although researchers working on distributed computing, multi-agent systems,databases, and networks have been using similar concepts for a long time, it isonly fairly recently that papers motivated by the current P2P paradigm havestarted appearing in high-quality conferences and workshops. Research in agentsystems in particular appears to be most relevant because, since their inception,multi-agent systems have always been thought of as collections of peers.The multi-agent paradigmcan thus be superimposed on the P2P architecture,where agents embody the description of the task environments, the decisionsupportcapabilities, the collective behavior, and the interaction protocols ofeach peer. The emphasis in this context on decentralization, user autonomy, dynamicgrowth, and other advantages of P2P also leads to significant potentialproblems. Most prominent among these problems are coordination: the abilityof an agent to make decisions on its own actions in the context of activitiesof other agents, and scalability: the value of the P2P systems lies in how well they scale along several dimensions, including complexity, heterogeneity of peers,robustness, traffic redistribution, and so forth. It is important to scale up coordinationstrategies along multiple dimensions to enhance their tractability andviability, and thereby to widen potential application domains. These two problemsare common to many large-scale applications.Without coordination, agentsmay be wasting their efforts, squandering, resources, and failing to achieve theirobjectives in situations requiring collective effort.This workshop series brings together researchers working on agent systemsand P2P computing with the intention of strengthening this connection. Researchersfrom other related areas such as distributed systems, networks, anddatabase systems are also welcome (and, in our opinion, have a lot to contribute).We sought high-quality and original contributions on the general themeof “Agents and P2P Computing.”

2010 - Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher [Relazione in Atti di Convegno]
Po, Laura; Bergamaschi, Sonia
abstract

This paper proposes lexical annotation as an effective method to solve the ambiguity problems that affect ontology matchers. Lexical annotation associates to each ontology element a set of meanings belonging to a semantic resource. Performing lexical annotation on theontologies involved in the matching process allows to detect false positive mappings and to enrich matching results by adding new mappings (i.e. lexical relationships between elements on the basis of the semanticrelationships holding among meanings).The paper will go through the explanation of how to apply lexical annotation on the results obtained by a matcher. In particular, the paper shows an application on the SCARLET matcher.We adopt an experimental approach on two test cases, where SCARLET was previously tested, to investigate the potential of lexical annotation. Experiments yielded promising results, showing that lexical annotationimproves the precision of the matcher.

2010 - Guest editors' introduction: Information overload [Articolo su rivista]
Bergamaschi, Sonia; Guerra, Francesco; Barry, Leiba
abstract

Search the Internet for the phrase “information overload definition,” andGoogle will return some 7,310,000results (at the time of this writing).Bing gets 9,760,000 results for thesame query. How is it possible for usto process that much data, to select themost interesting information sources,to summarize and combine differentfacets highlighted in the results, andto answer the questions we set out toask? Information overload is present ineverything we do on the Internet.Despite the number of occurrences ofthe term on the Internet, peer-reviewedliterature offers only a few accuratedefinitions of information overload.Among them, we prefer the one thatdefines it as the situation that “occursfor an individual when the informationprocessing demands on time (InformationLoad, IL) to perform interactionsand internal calculations exceed thesupply or capacity of time available (Information Processing Capacity, IPC) for such processing.”1 In other words, when the information available exceeds the user’s ability to process it. This formaldefinition provides a measure that we can express algebraically as IL > IPC, offering a way for classifying and comparing the different situations in which the phenomenon occurs. But measuring IL and IPC is a complex task because they strictly depend on a set of factors involving both the individual and the information (such as the individual’s skill), as well as the motivations and goals behind the information request.Clay Shirky, who teaches at New York University,takes a different view, focusing on how we sift through the information that’s available to us. We’ve long had access to “more reading material than you could finish in a lifetime,” he says, and “there is no such thing as information overload, there’s only filter failure.”2 But howeverwe look at it, whether it’s too much productionor failure in filtering, it’s a general and common problem, and information overload management requires the study and adoption of special, user- and context-dependent solutions.Due to the amount of information available that comes with no guarantee of importance, trust, or accuracy, the Internet’s growth has inevitably amplified preexisting information overload issues. Newspapers, TV networks, and press agencies form an interesting example of overload producers: they collectively make available hundreds of thousands of partially overlapping news articles each day. This large quantity gives rise to information overload in a “spatial” dimension — news articles about the same subject are published in different newspapers— and in a “temporal” dimension — news articles about the same topic are published and updated many times in a short time period.The effects of information overload include difficulty in making decisions due to time spent searching and processing information,3 inabilityto select among multiple information sources providing information about the same topic,4 and psychological issues concerning excessive interruptions generated by too many informationsources.5 To put it colloquially, this excess of information stresses Internet users out.

2010 - IEEE Internet Computing Special Issue on Information Overload [Curatela]
Bergamaschi, Sonia; Guerra, Francesco; Barry, Leiba
abstract

Search the Internet for the phrase “information overload definition,” and Google will return some 7,310,000 results (at the time of this writing). Bing gets 9,760,000 results for the same query. How is it possible for us to process that much data, to select the most interesting information sources, to summarize and combine different facets highlighted in the results, and to answer the questions we set out to ask? Information overload is present in everything we do on the Internet.Despite the number of occurrences of the term on the Internet, peer-reviewed literature offers only a few accurate definitions of information overload.Among them, we prefer the one that defines it as the situation that “occurs for an individual when the information processing demands on time (Information Load, IL) to perform interactionsand internal calculations exceed the supply or capacity of time available (Information Processing Capacity, IPC) for such processing.” In other words, when the information available exceeds the user’s ability to process it. This formal definition provides a measure that we can express algebraically as IL > IPC, offering a way for classifying and comparing the different situations in which the phenomenon occurs. But measuring IL and IPC is a complex task because they strictly depend on a set of factors involving both the individual and the information (such as the individual’s skill), as well as the motivations and goals behind the information request.Clay Shirky, who teaches at New York University, takes a different view, focusing on how we sift through the information that’s available to us. We’ve long had access to “more reading material than you could finish in a lifetime,” he says, and “there is no such thing as information overload, there’s only filter failure.” But however we look at it, whether it’s too much production or failure in filtering, it’s a general and common problem, and information overload management requires the study and adoption of special, user- and context-dependent solutions.Due to the amount of information available that comes with no guarantee of importance, trust, or accuracy, the Internet’s growth has inevitably amplified preexisting information overload issues. Newspapers, TV networks, and press agencies form an interesting example of overload producers: they collectively make available hundreds of thousands of partially overlapping news articles each day. This large quantity gives rise to information overload in a “spatial” dimension — news articles about the same subject are published in different newspapers— and in a “temporal” dimension — news articles about the same topic are published and updated many times in a short time period.The effects of information overload include difficulty in making decisions due to time spent searching and processing information, inability to select among multiple information sources providing information about the same topic, and psychological issues concerning excessive interruptions generated by too many information sources. To put it colloquially, this excess of information stresses Internet users out.

2010 - Keymantic: Semantic Keyword-based Searching in Data Integration Systems [Software]
Bergamaschi, Sonia; Domnori, Elton; Guerra, Francesco; Orsini, Mirko; Raquel Trillo, Lado; Yannis, Velegrakis
abstract

Keymantic is a systemfor keyword-based searching in relational databases thatdoes not require a-priori knowledge of instances held in adatabase. It finds numerous applications in situations wheretraditional keyword-based searching techniques are inappli-cable due to the unavailability of the database contents forthe construction of the required indexes.

2010 - Keymantic: Semantic Keyword-based Searching in Data Integration Systems [Articolo su rivista]
Bergamaschi, Sonia; Domnori, Elton; Guerra, Francesco; Orsini, Mirko; R., Trillo Lado; Y., Velegrakis
abstract

We propose the demonstration of Keymantic, a system for keyword-based searching in relational databases that does not require a-priori knowledge of instances held in a database. It nds numerous applications in situations where traditional keyword-based searching techniques are inapplicable due to the unavailability of the database contents for the construction of the required indexes.

2010 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Preface [Relazione in Atti di Convegno]
Joseph, S. R. H.; Despotovic, Z.; Moro, G.; Bergamaschi, S.
abstract

2010 - MOMIS: Getting through the THALIA benchmark [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Orsini, Mirko; Vincini, Maurizio
abstract

During the last decade many data integration systems characterized by a classical wrapper/mediator architecture based on a Global Virtual Schema (Global Virtual View - GVV) have been proposed. The data sources store data, while the GVV provides a reconciled, integrated, and virtual view of the underlying sources. Each proposed system contribute to the state of the art advancement by focusing on different aspects to provide an answer to one or more challenges of the data integration problem, ranging from system-level heterogeneities, to structural syntax level heterogeneities at the semantic level. The approaches are still in part manual, requiring a great amount of customization for data reconciliation and for writing specific non reusable programming code. The specialization of mediator systems make a comparisons among the various systems difficult. Therefore, the last Lowell Report [1] has provided the guideline for the definition of a public benchmark for data integration problems. The proposal is called THALIA (Test Harness for the Assessment of Legacy information Integration Approaches) [2], and it provides researchers with a collection of downloadable data sources representing University course catalogues, a set of twelve benchmark queries, as well as a scoring function for ranking the performance of an integration system. In this paper we show how the MOMIS mediator system we developed [3,4] can deal with all the twelve queries of the THALIA benchmark by simply extending and combining the declarative translation functions available in MOMIS and without any overhead of new code. This is a remarkable result, in fact, as far as we know, no system has provided a complete answer to the benchmark.

2010 - Preface [Relazione in Atti di Convegno]
Bergamaschi, S.; Lodi, S.; Martoglia, R.; Sartori, C.
abstract

2010 - SEBD 2010 Proceedings of the 18th Italian Symposium on Advanced Database Systems [Curatela]
Bergamaschi, Sonia; S., Lodi; Martoglia, Riccardo; C., Sartori
abstract

PrefaceThis volume collects the papers selected for presentation at the Eighteenth ItalianSymposium on Advanced Database Systems (SEBD 2010), held in Rimini,Italy, from the 20th to the 23rd of June 2010.SEBD is the major annual event of the Italian database research community.The symposium is conceived as a gathering forum for the discussionand exchange of ideas and experiences among researchers and experts fromthe academy and industry, about all aspects of database systems and their applications.SEBD is back in Rimini after sixteen years, and it is interesting to observehow the landscape of the Italian database research community has changed. In1994 twenty-one papers were accepted, now the number has more than doubled,meaning that the community has been steadily growing. Most of the topicsconsidered in 1994 are still around, even if the language, the formalisms andthe reference applications have changed. The Web was e-mail, FTP, Usenet,a small amount of HTML pages here and there, and little more, now it isthe pervasive engine of information dissemination and search. The Web is sopowerful that a series of brand new ideas and applications have arisen fromit, due to a mix of possibility and necessity. Social systems across the Web,mobility, and heterogeneity were not conceivable in the early 1990s. Semanticweb, data mining and warehousing, streaming techniques, large scale integrationare necessary to deal with the growing amount of data and information.The SEBD 2010 program reects the current interests of the Italian databaseresearchers and covers most of the topics considered by the international researchcommunity. Sixty papers were submitted to SEBD 2010, of which twenty-twowere research papers, two were software demonstrations, and thirty-four wereextended abstracts, i.e., papers containing descriptions of on-going projects orpresenting results already published. Fifty-one papers were accepted for presentation,of which seventeen were research papers, two were software demonstrations,and thirty-two were extended abstracts.Besides paper presentations, the program includes a tutorial by Divesh Srivastava(AT&T Labs-Research) and two invited talks, the rst by Hector Garcia-Molina (Stanford University, CA) and the second by Amr El Abbadi (Universityof California, CA).We would like to thank all the authors who submitted papers and all symposiumparticipants. We are grateful to the members of the Program Committeeand the external referees for their thorough work in reviewing submissions withexpertise and patience, and to the members of the SEBD Steering Committeefor their support in the organization of SEBD 2010. Special thanks are due tothe members of the Organizing Committee and to the University of Bologna,Polo di Rimini, which made this event possible. Finally, we gratefully thank allcooperating institutions.Rimini, June 2010 Sonia Bergamaschi Stefano LodiRiccardo Martoglia Claudio Sartori

2010 - Schema Label Normalization for Improving Schema Matching [Articolo su rivista]
Sorrentino, Serena; Bergamaschi, Sonia; Gawinecki, Maciej; Po, Laura
abstract

Schema matching is the problem of finding relationships among concepts across heterogeneous data sources that are heterogeneous in format and in structure. Starting from the “hidden meaning” associated with schema labels (i.e. class/attribute names) it is possible to discover relationships among the elements of different schemata. Lexical annotation (i.e. annotation w.r.t. a thesaurus/lexical resource) helps in associating a “meaning” to schema labels.However, the performance of semi-automatic lexical annotation methods on real-world schemata suffers from the abundance of non-dictionary words such as compound nouns, abbreviations, and acronyms. We address this problem by proposing a method to perform schema label normalization which increases the number of comparable labels. The method semi-automatically expands abbreviations/acronyms and annotates compound nouns, with minimal manual effort. We empirically prove that our normalization method helps in the identification of similarities among schema elements of different data sources, thus improving schema matching results.

2010 - Uncertainty in data integration systems: automatic generation of probabilistic relationships [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Po, Laura; Sorrentino, Serena; Corni, Alberto
abstract

This paper proposes a method for the automatic discovery of probabilistic relationships in the environment of data integration systems. Dynamic data integration systems extend the architecture of current data integration systems by modeling uncertainty at their core. Our method is based on probabilistic word sense disambiguation (PWSD), which allows to automatically lexically annotate (i.e. to perform annotation w.r.t. a thesaurus/lexical resource) the schemata of a given set of data sources to be integrated. From the annotated schemata and the relathionships defined in the thesaurus, we derived the probabilistic lexical relationships among schema elements. Lexical relationships are collected in the Probabilistic Common Thesaurus (PCT), as well as structural relationships.

2009 - A Mediator Based Approach to Ontology Generation and Querying of Molecular and Phenotypic Cereals Data [Articolo su rivista]
Sala, Antonio; Bergamaschi, Sonia
abstract

We describe the development of the CEREALAB ontology, an ontology of molecular and phenotypic cereals data, that allows identifying the correlation between the phenotype of a plant with its molecular data. It is realised by integrating public web databases with the database developed by the research group of the CEREALAB laboratory. Integration is obtained semi-automatically by using the Mediator envirOnment for Multiple Information Sources (MOMIS) system, a data integration system developed by the Database Group of the University of Modena and Reggio Emilia, and allows querying the integrated data sources regardless of the specific languages of the source databases.

2009 - ALA: Dealing with Uncertainty in Lexical Annotation [Software]
Bergamaschi, Sonia; Po, Laura; Sorrentino, Serena; Corni, Alberto
abstract

We present ALA, a tool for the automatic lexical annotation (i.e. annotation w.r.t. a thesaurus/lexical resource) of structured and semi-structured data sources and the discovery of probabilistic lexical relationships in a data integration environment. ALA performs automatic lexical annotation through the use of probabilistic annotations, i.e. an annotation is associated to a probability value. By performing probabilistic lexical annotation, we discover probabilistic inter-sources lexical relationships among schema elements. ALA extends the lexical annotation module of the MOMIS data integration system. However, it may be applied in general in the context of schema mapping discovery, ontology merging and data integration system and it is particularly suitable for performing “on-the-fly” data integration or probabilistic ontology matching.

2009 - An ETL tool based on semantic analysis of schemata and instances [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Guerra, Francesco; Orsini, Mirko; C., Sartori; Vincini, Maurizio
abstract

In this paper we propose a system supporting the semi-automatic definition of inter-attribute mappings and transformation functions used as an ETL tool in a data warehouse project. The tool supports both schema level analysis, exploited for the mapping definitions amongst the data sources and the data warehouse,and instance level operations, exploited for defining transformation functions that integrate data coming from multiple sources in a common representation.Our proposal couples and extends the functionalities of two previously developed systems: the MOMIS integration system and the RELEVANT data analysis system.

2009 - DataRiver [Spin Off]
Bergamaschi, Sonia; Orsini, Mirko; Beneventano, Domenico; Sala, Antonio; Corni, Alberto; Po, Laura; Sorrentino, Serena; Quix, Srl
abstract

2009 - Dealing with Uncertainty in Lexical Annotation [Articolo su rivista]
Bergamaschi, Sonia; Po, Laura; Sorrentino, Serena; Corni, Alberto
abstract

2009 - Extending Word Net with compound nouns for semi-automatic annotation in data integration systems [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Sorrentino, Serena
abstract

The focus of data integration systems is on producing a comprehensive global schema successfully integrating data from heterogeneous data sources (heterogeneous in format and in structure). Starting from the “meanings” associated to schema elements (i.e. class/attribute labels) and exploiting the structural knowledge of sources, it is possible to discover relationships among the elements of different schemata. Lexical annotation is the explicit inclusion of the “meaning” of a data source element according to a lexical resource. Accuracy of semi-automatic lexical annotatortools is poor on real-world schemata due to the abundance of non-dictionary compound nouns. It follows that a large set of relationships among different schemata is discovered, including a great amount of false positive relationships. In this paper we propose a new method for the annotation ofnon-dictionary compound nouns, which draws its inspiration from works in the natural languagedisambiguation area. The method extends the lexical annotation module of the MOMIS data integration system.

2009 - Improving Extraction and Transformation in ETL by Semantic Analysis [Relazione in Atti di Convegno]
Guerra, Francesco; Bergamaschi, Sonia; Orsini, Mirko; Claudio, Sartori; Vincini, Maurizio
abstract

Extraction, Transformation and Loading processes (ETL) are crucial for the data warehouseconsistency and are typically based on constraints and requirements expressed in natural language in the form ofcomments and documentations. This task is poorly supported by automatic software applications, thus makingthese activities a huge works for data warehouse. In a traditional business scenario, this fact does not representa real big issue, since the sources populating a data warehouse are fixed and directly known by the dataadministrator. Nowadays, the actual business needs require enterprise information systems to have a greatflexibility concerning the allowed business analysis and the treated data. Temporary alliances of enterprises,market analysis processes, the data availability on Internet push enterprises to quickly integrate unexpected datasources for their activities. Therefore, the reference scenario for data warehouse systems extremely changes,since data sources populating the data warehouse may not directly be known and managed by the designers,thus creating new requirements for ETL tools related to the improvement of the automation of the extraction andtransformation process, the need of managing heterogeneous attribute values and the ability to manage differentkinds of data sources, ranging from DBMS, to flat file, XML documents and spreadsheets. In this paper wepropose a semantic-driven tool that couples and extends the functionalities of two systems: the MOMISintegration system and the RELEVANT data analysis system. The tool aims at supporting the semi-automaticdefinition of ETL inter-attribute mappings and transformations in a data warehouse project. By means of asemantic analysis, two tasks are performed: 1) identification of the parts of the schemata of the data sourceswhich are related to the data warehouse; 2) supporting the definition of transformation rules for populating thedata warehouse. We experimented the approach in a real scenario: preliminary qualitative results show that ourtool may really support the data warehouse administrator’s work, by considerably reducing the data warehousedesign time.

2009 - Keymantic: A keyword Based Search Engine using Structural Knwoledge [Relazione in Atti di Convegno]
Guerra, Francesco; Bergamaschi, Sonia; Orsini, Mirko; Sala, Antonio; Sartori, C.
abstract

Traditional techniques for query formulation need the knowledge of the database contents, i.e. which data are stored in the data source and how they are represented.In this paper, we discuss the development of a keyword-based search engine for structured data sources. The idea is to couple the ease of use and flexibility of keyword-based search with metadata extracted from data schemata and extensional knowledge which constitute a semantic network of knowledge. Translating keywords into SQL statements, we will develop a search engine that is effective, semantic-based, and applicablealso when instance are not continuously available, such as in integrated data sources or in data sources extracted from the deep web.

2009 - Keymantic: A keyword-based search engine using structural knowledge [Relazione in Atti di Convegno]
Guerra, F.; Bergamaschi, S.; Orsini, M.; Sala, A.; Sartori, C.
abstract

Traditional techniques for query formulation need the knowledge of the database contents, i.e. which data are stored in the data source and how they are represented. In this paper, we discuss the development of a keyword-based search engine for structured data sources. The idea is to couple the ease of use and flexibility of keyword-based search with metadata extracted from data schemata and extensional knowledge which constitute a semantic network of knowledge. Translating keywords into SQL statements, we will develop a search engine that is effective, semantic-based, and applicable also when instance are not continuously available, such as in integrated data sources or in data sources extracted from the deep web.

2009 - Lexical Knowledge Extraction: an Effective Approach to Schema and Ontology Matching [Relazione in Atti di Convegno]
Po, Laura; Sorrentino, Serena; Bergamaschi, Sonia; Beneventano, Domenico
abstract

This paper’s aim is to examine what role Lexical Knowledge Extraction plays in data integration as well as ontology engineering.Data integration is the problem of combining data residing at distributed heterogeneous sources, and providing the user with a unified view of these data; a common and important scenario in data integration are structured or semi-structure data sources described by a schema.Ontology engineering is a subfield of knowledge engineering that studies the methodologies for building and maintaining ontologies. Ontology engineering offers a direction towards solving the interoperability problems brought about by semantic obstacles, such as the obstacles related to the definitions of business terms and software classes. In these contexts where users are confronted with heterogeneous information it is crucial the support of matching techniques. Matching techniques aim at finding correspondences between semantically related entities of different schemata/ontologies.Several matching techniques have been proposed in the literature based on different approaches, often derived from other fields, such as text similarity, graph comparison and machine learning.This paper proposes a matching technique based on Lexical Knowledge Extraction: first, an Automatic Lexical Annotation of schemata/ontologies is performed, then lexical relationships are extracted based on such annotations.Lexical Annotation is a piece of information added in a document (book, online record, video, or other data), that refers to a semantic resource such as WordNet. Each annotation has the property to own one or more lexical descriptions. Lexical annotation is performed by the Probabilistic Word Sense Disambiguation (PWSD) method that combines several disambiguation algorithms.Our hypothesis is that performing lexical annotation of elements (e.g. classes and properties/attributes) of schemata/ontologies makes the system able to automatically extract the lexical knowledge that is implicit in a schema/ontology and then to derive lexical relationships between the elements of a schema/ontology or among elements of different schemata/ontologies.The effectiveness of the method presented in this paper has been proven within the data integration system MOMIS.

2009 - Schema Normalization for Improving Schema Matching [Relazione in Atti di Convegno]
Sorrentino, Serena; Bergamaschi, Sonia; Gawinecki, Maciej; Po, Laura
abstract

Schema matching is the problem of finding relationships among concepts across heterogeneous data sources (heterogeneous in format and in structure). Starting from the \hidden meaning" associated to schema labels (i.e. class/attribute names) it is possible to discover relationships among the elements of different schemata. Lexical annotation (i.e. annotation w.r.t. a thesaurus/lexical resource) helps in associating a “meaning" to schema labels. However, accuracy of semi-automatic lexical annotation methods on real-world schemata suffers from the abundance of non-dictionary words such as compound nouns and word abbreviations.In this work, we address this problem by proposing a method to perform schema labels normalization which increases the number of comparable labels. Unlike other solutions, the method semi-automatically expands abbreviations and annotates compound terms, without a minimal manual effort. We empirically prove that our normalization method helps in the identification of similarities among schema elements of different data sources, thus improving schema matching accuracy.

2009 - Semantic Access to Data from the Web [Relazione in Atti di Convegno]
Raquel, Trillo; Po, Laura; Sergio, Ilarri; Bergamaschi, Sonia; Eduardo, Mena
abstract

There is a great amount of information available on the web. So, users typically use different keyword-based web search engines to find the information they need. However, many words are polysemous and therefore the output of the search engine will include links to web pages referring to different meanings of the keywords. Besides, results with different meanings are mixed up, which makes the task of finding the relevant information difficult for the user, specially if the meanings behind the input keywords are not among the most popular in the web. In this paper, we propose a semantics-based approach to group the results returned to the user in clusters defined by the different meanings of the input keywords. Differently from other proposals, our method considers the knowledge provided by a pool of ontologies available on the Web in order to dynamically define the different categories (or clusters). Thus, it is independent of the sources providing the results that must be grouped.

2009 - Semantic Analysis for an Advanced ETL framework [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Guerra, Francesco; Orsini, Mirko; C., Sartori; Vincini, Maurizio
abstract

In this paper we propose a system supporting the semi-automatic definition of inter-attribute mappings and transformation functions used as ETL tool in a data warehouse project. The tool supports both schema level analysis, exploited for the mapping definitions amongst the data sources and the data warehouse, and instance level operations, exploited for defining transformationfunctions that integrate in a common representation data coming from multiple sources.Our proposal couples and extends the functionalities of two previously developed systems: the MOMIS integration system and the RELEVANT data analysis system.

2009 - Semi-automatic compound nouns annotation for data integration systems [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Sorrentino, Serena
abstract

Lexical annotation is the explicit inclusion of the “meaning" of a data source element according to a lexical resource. Accuracy of semi-automatic lexical annotator tools is poor on real-world schemata due to the abundance of non-dictionary compound nouns. It follows that a large set of relationships among different schemata is discovered, including a great amount of false positive relationships. In this paper we propose a new method for the annotation of non- dictionary compound nouns, which draws its inspiration from works in the natural languagedisambiguation area. The method extends the lexical annotation module of the MOMIS data integration system.

2009 - Toward a Unified View of Data and Services [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Andrea, Maurino
abstract

The research on data integration and service discovering has involved from the beginning different (not always overlapping) communities. Therefore, data and services are described with different models and different techniques to retrieve data and services have been developed. Nevertheless, from a user perspective, the border between data and services is often not so definite, since data and services provide a complementary vision about the available resources.In NeP4B (Networked Peers for Business), a project funded by the Italian Ministry of University and Research, we developed a semantic approach for providing a uniform representation of data and services, thus allowing users to obtain sets of data and lists of web-services as query results. The NeP4B idea relies on the creation of a Peer Virtual View (PVV) representing sets of data sources and web services, i.e. an ontological representation of data sources which is mapped to an ontological representation of web services. The PVV is exploited for solving user queries: 1) data results are selected by adopting a GAV approach; 2) services are retrieved by an information retrieval approach applied on service descriptions and by exploiting the mappings on the PVV.In the tutorial, we introduce: 1) the state of the art of semantic-based data integration and web service discovering systems; 2) the NeP4B architecture.

2008 - 2nd International Workshop on Semantic Web Architectures For Enterprises [Esposizione]
Bergamaschi, Sonia; Guerra, Francesco; Yannis, Velegrakis
abstract

The Semantic Web vision aims at building a "web of data", where applications may share their data on the Internet and relate them to real world objects for interoperability and exchange. Similar ideas have been applied to web services, where different modeling architectures have been proposed for adding semantics to web service descriptions making services on the web widely available. The potential impact envisaged by these approaches on real business applications is also important in areas such as: Semantic-based business integration: business integration allows enterprises to share their data and services with other enterprises for business purposes. Making data and services available satisfies both "structural" requirements of enterprises (e.g. the possibility of sharing data about products or about available services), and "dynamic" requirement (e.g. business-to-business partnerships to execute an order). Information systems implementing semantic web architectures can enable and strongly support this process. Semantic interoperability: metadata and ontologies support the dynamic and flexible exchange of data and services across information systems of different organizations. Adding semantics to representations of data and services allows accurate data querying and service discovering. Semantic-based lifecycle management: metadata, ontologies and rules are becoming an effective way for modeling corporate processes and business domains, effectively supporting the maintenance and evolution of business processes, corporate data, and knowledge. Knowledge management: ontologies and automated reasoning tools seem to provide an innovative support to the elicitation, representation and sharing of corporate knowledge. SWAE (Semantic Web Architectures for Enterprises) aims at evaluating how and how much the Semantic Web vision has met its promises with respect to business and market needs. Papers and demonstrations of interest for the workshop will show and highlight the interactions between Semantic Web technologies and business applications. The workshop aims at collecting models, tools, use cases and practical experience in which Semantic Web techniques have been developed and applied to support any relevant business processes. It aims at assessing their degree of success, the challenges that have been addressed, the solutions that have been provided and the new tools that have been implemented. Special attention will be paid to proposals of “complete architecture”, i.e. applications that can effectively support the maintenance and evolution of business processes as a whole and applications that are able to combine representations of data and services in order to realize a common business knowledge management system.

2008 - A Mediator System for Data and Multimedia Sources [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Claudio, Gennaro; Guerra, Francesco; Matteo, Mordacchini; Sala, Antonio
abstract

Managing data and multimedia sources with a unique tool is a challenging issue. In this paper, the capabilities of the MOMIS integration system and the MILOS multimedia content management system are coupled, thus providing a methodology and a tool for building and querying an integrated virtual view of data and multimedia sources.

2008 - Agents and Peer-to-Peer Computing5th International Workshop, AP2PC 2006, Hakodate, Japan, May 9, 2006, Revised and Invited Papers [Curatela]
S., Joseph; Z., Despotovic; G., Moro; Bergamaschi, Sonia
abstract

Peer-to-peer (P2P) computing has attracted significant media attention, initiallyspurred by the popularity of file-sharing systems such as Napster, Gnutella, andMorpheus.More recently systems like BitTorrent and eDonkey have continued tosustain that attention. New techniques such as distributed hash-tables (DHTs),semantic routing, and Plaxton Meshes are being combined with traditional conceptssuch as Hypercubes, Trust Metrics, and caching techniques to pool togetherthe untapped computing power at the “edges” of the Internet. These newtechniques and possibilities have generated a lot of interest in many industrialorganizations, and have resulted in the creation of a P2P working group on standardizationin this area (http://www.irtf.org/charter?gtype=rg&group=p2prg).In P2P computing, peers and services forego central coordination and dynamicallyorganize themselves to support knowledge sharing and collaboration,in both cooperative and non-cooperative environments. The success of P2P systemsstrongly depends on a number of factors. First, the ability to ensure equitabledistribution of content and services. Economic and business models whichrely on incentive mechanisms to supply contributions to the system are beingdeveloped, along with methods for controlling the “free riding” issue. Second,the ability to enforce provision of trusted services. Reputation-based P2P trustmanagement models are becoming a focus of the research community as a viablesolution. The trust models must balance both constraints imposed by theenvironment (e.g., scalability) and the unique properties of trust as a social andpsychological phenomenon. Recently, we are also witnessing a move of the P2Pparadigm to embrace mobile computing in an attempt to achieve even higherubiquitousness. The possibility of services related to physical location and therelation with agents in physical proximity could introduce new opportunities andalso new technical challenges.Although researchers working on distributed computing, multi-agent systems,databases, and networks have been using similar concepts for a long time, it isonly fairly recently that papers motivated by the current P2P paradigm havestarted appearing in high-quality conferences and workshops. Research in agentsystems in particular appears to be most relevant because, since their inception,multiagent systems have always been thought of as collections of peers.The multiagent paradigm can thus be superimposed on the P2P architecture,where agents embody the description of the task environments, the decisionsupportcapabilities, the collective behavior, and the interaction protocols ofeach peer. The emphasis in this context on decentralization, user autonomy, dynamicgrowth, and other advantages of P2P also leads to significant potentialproblems. Most prominent among these problems are coordination: the abilityof an agent to make decisions on its own actions in the context of activitiesof other agents; and scalability: the value of the P2P systems lies in how well they scale along several dimensions, including complexity, heterogeneity of peers,robustness, traffic redistribution, and so forth. It is important to scale up coordinationstrategies along multiple dimensions to enhance their tractability andviability, and thereby to widen potential application domains. These two problemsare common to many large-scale applications.Without coordination, agentsmay be wasting their efforts, squandering resources, and failing to achieve theirobjectives in situations requiring collective effort.This workshop series brings together researchers working on agent systems andP2P computing with the intention of strengthening this connection. Researchersfrom other related areas such as distributed systems, networks, and databasesystems are also welcome (and, in our opinion, have a lot to contribute). Weseek high-quality and original contributions on the general theme of “Agentsand P2P Computing.”

2008 - Automatic annotation for mapping discovery in data integration systems (Extended abstract) [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Po, Laura; Sorrentino, Serena
abstract

In this article we present CWSD (Combined Word Sense Disambiguation) a method and a software tool for enabling automatic lexical annotation of local (structured and semi-structured) data sources in a data integration system. CWSD is based on the exploitation of WordNet Domains and the lexical and structural knowledge of the data sources. The method extends the semi-automatic lexical annotation module of the MOMIS data integration system. The distinguishing feature of the method is its independence or low dependence of a human intervention. CWSD is a valid method to satisfy two important tasks: (1) the source lexical annotation process, i.e. the operation of associating an element of a lexical reference database (WordNet) to all source elements, (2) the discover of mappings among concepts of distributed data sources/ontologies.

2008 - DEXA 2008: Second international workshop on Semantic Web Architectures for Enterprises - SWAE'08 [Relazione in Atti di Convegno]
Bergamaschi, S.; Guerra, F.; Velegrakis, Y.
abstract

The aim of the second edition of the workshop on Semantic Web Architectures for Enterprises (SWAE) is to evaluate how and how much the Semantic Web vision has met its promises with respect to business and market needs. On the basis of our research experience within the basic research Italian project NeP4B (http://www.dbgroup.unimo.it/nep4b/it/index.htm), the European projects SEWASIE (www.sewasie.org), STASIS (http://www.dbgroup.unimo.it/stasis/), OKKAM (www.okkam.org) and Papyrus (www.ict-papyrus.eu), we focus on the permeation of the Semantic Web technologies in industrial and real applications.

2008 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Preface [Relazione in Atti di Convegno]
Joseph, S.; Despotovic, Z.; Moro, G.; Bergamaschi, S.
abstract

2008 - Open Source come modello di business per le PMI: analisi critica e casi di studio [Capitolo/Saggio]
Bergamaschi, Sonia; Nigro, Francesco; Po, Laura; Vincini, Maurizio
abstract

Il software Open Source sta attirando l'attenzione a tutti i livelli, sia all'interno del mondo economico che produttivo, perché propone un nuovo modello di sviluppo tecnologico ed economico fortemente innovativo e di rottura con il passato.In questo elaborato verranno analizzate le ragioni che stanno determinando il successo di tale modello e verranno inoltre presentate alcune casistiche in cui l'Open Source risulta vantaggioso, evidenziando gli aspetti più interessanti sia per gli utilizzatori che per i produttori del software.

2008 - Sixth International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P 2008) [Esposizione]
Bergamaschi, Sonia; Boon Chong, Seet; Noria, Foukia; Nigel, Stanger; Gianluca, Moro
abstract

The aim of this sixth workshop is to explore the promise of P2P to offer exciting new possibilities in distributed information processing and database technologies. Nowadays network technologies allow the deployment of systems composed of a big number of devices and calculators whose complexity may lead to high management costs. Besides, the investments required to guarantee robustness and reliability may become unsustainable. Examples of this kind of systems are grid networks and enterprises' or governmental server clusters spread all over the world and used for business, social or scientific purposes. Other application scenarios are emerging where the system can not be configured by specific adjustments on its single components, for instance in sensor networks with thousands or millions of micro-devices. Viable solutions require the system to be able to self-configure, self- manage, self-recovery, more generally speaking it must be able to selforganize without human interaction, adapting its working and optimization strategies to resource usage and to overall efficiency. The P2P paradigm lends itself to constructing large scale complex, adaptive, autonomous and heterogeneous database and information systems, endowed with clearly specified and differential capabilities to negotiate, bargain, coordinate and self-organize the information exchanges in large scale networks. Peer-to-peer systems have concretely shown how to aggregate huge computation and information resources from small autonomous and heterogeneous calculators. Although no centralized coordination exists, these systems are able to organize themselves and can offer basic services for information discovery. The literature on peer-to-peer systems, which has grown rapidly in the last years, has highlighted the potential of this new paradigm offering more efficient and reliable solutions for self-organization of big distributed systems. The realization of these promises lies fundamentally in the availability of enhanced services such as structured ways for representing, classifying, querying and registering shared information, verification and certification of information, content distributed schemes and quality of content, security features, information discovery and accessibility, interoperation and composition of active information services, and finally market-based mechanisms to allow cooperative and non cooperative information exchanges. The exploitation of the knowledge extracted from the peers' network is definitely a further potential of such systems. For example, the possibility of performing distributed data mining on the big amount of data that peer-to-peer systems are able to put together, also exploiting their huge parallel computing potential, may lead to extract knowledge potentially useful for scientific, social and commercial purposes, depending on the network domain. Moreover, in-network data mining algorithms would supply the capability of generating and transmitting highlevel models instead of raw data allowing to significantly reduce network traffic and the network could forecast events and return only relevant information to user requests. The use of semantics for the descriptions of peers and services could introduce new approaches for querying, sharing, distributing and organizing knowledge. Such approach generates several challenges related to the association of services/contents to ontologies, the interoperability/ integration of ontologies, the exploitation of emergent semantics required for understanding different contents and the automation of such processes. For example, in mobile computing the possibility of data and services related to physical location and the relation with peers and sensors in physical proximity could introduce new opportunities and also new technical challenges. Such dynamic environments, which are inherently characterized by mobility and heterogeneity of resources like devices, participants, services, information and

2007 - A new type of metadata for querying data integration systems [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Guerra, Francesco; Orsini, Mirko; C., Sartori
abstract

Research on data integration has provided languages and systems able to guarantee an integrated intensional representation of a given set of data sources.A significant limitation common to most proposals is that only intensional knowledge is considered, with little or no consideration for extensional knowledge. In this paper we propose a technique to enrich the intension of an attribute with a new sort of metadata: the “relevant values”, extracted from the attribute values.Relevant values enrich schemata with domain knowledge; moreover they can be exploited by a user in the interactive process of creating/refining a query. The technique, fully implemented in a prototype, is automatic, independent of the attribute domain and it is based on data mining clustering techniques and emerging semantics from data values. It is parametrized with various metrics for similarity measures and is a viable tool for dealing with frequently changing sources.

2007 - Automatic annotation in data integration systems [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Po, Laura; Sorrentino, Serena
abstract

We propose a CWSD (Combined Word Sense Disambiguation) algorithm for the automatic annotation of structured and semi-structured data sources. Rather than being targeted to textual data sources like most of the traditional WSD algorithms found in the literature, our algorithm can exploit information coming from the structure of the sources together with the lexical knowledge associated with the terms (elements of the schemata).

2007 - Automatic annotation of local data sources for data integration systems [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Po, Laura; Sala, Antonio; Sorrentino, Serena
abstract

In this article we present CWSD (Combined Word Sense Disambiguation) a method and a software tool for enabling automatic annotation of local structured and semi-structured data sources, with lexical information, in a data integration system. CWSD is based on the exploitation of WordNet Domains, structural knowledge and on the extension of the lexical annotation module of the MOMIS data integration system. The distinguishing feature of the algorithm is its low dependence of a human intervention. Our approach is a valid method to satisfy two important tasks: (1) the source annotation process, i.e. the operation of associating an element of a lexical reference database (WordNet) to all source elements, (2) the discover of mappings among concepts of distributed data sources/ontologies.

2007 - CEREALAB DATABASE: Data Integration with the MOMIS System [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Sala, Antonio
abstract

Biological information is frequently widespread over the Web and retrieving knowledge in this domain often requires to navigate through several websites. Data sources are usually heterogeneous and present different structures and interfaces. Mediator systems can be used to perform integration of such databases in order to have an integrated view of multiple information sources and to query them. The MOMIS system (Mediator envirOnment for Multiple Information Sources) is a framework developed by the Database Group of the University of Modena and Reggio Emilia (www.dbgroup.unimo.it) to perform intelligent information integration from both structured and unstructured data sources. The result of the integration process is a Global Virtual View (GVV) of the underlying sources which is a conceptualization of the underlying domain and then may be thought of as an ontology describing the involved sources. Queries can be posed over the GVV regardless of the structure of the local sources in a transparent way for the user. The MOMIS system has been experimented for the realization of the CEREALAB database. CEREALAB is a research project of technology transfer for applying Marker Assisted Selection (MAS) techniques to cereal breeding in Italian seed companies.

2007 - Creating and Querying an Integrated Ontology for Molecular and Phenotypic Cereals Data [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Sala, Antonio
abstract

In this paper we describe the development of an ontology of molecular and phenotypic cereals data, realized by integrating existing public web databases with the database developed by the research group of the CEREALAB project (www.cerealab.org). This integration is obtained using the MOMIS system (Mediator envirOnment for Multiple Information Sources), a mediator based data integration system developed by the Database Group of the University of Modena and Reggio Emilia(www.dbgroup.unimo.it). MOMIS performs information extraction and integration from both structured and semi-structured data sources in a semi-automatic way. Information integration is performed in a semi-automatic way, by exploiting the knowledge in a Common Thesaurus (defined by the framework) and the descriptions of source schemas with a combination of clustering and Description Logics techniques. The result of the integration process is a Global Virtual Schema (GVV) of the underlying data sources for which mapping rules and integrity constraints are specified to handle heterogeneity. Each GVV element is annotated w.r.t. the WordNet lexical database(wordnet.princeton.edu). The GVV can be queried transparently with regards to integrated data sources using an easy to use graphical interface regardless of the specific languages of the source databases.

2007 - Databases, Information Systems, and Peer-to-Peer Computing, International Workshops, DBISP2P 2005/2006, Trondheim, Norway, August 28-29, 2005, Seoul, Korea, September 11, 2006, Revised Selected Papers [Curatela]
Gianluca, Moro; Bergamaschi, Sonia; Sam, Joseph; Jean Henry, Morin; Aris M., Ouksel
abstract

Atti delle edizioni 2005 e 2006 degli workshop su Databases, Information Systems, and Peer-to-Peer Computing.

2007 - Development of an On-Line Database of Molecular and Phenotypic Data for Marker Assisted Selection of Cereals. [Abstract in Atti di Convegno]
Milc, Justyna Anna; Albertazzi, Giorgia; Caffagni, Alessandra; Sala, Antonio; Francia, Enrico; Barbieri, Mirko; Bergamaschi, Sonia; Pecchioni, Nicola
abstract

2007 - Extracting Relevant Attribute Values for Improved Search [Articolo su rivista]
Bergamaschi, Sonia; Guerra, Francesco; Orsini, Mirko; C., Sartori
abstract

A new kind of metadata offers a synthesized view of an attribute's values for a user to exploit when creating or refining a search query in data-integration systems. The extraction technique that obtains these values is automatic and independent of an attribute domain but parameterized with various metrics for similarity measures. The authors describe a fully implemented prototype and some experimental results to show the effectiveness of "relevant values" when searching a knowledge base.

2007 - Fifth International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P 2007) [Esposizione]
Bergamaschi, Sonia; Zoran, Despotovic; Sam, Joseph; Gianluca, Moro
abstract

The aim of the workshop is to explore the promise of P2P to offer exciting new possibilities in distributed information processing and database technologies. The realization of these promises lies fundamentally in the availability of enhanced services such as structured ways for classifying and registering shared information, verification and certification of information, content distributed schemes and quality of content, security features, information discovery and accessibility, interoperation and composition of active information services, and finally market-based mechanisms to allow cooperative and non cooperative information exchanges. The P2P paradigm lends itself to constructing large scale complex, adaptive, autonomous and heterogeneous database and information systems, endowed with clearly specified and differential capabilities to negotiate, bargain, coordinate and self-organize the information exchanges in large scale networks. This vision will have a radical impact on the structure of complex organizations (business, scientific or otherwise) and on the emergence and the formation of social communities, and on how the information is organized and processed. Recently, the P2P paradigm is embracing mobile computing and ad-hoc networks in an attempt to achieve even higher ubiquitousness. The possibility of data and services related to physical location and the relation with peers and sensors in physical proximity could introduce new opportunities and also new technical challenges. Such dynamic environments, which are inherently characterized by high mobility and heterogeneity of resources like devices, participants, services, information and data representation, pose several issues on how to search and localize resources, how to efficiently route traffic, up to higher level problems related to semantic interoperability and information relevance. The use of ontologies for the descriptions of peers and services could introduce new approaches for querying, sharing, distributing and organizing knowledge. Nevertheless, several challenges related to the association of services/contents to ontologies, the interoperability/ integration of! ontologies required for understanding different contents and the automation of such processes rise. The workshop is build on the success of the four preceding editions since VLDB 2003, whose proceedings have been always published by Springer in Lecture Notes in Computer Science series. It concentrates on exploring the synergies between current database research and P2P computing, in fact it is our belief that database research has much to contribute to the P2P grand challenge through its wealth of techniques for sophisticated semantics-based data models, new indexing algorithms and efficient data placement, query processing techniques and transaction processing. Database technologies in the new information age will form the crucial components of the first generation of complex adaptive P2P information systems, which will be characterized by their ability to continuously self-organize, adapt to new circumstances, promote emergence as an inherent property, optimize locally but not necessarily globally, deal with approximation and incompleteness. This workshop also concentra! tes on the impact of complex adaptive information systems on current database technologies and their relation to emerging industrial technologies. The workshop co-location with VLDB (http://www.vldb2007.org/), the major international database and information systems conference, is important in order to actually bring together key researchers from all over the world working on databases and P2P computing with the intention of strengthening this connection. Researchers from other related areas such as distributed systems, networks, multi-agent systems and complex systems are also invited, in fact we believe that mostly in the P2P paradigm, as an interdisciplinary theme, different approaches and point of views may generate c

2007 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes: Preface [Curatela]
Bergamaschi, S.; Joseph, S.; Morin, J. -H.; Moro, G.; Ouksel, A. M.
abstract

2007 - MELIS: An Incremental Method For The Lexical Annotation Of Domain Ontologies [Relazione in Atti di Convegno]
Bergamaschi, Sonia; P., Bouquet; D., Giacomuzzi; Guerra, Francesco; Po, Laura; Vincini, Maurizio
abstract

In this paper, we present MELIS (Meaning Elicitation and Lexical Integration System), a method and a software tool for enabling an incremental process of automatic annotation of local schemas (e.g. relational database schemas, directory trees) with lexical information. The distinguishing and original feature of MELISis its incrementality: the higher the number of schemas which are processed, the more background/domain knowledge is cumulated in the system (a portion of domain ontology is learned at every step), the better the performance of the systems on annotating new schemas.MELIS has been tested as component of MOMIS-Ontology Builder, a framework able to create a domain ontology representing a set of selected data sources, described with a standard W3C language wherein concepts and attributes are annotated according to the lexical reference database.We describe the MELIS component within the MOMIS-Ontology Builder framework and provide some experimental results of MELIS as a standalone tool and as a component integrated in MOMIS.

2007 - MELIS: a tool for the incremental annotation of domain ontologies [Software]
Bergamaschi, Sonia; Paolo, Bouquet; Daniel, Giacomuzzi; Guerra, Francesco; Po, Laura; Vincini, Maurizio
abstract

Melis is a software tool for enablingan incremental process of automatic annotation of local schemas (e.g. re-lational database schemas, directory trees) with lexical information. Thedistinguishing and original feature of MELIS is its incrementality: thehigher the number of schemas which are processed, the more back-ground/domain knowledge is cumulated in the system (a portion of do-main ontology is learned at every step), the better the performance ofthe systems on annotating new schemas.

2007 - Melis: an incremental method for the lexical annotation of domain ontologies [Articolo su rivista]
Bergamaschi, Sonia; P., Bouquet; D., Giacomuzzi; Guerra, Francesco; Po, Laura; Vincini, Maurizio
abstract

In this paper, we present MELIS (Meaning Elicitation and Lexical Integration System), a method and a software tool for enabling an incremental process of automatic annotation of local schemas (e.g. relational database schemas, directory trees) with lexical information. The distinguishing and original feature of MELIS is the incremental process: the higher the number of schemas which are processed, the more background/domain knowledge is cumulated in the system (a portion of domain ontology is learned at every step), the better the performance of the systems on annotating new schemas.MELIS has been tested as component of MOMIS-Ontology Builder, a framework able to create a domain ontology representing a set of selected data sources, described with a standard W3C language wherein concepts and attributes are annotated according to the lexical reference database.We describe the MELIS component within the MOMIS-Ontology Builder framework and provide some experimental results of ME LIS as a standalone tool and as a component integrated in MOMIS.

2007 - Progetto di Basi di Dati Relazionali [Monografia/Trattato scientifico]
Beneventano, Domenico; Bergamaschi, Sonia; Guerra, Francesco; Vincini, Maurizio
abstract

L'obiettivo del volume è fornire al lettore le nozioni fondamentali di progettazione e di realizzazione di applicazioni di basi di dati relazionali. Relativamente alla progettazione, vengono trattate le fasi di progettazione concettuale e logica e vengono presentati i modelli dei dati Entity-Relationship e Relazionale che costituiscono gli strumenti di base, rispettivamente, per la progettazione concettuale e la progettazione logica. Viene inoltre introdotto lo studente alla teoria della normalizzazione di basi di dati relazionali. Relativamente alla realizzazione, vengono presentati elementi ed esempi del linguaggio standard per RDBMS (Relational Database Management Systems) SQL. Ampio spazio è dedicato ad esercizi svolti sui temi trattati. Il volume nasce dalla pluriennale esperienza didattica condotta dagli autori nei corsi di Basi di Dati e di Sistemi Informativi per studenti dei corsi di laurea e laurea specialistica della Facoltà di Ingegneria di Modena, della Facoltà di Ingegneria di Reggio Emilia e della Facoltà di Economia "Marco Biagi" dell'Università degli Studi di Modena e Reggio Emilia. Il volume attuale estende notevolmente le edizioni precedenti arricchendo la sezione di progettazione logica e di SQL.La sezione di esercizi è completamente nuova, inoltre, ulteriori esercizi sono reperibili su questa pagina web. Come le edizioni precedenti, costituisce più una collezione di appunti che un vero libro nel senso che tratta in modo rigoroso ma essenziale i concetti forniti. Inoltre, non esaurisce tutte le tematiche di un corso di Basi di Dati, la cui altra componente fondamentale è costituita dalla tecnologia delle basi di dati. Questa componente è, a parere degli autori, trattata in maniera eccellente da un altro testo di Basi di Dati, scritto dai nostri colleghi e amici Paolo Ciaccia e Dario Maio dell'Università di Bologna. Il volume, pure nella sua essenzialità, è ricco di esercizi svolti e quindi può costituire un ottimo strumento per gruppi di lavoro che, nell'ambito di software house, si occupino di progettazione di applicazioni di basi di dati relazionali.

2007 - Query Translation on heterogeneous sources in MOMIS Data Transformation Systems [Relazione in Atti di Convegno]
Beneventano, Domenico; Vincini, Maurizio; Orsini, Mirko; Bergamaschi, Sonia; Nana, C.
abstract

Abstract

2007 - Querying a super-peer in a schema-based super-peer network [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Guerra, Francesco; Vincini, Maurizio
abstract

We propose a novel approach for defining and querying a super-peer within a schema-based super-peer network organized into a two-level architecture: the low level, called the peer level (which contains a mediator node), the second one, called super-peer level (which integrates mediators peers with similar content).We focus on a single super-peer and propose a method to define and solve a query, fully implemented in the SEWASIE project prototype. The problem we faced is relevant as a super-peer is a two-level data integrated system, then we are going beyond traditional setting in data integration. We have two different levels of Global as View mappings: the first mapping is at the super-peer level and maps several Global Virtual Views (GVVs) of peers into the GVV of the super-peer; the second mapping is within a peer and maps the data sources into the GVV of the peer. Moreover, we propose an approach where the integration designer, supported by a graphical interface, can implicitly define mappings by using Resolution Functions to solve data conflicts, and the Full Disjunction operator that has been recognized as providing a natural semantics for data merging queries.

2007 - RELEvant VAlues geNeraTor [Software]
Bergamaschi, Sonia; Claudio, Sartori; Guerra, Francesco; Orsini, Mirko
abstract

2007 - Relevant News: a semantic news feed aggregator [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Guerra, Francesco; Orsini, Mirko; Sartori, C; Vincini, Maurizio
abstract

In this paper we present RELEVANTNews, a web feed reader that automatically groups news related to the same topic published in different newspapers in different days. The tool is based on RELEVANT, a previously developed tool, which computes the “relevant values”, i.e. a subset of the values of a string attribute.Clustering the titles of the news feeds selected by the user, it is possible identify sets of related news on the basis of syntactic and lexical similarity.RELEVANTNews may be used in its default configuration or in a personalized way: the user may tune some parameters in order to improve the grouping results. We tested the tool with more than 700 news published in 30 newspapers in four daysand some preliminary results are discussed.

2007 - Relevant values: new metadata to provide insight on attribute values at schema level [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Guerra, Francesco; Orsini, Mirko; C., Sartori
abstract

Research on data integration has provided languages and systems able to guarantee an integrated intensionalrepresentation of a given set of data sources. A significant limitation common to most proposals is that only intensional knowledge is considered, with little or no consideration for extensional knowledge.In this paper we propose a technique to enrich the intension of an attribute with a new sort of metadata: the “relevant values”, extracted from the attribute values. Relevant values enrich schemata with domain knowledge; moreover they can be exploited by a user in the interactive process of creating/refining a query. The technique, fully implemented in a prototype, is automatic, independent of the attribute domain and it is basedon data mining clustering techniques and emerging semantics from data values. It is parametrized with various metrics for similarity measures and is a viable tool for dealing with frequently changing sources, as in the Semantic Web context.

2007 - Semantic search engines based on data integration systems [Capitolo/Saggio]
Beneventano, Domenico; Bergamaschi, Sonia
abstract

As the use of the World Wide Web has become increasingly widespread, the business of commercial search engines has become a vital and lucrative part of the Web. Search engines are common place tools for virtually every user of the Internet; and companies, such as Google and Yahoo!, have become household names. Semantic Search Engines try to augment and improve traditional Web Search Engines by using not just words, but concepts and logical relationships. In this chapter a relevant class of Semantic Search Engines, based on a peer-to-peer, data integration mediator-based architecture is described. The architectural and functional features are presented with respect to two projects, SEWASIE and WISDOM, involving the authors. The methodology to create a two level ontology and query processing in the SEWASIE project are fully described.

2007 - Sixth International Workshop on Agents and Peer-to-Peer Computing (AP2PC 2007) [Esposizione]
Bergamaschi, Sonia; Zoran, Despotovic; Sam, Joseph; Gianluca, Moro
abstract

Peer-to-peer (P2P) computing has attracted enormous media attention, initially spurred by the popularity of file sharing systems such as Napster, Gnutella, and Morpheus. More recently systems like BitTorrent and eDonkey have continued to sustain that attention. New techniques such as distributed hash-tables (DHTs), semantic routing, and Plaxton Meshes are being combined with traditional concepts such as Hypercubes, Trust Metrics and caching techniques to pool together the untapped computing resources at the "edges" of the internet. These new techniques and possibilities have generated a lot of interest in many industrial organizations, and has resulted in the creation of a P2P working group on standardization in this area. (http://www.irtf.org/charter?gtype=rg&group=p2prg).In P2P computing peers and services forego central coordination and dynamically organise themselves to support knowledge sharing and collaboration, in both cooperative and non-cooperative environments. The success of P2P systems strongly depends on a number of factors. Firstly, the ability to ensure equitable distribution of content and services. Economic and business models which rely on incentive mechanisms to supply contributions to the system are being developed, along with methods for controlling the "free riding" issue. Second, the ability to enforce provision of trusted services. Reputation based P2P trust management models are becoming a focus of the research community as a viable solution. The trust models must balance both constraints imposed by the environment (e.g. scalability) and the unique properties of trust as a social and psychological phenomenon. Recently, we are also witnessing a move of the P2P paradigm to embrace mobile computing in an attempt to achieve even higher ubiquitousness. The possibility of services related to physical location and the relation with agents in physical proximity could introduce new opportunities and also new technical challenges.Although researchers working on distributed computing, MultiAgent Systems, databases and networks have been using similar concepts for a long time, it is only fairly recently that papers motivated by the current P2P paradigm have started appearing in high quality conferences and workshops. Research in agent systems in particular appears to be most relevant because, since their inception, MultiAgent Systems have always been thought of as collections of peers.The MultiAgent paradigm can thus be superimposed on the P2P architecture, where agents embody the description of the task environments, the decision-support capabilities, the collective behavior, and the interaction protocols of each peer. The emphasis in this context on decentralization, user autonomy, dynamic growth and other advantages of P2P, also leads to significant potential problems. Most prominent among these problems are coordination: the ability of an agent to make decisions on its own actions in the context of activities of other agents, and scalability: the value of the P2P systems lies in how well they scale along several dimensions, including complexity, heterogeneity of peers, robustness, traffic redistribution, and so forth. It is important to scale up coordination strategies along multiple dimensions to enhance their tractability and viability, and thereby to widen potential application domains. These two problems are common to many large-scale applications. Without coordination, agents may be wasting their efforts, squander resources and fail to achieve their objectives in situations requiring collective effort.This workshop will bring together researchers working on agent systems and P2P computing with the intention of strengthening this connection. The increasing interest in this research area is evident in that the four previous editions of AP2PC has been among the most popular AAMAS workshops in terms of participation. Research in Agents and Peer to Peer is by its nature interdisciplinary and offers a challenge

2007 - The SEWASIE MAS for Semantic Search [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Guerra, Francesco; Vincini, Maurizio
abstract

The capillary diffusion of the Internet has made available access to an overwhelming amount of data, allowing users having benefit of vast information. However, information is not really directly available: internet data are heterogeneous and spread over different places, with several duplications, and inconsistencies. The integration of such heterogeneous inconsistent data, with data reconciliation and data fusion techniques, may therefore represent a key activity enabling a more organized and semantically meaningful access to data sources. Some issues are to be solved concerning in particular the discovery and the explicit specification of the relationships between abstract data concepts and the need for data reliability in dynamic, constantly changing network. Ontologies provide a key mechanism for solving these challenges, but the web’s dynamic nature leaves open the question of how to manage them.Many solutions based on ontology creation by a mediator system have been proposed: a unified virtual view (the ontology) of the underlying data sources is obtained giving to the users a transparent access to the integrated data sources. The centralized architecture of a mediator system presents several limitations, emphasized in the hidden web: firstly, web data sources hold information according to their particular view of the matter, i.e. each of them uses a specific ontology to represent its data. Also, data sources are usually isolated, i.e. they do not share any topological information concerning the content or structure of other sources.Our proposal is to develop a network of ontology-based mediator systems, where mediators are not isolated from each other and include tools for sharing and mapping their ontologies. In this paper, we describe the use of a multi-agent architecture to achieve and manage the mediators network. The functional architecture is composed of single peers (implemented as mediator agents) independently carrying out their own integration activities. Such agents may then exchange data and knowledge with other peers by means of specialized agents (called brokering agents) which provide a coherent access plan to the peer network. In this way, two layers are defined in the architecture: at the local level, peers maintain an integrated view of local sources; at the network level, agents maintain mappings among the different peers. The result is the definition of a new type of mediator system network intended to operate in web economies, which we realized within SEWASIE (SEmantic Webs and AgentS in Integrated Economies), an RDT project supported by the 5th Framework IST program of the European Community, successfully ended on September 2005.

2007 - The SEWASIE Network of Mediator Agents for Semantic Search [Articolo su rivista]
Beneventano, Domenico; Bergamaschi, Sonia; Guerra, Francesco; Vincini, Maurizio
abstract

Integration of heterogeneous information in the context of Internet becomes a key activity to enable a more organized and semantically meaningful access to data sources. As Internet can be viewed as a data-sharing network where sites are data sources, the challenge is twofold. Firstly, sources present information according to their particular view of the matter, i.e. each of them assumes a specific ontology. Then, data sources are usually isolated, i.e. they do not share any topological information concerning the content or the structure of other sources. The classical approach to solve these issues is provided by mediator systems which aim at creating a unified virtual view of the underlying data sources in order to hide the heterogeneity of data and give users a transparent access to the integrated information.In this paper we propose to use a multi-agent architecture to build and manage a mediators network. While a single peer (i.e. a mediator agent) independently carries out data integration activities, it exchanges knowledge with other peers by means of specialized agents (i.e. brokers) which provide a coherent access plan to access information in the peer network. This defines two layers in the system: at local level, peers maintain an integrated view of local sources, while at network level agents maintain mappings among the different peers. The result is the definition of a new networked mediator system intended to operate in web economies, which we realized in the SEWASIE (SEmantic Webs and AgentS in Integrated Economies) project. SEWASIE is a RDT project supported by the 5th Framework IST program of the European Community successfully ended on September 2005.

2007 - W10 - SWAE '07: 1st International Workshop on Semantic Web Architectures for Enterprises [Esposizione]
Bergamaschi, Sonia; Paolo, Bouquet; Guerra, Francesco
abstract

SWAE aims at evaluating how and how much the Semantic Web vision has met its promises with respect to business and market needs. Even though the Semantic Web is a relatively new branch of scientific and technological research, its relevance has already been envisaged for some crucial business processes: Semantic-based business data integration: data integration satisfies both "structural" requirements of enterprises (e.g. the possibility of consulting its data in a unified manner), and "dynamic" requirement (e.g. business-to-business partnerships to execute an order). Information systems implementing semantic web architectures can strongly support this process, or simply enable it. Semantic interoperability: metadata and ontologies support the dynamic and flexible exchange of data and services across information systems of different organizations. The development of applications for the automatic classification of services and goods on the basis of standard hierarchies, and the translation of such classifications into the different standards used by companies is a clear example of the potential for semantic interoperability methods and tools. Knowledge management: ontologies and automated reasoning tools seem to provide an innovative support to the elicitation, representation and sharing of corporate knowledge. In particular, for the shift from document-centric KM to an entity-centric KM approach. Enterprise and process modeling: ontologies and rules are becoming an effective way for modeling corporate processes and business domains (for example, in cost reduction). The goal of the workshop is to evaluate and assess how deep the permeation of Semantic Web models, languages, technologies and applications has been in effective enterprise business applications. It would also identify how semantic web based systems, methods and theories sustain business applications such as decision processes, workflow management processes, accountability, and production chain management. A particular attention will be dedicated to metrics and criteria that evaluate cost-effectiveness of system designing processes, knowledge encoding and management, system maintenance, etc.

2007 - dBase CEREALAB [Software]
Pecchioni, Nicola; Milc, Justyna Anna; Sala, Antonio; Bergamaschi, Sonia
abstract

The CEREALAB database; an information system for breeders is a source of molecular and phenotypic data, realized by integrating two already existing web databases, Gramene and Graingenes together with the source storing the information achieved by research groups of the CEREALAB project. The new data derives from a systematic genotyping work using already known markers and some brandly new protocols developed by the discovery workpackage of the project.This integration is obtained using the MOMIS system (Mediator Environment for Multiple Information Sources). The result obtained is a queriable virtual view that integrates the three sources and allows performing selection of cultivars of barley, wheat and rice based on molecular data and phenotypic traits, regardless of the specific languages of the three source databases. The phenotypic characters to be included in the database have been chosen among those of major interest for the breeders and divided into six categories: Abiotic Stress, Biotic Stress, Growth and Development, Quality and Yield. As far as molecular data is concerned the major categories for the query are: Trait, Qtl, Gene and Marker.

2006 - An incremental method for meaning elicitation of a domain ontology [Relazione in Atti di Convegno]
Bergamaschi, Sonia; P., Bouquet; D., Giacomuzzi; Guerra, Francesco; Po, Laura; Vincini, Maurizio
abstract

Internet has opened the access to an overwhelming amount of data, requiring the development of new applications to automatically recognize, process and manage informationavailable in web sites or web-based applications. The standardSemantic Web architecture exploits ontologies to give a shared(and known) meaning to each web source elements.In this context, we developed MELIS (Meaning Elicitation and Lexical Integration System). MELIS couples the lexical annotation module of the MOMIS system with some components from CTXMATCH2.0, a tool for eliciting meaning from severaltypes of schemas and match them. MELIS uses the MOMIS WNEditor and CTXMATCH2.0 to support two main tasks in theMOMIS ontology generation methodology: the source annotationprocess, i.e. the operation of associating an element of a lexicaldatabase to each source element, and the extraction of lexicalrelationships among elements of different data sources.

2006 - An intelligent data integration approach for collaborative project management in virtual enterprises [Articolo su rivista]
Bergamaschi, Sonia; Gelati, Gionata; Guerra, Francesco; Vincini, Maurizio
abstract

The increasing globalization and flexibility required by companies has generated new issues in the last decade related to the managing of large scale projects and to the cooperation of enterprises within geographically distributed networks. ICT support systems are required to help enterprises share information, guarantee data-consistency and establish synchronized and collaborative processes. In this paper we present a collaborative project management system that integrates data coming from aerospace industries with a main goal: to facilitate the activity of assembling, integration and the verification of a multi-enterprise project. The main achievement of the system from a data management perspective is to avoid inconsistencies generated by updates at the sources' level and minimizes data replications. The developed system is composed of a collaborative project management component supported by a web interface, a multi-agent data integration system, which supports information sharing and querying, and web-services that ensure the interoperability of the software components. The system was developed by the University of Modena and Reggio Emilia. Gruppo Formula S.p.A. and tested by Alenia Spazio S.p.A. within the EU WINK Project (Web-linked Integration of Network based Knowledge-IST-2000-28221).

2006 - Fifth International Workshop on Agents and Peer-to-Peer Computing(AP2PC 2006) [Esposizione]
Bergamaschi, Sonia; Zoran, Despotovic; Sam, Joseph; Gianluca, Moro
abstract

2006 - Fourth International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P 2006) [Esposizione]
Bergamaschi, Sonia; Sam, Joseph; Jean Henry, Morin; Gianluca, Moro
abstract

The aim of this fourth workshop is to explore the promise of P2P to offer exciting new possibilities in distributed information processing and database technologies. The realization of these promises lies fundamentally in the availability of enhanced services such as structured ways for classifying and registering shared information, verification and certification of information, content distributed schemes and quality of content, security features, information discovery and accessibility, interoperation and composition of active information services, and finally market-based mechanisms to allow cooperative and non cooperative information exchanges. The P2P paradigm lends itself to constructing large scale complex, adaptive, autonomous and heterogeneous database and information systems, endowed with clearly specified and differential capabilities to negotiate, bargain, coordinate and self-organize the information exchanges in large scale networks. This vision will have a radical impact on the structure of complex organizations (business, scientific or otherwise) and on the emergence and the formation of social communities, and on how the information is organized and processed. Recently, the P2P paradigm is embracing mobile computing and ad-hoc networks in an attempt to achieve even higher ubiquitousness. The possibility of data and services related to physical location and the relation with peers and sensors in physical proximity could introduce new opportunities and also new technical challenges. Such dynamic environments, which are inherently characterized by high mobility and heterogeneity of resources like devices, participants, services, information and data representation, pose several issues on how to search and localize resources, how to efficiently route traffic, up to higher level problems related to semantic interoperability and information relevance. The use of ontologies for the descriptions of peers and services could introduce new approaches for querying, sharing, distributing and organizing knowledge. Nevertheless, several challenges related to the association of services/contents to ontologies, the interoperability/ integration of ontologies required for understanding different contents and the automation of such processes rise. A sample applicative scenario may be the offer of new services for business trades on the basis of the client requirements both established by means of (different) ontologies. On the basis of the physical location, the client ontology contacts other ontologies, executing automatic integration/ interoperation/ reconciliation processes whereas information are expressed according with different ontologies. Analogous issues and similar scenarios may be depicted for static and wireless connectivity, and static and mobile architectures. The proposed workshop will build on the success of the three preceding editions at VLDB 2003, 2004 and 2005. It will concentrate on exploring the synergies between current database research and P2P computing. It is our belief that database research has much to contribute to the P2P grand challenge through its wealth of techniques for sophisticated semantics-based data models, new indexing algorithms and efficient data placement, query processing techniques and transaction processing. Database technologies in the new information age will form the crucial components of the first generation of complex adaptive P2P information systems, which will be characterized by their ability to continuously self-organize, adapt to new circumstances, promote emergence as an inherent property, optimize locally but not necessarily globally, deal with approximation and incompleteness. This workshop will also concentrate on the impact of complex adaptive information systems on current database technologies and their relation to emerging industrial technologies such as IBM's autonomic computing initiative. The workshop will be co-located with VLDB, the major international database

2006 - Instances Navigation for Querying Integrated Data from Web-Sites [Capitolo/Saggio]
Beneventano, Domenico; Bergamaschi, Sonia; Bruschi, Stefania; Guerra, Francesco; Orsini, Mirko; Vincini, Maurizio
abstract

Research on data integration has provided a set of rich and well understood schema mediation languages and systems that provide a meta-data representation of the modeled real world, while, in general, they do not deal with data instances.Such meta-data are necessary for querying classes result of an integration process: the end user typically does not know the contents of such classes, he simply defines his queries on the basis of the names of classes and attributes.In this paper we introduce an approach enriching the description of selected attributes specifying as meta-data a list of the “relevant values” for such attributes. Furthermore relevant values may be hierarchically collected in a taxonomy. In this way, the user may exploit new meta-data in the interactive process of creating/refining a query. The same meta-data are also exploited by the system in the query rewriting/unfolding process in orderto filter the results showed to the user.We conducted an evaluation of the strategy in an e-business context within the EU-IST SEWASIE project. The evaluation proved the practicability of the approach for large value instances.

2006 - Instances navigation for querying integrated data from web-sites [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Bruschi, Stefania; Guerra, Francesco; Orsini, Mirko; Vincini, Maurizio
abstract

2006 - Semantic search engines based on data integration systems [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia
abstract

As the use of the World Wide Web has become increasingly widespread, the business ofcommercial search engines has become a vital and lucrative part of the Web. Search engines arecommon place tools for virtually every user of the Internet; and companies, such as Google andYahoo!, have become household names. Semantic Search Engines try to augment and improvetraditional Web Search Engines by using not just words, but concepts and logical relationships.In this chapter a relevant class of Semantic Search Engines, based on a peer-to-peer, dataintegration mediator-based architecture is described.The architectural and functional features are presented with respect to two projects, SEWASIEand WISDOM, involving the authors. The methodology to create a two level ontology and queryprocessing in the SEWASIE project are fully described.

2006 - Virtual Integration of Existing Web Databases for the Genotypic Selection of Cereal Cultivars [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Sala, Antonio
abstract

The paper presents the development of a virtual database for the genotypic selection of cereal cultivars starting from phenotypic traits. The database is realized by integrating two existing web databases, Gramene and Graingenes, and a pre-existing data source developed by the Agrarian Faculty of the University of Modena and Reggio Emilia. The integration process gives rise to a virtual integrated view of the underlying sources. This integration is obtained using the MOMIS system (Mediator envirOnment for Multiple Information Sources), a framework developed by the Database Group of the University of Modena and Reggio Emilia. MOMIS performs information extraction and integration from both structured and semistructured data sources. Information integration is performed in a semi-automatic way, by exploiting the knowledge in a Common Thesaurus (defined by the framework) and the descriptions of source schemas with a combination of clustering and Description Logics techniques. Momis allows querying information in a transparent mode for the user regardless of the specific languages of the sources. The result obtained by applying MOMIS to Gramene and Graingenes web databases is a queriable virtual view that integrates the two sources and allow performing genotypic selection of cultivars of barley, wheat and rice based on phenotypic traits, regardless of the specific languages of the web databases. The project is conducted in collaboration with the Agrarian Faculty of the University of Modena and Reggio Emilia and funded by the Regional Government of Emilia Romagna.

2005 - Agents and Peer-to-Peer Computing, Third International Workshop, AP2PC 2004, New York, NY, USA, July 19, 2004, Revised and Invited Papers [Curatela]
Gianluca, Moro; Bergamaschi, Sonia; Karl, Aberer
abstract

Atti del workshop Agents and Peer-to-Peer Computing tenuto nel 2004.

2005 - Building a tourism information provider with the MOMIS system [Articolo su rivista]
Beneventano, Domenico; Bergamaschi, Sonia; Guerra, Francesco; Vincini, Maurizio
abstract

The tourism industry is a good candidate for taking up Semantic Web technology. In fact, there are many portals and websites belonging to the tourism domain that promote tourist products (places to visit, food to eat, museums, etc.) and tourist services (hotels, events, etc.), published by several operators (tourist promoter associations, public agencies, etc.). This article presents how the MOMIS system may be used for building a tourism information provider by exploiting the tourism information that is available in Internet websites. MOMIS (Mediator envirOnment for Multiple Information Sources) is a mediator framework that performs information extraction and integration from heterogeneous distributed data sources and includes query management facilities to transparently support queries posed to the integrated data sources.

2005 - Lecture Notes in Artificial Intelligence: Preface [Relazione in Atti di Convegno]
Moro, G.; Bergamaschi, S.; Aberer, K.
abstract

2005 - SEWASIE - SEmantic Webs and AgentS in Integrated Economies. [Software]
Bergamaschi, Sonia; Beneventano, Domenico; Vincini, Maurizio; Guerra, Francesco
abstract

SEWASIE (SEmantic Webs and AgentS in Integrated Economies) aims to design and implement an advanced search engine enabling intelligent access to heterogeneous data sources on the web via semantic enrichment to provide the basis of structured secure web-based communication. SEWASIE implemented an advanced search engine that provides intelligent access to heterogeneous data sources on the web via semantic enrichment to provide the basis of structured secure web-based communication. SEWASIE provides users with a search client that has an easy-to-use query interface, and which can extract the required information from the Internet and can show it in a useful and user-friendly format. From an architectural point of view, the prototype provides a search engine client and indexing servers and ontologies.

2005 - Speaking Words of WISDOM: Web Intelligent Search based on DOMain ontologies [Relazione in Atti di Convegno]
Bergamaschi, Sonia; P., Bouquet; P., Ciaccia; P., Merialdo
abstract

In this paper we present the architecture of a system for searching and querying information sources available on the web which was developed as part of a project called WISDOM. key feature of our proposal is a distributed architecture based on (i) the peer-to-peer paradigm and (ii) the adoption of domainontologies. at the lower level, we support a strong, ontology-based integration of the information content of a bunch of source peers, which form a so-called semantic peer. at the upper level, we provide a loose, mapping-based integrationof a set of semantic peers. we then show how queries can be efficiently managed and distributed in such a two-layer scenario.

2005 - The SEWASIE multi-agent system [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Fillottrani, Pr; Gelati, Gionata
abstract

Data integration, in the context of the web, faces new problems, due in particular to the heterogeneity of sources, to the fragmentation of the information and to the absence of a unique way to structure, and view information. In such areas, the traditional paradigms on which database foundations are based (i.e. client/server architecture, few sources containing large information) have to be overcome by new architectures. In this paper we propose a layered P2P architecture for mediator systems. Peers are information nodes which are coordinated by a multi-agent system in order to allow distributed query processing.

2005 - Third International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P 2005) [Esposizione]
Bergamaschi, Sonia; Gianluca, Moro; Aris M., Ouksel
abstract

The aim of this third workshop is to explore the promise of P2P to offer exciting new possibilities in distributed information processing and database technologies. The realization of this promise lies fundamentally in the availability of enhanced services such as structured ways for classifying and registering shared information, verification and certification of information, content distributed schemes and quality of content, security features, information discovery and accessibility, interoperation and composition of active information services, and finally market-based mechanisms to allow cooperative and non cooperative information exchanges. The P2P paradigm lends itself to constructing large scale complex, adaptive, autonomous and heterogeneous database and information systems, endowed with clearly specified and differential capabilities to negotiate, bargain, coordinate and self-organize the information exchanges in large scale networks. This vision will have a radical impact on the structure of complex organizations (business, scientific or otherwise) and on the emergence and the formation of social communities, and on how the information is organized and processed.The P2P information paradigm naturally encompasses static and wireless connectivity, and static and mobile architectures. Wireless connectivity combined with the increasingly small and powerful mobile devices and sensors pose new challenges as well as opportunities to the database community. Information becomes ubiquitous, highly distributed and accessible anywhere and at any time over highly dynamic, unstable networks with very severe constraints on the information management and processing capabilities. What techniques and data models may be appropriate for this environment, and yet guarantee or approach the performance, versatility and capability that users and developers come to enjoy in traditional static, centralized and distributed database environment? Is there a need to define new notions of consistency and durability, and completeness, for example?The proposed workshop will build on the success of the two preceding editions at VLDB 2003 and 2004. It will concentrate on exploring the synergies between current database research and P2P computing. It is our belief that database research has much to contribute to the P2P grand challenge through its wealth of techniques for sophisticated semantics-based data models, new indexing algorithms and efficient data placement, query processing techniques and transaction processing. Database technologies in the new information age will form the crucial components of the first generation of complex adaptive P2P information systems, which will be characterized by their ability to continuously self-organize, adapt to new circumstances, promote emergence as an inherent property, optimize locally but not necessarily globally, deal with approximation and incompleteness. This workshop will also concentrate on the impact of complex adaptive information systems on current database technologies and their relation to emerging industrial technologies such as IBM's autonomic computing initiative.The workshop will be co-located with VLDB, the major international database and information systems conference, and will bring together key researchers from all over the world working on databases and P2P computing with the intention of strengthening this connection. Researchers from other related areas such as distributed systems, networks, multi-agent systems and complex systems will also be invited.

2004 - A peer-to-peer information system for the semantic web [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Guerra, Francesco; Vincini, Maurizio
abstract

Data integration, in the context of the web, faces new problems, due in particular to the heterogeneity of sources, to the fragmentation of the information and to the absence of a unique way to structure and view information. In such areas, the traditional paradigms, on which database foundations are based (i.e. client server architecture, few sources containing large information), have to be overcome by new architectures. The peer-to-peer (P2P) architecture seems to be the best way to fulfill these new kinds of data sources, offering an alternative to traditional client/server architecture.In this paper we present the SEWASIE system that aims at providing access to heterogeneous web information sources. An enhancement of the system architecture in the direction of P2P architecture, where connections among SEWASIE peers rely on exchange of XML metadata, is described.

2004 - Extending a Lexicon Ontology for Intelligent Information Integration [Relazione in Atti di Convegno]
Benassi, Roberta; Bergamaschi, Sonia; Fergnani, Alain; Miselli, Daniele
abstract

One of the current research on the Semantic Web areais semantic annotation of information sources. On-line lexical ontologie can be exploited as a-priori common knowledge to provide easily understandable, machine-readable metadata. Nevertheless, the absence of terms related to specific domains causes a loss of semantics.In this paper we present WNEditor, a tool that aims at guidingthe annotation designer during the creation of a domain lexiconontology, extending the pre-existing WordNet ontology. New terms, meanings and relations between terms are virtually added and managed by preserving the WordNet’s internal organization.

2004 - MOMIS: an Ontology-based Information Integration System(software) [Software]
Bergamaschi, Sonia; Beneventano, Domenico; Guerra, Francesco; Orsini, Mirko; Vincini, Maurizio
abstract

The Mediator Environment for Multiple Information Sources (Momis), developed by the database research group at the University of Modena and Reggio Emilia, aims to construct synthesized, integrated descriptions of information coming from multiple heterogeneous sources. Our goal is to provide users with a global virtual view (GVV) of information sources, independent oftheir location or their data’s heterogeneity.An open source version of the MOMIS system was released on April 2010 by the spin-off DATARIVER (www.datariver.it)Such a view conceptualizes the underlying domain; you can think of it as an ontology describing the sources involved. The Semantic Web exploits semantic markups to provide Web ages with machine-readable definitions. It thus relieson the a priori existence of ontologies that represent the domains associated with the given information sources. This approachrelies on the selected reference ontology’s accuracy, but we find that most ontologies in common use are generic and that theannotation phase (in which semantic annotations connect Web page parts to ontology items) causes a loss of semantics. Byinvolving the sources themselves, our approach builds an ontology that more precisely represents the domain. Moreover,the GVV is annotated according to a lexical ontology, which provides an easily understandable meaning to content.

2004 - SOAP-ENABLED WEB SERVICES FOR KNOWLEDGE MANAGEMENT [Articolo su rivista]
I., Benetti; Bergamaschi, Sonia; Guerra, Francesco; Vincini, Maurizio
abstract

The widespread diffusion of the World Wide Web among medium/small companies yields a huge amount of information to make business available online. Nevertheless the heterogeneity of that information, forces even trading partners involved in the same business process to face daily interoperability issues.The challenge is the integration of distributed business processes, which, in turn, means integration of heterogeneous data coming from distributed sources.This paper presents the new web services-based architecture of the MOMIS (Mediator envirOnment for Multiple Information Sources) framework that enhances the semantic integration features of MOMIS, leveraging new technologies such as XML web services and the SOAP protocol.The new architecture decouples the different MOMIS modules, publishing them as XML web services. Since the SOAP protocol used to access XML web services requires the same network security settings as a normal internet browser, companies are enabled to share knowledge without softening their protection strategies.

2004 - Synthesizing an Integrated Ontology with MOMIS [Relazione in Atti di Convegno]
Benassi, Roberta; Beneventano, Domenico; Bergamaschi, Sonia; Guerra, Francesco; Vincini, Maurizio
abstract

The Mediator EnvirOnment for Multiple Information Sources (MOMIS) aims at constructing synthesized, integrated descriptions of the information coming from multiple heterogeneous sources, in order to provide the user with a global virtual view of the sources independent from their location and the level of hetero-geneity of their data. Such a global virtual view is a con-ceptualization of the underlying domain and then may be thought of as an ontology describing the involved sources. In this article we explore the framework’s main elements and discuss how the output of the integration process can be exploited to create a conceptualization of the underly-ing domain

2004 - TUCUXI: the Intelligent Hunter Agent for Concept Understanding and Lexical Chaining [Relazione in Atti di Convegno]
Benassi, Roberta; Bergamaschi, Sonia; Vincini, Maurizio
abstract

In this paper we present Tucuxi, an intelligent hunter agent that replaces traditional keyword-based queries on the Web with a user-provided domani ontology, where meanings to be searched are not ambiguous.

2004 - The MOMIS methodology for integrating heterogeneous data sources [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia
abstract

The Mediator EnvirOnment for Multiple Information Sources (MOMIS) aims at constructingsynthesized, integrateddescriptions of the information coming from multiplehe terogeneous sources, in order to provide the user with a global virtual viewof the sources independent from their location and the level of heterogeneity of their data. Such a global virtual view is a conceptualizationof the underlying domain andthen may be thought of as anontology describing the involved sources. In this article wee xplore the framework’s main elements and discuss how the output of the integration process can be exploited to create a conceptualization of the underlying domain.

2004 - Third International Workshop on Agents and Peer-to-Peer Computing(AP2PC 2004) [Esposizione]
Gianluca, Moro; Bergamaschi, Sonia; Karl, Aberer; Munindar P., Singh
abstract

Peer-to-peer (P2P) computing is attracting enormous media attention, spurred by the popularity of file sharing systems such as Napster, Gnutella, and Morpheus. The peers are autonomous, or as some call them, first-class citizens. P2P networks are emerging as a new distributed computing paradigm for their potential to harness the computing power of the hosts composing the network and make their under-utilized resources available to others. This possibility has generated a lot of interest in many industrial organizations which have already launched important projects.In P2P systems, peer and web services in the role of resources become shared and combined to enable new capabilities greater than the sum of the parts. This means that services can be developed and treated as pools of methods that can be composed dynamically. The decentralized nature of P2P computing makes it also ideal for economic environments that foster knowledge sharing and collaboration as well as cooperative and non-cooperative behaviors in sharing resources. Business models are being developed, which rely on incentive mechanisms to supply contributions to the system and methods for controlling free riding. Clearly, the growth and the management of P2P networks must be regulated to ensure adequate compensation of content and/or service providers. At the same time, there is also a need to ensure equitable distribution of content and services.Although researchers working on distributed computing, MultiAgent Systems, databases and networks have been using similar concepts for a long time, it is only recently that papers motivated by the current P2P paradigm have started appearing in high quality conferences and workshops. Research in agent systems in particular appears to be most relevant because, since their inception, MultiAgent Systems have always been thought of as networks of peers.The MultiAgent paradigm can thus be superimposed on the P2P architecture, where agents embody the description of the task environments, the decision-support capabilities, the collective behavior, and the interaction protocols of each peer. The emphasis in this context on decentralization, user autonomy, ease and speed of growth that gives P2P its advantages, also leads to significant potential problems. Most prominent among these problems are coordination: the ability of an agent to make decisions on its own actions in the context of activities of other agents, and scalability: the value of the P2P systems lies in how well they scale along several dimensions, including complexity, heterogeneity of peers, robustness, traffic redistribution, and so on. It is important to scale up coordination strategies along multiple dimensions to enhance their tractability and viability, and thereby to widen the application domains. These two problems are common to many large-scale applications. Without coordination, agents may be wasting their efforts, squander resources and fail to achieve their objectives in situations requiring collective effort.This workshop will bring together researchers working on agent systems and P2P computing with the intention of strengthening this connection. Researchers from other related areas such as distributed systems, networks and database systems will also be welcome (and, in our opinion, have a lot to contribute).

2004 - Web Semantic Search with TUCUXI [Relazione in Atti di Convegno]
R., Benassi; Bergamaschi, Sonia; Vincini, Maurizio
abstract

S.Margherita di Pula (CAGLIARI), Italia, 21-23 Giugno.

2003 - A Experiencing AUML for the WINK Multi-Agent System [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Gelati, Gionata; Guerra, Francesco; Vincini, Maurizio
abstract

In the last few years, efforts have been done towards bridging thegap between agent technology and de facto standard technologies,aiming at introducing multi-agent systems in industrialapplications. This paper presents an experience done by using oneof such proposals, Agent UML. Agent UML is a graphicalmodelling language based on UML. The practical usage of thisnotation has brought to suggest some refinements of the AgentUML features.

2003 - A Peer-to-Peer Agent-Based Semantic Search Engine [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Fergnani, Alain; Guerra, Francesco; Vincini, Maurizio; D., Montanari
abstract

Several architectures, protocols, languages, and candidate standards, have been proposed to let the "semantic web'' idea take off. In particular, searching for information requires cooperation of the information providers and seekers. Past experience and history show that a successful architecture must support ease of adoption and deployment by a wide and heterogeneous population, a flexible policy to establish an acceptable cost-benefit ratio for using the system, and the growth of a cooperative distributed infrastructure with no central control. In this paper an agent-based peer-to-peer architecture is defined to support search through a flexible integration of semantic information.Two levels of integration are foreseen: strong integration of sources related to the same domain into a single information node by means of a mediator-based system; weak integration of information nodes on the basis of semantic relationships existing among concepts of different nodes.The EU IST SEWASIE project is described as an instantiation of this architecture. SEWASIE aims at implementing an advanced search engine, which will provide SMEs with intelligent access to heterogeneous information on the Internet.

2003 - Building an Integrated Ontology within the SEWASIE Project: The Ontology Builder Tool [Abstract in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; D., Miselli; A., Fergnani; Vincini, Maurizio
abstract

See http://www.sewasie.org/

2003 - Building an Ontology with MOMIS [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Guerra, Francesco
abstract

Nowadays the Web is a huge collection of data and its expansion rate is very high. Web users need new ways to exploit all this available information and possibilities. A new vision of the Web, the Semantic Web , where resources are annotated with machine-processable metadata providing them with background knowledge and meaning, arises. A fundamental component of the Semantic Web is the ontology; this “explicit specification of a conceptualization” allows information providers to give a understandable meaning to their documents. MOMIS (Mediator envirOnment for Multiple Information Sources) is a framework for information extraction and integration of heterogeneous information sources. The system implements a semi-automatic methodology for data integration that follows the Global as View (GAV) approach. The result of the integration process is a global schema, which provides a reconciled, integrated and virtual view of the underlying sources, GVV (Global Virtual View). The GVV is composed of a set of (global) classes that represent the information contained in the sources. In this paper, we focus on the MOMIS application into a particular kind of source (i.e. web documents), and show how the result of the integration process can be exploited to create a conceptualization of the underlying domain, i.e. a domain ontology for the integrated sources. GVV is then semi-automatically annotated according to a lexical ontology. With reference to the Semantic Web area, where generally the annotation process consists of providing a web page with semantic markups according to an ontology, we firstly markup the local metadata descriptions and then the MOMIS system generates an annotated conceptualization of the sources. Moreover, our approach “builds” the domain ontology as the synthesis of the integration process, while the usual approach in the Semantic Web is based on “a priori” existence of ontology

2003 - Building an integrated Ontology within SEWASIE system [Relazione in Atti di Convegno]
Beneventano, D.; Bergamaschi, S.; Guerra, F.; Vincini, M.
abstract

The SEWASIE (SEmantic Webs and AgentS in Integrated Economies) project (IST-2001-34825) is an European research project that aims at designing and implementing an advanced search engine enabling intelligent access to heterogeneous data sources on the web. In this paper we focus on the Ontology Builder component of the SEWASIE system, that is a framework for information extraction and integration of heterogeneous structured and semi-structured information sources, built upon the MOMIS (Mediator envirOnment for Multiple Information Sources) system. The result of the integration process is a Global Virtual View (in short GVV) which is a set of (global) classes that represent the information contained in the sources being used. In particular, we present the application of our integration concerning a specific type of source (i.e. web documents), and show the extension of a built-up GVV by the addition of another source.

2003 - Building an integrated Ontology within the SEWASIE system [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Guerra, Francesco; Vincini, Maurizio
abstract

MOMIS (Mediator envirOnment for Multiple Information Sources) is a framework for information extraction and integration of heterogeneous structured and semi-structured information sources. The result of the integration process is a Global Virtual View (in short GVV) which is a set of (global) classesthat represent the information contained in the sources being used. In this paper, we present the application of our integration concerning a specific type of source (i.e. web documents), and show how the result of the integration approach can be exploited to create a conceptualization of the domain belonging the sources, i.e. an ontology. Two new achievements of the MOMIS system are presented: the semi-automatic annotation of the GVV and the extension of a built-up ontology by the addition of another source.

2003 - Description logics for semantic query optimization in object-oriented database systems [Articolo su rivista]
Beneventano, Domenico; Bergamaschi, Sonia; C., Sartori
abstract

Semantic query optimization uses semantic knowledge (i.e., integrity constraints) to transform a query into an equivalent one that may be answered more efficiently. This article proposes a general method for semantic query optimization in the framework of Object-Oriented Database Systems. The method is effective for a large class of queries, including conjunctive recursive queries expressed with regular path expressions and is based on three ingredients. The first is a Description Logic, ODLRE, providing a type system capable of expressing: class descriptions, queries, views, integrity constraint rules and inference techniques, such as incoherence detection and subsumption computation. The second is a semantic expansion function for queries, which incorporates restrictions logically implied by the query and the schema (classes + rules) in one query. The third is an optimal rewriting method of a query with respect to the schema classes that rewrites a query into an equivalent one, by determining more specialized classes to be accessed and by reducing the number of factors. We implemented the method in a tool providing an ODMG-compliant interface that allows a full interaction with OQL queries, wrapping underlying Description Logic representation and techniques to the user.

2003 - European Research and Development of Intelligent Information Agents: The AgentLink Perspective [Capitolo/Saggio]
Matthias, Klush; Bergamaschi, Sonia; Paolo, Petta
abstract

The vast amount of heterogeneous information sources available in the Internet demands advanced solutions for acquiring, mediating, and maintaining relevant information for the common user. The impacts of data, system, and semantic heterogeneity on the information overload of the user are manifold and especially due to potentially significant differences in data modeling, data structures, content representations using ontologies and vocabularies, query languages and operations to retrieve, extract, and analyse information in the appropriate context. The impacts of the increasing globalisation on the information overload encompass the tedious tasks of the user to determine and keep track of relevant information sources, to efficiently deal with different levels of abstractions of information modeling at sources, and to combine partially relevant information from potentially billions of sources. A special type of intelligent software agents , so called information agents, is supposed to cope with these difficulties associated with the information overload of the user. This implies its ability to semantically broker information by providing pro-active resource discovery, resolving the information impedance of information consumers and providers in the Internet, and offering value-added information services and products to the user or other agents. In subsequent sections we briefly introduce the reader to the notion of such type of agents as well as one of the prominent European forums for research on and development of these agents, the AgentLink special interest group on intelligent information agents. This book includes presentations of advanced systems of information agents and solution approaches to different problems in the domain that have been developed jointly by members of this special interest group in respective working groups.

2003 - Intelligent Information Agents - The AgentLink Perspective [Curatela]
Matthias, Klusch; Bergamaschi, Sonia; Peter, Edwards; Paolo, Petta
abstract

State of the art of the research on intelligent information agents

2003 - MIKS: an agent framework supporting information access and integration [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; J., Gelati; Guerra, Francesco; Vincini, Maurizio
abstract

Providing an integrated access to multiple heterogeneous sourcesis a challenging issue in global information systems for cooperation and interoperability. In the past, companies haveequipped themselves with data storing systems building upinformative systems containing data that are related one another,but which are often redundant, not homogeneous and not alwayssemantically consistent. Moreover, to meet the requirements ofglobal, Internet-based information systems, it is important thatthe tools developed for supporting these activities aresemi-automatic and scalable as much as possible.To face the issues related to scalability in the large-scale, in this paper we propose the exploitation of mobile agents in the information integration area, and, in particular, their integration in the Momis infrastructure. MOMIS (Mediator EnvirOnment for Multiple Information Sources) is a system that has been conceived as a pool of tools to provide an integrated access to heterogeneous information stored in traditional databases (for example relational, object oriented databases) or in file systems, as well as in semi-structured data sources (XML-file).This proposal has been implemented within the MIKS (Mediator agent for Integration of Knowledge Sources) system and it is completely described in this paper.

2003 - Managing knowledge through electronic commerce applications: a framework for integrating information coming from heterogeneous web sources [Articolo su rivista]
I., Benetti; Bergamaschi, Sonia; Scarso, Enrico
abstract

The paper aims at investigating the interplay existing between Electronic Commerce (EC) technologies, knowledge and Knowledge Management (KM), an issue that has raised attention in the academic and professional literature in recent times. To this end, a careful examination of the logic and working mechanisms of MOMIS is conducted, a semi-automatic framework for integrating information coming from heterogeneous sources that is under development at the Department of Information Engineering of the University of Modena e Reggio Emilia. In particular, the use of MOMIS to create virtual catalogues (i.e. EC instruments that dynamically retrieve information from multiple heterogeneous sources) is discussed in depth from a knowledge-based point of view. The analysis seems to confirm that EC and KM are not unrelated managerial issues, but rather that they can (and must) be beneficially integrated. More specifically, it seems possible to state that a better and more accurate understanding of knowledge management processes are essential to design and realise more effective EC applications.

2003 - Peer to Peer Paradigm for a Semantic Search Engine [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Guerra, Francesco
abstract

This paper provides, firstly, a general description of the research project SEWASIE and, secondly, a proposal of an architectural evolution of the SEWASIE system in the direction of peer-to-peer paradigm. The SEWASIE project has the aim to design and implement an advanced search engine enabling intelligent access to heterogeneous data sources on the web using community-specific multilingual ontologies. After a presentation of the main features of the system a preliminar proposal of architectural evolutions of the SEWASIE system in the direction of peer-to-peer paradigm is proposed.

2003 - Semantic Web Search Engines: the SEWASIE approach [Poster]
Beneventano, Domenico; Bergamaschi, Sonia; D., Montanari; L., Ottaviani
abstract

SEWASIE is a research project funded by the European Commission that aims to design and implement an advanced search engine enabling intelligent access to heterogeneous data sources on the web via semantic enrichment to provide the basis of structured secure web-based communication.

2003 - Stato e prospettive di sviluppo delle tecnologie informatiche per l'economia digitale [Capitolo/Saggio]
A., ALBERIGI QUARANTA; I., Benetti; Bergamaschi, Sonia; Scarso, Enrico
abstract

Il capitolo descrive lo stato e le prospettive di sviluppo delle tecnologie informatiche per l'economia digitale.

2003 - Synthesizing, an integrated ontology [Articolo su rivista]
Beneventano, Domenico; Bergamaschi, Sonia; Guerra, Francesco; Vincini, Maurizio
abstract

To exploit the Internet’s expanding data collection, current Semantic Web approaches employ annotation techniques to link individual information resources with machine-comprehensible metadata. Before we can realize the potential this new vision presents, however, several issues must be solved. One of these is the need for data reliability in dynamic, constantly changing networks. Another issue is how to explicitly specify relationships between abstract data concepts. Ontologies provide a key mechanism for solving these challenges, but the Web’s dynamic nature leaves open the question of how to manage them. The Mediator Environment for Multiple Information Sources (Momis), developed by the database research group at the University of Modena and Reggio Emilia, aims to construct synthesized, integrated descriptions of information coming from multiple heterogeneous sources. Our goal is to provide users with a global virtual view (GVV) of information sources, independent of their location or their data’s heterogeneity. Such a view conceptualizes the underlying domain; you can think of it as an ontology describing the sources involved. The Semantic Web exploits semantic markups to provide Web pages with machine-readable definitions. It thus relies on the a priori existence of ontologies that represent the domains associated with the given information sources. This approach relies on the selected reference ontology’s accuracy, but we find that most ontologies in common use are generic and that the annotation phase (in which semantic annotations connect Web page parts to ontology items) causes a loss of semantics. By involving the sources themselves, our approach builds an ontology that more precisely represents the domain. Moreover, the GVV is annotated according to a lexical ontology, which provides an easily understandable meaning to content. In this article, we use Web documents as a representative information source to describe the Momis methodology’s general application. We explore the framework’s main elements and discuss how the output of the integration process can be exploited to create a conceptualization of the underlying domain. In particular, our method provides a way to extend previously created conceptualizations, rather than starting from scratch, by inserting a new source.

2003 - WINK: A web-based system for collaborative project management in virtual enterprises [Relazione in Atti di Convegno]
Bergamaschi, S.; Gelati, G.; Guerra, F.; Vincini, M.
abstract

The increasing of globalization and flexibility required to the companies has generated, in the last decade, new issues, related to the managing of large scale projects within geographically distributed networks and to the cooperation of enterprises. ICT support systems are required to allow enterprises to share information, guarantee data-consistency and establish synchronized and collaborative processes. In this paper we present a collaborative project management system that integrates data coming from aerospace industries with two main goals: avoiding inconsistencies generated by updates at the sources’ level and minimizing data replications. The proposed system is composed of a collaborative project management component supported by a web interface, a multi-agent data integration component, which supports information sharing and querying, and SOAP enabled web-services which ensure the whole interoperability of the software components. The system was developed by the University of Modena and Reggio Emilia, Gruppo Formula S.p.A. and Alenia Spazio S.p.A. within the EU WINK Project (Web-linked Integration of Network based Knowledge - IST-2000-28221).

2002 - A data integration framework for e-commerce product classification [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Guerra, Francesco; Vincini, Maurizio
abstract

A marketplace is the place in which the demand and supply of buyers and vendors participating in a business process may meet. Therefore, electronic marketplaces are virtual communities in which buyers may meet proposals of several suppliers and make the best choice. In the electronic commerce world, the comparison between different products is blocked due to the lack of standards (on the contrary, the proliferation of standards) describing and classifying them. Therefore, the need for B2B and B2C marketplaces is to reclassify products and goods according to different standardization models. This paper aims to face this problem by suggesting the use of a semi-automatic methodology, supported by a tool (SI-Designer), to define the mapping among different e-commerce product classification standards. This methodology was developed for the MOMIS system within the Intelligent Integration of Information research area. We describe our extension to the methodology that makes it applyable in general to product classification standard, by selecting a fragment of ECCMA/UNSPSC and ecl @ss standard.

2002 - A semantic approach to access heterogeneous data sources: the SEWASIE Project [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Vincini, Maurizio
abstract

SEWASIE is implementing an advanced search engine that provides intelligent access to heterogeneous data sources on the web via semantic enrichment. This can be thought of as the basis of structured secure web-based communication. SEWASIE provides users with a search client that has an easy-to-use query interface, and which can extract the required information from the Internet and to show it in a useful and user-friendly format. From an architectural point of view, the prototype will provide a search engine client and indexing servers and ontologies.There are many benefits to be had from such a system. There will be a reduction of transaction costs by efficient search and communication facilities. Within the business context, the system will support integrated searching and negotiating, which will promote the take-up of key technologies for SMEs and give them a competitive edge.

2002 - An Agent framework for Supporting the MIKS Integration Process [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; M., Felice; D., Gazzotti; Gelati, Gionata; Guerra, Francesco; Vincini, Maurizio
abstract

Providing an integrated access to multiple heterogeneous sourcesis a challenging issue in global information systems forcooperation and interoperability. In the past, companies haveequipped themselves with data storing systems building upinformative systems containing data that are related one another,but which are often redundant, not homogeneous and not alwayssemantically consistent. Moreover, to meet the requirements ofglobal, Internet-based information systems, it is important thatthe tools developed for supporting these activities aresemi-automatic and scalable as much as possible.To face the issues related to scalability in the large-scale, inthis paper we propose the exploitation of mobile agents inthe information integration area, and, in particular, the rolesthey play in enhancing the feature of the Momis infrastructure.Momis (Mediator agent for Integration of Knowledge Sources) is asystem that has been conceived as a pool of tools to provide anintegrated access to heterogeneous information stored intraditional databases (for example relational, object orienteddatabases) or in file systems, as well as in semi-structured datasources (XML-file).In this paper we describe the new agent-based framework concerning the integration process as implemented in Miks (Mediator agent for Integration of Knowledge Sources) system.

2002 - An information integration framework for E-commerce [Articolo su rivista]
I., Benetti; Beneventano, Domenico; Bergamaschi, Sonia; Guerra, Francesco; Vincini, Maurizio
abstract

The Web has transformed electronic information systems from single, isolated nodes into a worldwide network of information exchange and business transactions. In this context, companies have equipped themselves with high-capacity storage systems that contain data in several formats. The problems faced by these companies often emerge because the storage systems lack structural and application homogeneity in addition to a common ontology.The semantic differences generated by a lack of consistent ontology can lead to conflicts that range from simple name contradictions (when companies use different names to indicate the same data concept) to structural incompatibilities (when companies use different models to represent the same information types).One of the main challenges for e-commerce infrastructure designers is information sharing and retrieving data from different sources to obtain an integrated view that can overcome any contradictions or redundancies. Virtual catalogs can help overcome this challenge because they act as instruments to retrieve information dynamically from multiple catalogs and present unified product data to customers. Instead of having to interact with multiple heterogeneous catalogs, customers can instead interact with a virtual catalog in a straightforward, uniform manner.This article presents a virtual catalog project called Momis (mediator environment for multiple information sources). Momis is a mediator-based system for information extraction and integration that works with structured and semistructured data sources. Momis includes a component called the SI-Designer for semiautomatically integrating the schemas of heterogeneous data sources, such as relational, object, XML, or semistructured sources. Starting from local source descriptions, the Global Schema Builder generates an integrated view of all data sources and expresses those views using XML. Momis lets you use the infrastructure with other open integration information systems by simply interchanging XML data files.Momis creates XML global schema using different stages, first by creating a common thesaurus of intra and interschema relationships. Momis extracts the intraschema relationships by using inference techniques, then shares these relationships in the common thesaurus. After this initial phase, Momis enriches the common thesaurus with interschema relationships obtained using the lexical WordNet system (www.cogsci.princeton.edu/wn), which identifies the affinities between interschema concepts on the basis of their lexicon meaning. Momis also enriches the common thesaurus using the Artemis system, which evaluates structural affinities among interschema concepts.

2002 - MOMIS: Exploiting agents to support information integration [Articolo su rivista]
Cabri, Giacomo; Guerra, Francesco; Vincini, Maurizio; Bergamaschi, Sonia; Leonardi, Letizia; Zambonelli, Franco
abstract

Information overloading introduced by the large amount of data that is spread over the Internet must be faced in an appropriate way. The dynamism and the uncertainty of the Internet, along with the heterogeneity of the sources of information are the two main challenges for today's technologies related to information management. In the area of information integration, this paper proposes an approach based on mobile software agents integrated in the MOMIS (Mediator envirOnment for Multiple Information Sources) infrastructure, which enables semi-automatic information integration to deal with the integration and query of multiple, heterogeneous information sources (relational, object, XML and semi-structured sources). The exploitation of mobile agents in MOMIS can significantly increase the flexibility of the system. In fact, their characteristics of autonomy and adaptability well suit the distributed and open environments, such as the Internet. The aim of this paper is to show the advantages of the introduction in the MOMIS infrastructure of intelligent and mobile software agents for the autonomous management and coordination of integration and query processing over heterogeneous data sources.

2002 - Product Classification Integration for E-Commerce [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Guerra, Francesco; Vincini, Maurizio
abstract

A marketplace is the place where the demand and supply of buyers and vendors participating in a business process may meet. Therefore, electronic marketplaces are virtual communities in which buyers may meet proposals of several suppliers and make the best choice. In the electronic commerce world, the comparison between different products is blocked due to the lack of standards (on the contrary, the proliferation of standards) describing and classifying them. Therefore, the need for B2B and B2C marketplaces is to reclassify products and goods according to different standardization models. This paper aims to face this problem by suggesting the use of a semi-automatic methodology to define a mapping among different e-commerce product classification standards. This methodology is an extension of the MOMIS-system, a mediator system developed within the Intelligent Integration of Information research area.

2002 - SI-Web: a Web based interface for the MOMIS project [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; D., Bianco; Guerra, Francesco; Vincini, Maurizio
abstract

The MOMIS project (Mediator envirOnment for MultipleInformation Sources) developed in the past years allows the integration of data from structured and semi-structured data sources. SI-Designer (Source Integrator Designer) is a designer support tool implemented within the MOMIS project for semi-automatic integration of heterogeneous sources schemata. It is a java application where all modules involved are available as CORBA Object and interact using established IDL interfaces. The goal of this demonstration is to present a new tool: SI-Web (Source Integrator on Web), it offers the same features of SI-Designer but it has got the great advantage of being usable onInternet through a web browser.

2002 - Semantic Integration and Query Optimization of Heterogeneous Data Sources [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Beneventano, Domenico; Castano, S; DE ANTONELLIS, V; Ferrara, A; Guerra, Francesco; Mandreoli, Federica; ORNETTI G., C; Vincini, Maurizio
abstract

In modern Internet/Intranet-based architectures, an increasing number of applications requires an integrated and uniform accessto a multitude of heterogeneous and distributed data sources. Inthis paper, we describe the ARTEMIS/MOMIS system for the semantic integration and query optimization of heterogeneous structured and semistructured data sources.

2002 - The WINK Project for Virtual Enterprise Networking and Integration [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Gazzotti, Davide; Gelati, Gionata; Guerra, Francesco; Vincini, Maurizio
abstract

To stay competitive (or sometimes simply to stay) on the market companies and manufacturers more and more often have to join their forces to survive and possibly flourish. Among other solutions, the last decade has experienced the growth and spreading of an original business model called Virtual Enterprise. To manage a Virtual Enterprise modern information systems have to tackle technological issues as networking, integration and cooperation. The WINK project, born form the partnership between University of Modena and Reggio Emilia and Gruppo Formula, addresses these problems. The ultimate goal is to design, implement and finally test on a pilot case (provided by Alenia), the WINK system, as combination of two existing and promising software systems (the WHALES and MIKS systems), to provide the Virtual Enterprise requirement for data integration and cooperation amd management planning.

2001 - Exploiting extensional knowledge for query reformulation and object fusion in a data integration system [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Guerra, Francesco; Vincini, Maurizio
abstract

Query processing in global information systems integrating multiple heterogeneous sources is a challenging issue in relation to the effective extraction of information available on-line. In this paper we propose intelligent, tool-supported techniques for querying global information systems integrating both structured and semistructured data sources. The techniques have been developed in the environment of a data integration, wrapper/mediator based system, MOMIS, and try to achieve two main goals: optimized query reformulation w.r.t local sources and object fusion, i.e. grouping together information (from the same or different sources) about the same real-world entity. The developed techniques rely on the availability of integrationknowledge, i.e. local source schemata, a virtual mediated schema and its mapping descriptions, that is semantic mappings w.r.t. the underlying sources both at the intensional and extensional level. Mapping descriptions, obtained as a result of the semi-automatic integration process of multiple heterogeneous sources developed for the MOMIS system, include, unlike previous data integration proposals, extensional intra/interschema knowledge. Extensional knowledge is exploited to detect extensionally overlapping classes and to discover implicit join criteria among classes, which enables the goals of optimized query reformulation and object fusion to be achieved.The techniques have been implemented in the MOMIS system but can be applied, in general, to data integration systems including extensional intra/interschema knowledge in mapping descriptions.

2001 - Extensional Knowledge for semantic query optimization in a mediator based system [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Mandreoli, Federica
abstract

Query processing in global information systems integrating multiple heterogeneous sources is a challenging issue in relation to the effective extraction of information available on-line. In this paper we propose intelligent, tool-supported techniques for querying global information systems integrating both structured and semistructured data sources. The techniques have been developed in the environment of a data integration, wrapper/mediator based system, MOMIS, and try to achieve the goal of optimized query reformulation w.r.t local sources. The developed techniques rely on the availability of integration knowledge whose semantics is expressed in terms of description logics. Integration knowledge includes local source schemata, a virtual mediated schema and its mapping descriptions, that is semantic mappings w.r.t. the underlying sources both at the intensional and extensional level. Mapping descriptions, obtained as a result of the semi-automatic integration process of multiple heterogeneous sources developed for the MOMIS system, include, unlike previous data integration proposals, extensional intra/interschema knowledge. Extensional knowledge is exploited to perform semantic query optimization in a mediator based system as it allows to devise an optimized query reformulation method. The techniques are under development in the MOMIS system but can be applied, in general, to data integration systems including extensional intra/interschema knowledge in mapping descriptions.

2001 - SI-Designer: a tool for intelligent integration of information [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; I., Benetti; Corni, Alberto; Guerra, Francesco; G., Malvezzi
abstract

SI-Designer (Source Integrator Designer) is a designer supporttool for semi- automatic integration of heterogeneoussources schemata (relational, object and semi structuredsources); it has been implemented within the MOMIS projectand it carries out integration following a semantic approachwhich uses intelligent Description Logics-based techniques,clustering techniques and an extended ODMG-ODL language,ODL-I3, to represent schemata, extracted, integratedinformation. Starting from the sources’ ODL-I3 descriptions(local schemata) SI-Designer supports the designer inthe creation of an integrated view of all the sources (globalschema) which is expressed in the same ODL-I3 language.We propose SI-Designer as a tool to build virtual catalogsin the E-Commerce environment.

2001 - SI-Designer: an Integration Framework for E-Commerce [Relazione in Atti di Convegno]
I., Benetti; Beneventano, Domenico; Bergamaschi, Sonia; Guerra, Francesco; Vincini, Maurizio
abstract

Electronic commerce lets people purchase goods and exchange information on business transactions on-line. Therefore one of the main challenges for the designers of the e-commerce infrastructures is the information sharing, retrieving data located in different sources thus obtaining an integrated view to overcome any contradiction or redundancy. Virtual Catalogs synthesize this approach as they are conceived as instruments to dynamically retrieve information from multiple catalogs and present product data in a unified manner, without directly storing product data from catalogs.In this paper we propose SI-Designer, a support tool for the integration of data from structured and semi-structured data sources, developed within the MOMIS (Mediator environment for Multiple Information Sources) project.

2001 - Semantic Integration of Heterogeneous Information Sources [Articolo su rivista]
Bergamaschi, Sonia; Castano, S.; Vincini, Maurizio; Beneventano, Domenico
abstract

Developing intelligent tools for the integration of information extracted from multiple heterogeneous sources is a challenging issue to effectively exploit the numerous sources available on-line in global information systems. In this paper, we propose intelligent, tool-supported techniques to information extraction and integration from both structured and semistructured data sources. An object-oriented language, with an underlying Description Logic, called ODLI3 , derived from the standard ODMG is introduced for information extraction. ODLI3 descriptions of the source schemas are exploited first to set a Common Thesaurus for the sources. Information integration is then performed in a semiautomatic way by exploiting the knowledge in the Common Thesaurus and ODLI 3 descriptions of source schemas with a combination of clustering techniques and Description Logics. This integration process gives rise to a virtual integrated view of the underlying sources for which mapping rules and integrity constraints are specified to handle heterogeneity. Integration techniques described in the paper are provided in the framework of the MOMIS system based on a conventional wrapper/mediator architecture.

2001 - Supporting information integration with autonomous agents [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Cabri, Giacomo; Guerra, Francesco; Leonardi, Letizia; Vincini, Maurizio; Zambonelli, Franco
abstract

The large amount of information that is spread over the Internet is an important resource for all people but also introduces some issues that must be faced. The dynamism and the uncertainty of the Internet, along with the heterogeneity of the sources of information are the two main challanges for the today’s technologies. This paper proposes an approach based on mobile agents integrated in an information integration infrastructure. Mobile agents can significantly improve the design and the development of Internet applications thanks to their characteristics of autonomy and adaptability to open and distributed environments, such as the Internet. MOMIS (Mediator envirOnment for Multiple Information Sources) is an infrastructure for semi-automatic information integrationthat deals with the integration and query of multiple, heterogeneous information sources (relational, object, XML and semi-structured sources). The aim of this paper is to show the advantage of the introduction in the MOMIS infrastructureof intelligent and mobile software agents for the autonomous management and coordination of the integration and query processes over heterogeneous sources.

2001 - The MOMIS approach to information integration [Relazione in Atti di Convegno]
Beneventano, D.; Bergamaschi, S.; Guerra, F.; Vincini, M.
abstract

The web explosion, both at internet and intranet level, has transformed the electronic information system from single isolated node to an entry points into a worldwide network of information exchange and business transactions. Business and commerce has taken the opportunity of the new technologies to define the ecommerce activity. Therefore one of the main challenges for the designers of the e-commerce infrastructures is the information sharing, retrieving data located in different sources thus obtaining an integrated view to overcome any contradiction or redundancy. Virtual Catalogs synthesize this approach as they are conceived as instruments to dynamically retrieve information from multiple catalogs and present product data in a unified manner, without directly storing product data from catalogs. Customers, instead of having to interact with multiple heterogeneous catalogs, can interact in a uniform way with a virtual catalog. In this paper we propose a designer support tool, called SI-Designer, for information integration developed within the MOMIS project. The MOMIS project (Mediator environment for Multiple Information Sources) aims to integrate data from structured and semi-structured data sources.

2001 - The Momis approach to Information Integration [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Guerra, Francesco; Vincini, Maurizio
abstract

The web explosion, both at internet and intranet level, has transformed the electronic information systemfrom single isolated node to an entry points into a worldwide network of information exchange and businesstransactions. Business and commerce has taken the opportunity of the new technologies to define the ecommerceactivity. Therefore one of the main challenges for the designers of the e-commerceinfrastructures is the information sharing, retrieving data located in different sources thus obtaining anintegrated view to overcome any contradiction or redundancy. Virtual Catalogs synthesize this approach asthey are conceived as instruments to dynamically retrieve information from multiple catalogs and presentproduct data in a unified manner, without directly storing product data from catalogs. Customers, instead ofhaving to interact with multiple heterogeneous catalogs, can interact in a uniform way with a virtual catalog.In this paper we propose a designer support tool, called SI-Designer, for information integration developedwithin the MOMIS project. The MOMIS project (Mediator environment for Multiple Information Sources)aims to integrate data from structured and semi-structured data sources.

2000 - Bologna, European City of Culture [Esposizione]
Bergamaschi, Sonia
abstract

MOMIS Poster for Bologna, European City of Culture

2000 - Creazione di una vista globale d'impresa con il sistema MOMIS basato su Description Logics [Articolo su rivista]
Beneventano, Domenico; Bergamaschi, Sonia; A., Corni; Vincini, Maurizio
abstract

2000 - Creazione di una vista globale d'impresa con il sistema MOMIS basato su Description Logics [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; S., Castano; A., Corni; R., Guidetti; G., Malvezzi; M., Melchiori; Vincini, Maurizio
abstract

Sviluppare strumenti intelligenti per l'integrazione di informazioni provenienti da sorgenti eterogenee all'interno di un'impresa è un argomento di forte interesse in ambito di ricerca. In questo articolo proponiamo tecniche basate su strumenti intelligenti per l'estrazione e l'integrazione di informazioni provenienti da sorgenti strutturate e semistrutturate fornite dal sistema MOMIS. Per la descrizione delle sorgenti presenteremo e utilizzeremo il linguaggio object-oriented ODLI3 derivato dallo standard ODMG. Le sorgenti descritte in ODLI3 vengono elaborate in modo da creare un thesaurus delle informazioni condivise tra le sorgenti. L'integrazione delle sorgenti viene poi effettuata in modo semi-automatico elaborando le informazioni che descrivono le sorgenti con tecniche basate su Description Logics e tecniche di clustering generando uno Schema globale che permette la visione integrata virtuale delle sorgenti.

2000 - Fondamenti di Informatica [Monografia/Trattato scientifico]
Beneventano, Domenico; Bergamaschi, Sonia; Claudio, Sartori
abstract

Manuale di fondamenti di programmazione dei calcolatori elettronici e in particolare con l'obiettivo di sviluppare un metodo di soluzione rigoroso di classi diverse di problemi. Particolare accento è posto sui costrutti fondamentali e sulla possibilità di costruire soluzioni basate sul riuso del software.

2000 - Information integration - the MOMIS project demostration [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; S., Castano; Corni, Alberto; G., Guidetti; M., Malvezzi; M., Melchiori; Vincini, Maurizio
abstract

The goal of this demonstration is to present the main features of a Mediator component, Global Schema Builder of an I3 system, called MOMIS (Mediator envirOnment for Multiple Information Sources). MOMIS has been conceived to provide an integrated access to heterogeneous information stored in traditional databases (e.g., relational, object- oriented) or file systems, as well as in semistructured sources. The demonstration is based on the integration of two simple sources of different kind, structured and semi-structured.

2000 - Information integration: The momis project demonstration [Relazione in Atti di Convegno]
Beneventano, D.; Bergamaschi, S.; Castano, S.; Cornil, A.; Guidettil, R.; Malvezzi, G.; Melchiori, M.; Vincini, M.
abstract

2000 - MOMIS: un sistema di Description Logics per l'integrazione del sistema informativo d'impresa [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Corni, Alberto; Vincini, Maurizio
abstract

Taormina

2000 - Ontology based access to digital libraries [Esposizione]
Bergamaschi, Sonia; Fausto, Rabitti
abstract

Luxembourg. (Relazione Invitata)

2000 - SI-DESIGNER: un tool di ausilio all'integrazione intelligente di sorgenti di informazione [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; Corni, Alberto; R., Guidetti; G., Malvezzi
abstract

SI-Designer (Source Integrator Designer) e' un tool di supporto al progettista per l'integrazione semi-automatica di schemi di sorgenti eterogenee (relazionali, oggetti e semistrutturate). Realizzato nell'ambito del progetto MOMIS, SI-Designer esegue l'integrazione seguendo un approccio semantico che fa uso di tecniche intelligenti basate sulla Description Logics OLCD, di tecniche di clustering e di un linguaggio object-oriented per rappresentare le informazioni estratte ed integrate, ODLII3, derivato dallo standard ODMG. Partendo dalle descrizioni delle sorgenti in ODLII3 (gli schemi locali) SI-Designer assiste il progettista nella creazione di una vista integrata di tutte le sorgenti (schema globale) anch'essa espressa in linguaggio ODLII3.

2000 - Tecnologie database ed Integrazione di Dati nel Commercio Elettronico [Esposizione]
Bergamaschi, Sonia
abstract

Vicenza. (Relazione Invitata)

1999 - Distributed Database Support for Data-Intensive Workflow Application [Relazione in Atti di Convegno]
Bergamaschi, Sonia; S., Castano; C., Sartori; Tiberio, Paolo; Vincini, Maurizio
abstract

Venice, Italy

1999 - Integration of information from multiple sources of textual data. [Capitolo/Saggio]
Beneventano, Domenico; Bergamaschi, Sonia
abstract

The chapter presents two ongoing projects towards an intelligent integration of information. They adopt a structural and semantic approach TSIMMIS (The Stanford IBM Manager of Multiple Information Sources) and MOMIS (Mediator environment for Multiple Information Sources) respectively. Both projects focus on mediator based information systems. The chapter describes the architecture of a wrapper and how to generate a mediator agent in TSIMMIS. Wrapper agents in TSIMMIS extract informations from a textual source and convert local data into a common data model; the mediator is an integration and refinement tool of data provided by the wrapper agents. In the second project MOMIS a conceptual schema for each source is provided adopting a common standard model and language The MOMIS approach uses a description logic or concept language for knowledge representation to obtain a semiautomatic generation of a common thesaurus. Clustering techniques are used to build the unified schema, i.e. the unified view of the data to be used for query processing in distributed heterogeneous and autonomous databases by a mediator.

1999 - Intelligent Techniques for the Extraction and Integration of Heterogeneous Information [Relazione in Atti di Convegno]
Bergamaschi, Sonia; S., Castano; Vincini, Maurizio; Beneventano, Domenico
abstract

Developing intelligent tools for the integration of informationextracted from multiple heterogeneous sources is a challenging issue to effectively exploit the numerous sources available on-line in global information systems. In this paper, we propose intelligent, tool-supported techniques to information extraction and integration which take into account both structured and semistructured data sources. An object-oriented language called odli3, derived from the standard ODMG, with an underlying Description Logics, is introduced for information extraction. Odli3 descriptions of the information sources are exploited first to set a shared vocabulary for the sources.Information integration is performed in a semi-automatic way, by exploiting odli3 descriptions of source schemas with a combination of Description Logics and clustering techniques. Techniques described in the paper have been implemented in theMOMIS system, based on a conventional mediator architecture.

1999 - ODL-Designer UNISQL: Un'Interfaccia per la Specifica Dichiarativa di Vincoli di Integrità in OODBMS [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Beneventano, Domenico; F., Sgarbi; Vincini, Maurizio
abstract

La specifica ed il trattamento dei vincoli di integrita' rappresenta un tema di ricerca fondamentale nell'ambito dellebasi di dati; infatti, spesso, i vincoli costituiscono la partepiu' onerosa nello sviluppo delle applicazioni reali basate suDBMS. L'obiettivo principale del componente software ODL-Designer UNISQL, presentato nel lavoro, e' quello di consentire alprogettista di basi di dati di esprimere i vincoli di integrita'attraverso un linguaggio dichiarativo, superando quindi l'approcciodegli OODBMS attuali che ne consente l'espressione solo attraverso procedure (metodi etrigger). ODL-Designer UNISQL acquisisce vincoli dichiarativi e genera automaticamente, in maniera trasparente al progettista, le ``procedure'' che implementano tali vincoli.Il linguaggio supportato da ODL-Designer UNISQL e' lo standard ODL-ODMG opportunamente esteso per esprimere vincoli di integrita', mentre l'OODBMS commerciale utilizzato e' UNISQL.

1999 - Semantic Integration of Semistructured and Structured Data Sources [Articolo su rivista]
Bergamaschi, Sonia; S., Castano; Vincini, Maurizio
abstract

Providing an integrated access to multiple heterogeneous sources is a challenging issue in global information systems for cooperation and interoperability. In this context, two fundamental problems arise. First, how to determine if the sources contain semantically related information, that is, information related to the same or similar real-world concept(s). Second, how to handle semantic heterogeneity to support integration and uniform query interfaces. Complicating factors with respect to conventional view integration techniques are related to the fact that the sources to be integrated already exist and that semantic heterogeneity occurs on the large-scale, involving terminology, structure, and context of the involved sources, with respect to geographical, organizational, and functional aspects related to information use. Moreover, to meet the requirements of global, Internet-based information systems, it is important that tools developed for supporting these activities are semi-automatic and scalable as much as possible. The goal of this paper is to describe the MOMIS [4, 5] (Mediator envirOnment for Multiple Information Sources) approach to the integration and query of multiple, heterogeneous information sources, containing structured and semistructured data. MOMIS has been conceived as a joint collaboration between University of Milano and Modena in the framework of the INTERDATA national research project, aiming at providing methods and tools for data management in Internet-based information systems. Like other integration projects [1, 10, 14], MOMIS follows a “semantic approach” to information integration based on the conceptual schema, or metadata, of the information sources, and on the following architectural elements: i) a common object-oriented data model, defined according to the ODLI3 language, to describe source schemas for integration purposes. The data model and ODLI3 have been defined in MOMIS as subset of the ODMG-93 ones, following the proposal for a standard mediator language developed by the I3/POB working group [7]. In addition, ODLI3 introduces new constructors to support the semantic integration process [4, 5]; ii) one or more wrappers, to translate schema descriptions into the common ODLI3 representation; iii) a mediator and a query-processing component, based on two pre-existing tools, namely ARTEMIS [8] and ODB-Tools [3] (available on Internet at http://sparc20.dsi.unimo.it/), to provide an I3 architecture for integration and query optimization. In this paper, we focus on capturing and reasoning about semantic aspects of schema descriptions of heterogeneous information sources for supporting integration and query optimization. Both semistructured and structured data sources are taken into account [5]. A Common Thesaurus is constructed, which has the role of a shared ontology for the information sources. The Common Thesaurus is built by analyzing ODLI3 descriptions of the sources, by exploiting the Description Logics OLCD (Object Language with Complements allowing Descriptive cycles) [2, 6], derived from KL-ONE family [17]. The knowledge in the Common Thesaurus is then exploited for the identification of semantically related information in ODLI3 descriptions of different sources and for their integration at the global level. Mapping rules and integrity constraints are defined at the global level to express the relationships holding between the integrated description and the sources descriptions. ODB-Tools, supporting OLCD and description logic inference techniques, allows the analysis of sources descriptions for generating a consistent Common Thesaurus and provides support for semantic optimization of queries at the global level, based on defined mapping rules and integrity constraints.

1998 - An Intelligent Approach to Information Integration [Relazione in Atti di Convegno]
Bergamaschi, Sonia; S., Castano; S., DE CAPITANI DE VIMERCATI; S., Montanari; Vincini, Maurizio
abstract

FORMAL ONTOLOGY IN INFORMATION SYSTEMS Book Series: FRONTIERS IN ARTIFICIAL INTELLIGENCE AND APPLICATIONS Volume: 46 Pages: 253-268

1998 - Chrono: a conceptual design framework for temporal entities [Relazione in Atti di Convegno]
Bergamaschi, Sonia; C., Sartori
abstract

Database applications are frequently faced with the necessity of representing time varying information and, particularly in the management of information systems, a few kinds of behavior in time can characterize a wide class of applications. A great amount of work in the area of temporal databases aiming at the denition of standard representation and manipulation of time, mainly in relational database environment, has been presented in the last years. Nevertheless, conceptual design of databases with temporal aspects has not yet received sufficient attention. The purpose of this paper is twofold: to propose a simple temporal treatment of information at the initial conceptual phase of database design; to show how the chosen temporal treatment can be exploited in time integrity enforcement by using standard DBMS tools, such as referential integrity and triggers. Furthermore, we present a design tool implementing our data model and constraint generation technique, obtained by extending a commercial design tool.

1998 - Consistency checking in complex object database schemata with integrity constraints [Articolo su rivista]
Beneventano, Domenico; Bergamaschi, Sonia; S., Lodi; C., Sartori
abstract

Integrity constraints are rules that should guarantee the integrity of a database. Provided an adequate mechanism to express them is available, the following question arises: Is there any way to populate a database which satisfies the constraints supplied by a database designer? That is, does the database schema, including constraints, admit at least a nonempty model? This work answers the above question in a complex object database environment. providing a theoretical framework, including the following ingredients: 1) two alternative formalisms, able to express a relevant set of state integrity constraints with a declarative style; 2) two specialized reasoners, based on the tableaux calculus, able to check the consistency of complex objects database schemata expressed with the two formalisms. The proposed formalisms share a common kernel, which supports complex objects and object identifiers, and which allow the expression of acyclic descriptions of: classes. nested relations and views, built up by means of the recursive use of record. quantified set. and object type constructors and by the intersection, union, and complement operators. Furthermore, the kernel formalism allows the declarative formulation of typing constraints and integrity rules. In order to improve the expressiveness and maintain the decidability of the reasoning activities. we extend the kernel formalism into two alternative directions. The first formalism, OLCP, introduces the capability of expressing path relations. Because cyclic schemas are extremely useful, we introduce a second formalism, OLCD, with the capability of expressing cyclic descriptions but disallowing the expression of path relations. In fact. we show that the reasoning activity in OLCDP (i.e., OLCP with cycles) is undecidable.

1998 - Exploiting Schema Knowledge for the Integration of Heterogeneous Sources [Relazione in Atti di Convegno]
Bergamaschi, Sonia; S., DE CAPITANI DE VIMERCATI; S., Montanari; Vincini, Maurizio
abstract

Ancona, Italy

1997 - A semantics-driven query optimizer for OODBs [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Vincini, Maurizio; Beneventano, Domenico
abstract

ODB-QOptimizer is a ODMG 93 compliant tool for the schema validation and semantic query optimization. The approach is based on two fundamental ingredients. The first one is the OCDL description logics (DLs) proposed as a common formalism to express class descriptions, a relevant set of integrity constraints rules (IC rules) and queries. The second one are DLs inference techniques, exploited to evaluate the logical implications expressed by IC rules and thus to produce the semantic expansion of a given query.

1997 - An Approach for the Extraction of Information from Heterogeneous Sources of Textual Data [Relazione in Atti di Convegno]
Bergamaschi, Sonia; C., Sartori
abstract

CEUR Workshop Proceedings. Atene

1997 - Extraction of informations from highly heterogeneous source of textual data [Relazione in Atti di Convegno]
Bergamaschi, S.
abstract

Extracting informations from multiple sources, highly heterogeneous, of textual data and integrating them in order to provide true information is a challenging research topic in the database area. In order to illustrate problems and solutions, one of the most interesting projects facing this problem, TSIMMIS, is presented. Furthermore, a Description Logics approach, able to provide interesting solutions both for data integration and data querying, is introduced.

1997 - Incoherence and Subsumption for recursive views and queries in Object-Oriented Data Models [Articolo su rivista]
Bergamaschi, Sonia; Beneventano, Domenico
abstract

Elsevier Science B.V. (North- Holland)

1997 - ODB-QOptimizer: a tool for semantic query optimization in OODB [Software]
Beneventano, Domenico; Bergamaschi, Sonia; Sartori, Claudio; Vincini, Maurizio
abstract

ODB-QOPTIMIZER is a ODMG 93 compliant tool for the schema validation and semantic query optimization.The approach is based on two fundamental ingredients. The first one is the OCDL description logics (DLs) proposed as a common formalism to express class descriptions, a relevant set of integrity constraints rules (IC rules) and queries.The second one are DLs inference techniques, exploited to evaluate the logical implications expressed by IC rules and thus to produce the semantic expansion of a given query.

1997 - ODB-QOptimizer: a tool for semantic query optimization in OODB [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; C., Sartori; Vincini, Maurizio
abstract

Birmingham, UK

1997 - ODB-Tools: a description logics based tool for schema validation and semantic query optimization in Object Oriented Databases [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; C., Sartori; Vincini, Maurizio
abstract

LNAI 1321. Roma

1997 - Object Wrapper: an Object-Oriented Interface for Relational Databases [Relazione in Atti di Convegno]
Bergamaschi, Sonia; A., Garuti; C., Sartori; A., Venuta
abstract

Most commercial applications have to cope with a large number of stored object instances and have data shared among many users and applications. For object-oriented as well as conventional application development RDBMS technology is currently being used in most case. We describe a software module called Object Wrapper for storing and retrieving objects in a RDBMS. Having these capabilities in a separate component helps to isolate data management system dependencies and hence contributes to portable applications.

1996 - ODB- Reasoner: un ambiente per la verifica di schemi e l’ottimizzazione di interrogazioni in OODB [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; A., Garuti; Vincini, Maurizio; C., Sartori
abstract

S.Miniato. Atti a cura di Fausto Rabitti et al.

1996 - Scoperta di regole per l’ottimizzazione semantica delle interrogazioni [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; C., Sartori
abstract

Ottimizzazione Semantica delle Interrogazioni in ambiente relazionale con riferimento a vincoli di integrità che rappresentano restrizioni e semplici regole sugli attributi. Utilizzo del sistema Explora per la derivazione automatica delle regole da usare nell'ambito del processo di Ottimizzazione Semantica delle Interrogazioni.

1996 - Semantic Query Optimization by Subsumption in OODB [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; C., Sartori
abstract

The purpose of semantic query optimization is to use semantic knowledge (e.g. integrity constraints) for transforming a query into a form that may be answered more efficiently than the original version. This paper proposes a general method for semantic query optimization in the framework of Object Oriented Database Systems. The method is applicable to the class of conjunctive queries and is based on two ingredients: a formalism able to express both class descriptions and integrity constraints rules as types; subsumption computation between types to evaluate the logical implications expressed by integrity constraints rules.

1995 - A semantics-driven query optimizer for OODBs [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; C., Sartori; J. P., Ballerini; Vincini, Maurizio
abstract

Semantic query optimization uses problem-specic knowledge (e.g. integrity constraints) for transforming a query into an equivalentone (i.e., with the same answer set) that may be answered more eciently. The optimizer is applicable to the class conjunctive queries is based on two fundamental ingredients. The first one is the ODL description logics proposed as a common formalism to express: class descriptions, a relevant set of integrity constraintsrules (IC rules), queries as ODL types. The second one are DLs (Description Logics) inference techniques exploited to evaluate the logical implications expressed by IC rules and thus to produce the semantic expansion of a given query. The optimizer tentatively applies all the possible transformations and delays the choice of ben-ecial transformation till the end. Some preliminar ideas on ltering activities on the semantically expanded queryare reported. A prototype semantic queryoptimizer (ODB-QOptimizer) for object-oriented database systems (OODBs) is described.

1995 - Consistency checking in Complex Objects Database schemata with integrity constraints [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; S., Lodi; C., Sartori
abstract

Integrity constraints are rules which should guarantee the integrity of a database.Provided that an adequate mechanism to express them is available, the following question arises: is there any way to populate a database which satisfies the constraints supplied by a designer? i.e., does the database schema, including constraints, admit at least one model in which all classes are non-empty?This work gives an answer to the above question in an OODB environment, providing a Data Definition Language (DDL) able to express the semantics of a relevant set of state constraints and a specialized reasoner able to check the consistency of a schema with such constraints.The choice of the set of constraints expressed in the DDL is motivated by decidability issues.

1995 - DL techniques for intensional query answering in OODBs [Relazione in Atti di Convegno]
Bergamaschi, Sonia; C., Sartori; Vincini, Maurizio
abstract

Int. Workshop on Reasoning about Structured Objects: Knowledge Representation meets Databases

1995 - Lezioni di Fondamenti di Informatica [Monografia/Trattato scientifico]
Bergamaschi, Sonia; Claudio, Sartori; maria Rita, Scalas
abstract

Lezioni di fondamenti di informatica - seconda edizione

1995 - ODBQOptimizer: un ottimizzatore semantico per interrogazioni in OODB [Relazione in Atti di Convegno]
J. P., Ballerini; Beneventano, Domenico; Bergamaschi, Sonia; Vincini, Maurizio
abstract

Atti a cura di Antonio Albano et al.

1995 - Terminological logics for schema design and query processing inOODBs [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; S., Lodi; C., Sartori
abstract

The paper introduces ideas which make feasible and effective the application of Terminological Logic (TL) techniques for schema design and query optimization in Object Oriented Databases (OODBs).

1994 - ACQUISITION AND VALIDATION OF COMPLEX OBJECT DATABASE SCHEMATA SUPPORTING MULTIPLE INHERITANCE [Articolo su rivista]
Bergamaschi, Sonia; Nebel, B.
abstract

We present an intelligent tool for the acquisition of object-oriented schemata supporting multiple inheritance, which preserves taxonomy coherence and performs taxonomic inferences. Its theoretical framework is based on terminological logics, which have been developed in the area of artificial intelligence. The framework includes a rigorous formalization-of complex objects, which is able to express cyclic references on the schema and instance level; a subsumption algorithm, which computes all implied specialization relationships between types; and an algorithm to detect incoherent types, i.e., necessarily empty types. Using results from formal analyses of knowledge representation languages, we show that subsumption and incoherence detection are computationally intractable from a theoretical point of view. However, the problems appear to be feasible in almost all practical cases.

1994 - Constraints in Complex Object Database Models [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; S., Lodi; C., Sartori
abstract

Database design almost invariably includes a specification of a set of rules (the integrity constraints) which should guarantee its consistency. Constraints are expressed in various fashions, depending on the data model, e.g. sub-sets of first order logic, or inclusion dependencies and predicates on row values, or methods in OO environments. Provided that an adequate formalism to express them is available, the following question arises?? Is there any way to populate a database which satisfieses the constraints supplied by a designer? Means of answering to this question should be embedded in automatic design tools, whose use is recommendable or often required in the difficult task of designing complex database schemas. The contribution of this research is to propose a computational solution to the problem of schema consistency in Complex Object Data Models.

1994 - Description Logics as a core of a tutoring system [Relazione in Atti di Convegno]
Bergamaschi, Sonia; S., Lodi; C., Sartori
abstract

A tutorng system based on description logicsDFKI report n. D-94-10)

1994 - Reasoning with constraints in Database Models [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; S., Lodi; C., Sartori
abstract

Database design almost invariably includes a specification of a set of rules(the integrity constraints) which should guarantee its consistency. Provided that an adequate mechanism to express them is available, the following question arises: is there any way to populate a database which satisfies the constraints supplied by a designer? i.e., does the database schema, including constraints, admit at least one model in which all classes are non-empty? This work gives an answer to the above question in an OODB environment, providing a Data Definition Language (DDL) able to express the semantics of a relevant set of state constraints and a specialized reasoner able to check the consistency of a schema with such constraints.

1994 - THE E/S KNOWLEDGE REPRESENTATION SYSTEM [Articolo su rivista]
Bergamaschi, Sonia; Lodi, Silvano Alberto; Sartori, C.
abstract

This paper introduces the E/S knowledge representation model and describes a system based on that model. The model takes ideas from KL-ONE and ER, and its main strength is the direct representation of n-ary relationships. The system is classification-based, and therefore organizes its knowledge in hierarchies of structured intensional objects and offers a set of services to reason about intensional objects, to store extensional objects and to make inferences on the stored knowledge.

1994 - The Entity/Situation Knowledge Representation System [Articolo su rivista]
Bergamaschi, Sonia; S., Lodi; C., Sartori
abstract

Elsevier Science B.V. (North- Holland)

1994 - Using subsumption for semantic query optimization [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; S., Lodi; C., Sartori
abstract

The purpose of semantic query optimization is to use semantic knowledge (e.g. integrity constraints) for transforming a query into an equivalent one that may be answered more efficiently than the original version. This paper proposes a general method for semantic query optimization in the framework of OODBs(Object Oriented Database Systems). The method is applicable to the class of conjunctive queries and is based on two ingredients: a description logics able to express both class descriptions and integrity constraints rules (IC rules) as types; subsumption computation between types to evaluate the logical implications expressed by IC rules.

1993 - Prototypes in the LOGIDATA+ Project [Capitolo/Saggio]
A., Artale; J. P., Ballerini; Bergamaschi, Sonia; F., Cacace; S., Ceri; F., Cesarini; A., Formica; H., Lam; S., Greco; G., Marrella; M., Missikoff; L., Palopoli; L., Pichetti; D., Saccà; S., Salza; C., Sartori; G., Soda; L., Tanca; M., Toiati
abstract

The paper introduces the prototypes developed in the LOGIDATA+ project

1993 - Taxonomic Reasoning in LOGIDATA+ [Capitolo/Saggio]
Beneventano, Domenico; Bergamaschi, Sonia; C., Sartori; A., Artale; F., Cesarini; G., Soda
abstract

This chapter introduces the subsumption computation techniques for a LOGIDATA+ schema.

1993 - Taxonomic Reasoning with Cycles in LOGIDATA+ [Capitolo/Saggio]
Beneventano, Domenico; Bergamaschi, Sonia; Sartori, C.
abstract

This chapter shows the subsumption computation techniques for a LOGIDATA+ schema allowing cyclic definitions for classes. The formal framework LOGIDATA_CYC*, which extends LOGIDATA* to perform taxonomic reasoning in the presence of cyclic class definitions is introduced. It includes the notions of possible instances of a schema; legal instance of a schema, defined as the greatest fixed-point of possible instances; subsumption relation. On the basis of this framework, the definitions of coherent type and consistent class are introduced and the necessary algorithms to detect incoherence and compute subsumption in a LOGIDATA+ schema are given. Some examples of subsumption computation show its feasibility for schema design and validation.

1993 - Using Subsumption in Semantic Query Optmization [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; S., Lodi; C., Sartori
abstract

The purpose of semantic query optimization is to use semantic knowledge (e.g. integrity constraints) for transforming a query into an equivalent one that may be answered more efficiently than the original version. This paper proposes a general method for semantic query optimization in the framework of OODBs (Object Oriented Database Systems). The method is applicable to the class of conjunctive queries and is based on two ingredients: a description logic able to express both class descriptions and integrity constraints rules (IC rules) as types; subsumption computation between types to evaluate the logical implications expressed by IC rules.

1993 - Uso della Subsumption per l'Ottimizzazione Semantica delle Queries [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; S., Lodi; C., Sartori
abstract

In questo lavoro si vuole analizzare la possibilità di effettuare l'Ottimizzazione Semantica delle Interrogazioni utilizzando la relazione di subsumption. Il lavoro include una formalizzazione dei modelli dei dati ad oggetti complessi, arricchita con la nozione di subsumption, che individua tutte le relazioni di specilizzazione tra classi di oggetti sulla base delle loro descrizioni.

1992 - On taxonomic reasoning in conceptual design [Articolo su rivista]
Bergamaschi, Sonia; Sartori, C.
abstract

This paper introduces for the first time the oupling of description logics and conceptual database design

1992 - Representation Extensions of DLs [Relazione in Atti di Convegno]
Bergamaschi, Sonia; S., Lodi; C., Sartori
abstract

Representational extensions

1992 - Subsumption for Complex Object Data Models [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia
abstract

We adopt a formalism, similar to terminological logic languages developed in AI knowledge representation systems, to express the semantics of complex objects data models. Two main extensions are proposed with respect to previous proposed models: the conjunction operator, which permits the expression multiple inheritance between types (classes) as a semantic property and the introduction in the schema of derived (classes), similar to views. These extensions, together with the adoption of suitable semantics able for dealing with cyclic descriptions, allow for the automatic placement of classes in a specialization hierarchy. Mapping schemata to nondeterministic finite automata we face and solve interesting problems like detection of emptiness of a classextension and computation of a specialization ordering for the greatest, least and descriptive semantics. As queries can be expressed as derived classes these results also apply to intentional query answering and query validation.

1991 - E-S: a three-sorted terminological language [Relazione in Atti di Convegno]
Bergamaschi, Sonia; S., Lodi; C., Sartori
abstract

A three-sorted terminological lagiage: Entuty Situation

1991 - Research Interests and Accomplishments for the Terminological Users Workshop [Relazione in Atti di Convegno]
Bergamaschi, Sonia; S., Lodi; C., Sartori
abstract

New requirements for coupling databases and terminological logics

1991 - Subsumption for Database Schema Design [Relazione in Atti di Convegno]
Bergamaschi, Sonia; C., Sartori
abstract

Subsumption is a useful technique to asses consistency checking in databases.

1991 - Taxonomic Reasoning in Complex Object Data Models [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; C., Sartori
abstract

1991 - Taxonomic Reasoning in LOGIDATA+ [Relazione in Atti di Convegno]
Beneventano, Domenico; Bergamaschi, Sonia; C., Sartori
abstract

This paper introduces the subsumption computation techniques for a LOGIDATA+ schema.

1989 - An expert system for the selection of a composite material [Relazione in Atti di Convegno]
Bergamaschi, Sonia; G., Bombarda; L., Piancastelli; C., Sartori
abstract

An expert system for composite material selection starting from the user specification was developed. Since a database management system (DBMS) was necessary to manage the amount of information for material and application characterization, a logical interface between the Expert 2 system and the relational database was developed. This configuration allows complete separation between the database problem of material characteristic management and the rule-oriented material selection problem handled by the expert system-

1989 - Entita'-Situazione: un modello di rappresentazione della conoscenza [Relazione in Atti di Convegno]
Bergamaschi, Sonia; S., Lodi; C., Sartori
abstract

Trieste

1989 - Ingegneria della Conoscenza [Articolo su rivista]
Bergamaschi, Sonia; C., Sartori
abstract

la necessità di sviluppare sistemi di rappresentazione della conoscenza che integrino espressività ed efficienza nella gestione di grandi quantità di informazione è ormai largamente sentita. In questo articolo si tratta l'evoluzione da DBMS a sistemi in grado di aggiungere espressività e maggiore intelligenza.

1989 - Un editor intelligente per la costruzione di una base di conoscenza secondo il modello entita' situazione [Relazione in Atti di Convegno]
Bergamaschi, Sonia; S., Lodi; C., Sartori
abstract

Atti di: P. Mello, (a cura di). Bologna.

1988 - Basi di conoscenza e basi di dati: memorizzazione degli aspetti estensionali [Relazione in Atti di Convegno]
Bergamaschi, Sonia; P., Ciaccia; C., Sartori
abstract

In questo paper si descrive la diversa rappresentazione della conoscenza estensionale nelle basi di dati e nelle basi di conoscenza.

1988 - Entity-Situation: a Model for the Knowledge Representation Module of a KBMS [Relazione in Atti di Convegno]
Bergamaschi, Sonia; Bonfatti, Flavio; C., Sartori
abstract

ADKS is an advanced data and management system whose main objective is to couple expressiveness and efficiency in the management of large knowledge bases is described. The architecture of the system and a new semantic model which is the basis of its knowledge representation module is presented.

1988 - On taxonomic reasoning in E/R environment [Relazione in Atti di Convegno]
Bergamaschi, Sonia; L., Cavedoni; C., Sartori; P., Tiberio
abstract

in C. Batini, (a cura di) Entity-Relationship Approach, Elsevier Science Publishers B.V.

1988 - Relational data base design for the intensional aspects of a knowledge base [Articolo su rivista]
Bergamaschi, Sonia; F., Bonfatti; L., Cavazza; L., Cavazza; P., Tiberio
abstract

PERGAMON PRESS

1987 - Conceptual Models: Support for a Database Projects [Articolo su rivista]
Bergamaschi, Sonia; Bonfatti, Flavio; Sartori, Claudio; Tiberio, Paolo
abstract

The paper describes some basic concepts adopted for the most noteworthy conceptual models proposed within the framework of drawing up a project and prototypes of information systems based on the database technology. Some systems provided with high-level programming languages are presented. Integrated concepts derived from the field of artificial intelligence, they constitute a valid tool for designing and developing database systems.

1986 - Choice of the optimal number of blocks for data access by an index [Articolo su rivista]
Bergamaschi, Sonia; M. R., Scalas
abstract

Pergamon Press

Università degli studi di Modena e Reggio Emilia

Pubblicazioni