Sukhobok, N. Nikolov, and D. Roman. Tabular Data Anomaly Patterns. To appear in the proceedings of The 3rd International Conference on Big Data Innovations and Applications (Innovate-Data 2017), 21-23 August 2017, Prague, Czech Republic, IEEE.
- Abstract: One essential and challenging task in data science is data cleaning — the process of identifying and eliminating data anomalies. Different data types, data domains, data acquisition methods, and final purposes of data cleaning have resulted in different approaches in defining data anomalies in the literature. This paper proposes and describes a set of basic data anomalies in the form of anomaly patterns commonly encountered in tabular data, independently of the data domain, data acquisition technique, or the purpose of data cleaning. This set of anomalies can serve as a valuable basis for developing and enhancing software products that provide general-purpose data cleaning facilities and can provide a basis for comparing different tools aimed to support tabular data cleaning capabilities. Furthermore, this paper introduces a set of corresponding data operations suitable for addressing the identified anomaly patterns and introduces Grafterizer — a software framework that implements those data operations.
- Download paper
Understanding territorial distribution of Properties of Managers and Shareholders: a Data-driven Approach by S. Pozzati, D. Sanvito, C. Castelli, D. Roman, Territorio Italia 2 (2016), DOI: 10.14609/Ti_2_16_2e.
- Abstract: The analysis and better understanding of the distribution of wealth of individuals in cities can be a precious tool, especially in support of the estimation of real estate values. These analyses can also be used to facilitate decision making in various sectors, such as public administration or the real estate market. In this paper, by making use of publicly available data and of data owned by Cerved, (a credit scoring company in Italy), we can observe the territorial distribution of the properties of managers and shareholders – categories of people usually linked to high economic well-being – and, based on that, we identify the areas of the cities where the value of real estate properties is presumably higher. More specifically, we introduce the Manager and Shareholder Concentration (MSHC) score and validate its accuracy and effectiveness in three Italian cities (Turin, Rome and Milan).
- Download paper
DataGraft: One-Stop-Shop for Open Data Management by D. Roman, N. Nikolov, A. Putlier, D. Sukhobok, B. Elvesæter, A. Berre, X. Ye, M. Dimitrov, A. Simov, M. Zarev, R. Moynihan, B. Roberts, I. Berlocher, S. Kim, T. Lee, A. Smith, and T. Heath. Semantic Web journal, 2016.
- Abstract: This paper introduces DataGraft (https://datagraft.net/) – a cloud-based platform for data transformation and publishing. DataGraft was developed to provide better and easier to use tools for data workers and developers (e.g., open data publishers, linked data developers, data scientists) who consider existing approaches to data transformation, hosting, and access too costly and technically complex. DataGraft offers an integrated, flexible, and reliable cloud-based solution for hosted open data management. Key features include flexible management of data transformations (e.g., interactive creation, execution, sharing, and reuse) and reliable data hosting services. This paper provides an overview of DataGraft focusing on the rationale, key features and components, and evaluation.
- Download paper
DataGraft: Simplifying Open Data Publishing by D. Roman, M. Dimitrov, N. Nikolov, A. Putlier, D. Sukhobok, B. Elvesæter, A..J. Berre, X. Ye, A. Simov and Y. Petkov. ESWC Demo paper. 2016.
- Abstract: In this demonstrator we introduce DataGraft – a platform for Open Data management. DataGraft provides data transformation, publishing and hosting capabilities that aim to simplify the data publishing lifecycle for data workers (i.e., Open Data publishers, Linked Data developers, data scientists). This demonstrator highlights the key features of DataGraft by exemplifying a data transformation and publishing use case with property-related data.
- Download paper
Tabular Data Cleaning and Linked Data Generation with Grafterizer by D. Sukhobok, N. Nikolov, A. Pultier, X. Ye, A..J. Berre, R. Moynihan, B. Roberts, B. Elvesæter, N. Mahasivam and D. Roman. ESWC Demo paper. 2016.
- Abstract: Over the past several years the amount of published open data has increased significantly. The majority of this is tabular data, that requires powerful and flexible approaches for data cleaning and preparation in order to convert it into Linked Data. This paper introduces Grafterizer – a software framework developed to support data workers and data developers in the process of converting raw tabular data into linked data. Its main components include Grafter, a powerful software library and DSL for data cleaning and RDF-ization, and Grafterizer, a user interface for interactive specification of data transformations along with a back-end for management and execution of data transformations. The proposed demonstration will focus on Grafterizer’s powerful features for data cleaning and RDF-ization in a scenario using data about the risk of failure of transport infrastructure components due to natural hazards.
- Download paper
Towards a Reference Architecture for Trusted Data Marketplaces by Dumitru Roman and Stefano Gatti. 2nd International Conference on Open and Big Data, 2016.
- Abstract: Data sharing presents extensive opportunities and challenges in domains such as the public sector, health care and financial services. This paper introduces the concept of “trusted data marketplaces” as a mechanism for enabling trusted sharing of data. It takes credit scoring—an essential mechanism of the entire world-economic environment, determining access for companies and individuals to credit and the terms under which credit is provisioned—as an example for the realization of the trusted data marketplaces concept. This paper looks at credit scoring from a data perspective, analyzing current shortcomings in the use and sharing of data for credit scoring, and outlining a conceptual framework in terms of a trusted data marketplace to overcome the identified shortcomings. The contribution of this paper is two-fold: (1) identify and discuss the core data issues that hinder innovation in credit scoring; (2) propose a conceptual architecture for trusted data marketplaces for credit scoring in order to serve as a reference architecture for the implementation of future credit scoring systems. The architecture is generic and can be adopted in other domains where data sharing is of high relevance.
- Download paper
The proDataMarket SoE and CAPAS business cases have been published/presented at the RuleML2015 Industry Track:
Norwegian State of Estate: A Reporting Service for the State-Owned Properties in Norway by Ling Shi, Bjørg E. Pettersen, Ivar Østhassel, Nikolay Nikolov, Arash Khorramhonarnama, Arne J. Berre, and Dumitru Roman
- Abstract: Statsbygg is the public sector administration company responsible for reporting the state-owned property data in Norway. Traditionally the reporting process has been resource-demanding and error-prone. The State of Estate (SoE) business case presented in this paper is creating a new reporting service by sharing, integrating and utilizing cross-sectorial property data, aiming to increase the transparency and accessibility of property data from public sectors enabling downstream innovation. This paper explains the ambitions of the SoE business case, highlights the technical challenges related to data integration and data quality, data sharing and analysis, discusses the current solution and potential use of rules technologies.
CAPAS: A Service for Improving the Assignments of Common Agriculture Policy Funds to Farmers and Land Owners by Mariano Navarro, Ramón Baiget, Jesús Estrada and Dumitru Roman
- Abstract: The Tragsa Group is part of the group of companies administered by the Spanish state-owned holding company Sociedad Estatal de Participaciones Industriales (SEPI). Its 37 years of experience have placed this business group at the forefront of different sectors ranging from agricultural, forestry, livestock, and rural development services, to conservation and protection of the environment in Spain. Tragsa is currently developing a business case around the implementation of a Common Agriculture Policy Assignment Service (CAPAS) – an extension of a currently active and widely used service (more than 20 million visits per year). The extension of the service in this business case is based on leveraging new cross-sectorial data sources, and targets a substantial reduction of incorrect agricultural funds assignments to farmers and land owners. This paper provides an overview of the business case, technical challenges related to the implementation of CAPAS (in areas such as data integration), discusses the current solution and potential use of rule technologies.
SINTEF is Scandinavia’s largest independent research organization. SINTEF is multidisciplinary, with international top-level expertise in a wide range of technological and scientific disciplines, including areas such as ICT, medicine, and the social sciences. SINTEF’s company vision is “technology for a better society”, and it is an important aspect of SINTEF’s societal role to contribute to the creation of more jobs. SINTEF acts as an incubator, commercialising technologies through the establishment of new companies. SINTEF is represented in proDataMarket by Information and Communication Technology (SINTEF ICT) through the department for Networked Systems and Services (NSS).
Role in the project: SINTEF is the project leader of proDataMarket, and in addition serves as a technology provider in the project. SINTEF’s technical focus is on the technical infrastructure of the proDataMarket platform related to data management technologies, in particular data publishing and access, helping organizations with cost-effective solutions for (linked open) data management. Our goal is to promote standardisation with mechanisms for defining structure and semantics of data, as well as improve the interoperability and transparency among data publishers and consumers through leveraging the linked data format. Technically, we are constructing a software framework that consists of a frontend and a set of platform services that support reusable data cleaning and reconfiguration based on pluggable static, dynamic or streaming input in various formats (e.g., relational databases, CSV files, WMS/WFS services, etc.). Outputs will be published on the proDataMarket platform and available to end users and other publishers through a secured set of platform services such as SPARQL query endpoints and RESTful APIs. This framework is meant to provide automation for significantly reducing the manual effort involved in the highly laborious process of data retrieval and aggregation.
In proDataMarket, SINTEF reuses and extends its data reconfiguration solutions from the DaPaaS project. In particular, we plan to further develop the Grafterizer tool for data cleaning and linked data mapping of tabular inputs.