New paper: Tabular Data Anomaly Patterns

Sukhobok, N. Nikolov, and D. Roman. Tabular Data Anomaly Patterns. To appear in the proceedings of The 3rd International Conference on Big Data Innovations and Applications (Innovate-Data 2017), 21-23 August 2017, Prague, Czech Republic, IEEE.

  • Abstract: One essential and challenging task in data science is data cleaning — the process of identifying and eliminating data anomalies. Different data types, data domains, data acquisition methods, and final purposes of data cleaning have resulted in different approaches in defining data anomalies in the literature. This paper proposes and describes a set of basic data anomalies in the form of anomaly patterns commonly encountered in tabular data, independently of the data domain, data acquisition technique, or the purpose of data cleaning. This set of anomalies can serve as a valuable basis for developing and enhancing software products that provide general-purpose data cleaning facilities and can provide a basis for comparing different tools aimed to support tabular data cleaning capabilities. Furthermore, this paper introduces a set of corresponding data operations suitable for addressing the identified anomaly patterns and introduces Grafterizer — a software framework that implements those data operations.
  • Download paper

New proDataMarket paper: Combining Sentinel-2 and LiDAR data for objective and automated identification of agricultural parcel features

Combining Sentinel-2 and LiDAR data for objective and automated identification of agricultural parcel features by Jesús Estrada, Héctor Sanchez, Lorena Hernanz, María José Checa and Dumitru Roman

This new proDataMarket paper explains how a comprehensive strategy combining remote sensing and field data can be helpful for more effective agriculture management. Satellite data are suitable for monitoring large areas over time, while LiDAR provides specific and accurate data on height and relief. Both types of data can be used for calibration and validation purposes, avoiding field visits and saving useful resources. In this paper we propose a process for objective and automated identification of agricultural parcel features based on processing and combining Sentinel2 data (to sense different types of irrigation patterns) and LiDAR data (to detect landscape elements). The proposed process was validated in several use cases in Spain, yielding high accuracy rates in the identification of parcel features. An important application example of the work reported in this paper is the European Union (EU) Common Agriculture Policy (CAP) funds assignment service, which would significantly benefit from a more objective and automated process for identification of agricultural parcel features, thereby enabling the possibility for the EU to save significant amounts of money yearly.

Although some issues regarding the generation and improvement of agricultural property datasets were already explained in our previous blog entry (Data workflow in CAPAS), this paper highlights the current results of generation and usage of this new information.

Irrigation patterns map, obtained using Sentinel-2 Process 

The main result of this analysis is how the use of the external, and usually underused, data sources offers a powerful and accurate tool for generating new contrast and validation data for the information used by Spanish CAP Payment Agency, in order to provide a better service to landowners and farmers. As a conclusion, the use of Sentinel-2 series and LiDAR can help to detect areas that are not eligible for grant assignment, support cross-check, and these datasets can be used as a tool for choosing field samples.

The document is available Here.

New Paper: Understanding territorial distribution of Properties of Managers and Shareholders: a Data-driven Approach

Understanding territorial distribution of Properties of Managers and Shareholders: a Data-driven Approach by S. Pozzati, D. Sanvito, C. Castelli, D. Roman, Territorio Italia 2 (2016), DOI: 10.14609/Ti_2_16_2e.

  • Abstract: The analysis and better understanding of the distribution of wealth of individuals in cities can be a precious tool, especially in support of the estimation of real estate values. These analyses can also be used to facilitate decision making in various sectors, such as public administration or the real estate market. In this paper, by making use of publicly available data and of data owned by Cerved, (a credit scoring company in Italy), we can observe the territorial distribution of the properties of managers and shareholders – categories of people usually linked to high economic well-being – and, based on that, we identify the areas of the cities where the value of real estate properties is presumably higher. More specifically, we introduce the Manager and Shareholder Concentration (MSHC) score and validate its accuracy and effectiveness in three Italian cities (Turin, Rome and Milan).
  • Download paper

The proDataMarket Ontology: Enabling Semantic Interoperability of Real Property Data

Real property data (often referred to as real estate, realty, or immovable property data) represent a valuable asset that has the potential to enable innovative services when integrated with related contextual data (e.g., business data). Such services can range from providing evaluation of real estate to reporting on up-to-date information about state-owned properties. Real property data integration is a difficult task primarily due to the heterogeneity and complexity of the real property data, and the lack of generally agreed upon semantic descriptions of the concepts in this domain. The proDataMarket ontology is developed in the project as a key enabler for integration of real property data.

The proDataMarket ontology design and development process followed techniques and design choices supported by existing methodologies, mainly the one proposed by Noy [1]. Requirements are extracted from a set of relevant business cases and competency questions [2] are defined for each business case, so as core concepts and relationships. A conceptual model is then developed based on the requirements mentioned above and international standards including ISO 19152:2012 and European Union’s INSPIRE data specifications. For example, the LADM conceptual model from ISO 19152:2012 is used as reference model to the proDataMarket cadastral domain conceptual model. Afterwards we implemented the conceptual model using RDFS/OWL linked data standard. RDFS is used to model concepts, properties and simple relationships such as rdfs:subClassOf. OWL is built upon RDFS and provides a richer language for web ontology modelling and it is used to model constraints and other advanced relationships, such as the cardinality constraint needed to express the relationship between properties and buildings.

The proDataMarket ontology can be accessed at http://vocabs.datagraft.net/proDataMarket/. The ontology has been divided into several sub-ontologies (see Table below), reflecting the cross-domain nature of the requirements. This modular approach also helped to handle the complexity of the model and made it easier to maintain. In the current version, there are 11 sub-ontologies with 43 native classes and 43 native properties.

Table: Composition of the proDataMarket ontology

Domain/module Namespace prefix URL Classes Properties Business cases
Common prodm-com http://vocabs.datagraft.net/proDataMarket/0.1/Common# 4 4 ALL
Cadaster prodm-cad http://vocabs.datagraft.net/proDataMarket/0.1/Cadastre# 6 16 SoE, RVAS, NNAS, SIM
State of Estate Report prodm-soe http://vocabs.datagraft.net/proDataMarket/0.1/SoE# 4 2 SoE, RVAS
Business Entity-Reuse the existing vocabularies, no new classes and properties 0 0 SoE, RVAS
Building Accessibility-Reuse the existing vocabularies, no new classes and properties 0 0 SoE
Natural Hazard prodm-nh http://vocabs.datagraft.net/proDataMarket/0.1/NaturalHazard# 1 0 RVAS
Land Parcel Identification System (LPIS) prodm-lpis http://vocabs.datagraft.net/proDataMarket/0.1/LPIS# 1 7 CAPAS
Sentinel data prodm-sen http://vocabs.datagraft.net/proDataMarket/0.1/Sentinel# 1 1 CAPAS
Landscape Elements (LiDAR data) prodm-lid http://vocabs.datagraft.net/proDataMarket/0.1/Lidar# 3 0 CAPAS
Assessment prodm-asm http://vocabs.datagraft.net/proDataMarket/0.1/Assessment# 3 3 CAPAS
CensusTract prodm-ct http://vocabs.datagraft.net/proDataMarket/0.1/CensusTract# 1 0 CST,CCRS
Urban Infrastructure prodm-ui http://vocabs.datagraft.net/proDataMarket/0.1/UrbanInfrastructure# 17 10 SIM
Protected Sites prodm-ps http://vocabs.datagraft.net/proDataMarket/0.1/ProtectedSite# 2 0 CAPAS
Total: 43 43

More than 30 datasets have been published through the DataGraft platform [3] [4] using the proDataMarket ontology as a central reference model. All seven business cases use the proDataMarket ontology in data publishing.

More details on the proDataMarket vocabulary will be found in the paper “The proDataMarket Ontology for Publishing and Integrating Cross-domain Real Property Data” that was accepted for publication in scientific journal Territorio ItaliaLand AdministrationCadastre and Real Estate [5].

References

  • [1] Noy, Natalya F., and Deborah L. McGuinness. “Ontology development 101: A guide to creating your first ontology.” (2001).
  • [2] Grüninger, Michael, and Mark S. Fox. “Methodology for the Design and Evaluation of Ontologies.” (1995).
  • [3] Roman, D., et al. DataGraft: One-Stop-Shop for Open Data Management. 2017. Semantic Web, vol. Preprint, no. Preprint, pp. 1-19, 2017. DOI: 10.3233/SW-170263.
  • [4] Roman, D., et al. DataGraft: Simplifying Open Data Publishing. ESWC (Satellite Events) 2016: 101-106.
  • [5] L. Shi, N. Nikolov, D. Sukhobokb, T. Tarasova and D. Roman. “The proDataMarket Ontology for Publishing and Integrating Cross-domain Real Property Data”. To appear in the journal “Territorio Italia Land Administration, Cadastre and Real Estate”. n.2/2017.

New DataGraft-related papers

DataGraft: One-Stop-Shop for Open Data Management by D. Roman, N. Nikolov, A. Putlier, D. Sukhobok, B. Elvesæter, A. Berre, X. Ye, M. Dimitrov, A. Simov, M. Zarev, R. Moynihan, B. Roberts, I. Berlocher, S. Kim, T. Lee, A. Smith, and T. Heath. Semantic Web journal, 2016.

  • Abstract: This paper introduces DataGraft (https://datagraft.net/) – a cloud-based platform for data transformation and publishing. DataGraft was developed to provide better and easier to use tools for data workers and developers (e.g., open data publishers, linked data developers, data scientists) who consider existing approaches to data transformation, hosting, and access too costly and technically complex. DataGraft offers an integrated, flexible, and reliable cloud-based solution for hosted open data management. Key features include flexible management of data transformations (e.g., interactive creation, execution, sharing, and reuse) and reliable data hosting services. This paper provides an overview of DataGraft focusing on the rationale, key features and components, and evaluation.
  • Download paper

DataGraft: Simplifying Open Data Publishing by D. Roman, M. Dimitrov, N. Nikolov, A. Putlier, D. Sukhobok, B. Elvesæter, A..J. Berre, X. Ye, A. Simov and Y. Petkov. ESWC Demo paper. 2016.

  • Abstract: In this demonstrator we introduce DataGraft – a platform for Open Data management. DataGraft provides data transformation, publishing and hosting capabilities that aim to simplify the data publishing lifecycle for data workers (i.e., Open Data publishers, Linked Data developers, data scientists). This demonstrator highlights the key features of DataGraft by exemplifying a data transformation and publishing use case with property-related data.
  • Download paper

Tabular Data Cleaning and Linked Data Generation with Grafterizer by D. Sukhobok, N. Nikolov, A. Pultier, X. Ye, A..J. Berre, R. Moynihan, B. Roberts, B. Elvesæter, N. Mahasivam and D. Roman. ESWC Demo paper. 2016.

  • Abstract: Over the past several years the amount of published open data has increased significantly. The majority of this is tabular data, that requires powerful and flexible approaches for data cleaning and preparation in order to convert it into Linked Data. This paper introduces Grafterizer – a software framework developed to support data workers and data developers in the process of converting raw tabular data into linked data. Its main components include Grafter, a powerful software library and DSL for data cleaning and RDF-ization, and Grafterizer, a user interface for interactive specification of data transformations along with a back-end for management and execution of data transformations. The proposed demonstration will focus on Grafterizer’s powerful features for data cleaning and RDF-ization in a scenario using data about the risk of failure of transport infrastructure components due to natural hazards.
  • Download paper