New Demo Papers at ISWC 2017

Sukhobok,D., H. Sanchez, J. Estrada, D. Roman. Linked Data for Common Agriculture Policy: Enabling Semantic Querying over Sentinel-2 and LiDAR Data. International Semantic Web Conference. Demo paper. 2017. To appear.

  • Abstract: The amount of open and free satellite earth observation data combined with available data from other sectors (e.g. biodiversity, landscape elements, cadaster data) has the potential to enhance decision-making processes in various domains. An example of such a domain is agriculture, where the ability to objectively and automatically identify different types of agricultural features (e.g., irrigation patterns and landscape elements) can lead to more effective agriculture management. In this paper we show the possibility to publish and integrate multi-sectoral data from several sources into an existing data-intensive service targeting better and fairer Common Agriculture Policy (CAP) funds assignments to farmers and land owners. We show an end-to-end approach for integrating multi-sectoral data and publishing the result as Linked Data with the help of the DataGraft platform. To demonstrate the use of the resulted dataset, we developed a visualization system prototype showing various information about agricultural parcel features.
  • Download paper

Sukhobok, D., Nikolov, N., Lech, T. C., Moberg, A.-H., Frantsvag, R., Bergaas, H. R., Roman, D. . Interacting with Subterranean Infrastructure Linked Data using Augmented Reality. International Semantic Web Conference. Demo paper. 2017. To appear.

  • Abstract: Subterranean infrastructure damages caused by excavation works of all kinds are costly and potentially dangerous for workers. Such damages are often caused by poor subterranean data or inappropriate use of the existing data. We aim to provide solutions and services that will hinder obstacles related to the use of subterranean infrastructure data to ensure less damage and less time spent on finding and integrating data about subterranean infrastructure. The result of the work reported in this paper is an augmented reality application that can provide users the ability to see what subterranean infrastructure is located at a given physical location. In this paper we demonstrate a method to create such an application using Linked Data technologies.
  • Download paper

Sukhobok, D. Djordjevic, D. Sanvito and D. Roman. Publishing Socio-Economic Territory Indices as Linked Data and their Visualization for Real Estate Valuation. International Semantic Web Conference. Demo paper. 2017. To appear.

  • Abstract: The correct estimation of the real estate value facilitates decision making in various sectors, such as public administration or the real estate market. In this paper we demonstrate a method to manage territory scores and property valuation estimations as Linked Data with
    the help of the proDataMarket technical framework. The demo illustrates how the proDataMarket technical framework can be used to generate, maintain and serve territory and property valuation estimation data with the help of semantic technologies.
  • Download paper

Shi, L., Pettersen, B. E., Sukhobok, D., Nikolov N., and Roman, D. Linked Data for the Norwegian State of Estate Reporting Service. International Semantic Web Conference. Demo paper. 2017. To appear.

  • Abstract: The Norwegian State of Estate (SoE) report includes information about all Norwegian state-owned properties and buildings in the public sector and aims to assist government decision makers to allocate resources more effectively. A Linked Data based approach is presented here to increase the transparency in the government administration, improve the report generating process and also the report quality. Cross-domain government data originated from the business entity register, the cadastral system, the building accessibility register and the old SoE report are acquired, prepared, cleaned, transformed to Linked Data format and published. The source datasets are then integrated, augmented and interlinked before the results are published as a SPARQL endpoint, used for data visualization and report generation.
  • Download paper

Roman, D., Paniagua, J., Tarasova, T., Georgiev, G., Sukhobok, D., Nikolov, N., and Lech, T. C. proDataMarket: A Data Marketplace for Monetizing Linked Data. International Semantic Web Conference. Demo paper. 2017. To appear.

  • Abstract: Linked data has emerged as an interesting technology for publishing structured data on the Web but also as a powerful mechanism for integrating disparate data sources. Various tools and approaches have been developed in the semantic Web community to produce and consume linked data, however little attention has been paid to monetization of linked data. In this paper we introduce a data marketplace – proDataMarket – that enables data providers to generate, advertise, and sell linked data, and data consumers to purchase linked data on the marketplace. The marketplace was originally designed with a focus on geospatial linked data (targeting property-related data providers and consumers) but its capabilities are generic and can be used for data in various domains. This demo will highlight the capabilities offered to the providers and consumers of the data made available on the marketplace.
  • Download paper

Nikolov, N., Sukhobok, D., Dragnev, S., Dalgard, S., Elvesæter, B., von Zernichow, B. M., and Roman, D. DataGraft beta v2: New Features and Capabilities. International Semantic Web Conference. Demo paper. 2017. To appear.

  • Abstract: In this demonstrator, we will introduce the latest features and capabilities added to DataGraft – a Data-as-a-Service platform for data preparation and knowledge graph generation. DataGraft provides data transformation, publishing and hosting capabilities that aim to simplify the data publishing lifecycle for data workers (i.e., Open Data publishers, Linked Data developers, data scientists). This demonstrator highlights the recent features added to DataGraft by exemplifying data publication of statistical data – going from the raw data published at a public portal to published and accessible Linked Data with the help of the tools and features of the platform.
  • Download paper

New Papers at ODBASE 2017

Shi, D. Sukhobok, N. Nikolov and D. Roman. Norwegian State of Estate Report as Linked Open Data. To appear in the proceedings of ODBASE 2017 – The 16th International Conference on Ontologies, DataBases, and Applications of Semantics, Springer, 24-25 October 2017, Rhodes, Greece.

  • Abstract: This paper presents the Norwegian State of Estate (SoE) dataset containing data about real estates owned by the central government in Norway. The dataset is produced by integrating cross-domain government datasets including data from sources such as the Norwegian business entity register, cadastral system, building accessibility register and the previous SoE report. The dataset is made available as Linked Data. The Linked Data generation process includes data acquisition, cleaning, transformation, annotation, publishing, augmentation and interlinking the annotated data as well as quality assessment of the interlinked datasets. The dataset is published under the Norwegian License for Open Government Data (NLOD) and serves as a reference point for applications using data on central government real estates, such as generation of the SoE report, searching properties suitable for asylum reception centres, risk assessment for state-owned buildings or a public building application for visitors.
  • Download paper

M. von Zernichow and D. Roman. Usability of Visual Data Profiling in Data Cleaning and Transformation. To appear in the proceedings of ODBASE 2017 – The 16th International Conference on Ontologies, DataBases, and Applications of Semantics, Springer, 24-25 October 2017, Rhodes, Greece.

  • Abstract: This paper presents the Norwegian State of Estate (SoE) dataset containing data about real estates owned by the central government in Norway. The dataset is produced by integrating cross-domain government datasets including data from sources such as the Norwegian business entity register, cadastral system, building accessibility register and the previous SoE report. The dataset is made available as Linked Data. The Linked Data generation process includes data acquisition, cleaning, transformation, annotation, publishing, augmentation and interlinking the annotated data as well as quality assessment of the interlinked datasets. The dataset is published under the Norwegian License for Open Government Data (NLOD) and serves as a reference point for applications using data on central government real estates, such as generation of the SoE report, searching properties suitable for asylum reception centres, risk assessment for state-owned buildings or a public building application for visitors.
  • Download paper

Roman, D. Sukhobok, N. Nikolov, B. Elvesæter and A. Pultier. The InfraRisk Ontology: Enabling Semantic Interoperability for Critical Infrastructures at Risk from Natural Hazards. To appear in the proceedings of ODBASE 2017 – The 16th International Conference on Ontologies, DataBases, and Applications of Semantics, Springer, 24-25 October 2017, Rhodes, Greece.

  • Abstract: Earthquakes, landslides, and other natural hazard events have severe negative socio-economic impacts. Among other consequences, those events can cause damage to infrastructure networks such as roads and railways. Novel methodologies and tools are needed to analyse the potential impacts of extreme natural hazard events and aid in the decision-making process regarding the protection of existing critical road and rail infrastructure as well as the development of new infrastructure. Enabling uniform, integrated, and reliable access to data on historical failures of critical transport infrastructure can help infrastructure managers and scientist from various related areas to better understand, prevent, and mitigate the impact of natural hazards on critical infrastructures. This paper describes the construction of the InfraRisk ontology for representing relevant information about natural hazard events and their impact on infrastructure components. Furthermore, we present a software prototype that visualizes data published using the proposed ontology.
  • Download paper

New Paper: Data Preparation as a Service Based on Apache Spark

Mahasivam N., Nikolov N., Sukhobok D., Roman D. (2017) Data Preparation as a Service Based on Apache Spark. In: De Paoli F., Schulte S., Broch Johnsen E. (eds) Service-Oriented and Cloud Computing. ESOCC 2017. Lecture Notes in Computer Science, vol 10465. Springer, Cham

  • Abstract: Data preparation is the process of collecting, cleaning and consolidating raw datasets into cleaned data of certain quality. It is an important aspect in almost every data analysis process, and yet it remains tedious and time-consuming. The complexity of the process is further increased by the recent tendency to derive knowledge from very large datasets. Existing data preparation tools provide limited capabilities to effectively process such large volumes of data. On the other hand, frameworks and software libraries that do address the requirements of big data, require expert knowledge in various technical areas. In this paper, we propose a dynamic, service-based, scalable data preparation approach that aims to solve the challenges in data preparation on a large scale, while retaining the accessibility and flexibility provided by data preparation tools. Furthermore, we describe its implementation and integration with an existing framework for data preparation – Grafterizer. Our solution is based on Apache Spark, and exposes application programming interfaces (APIs) to integrate with external tools. Finally, we present experimental results that demonstrate the improvements to the scalability of Grafterizer.
  • Download paper

New paper: Enabling the Use of Sentinel-2 and LiDAR Data for Common Agriculture Policy Funds Assignment

Estrada J, Sánchez H, Hernanz L, Checa MJ, Roman D. Enabling the Use of Sentinel-2 and LiDAR Data for Common Agriculture Policy Funds Assignment. ISPRS International Journal of Geo-Information. 2017; 6(8):255.

  • Abstract: A comprehensive strategy combining remote sensing and field data can be helpful for more effective agriculture management. Satellite data are suitable for monitoring large areas over time, while LiDAR provides specific and accurate data on height and relief. Both types of data can be used for calibration and validation purposes, avoiding field visits and saving useful resources. In this paper, we propose a process for objective and automated identification of agricultural parcel features based on processing and combining Sentinel-2 data (to sense different types of irrigation patterns) and LiDAR data (to detect landscape elements). The proposed process was validated in several use cases in Spain, yielding high accuracy rates in the identification of irrigated areas and landscape elements. An important application example of the work reported in this paper is the European Union (EU) Common Agriculture Policy (CAP) funds assignment service, which would significantly benefit from a more objective and automated process for the identification of irrigated areas and landscape elements, thereby enabling the possibility for the EU to save significant amounts of money yearly.
  • Download paper

New release of DataGraft!

We are delighted to announce the second beta release of the DataGraft platform!

What is DataGraft?

DataGraft serves as the core of the proDataMarket producer portal. DataGraft is an online platform that provides data transformation, publishing and hosting capabilities that aim to simplify the data publishing lifecycle for data workers (i.e., Open Data publishers, Linked Data developers, data scientists).

The DataGraft platform mainly consists of three components – the DataGraft portal, Grafterizer and a cloud-enabled semantic graph database-as-a-service (as shown on the picture below), which is based on a dedicated instance of the Ontotext GraphDB Cloud platform mentioned in a previous blog post.

Main components of DataGraft platform

What’s new?

DataGraft has undergone major changes since the previous version:

  • New asset types for the catalogue and better sharing between users of the platform
    • SPARQL endpoints
    • queries
    • file pages
  • Improved Grafterizer capabilities
    • conditional RDF mappings
    • support various types and formats of tabular inputs
  • Versioning of assets
    • browsing
    • recording of provenance when copying assets
  • Visual browsing of SPARQL endpoints (using RDF Surveyor)
  • New Dashboard
    • more control of user assets
    • instant search and various filters
  • Improved security (authentication) using OAuth2
  • REST API improvements using Swagger
  • Updated version of the semantic graph database, which now supports geospatial queries and serialisation to GeoJSON
  • Various bug fixes and performance improvements
  • Updated user documentation
  • Quota management console allowing users to track their use of resources on the platform

DataGraft beta 2 is available for testing on http://datagraft.io and more details can be found in the platform documentation here. All platform code except the GraphDB Cloud component (used as a service) is open-source and is available on GitHub.

 

New paper: Tabular Data Anomaly Patterns

Sukhobok, N. Nikolov, and D. Roman. Tabular Data Anomaly Patterns. To appear in the proceedings of The 3rd International Conference on Big Data Innovations and Applications (Innovate-Data 2017), 21-23 August 2017, Prague, Czech Republic, IEEE.

  • Abstract: One essential and challenging task in data science is data cleaning — the process of identifying and eliminating data anomalies. Different data types, data domains, data acquisition methods, and final purposes of data cleaning have resulted in different approaches in defining data anomalies in the literature. This paper proposes and describes a set of basic data anomalies in the form of anomaly patterns commonly encountered in tabular data, independently of the data domain, data acquisition technique, or the purpose of data cleaning. This set of anomalies can serve as a valuable basis for developing and enhancing software products that provide general-purpose data cleaning facilities and can provide a basis for comparing different tools aimed to support tabular data cleaning capabilities. Furthermore, this paper introduces a set of corresponding data operations suitable for addressing the identified anomaly patterns and introduces Grafterizer — a software framework that implements those data operations.
  • Download paper

New Paper: Understanding territorial distribution of Properties of Managers and Shareholders: a Data-driven Approach

Understanding territorial distribution of Properties of Managers and Shareholders: a Data-driven Approach by S. Pozzati, D. Sanvito, C. Castelli, D. Roman, Territorio Italia 2 (2016), DOI: 10.14609/Ti_2_16_2e.

  • Abstract: The analysis and better understanding of the distribution of wealth of individuals in cities can be a precious tool, especially in support of the estimation of real estate values. These analyses can also be used to facilitate decision making in various sectors, such as public administration or the real estate market. In this paper, by making use of publicly available data and of data owned by Cerved, (a credit scoring company in Italy), we can observe the territorial distribution of the properties of managers and shareholders – categories of people usually linked to high economic well-being – and, based on that, we identify the areas of the cities where the value of real estate properties is presumably higher. More specifically, we introduce the Manager and Shareholder Concentration (MSHC) score and validate its accuracy and effectiveness in three Italian cities (Turin, Rome and Milan).
  • Download paper

New DataGraft-related papers

DataGraft: One-Stop-Shop for Open Data Management by D. Roman, N. Nikolov, A. Putlier, D. Sukhobok, B. Elvesæter, A. Berre, X. Ye, M. Dimitrov, A. Simov, M. Zarev, R. Moynihan, B. Roberts, I. Berlocher, S. Kim, T. Lee, A. Smith, and T. Heath. Semantic Web journal, 2016.

  • Abstract: This paper introduces DataGraft (https://datagraft.net/) – a cloud-based platform for data transformation and publishing. DataGraft was developed to provide better and easier to use tools for data workers and developers (e.g., open data publishers, linked data developers, data scientists) who consider existing approaches to data transformation, hosting, and access too costly and technically complex. DataGraft offers an integrated, flexible, and reliable cloud-based solution for hosted open data management. Key features include flexible management of data transformations (e.g., interactive creation, execution, sharing, and reuse) and reliable data hosting services. This paper provides an overview of DataGraft focusing on the rationale, key features and components, and evaluation.
  • Download paper

DataGraft: Simplifying Open Data Publishing by D. Roman, M. Dimitrov, N. Nikolov, A. Putlier, D. Sukhobok, B. Elvesæter, A..J. Berre, X. Ye, A. Simov and Y. Petkov. ESWC Demo paper. 2016.

  • Abstract: In this demonstrator we introduce DataGraft – a platform for Open Data management. DataGraft provides data transformation, publishing and hosting capabilities that aim to simplify the data publishing lifecycle for data workers (i.e., Open Data publishers, Linked Data developers, data scientists). This demonstrator highlights the key features of DataGraft by exemplifying a data transformation and publishing use case with property-related data.
  • Download paper

Tabular Data Cleaning and Linked Data Generation with Grafterizer by D. Sukhobok, N. Nikolov, A. Pultier, X. Ye, A..J. Berre, R. Moynihan, B. Roberts, B. Elvesæter, N. Mahasivam and D. Roman. ESWC Demo paper. 2016.

  • Abstract: Over the past several years the amount of published open data has increased significantly. The majority of this is tabular data, that requires powerful and flexible approaches for data cleaning and preparation in order to convert it into Linked Data. This paper introduces Grafterizer – a software framework developed to support data workers and data developers in the process of converting raw tabular data into linked data. Its main components include Grafter, a powerful software library and DSL for data cleaning and RDF-ization, and Grafterizer, a user interface for interactive specification of data transformations along with a back-end for management and execution of data transformations. The proposed demonstration will focus on Grafterizer’s powerful features for data cleaning and RDF-ization in a scenario using data about the risk of failure of transport infrastructure components due to natural hazards.
  • Download paper

 

New paper: Towards a Reference Architecture for Trusted Data Marketplaces

Towards a Reference Architecture for Trusted Data Marketplaces by Dumitru Roman and Stefano Gatti. 2nd International Conference on Open and Big Data, 2016.

  • Abstract: Data sharing presents extensive opportunities and challenges in domains such as the public sector, health care and financial services. This paper introduces the concept of “trusted data marketplaces” as a mechanism for enabling trusted sharing of data. It takes credit scoring—an essential mechanism of the entire world-economic environment, determining access for companies and individuals to credit and the terms under which credit is provisioned—as an example for the realization of the trusted data marketplaces concept. This paper looks at credit scoring from a data perspective, analyzing current shortcomings in the use and sharing of data for credit scoring, and outlining a conceptual framework in terms of a trusted data marketplace to overcome the identified shortcomings. The contribution of this paper is two-fold: (1) identify and discuss the core data issues that hinder innovation in credit scoring; (2) propose a conceptual architecture for trusted data marketplaces for credit scoring in order to serve as a reference architecture for the implementation of future credit scoring systems. The architecture is generic and can be adopted in other domains where data sharing is of high relevance.
  • Download paper

 

Recent proDataMarket presentations