Sukhobok, N. Nikolov, and D. Roman. Tabular Data Anomaly Patterns. To appear in the proceedings of The 3rd International Conference on Big Data Innovations and Applications (Innovate-Data 2017), 21-23 August 2017, Prague, Czech Republic, IEEE.
Abstract: One essential and challenging task in data science is data cleaning — the process of identifying and eliminating data anomalies. Different data types, data domains, data acquisition methods, and final purposes of data cleaning have resulted in different approaches in defining data anomalies in the literature. This paper proposes and describes a set of basic data anomalies in the form of anomaly patterns commonly encountered in tabular data, independently of the data domain, data acquisition technique, or the purpose of data cleaning. This set of anomalies can serve as a valuable basis for developing and enhancing software products that provide general-purpose data cleaning facilities and can provide a basis for comparing different tools aimed to support tabular data cleaning capabilities. Furthermore, this paper introduces a set of corresponding data operations suitable for addressing the identified anomaly patterns and introduces Grafterizer — a software framework that implements those data operations.
We, from Ontotext, are excited to announce GraphDB Cloud – the easy way to introduce you to a semantic database like our signature GraphDB product. The automated tasks in GraphDB Cloud save the organizations the time and effort to install and manage hardware and software as well as the cost to buy it. Compared to a do-it-yourself database, DBaaS offers developers the opportunity to cut down the time it took them to work with their databases and spend their valuable time on creating and innovating instead of administrating.
GraphDB Cloud is one part of the Cognitive Cloud solutions for low-cost and on-demand smart data management.
The users are with the following profile:
Small cognitive-technology oriented team in a big organization that needs low upfront and ongoing costs for a database.
Start-up companies without a database infrastructure, which requires a reliable technology that scales up along with their business.
Corporate solution architects working to solve the challenges their enterprise faces when handling huge amounts of data and information
Thanks to the collaboration between Cerved, SINTEF and “Territorio Italia” it was possible to publish a paper which presents a new score developed by Cerved.”Territorio Italia” is an open access peer-reviewed scientific magazine focused on territorial and geographic topics; it is edited by Agenzia dell’Entrate, which is the Italian Revenue authority.
The paper has been announced in the previous blog post. In this post we highlight the main results, the Manager and Shareholders Concentration score and its application to the cities of Turin, Milan and Rome.
Manager and Shareholders Concentration (MSHC) score
The paper introduces the “Manager and Shareholders Concentration (MSHC) score” – an index created with the aim of identifying the wealthiest areas within a certain municipality. This is of
particular interest for the real estate market, especially when there are several wealthy areas within
the same city. The paper thus introduces the index and demonstrate how it can correctly identify
the areas with high real estate values within a city, even when they are located far from the city centre.
The approach proposed in the paper aims to directly observe the distribution of the properties of the wealthiest citizens, who usually choose to move to and live in the most prestigious areas. While this phenomenon can be observed in many cities around the world, in Italy it is particularly evident in the city of Turin: although they are endowed with fascinating city centres, many of the buildings of greatest importance are located on the hills far from the centre. The crucial question becomes to correctly determine which sample of citizens to select and qualify as managers or, more generally, wealthy people. To do this, we used Cerved’s proprietary database – a database containing public data on all Italian companies – to extract information about individuals recognized as shareholders and/or managers. In the context of this work, a shareholder is considered anyone who owns shares above the threshold percentage of 25% of the company’s share capital, while a manager is defined as anyone who holds a key position within a company, accomplishes management duties, and is legally liable for the company’s debts. In calculating the MSHC score, the basic idea is to observe the total number of properties of managers and shareholders per geographic area, comparing this information with the total number of residents in the same geographic area. This approach provides a result that can be immediately visualized graphically using thematic maps; for example, by plotting the score on a map of the city of Turin, it may be noted that the two most relevant areas are, respectively, the centre and the hill on the eastern side of the city.
The territorial distribution of the MSHC score can be easily observed through a heat map. On the maps, darker colours correspond to high scores, while lighter colours are associated with lower scores. Heat maps also allow the territorial distribution of real estate values to be easily compared, in order to verify whether there is a correlation between prices and scores. For the city of Turin, it was possible to analyse the correlation between the MSHC score and the asking prices for real estate provided by Osservatorio Immobiliare della Città di Torino – OICT (Turin Real Estate Market Observatory), in comparison with their territorial distribution. For the cities of Rome and Milan, the comparison between the MSHC score and real estate values was made using the values published by Osservatorio del Mercato Immobiliare (OMI) of Agenzia dell’Entrate, an important reference for the real estate market on the national level.
The score shows high values in the city centre, the hill, and the micro-areas on the western side of the city, while it correctly identifies the south and north areas of the city as less prestigious. This result confirms that the score can also be considered a valuable tool for predicting values on the real estate market.
Figure 1 Territorial distribution of the MSHC score in the city of Turin. The MSHC score is displayed on the map, associating a darker colour with higher scores and brighter colours with lower
The second city chosen to analyse the MSHC score is Rome, a very complex city due to the vastness of the municipal area that is not comparable to any Italian metropolis, as well as due to the particular shape of some specific areas, namely the proximity to the city-state of the Vatican, the large number of historical and cultural points of interest, and access to the sea.
The size of the Italian capital does not allow the distribution to be observed in detail, but it may be noted that there are more high-value areas, which correspond to actual high-value neighbourhoods and others, which can be defined as emerging neighbourhoods due to the presence of undergrounds and public transit.
Figure 2 Territorial distribution of the MSHC score in the city of Rome. The MSHC score is visualised on the map by associating a darker colour with higher scores, and brighter colours with lower scores
The third city used to analyse the MSHC score was Milano – a city that has experienced major changes in recent years. Milan has seen the development of new neighbourhoods and skyscrapers, a universal exposition (EXPO), and a new underground line (with another under development) after years of inactivity. The highest MSHC score is found in the centre of the city, while in the suburbs not many neighbourhoods are identified as particularly wealthy.
Figure 3 Territorial distribution of the MSHC score in the city of Milan. The MSHC score is visualised on the map by associating a darker colour with higher scores, and brighter colours with lower scores
The MSHC score illustrated in the paper provides an interesting index that may be used to better comprehend where the richest segments of the population live, and consequently to identify the areas of the city with the highest real estate values. Obviously, although considering this score alone is not enough to support the valuation of real estate property values, together with other indicators under development at Cerved (for real estate valuation) it represents an excellent starting point. For a more in-depth analysis and to observe how much the score is correlated with housing price please have a look at the entire paper and the complete results .
 Stefano Pozzati, Diego Sanvito, Claudio Castelli, Dumitru Roman. Understanding territorial distribution of Properties of Managers and Shareholders: a Data-driven Approach. Territorio Italia 2 (2016), DOI: 10.14609/Ti_2_16_2e
Combining Sentinel-2 and LiDAR data for objective and automated identification of agricultural parcel features by Jesús Estrada, Héctor Sanchez, Lorena Hernanz, María José Checaand Dumitru Roman
This new proDataMarket paper explains how a comprehensive strategy combining remote sensing and field data can be helpful for more effective agriculture management. Satellite data are suitable for monitoring large areas over time, while LiDAR provides specific and accurate data on height and relief. Both types of data can be used for calibration and validation purposes, avoiding field visits and saving useful resources. In this paper we propose a process for objective and automated identification of agricultural parcel features based on processing and combining Sentinel2 data (to sense different types of irrigation patterns) and LiDAR data (to detect landscape elements). The proposed process was validated in several use cases in Spain, yielding high accuracy rates in the identification of parcel features. An important application example of the work reported in this paper is the European Union (EU) Common Agriculture Policy (CAP) funds assignment service, which would significantly benefit from a more objective and automated process for identification of agricultural parcel features, thereby enabling the possibility for the EU to save significant amounts of money yearly.
Although some issues regarding the generation and improvement of agricultural property datasets were already explained in our previous blog entry (Data workflow in CAPAS), this paper highlights the current results of generation and usage of this new information.
Irrigation patterns map, obtained using Sentinel-2 Process
The main result of this analysis is how the use of the external, and usually underused, data sources offers a powerful and accurate tool for generating new contrast and validation data for the information used by Spanish CAP Payment Agency, in order to provide a better service to landowners and farmers. As a conclusion, the use of Sentinel-2 series and LiDAR can help to detect areas that are not eligible for grant assignment, support cross-check, and these datasets can be used as a tool for choosing field samples.
Understanding territorial distribution of Properties of Managers and Shareholders: a Data-driven Approach by S. Pozzati, D. Sanvito, C. Castelli, D. Roman, Territorio Italia 2 (2016), DOI: 10.14609/Ti_2_16_2e.
Abstract: The analysis and better understanding of the distribution of wealth of individuals in cities can be a precious tool, especially in support of the estimation of real estate values. These analyses can also be used to facilitate decision making in various sectors, such as public administration or the real estate market. In this paper, by making use of publicly available data and of data owned by Cerved, (a credit scoring company in Italy), we can observe the territorial distribution of the properties of managers and shareholders – categories of people usually linked to high economic well-being – and, based on that, we identify the areas of the cities where the value of real estate properties is presumably higher. More specifically, we introduce the Manager and Shareholder Concentration (MSHC) score and validate its accuracy and effectiveness in three Italian cities (Turin, Rome and Milan).
Real property data (often referred to as real estate, realty, or immovable property data) represent a valuable asset that has the potential to enable innovative services when integrated with related contextual data (e.g., business data). Such services can range from providing evaluation of real estate to reporting on up-to-date information about state-owned properties. Real property data integration is a difficult task primarily due to the heterogeneity and complexity of the real property data, and the lack of generally agreed upon semantic descriptions of the concepts in this domain. The proDataMarket ontology is developed in the project as a key enabler for integration of real property data.
The proDataMarket ontology design and development process followed techniques and design choices supported by existing methodologies, mainly the one proposed by Noy . Requirements are extracted from a set of relevant business cases and competency questions  are defined for each business case, so as core concepts and relationships. A conceptual model is then developed based on the requirements mentioned above and international standards including ISO 19152:2012 and European Union’s INSPIRE data specifications. For example, the LADM conceptual model from ISO 19152:2012 is used as reference model to the proDataMarket cadastral domain conceptual model. Afterwards we implemented the conceptual model using RDFS/OWL linked data standard. RDFS is used to model concepts, properties and simple relationships such as rdfs:subClassOf. OWL is built upon RDFS and provides a richer language for web ontology modelling and it is used to model constraints and other advanced relationships, such as the cardinality constraint needed to express the relationship between properties and buildings.
The proDataMarket ontology can be accessed at http://vocabs.datagraft.net/proDataMarket/. The ontology has been divided into several sub-ontologies (see Table below), reflecting the cross-domain nature of the requirements. This modular approach also helped to handle the complexity of the model and made it easier to maintain. In the current version, there are 11 sub-ontologies with 43 native classes and 43 native properties.
More than 30 datasets have been published through the DataGraft platform  using the proDataMarket ontology as a central reference model. All seven business cases use the proDataMarket ontology in data publishing.
 Noy, Natalya F., and Deborah L. McGuinness. “Ontology development 101: A guide to creating your first ontology.” (2001).
 Grüninger, Michael, and Mark S. Fox. “Methodology for the Design and Evaluation of Ontologies.” (1995).
 Roman, D., et al. DataGraft: One-Stop-Shop for Open Data Management. 2017. Semantic Web, vol. Preprint, no. Preprint, pp. 1-19, 2017. DOI: 10.3233/SW-170263.
 Roman, D., et al. DataGraft: Simplifying Open Data Publishing. ESWC (Satellite Events) 2016: 101-106.
 L. Shi, N. Nikolov, D. Sukhobokb, T. Tarasova and D. Roman. “The proDataMarket Ontology for Publishing and Integrating Cross-domain Real Property Data”. To appear in the journal “Territorio Italia Land Administration, Cadastre and Real Estate”. n.2/2017.
During a project meeting in Sofia on September 21, 2016, Cerved teamed up with TRAGSA to brainstorm ideas of re-using the TRAGSA methods for processing satellite imagery to analyse green areas in urbanized cities.
Fundamentals of Tragsa Processing
A common feature in Vegetation Spectra is the high contrast observed between the red band and the Near Infrared (NIR) region. The optical instrument carried by Sentinel 2 satellites samples 13 spectral bands, including high resolution bands in the red (bands 4, 5 & 6) as well as bands in the NIR (8 & 8A). Refer to this blog post for more details about processing Sentinel 2 data.
Using the TRAGSA methodology it is possible to isolate and enhance the vegetation, to locate green areas in urban areas. Green areas are important input to the Cerved’s innovative real estate evaluation model (which is being developed within one of the Cerved’s business cases in the project, as introduced in this blog post). Cerved uses open data, to generate indicators of green areas defined for the model: green area coverage and distance to the wood. Operations that Cerved performs to compute these indicators are similar to those that TRAGSA does on satellite data, such as clustering of green areas into big areas and isolating trees and group of trees. This motivated us to experiment with satellite data and TRAGSA’s methodology, to see whether we could potentially use more complete, structured and up-to-date source of green areas information as input to our real estate evaluation model.
We identified a highly urbanized Italian city but with particular attention to green areas, which is the city of Turin.
The steps that we followed:
extraction of city boundaries of Turin in GeoJSON format by SPAZIODATI
selections of good quality imagery for Turin from the Sentinel data repository by TRAGSA
processing S2 imagery in order to get a vector layer which indicates the presence or absence of a green area in each pixel (1/0) by TRAGSA
display of the green areas of the tiles (see the screenshot below) prototype Amerigo visualisation service, under development by SPAZIODATI
data processing and aggregation of the tiles into census cells areas, in order to develop green areas indicators for each census cell, by CERVED
integration and testing of the score dedicated to green areas within the business model CCRS (Cerved Cadastral Report Service) by CERVED
The result of this experiment was extremely surprising; the detail and accuracy of this new score in identifying the green areas (not only public green areas) is far greater than accuracy of the other scores, developed on public and open green areas of datasets.
DataGraft: One-Stop-Shop for Open Data Management by D. Roman, N. Nikolov, A. Putlier, D. Sukhobok, B. Elvesæter, A. Berre, X. Ye, M. Dimitrov, A. Simov, M. Zarev, R. Moynihan, B. Roberts, I. Berlocher, S. Kim, T. Lee, A. Smith, and T. Heath. Semantic Web journal, 2016.
Abstract: This paper introduces DataGraft (https://datagraft.net/) – a cloud-based platform for data transformation and publishing. DataGraft was developed to provide better and easier to use tools for data workers and developers (e.g., open data publishers, linked data developers, data scientists) who consider existing approaches to data transformation, hosting, and access too costly and technically complex. DataGraft offers an integrated, flexible, and reliable cloud-based solution for hosted open data management. Key features include flexible management of data transformations (e.g., interactive creation, execution, sharing, and reuse) and reliable data hosting services. This paper provides an overview of DataGraft focusing on the rationale, key features and components, and evaluation.
DataGraft: Simplifying Open Data Publishing by D. Roman, M. Dimitrov, N. Nikolov, A. Putlier, D. Sukhobok, B. Elvesæter, A..J. Berre, X. Ye, A. Simov and Y. Petkov. ESWC Demo paper. 2016.
Abstract: In this demonstrator we introduce DataGraft – a platform for Open Data management. DataGraft provides data transformation, publishing and hosting capabilities that aim to simplify the data publishing lifecycle for data workers (i.e., Open Data publishers, Linked Data developers, data scientists). This demonstrator highlights the key features of DataGraft by exemplifying a data transformation and publishing use case with property-related data.
Tabular Data Cleaning and Linked Data Generation with Grafterizer by D. Sukhobok, N. Nikolov, A. Pultier, X. Ye, A..J. Berre, R. Moynihan, B. Roberts, B. Elvesæter, N. Mahasivam and D. Roman. ESWC Demo paper. 2016.
Abstract: Over the past several years the amount of published open data has increased significantly. The majority of this is tabular data, that requires powerful and flexible approaches for data cleaning and preparation in order to convert it into Linked Data. This paper introduces Grafterizer – a software framework developed to support data workers and data developers in the process of converting raw tabular data into linked data. Its main components include Grafter, a powerful software library and DSL for data cleaning and RDF-ization, and Grafterizer, a user interface for interactive specification of data transformations along with a back-end for management and execution of data transformations. The proposed demonstration will focus on Grafterizer’s powerful features for data cleaning and RDF-ization in a scenario using data about the risk of failure of transport infrastructure components due to natural hazards.
TRAGSA, as a business case provider in the project, is developing the CAPAS service which aims at publishing and integrating multi-sectorial data from several sources into an existing data-intensive service, targeting better Common Agriculture Policy (CAP) funds assignments to farmers and land owners. The goal is to leverage the data integration facilities offered by proDataMarket, to better define the funds assignments features in parcels and subplots.
CAPAS is working on an improvement of the efficiency and competitiveness of the existing Spanish CAP (Common Agriculture Policy) service by integrating more datasets, underused at the beginning of the proDataMarket project. To use them as a powerful tool, it was necessary to create and develop new data processing algorithms. Therefore, CAPAS is not only an end-user application. Indeed, it involves data collection, data modelling and data processing techniques.
The CAPAS Business Case is oriented towards the replacement of human-generated (subjective) data with more objective data that can be collected and integrated from different cross-sectorial sources in an automated way.
At least two external datasets (LIDAR and Copernicus SENTINEL2) are being used to improve the agricultural cadastre Spanish database. The economic value generated by this process and its relation to CAP funds assignment will be evaluated during the next year, in the final phase of the project.
Managing LIDAR data
LIDAR files are a collection of points stored as x, y, z which represent longitude, latitude, and elevation, respectively. This data is hard to process for non-specialists. To use them as a powerful tool to define objectively the parameters of agricultural use of parcels and the presence of landscape elements, a new data processing and treatment algorithm has been created.
This algorithm classifies and groups the cloud of points in order to simplify the huge amount of data. The clouds of points are topologically processed to obtain connected areas as polygons or to maintain them as single points. In conclusion, LIDAR datasets are transformed into new raster and vector files, more popular data types, and easier to be dealt with. The overlaps and intersections of the new datasets produced (as Landscape elements) will define the CAP parameters for a specific subplot or parcel.
Managing Satellite data
The Sentinels are a fleet of satellites designed specifically to deliver the wealth of data and imagery that are fundamental to the European Commission’s Copernicus program. The use of satellite images in CAPAS has already been explained in this blog entry.
Description of the source datasets and result dataset
The main source datasets of Business Case CAPAS and main processes used to obtain output datasets are explained below:
LIDAR files can be available under two different formats: .las and .laz. The LAS file format is a public file format commonly used to exchange 3-dimensional point cloud data between data users, being LAS just an abbreviation of LASER. LAZ files, due to the big size of LAS files, is the zipped version of the LAS format.
Although developed primarily for exchange of LIDAR point cloud data, LAS format supports the exchange of any 3-dimensional x,y,z tuples. This format maintains information specific to the LIDAR nature of the data while not being overly complex.
In the context of the ProDataMarket Project, LAS files used in the CAPAS business case will just be a collection of points (latitude, longitude, elevation).
The information to be used in CAPAS business case is the Image Data (JPEG2000) provided by Copernicus at Sentinels Scientific Data Hub (https://scihub.copernicus.eu/). The description of JPEG2000 format is beyond the aim of this blog entry but some general ideas will be described.
The following data workflow, as shown in the diagram below, illustrates the evolution of the different datasets, their transformations and their integration to generate the final result datasets.
The Grouping process gathers the LIDAR points using the following rules:
Errors, noise and overlaps are not taken into account (Classifications 1, 4, 7 and 12). As a consequence, more than 50% of points are removed from the process.
Soil, water and buildings have their own groups
Classification 19 is considered as short trees
Classification 20 is considered as medium trees
Classification 21 are 22 are grouped as tall trees
The result of this process is still a LAS file. The following image shows how LIDAR points (green points) have been processed and classified (Green points as trees, red points as soil, orange and yellow as bushes).
The next steps, such as Rasterization or Vectorization, involve topological rules in order to group the points to generate squares (raster) that would be processed to obtain the final vector shapefile.
The following image shows how LIDAR points have been grouped to create topologically connected surfaces. In the image below, yellow areas are Soil, orange are Bushes, green are Trees. Grey areas and blue surfaces (not present in this image) are Buildings and Water, respectively.
Once the trees class is defined in a raster format by LiDAR data, it wasrefined thanks to Sentinel Data which has more updated information. RGB and NDVI products help to identify which pixels have an NDVI value over 0.5 and it could be detected by RGB product in order to check which pixels represent vegetation areas.
Finally, trees auxiliary layer refined by Sentinel is processed to obtain different configurations:
The final result of the process is a vector ESRI shape file, where the copses layer is a polygon feature type and the isolated trees layer is as point feature type. All of them have a direct correspondence with the landscape elements.
The overlaps between detected landscape elements, currently protected sites of Natura 2000 network and the Land Parcel Identification System allows performing an accurate ecological value report for Spanish crops areas.
LiDAR algorithm allows to obtain more detailed information because the landscape value helps to identify which subplot has more value per parcel, obtaining the following benefits:
Farmers will get an economical profit through fund-assignments to maintain these trees forms, and
the ecosystem and its species will be preserved.
This Ecological value report has been developed regarding the following queries:
Query 1: Surface of Sites of Community Importance (LIC) / subplot area.
Score between 0 and 1.
Query 2: Surface of Special Protected Areas for Birds (ZEPA) / subplot area.
Score between 0 and 1.
Query 3: Protected Sites Value = Sum of query 1 + query 2. Score between 0 and 2.
Query 4: Number of Isolated tree / subplot area. Score between 0 and 1.
Query 5: Surface of copses area / subplot area. Score between 0 and 1.
Query 6: Landscape Elements Value = Sum of query 1 + query 2. Score between 0 and 2.
Query 7: Ecological Value = Sum of query 3 + Query 6.
Sentinel Products generation
In the first place, Sentinel 2 (S2) imagery has to be downloaded from the ESA server. In the automatic download process developed, selection parameters were incorporated in order to download only the imagery that satisfies our quality criteria. Two kinds of products are generated from S2 imagery.
Simple products: Those which have been generated with one-date imagery. By an automatic process, TRAGSA is generating RGB products for supporting photo interpretation. Another simple product generated is the Normalized Difference Vegetation Index (NDVI) which is widely used for vegetation monitoring.
Complex products: Those which are generated with imagery from different dates. The following four thematic layers are going to be created.
Permanent grassland: This layer will be useful to determine photosynthetically active vegetation and non active (unproductive or bare soil) areas. Therefore it will help to monitor the maintaining of existing permanent grassland, which is an agricultural beneficial practice for the climate and the environment (REGULATION (EU) No 1307/2013).
Herbaceous and woody crops: By using decision algorithms, different crops can be identified. The results will be displayed in two different layers, one for herbaceous crops and other for woody crops.
Change detection layer: This layer will highlight areas where changes have happened. The layer will be focused on forests and grassland areas in order to detect dramatic changes, such as those caused by logging or forest fires, as well as to detect more subtle changes associated with AIS (Alien Invasive Species), diseases and reforestation.
Hitherto, only one of the twin S2 satellites (Sentinel 2A) has been launched. When the second satellite (Sentinel 2B) is on orbit, the revisit time at the equator will be 5 days which results in 2-3 days at mid latitude. This high revisit time will offer a quicker updating of SigPAC database in comparison with current updates that are based on low precision data (LANDSAT and SPOT5 satellites) or ortophoto flights generated by each Autonomous Community.
As stated previously, Common Agriculture Policy funds Assignments Service (CAPAS) is a set of tools that improves the existing Common Agriculture Policy service (CAP), in order to innovatively manage and upgrade the CAP database provided by Spanish Administration to farmers and land owners. It is important to note that this CAP database is one of the main pillars of the CAP funds calculation systems. As mentioned earlier, the improvement process is based on the leverage of new cross-sectorial data sources from different fields and geographical areas, and the result datasets will be also available at the proDataMarket marketplace.
To use these new datasets as a powerful tool to define objectively the parameters of agricultural use of parcels, presence of landscape elements or temporal evolution of crops, the explained data processing and treatment algorithms have been, at the moment, partially developed.
As a summary, the usage of LIDAR files modifies some Parcel and Subplots features, and SENTINEL images will improve the definition of Parcel and Subplots land use and its temporal evolution.
The new datasets produced by CAPAS using those external sources will be RDFized and incorporated to proDataMarket platform. Therefore, Spanish rural property data, improved using new and underexploited datasets, will be accessible through proDataMarket platform providing the users with advanced visualization and querying features.
 JPEG 2000 (JP2) is an image compression standard and coding system. It was created by the Joint Photographic Experts Group committee in 2000
The SIM application (Subterranean Infrastructure Map App and Service) is developed to ease construction and digging projects by visualizing underground infrastructure with augmented reality.
Augmented reality (AR) is a live direct or indirect view of a physical, real-world environment whose elements are augmented (or supplemented) by computer-generated sensory input such as sound, video, graphics or GPS data. The applications EVRY develop uses augmented reality technology to present cadastral data which is distributed by the proDataMarket platform. With a connection to proDataMarket, SIM downloads subterranean infrastructure data that exists at the user location. This data is then used to visualize the underground grid of pipes and cables as well as give information about the pipe or cable. If there are a lot of pipes in a given area there could potentially be too much information to augment at a time. The user can then filter out pipe groups (such as water, sewage, electricity) to be able to get a more relevant view.
Relevant information could be a pipes depth, the pipes owner as well as the age and material of the pipe. An issue with data like this is that it is often private. Data are also often owned by different actors, and a challenge is to give them incentive to share their data.
One of the major technical challenges the development team have been facing, is the lack of accuracy on mobile devices. The GPS receivers and built-in compass on mobile devices are not accurate enough to give an exactly correct representation of the pipe grid. It is possible however to increase the GPS accuracy by using an external GPS receiver. But even though the GPS is correct, a small error with the heading will still create unwanted results. In addition to positioning, another challenge is the data quality in a given area. To create a good augmented reality experience, the framework needs to know the height above mean sea level. This is not always given information in the data set.
To accommodate these challenges, SIM has a calibration functionality that can “move” the pipe grid according to a given heading. It also has a call to “Google Elevation Service” to get the pipe grids height so that it does not rely on elevation data. If the augmented experience is still not sufficient, SIM also includes a 2d Map so the user may get an overview of the pipe grid
If the user for some reason does not want to use the device camera (i.e. poor lightning conditions, broken lens etc.) or does not want to relocate to see the pipe grid, a Google Street View module is also implemented. This is a regular Google street view, with the pipe grid integrated so the user can stay at one location and see the pipe grid at another location.