A Sai Venkata Lakshmi
Advanced Data Processing
Research Institute (ADRIN),
J Sai Baba
Group Director, SDAA, ADRIN
Head, Hydrogeology division,
NRSC, Secunderabad, India
TV Subba Rao
Head, Applications Division,
A number of applications in sustainable development, disaster management and urban planning necessitate integration of data from different sources. It is often difficult to make necessary and timely decisions because information created by multiple sources uses different vocabularies, taxonomies and content standards.
HETEROGENEITIES IN SPATIAL DATA
Huge geospatial data has been created under different national projects, including National Natural Resource Management System (NNRMS), National (Natural) Resource Information System (NRIS) and Rajiv Gandhi Drinking Water Mission (RGNDWM). Project purpose, study area, themes, scales and formats differ along with vocabularies, taxonomies and representations. While solutions for handling syntactic differences across spatial data sources have emerged, schematic and semantic heterogeneities have received less attention. Driven by practical challenges and need for collaboration among data creators and integration of data from multiple sources, the focus of interoperability research is moving to models and technologies of seamless automatic on-demand spatial data integration and modelling.
Data formats and vocabularies
Geospatial data sets exist in different formats like shapefiles, coverages, geodatabase etc. and must be linked with other data. Vocabularies differ across organisations and are sometimes not quantitative, making integration and automatic processing difficult. For example, volcanic rocks may also be called igneous extrusive rocks.
Syntax, attribute names and values
A class can be represented differently in multiple projects. For example, a single landform “Structural Hill” in geomorphology has the following representations in different projects:
A single class may mean different things to different people. Dense forest, large structural hill are ambiguous natural language terms (Table 1). This information must be specified explicitly if the data has to be used correctly and automatically for on-demand integration and modelling.
Different classification schemes may be adopted based on the project purpose. Some standardisation has happened over the years, like NRIS and more recently NNRMS standards. However, based on the area of study and scale, deviations such as lower-level classes being ignored, additional classes being included and ambiguity in terms is seen.
An architecture that handles interoperability and integration at multiple levels while giving good performance is suggested. Common formats like XML and GML improve interoperability but are inefficient for retrieval and processing of large datasets.
A hybrid approach is used in this paper where data itself resides in its original form and metadata with ontologies are used to achieve interoperability.
A knowledge base of domain and task-specific information is created in a machine-processable form. Data and project specific information is stored in data ontologies. By dynamically linking the data ontologies with the knowledge base, user-specific information is generated. By providing the results through Web-enabled services like WMS, WFS etc., interoperability, integration and performance are achieved.
Web Ontology Language (OWL)
Ontologies are used to capture knowledge about some domain of interest. An ontology describes the concepts in the domain and the relationships between them. OWL is the most recent development in standard ontology languages from the World Wide Web Consortium (W3C). One among three sublanguages: OWL-Lite, OWL-DL and OWL-Full can be chosen based on the degree of expressiveness and automated reasoning required.
REPRESENTING KNOWLEDGE USING ONTOLOGIES
Using technologies like geo-ontologies, one can explore how domain expertise can be represented as knowledge that can be understood by automated tools. Ontologies aid in storing knowledge about the domain (domain ontologies) and project-specific information about the data (data ontologies) in a form understandable by both humans and machines.
The scope and purpose of creating the ontology define the choice of representing the entities, such as classes, subclasses and attributes. For example, a dense forest can be stored as a forest with value ‘High’ under ‘Density’ or as a subclass of forest.
Experts’ knowledge of the domain, common vocabulary and terminology and other implicit information must be stored as domain ontologies. Multiple ontologies can co-exist and can be used to store opinions of different experts. These can be aligned / merged based on need or the one preferred by a user can be used by him. In this way, ontologies help in collaboration.
Figure 1.: Part of classification schemes of two projects
Task specific ontologies
Tasks can be categorised into common tasks, tasks that are clearly defined and tasks that are not needed often. These may require combining information from multiple domains and data themes. The purpose and mode in which this ontology will be used will help in deciding the representation of the information and the level of detail. An example of a task could be finding suitable sites for wells that provide drinking water, which requires information from multiple layers like lithology, geomorphology, geological structures and hydrology.
Subjective information like scope and purpose of the project, study area, standards and assumptions made can be stored explicitly as data ontologies. Data ontologies also include project specific details like data format, projection, source, classification schema and class representations. This information aids in reuse of data by a larger group of people, minimises errors during reuse and makes it relevant for longer periods of time.
Ontologies have been created for some national projects like NATP and RGNDWM. While newer projects follow better standards, there are implicit projectspecific assumptions which could misguide automated systems if not stored explicitly. For example, Class A found under a higher level Class B could be a subclass of A or discernable in A at a particular scale.
Figure 2.: Example workflow for seamless query processing
USING ONTOLOGIES FOR DIFFERENT APPLICATIONS
Data consistency and validation Semantic rules can be defined for a project. These can be checked in the actual data. E.g. Streams must not cross reservoir or pond. A state cannot be contained in a district. A settlement cannot be found in water.
Seamless Query Processing
This does not require any modification of original data. Steps include data discovery, query expansion, query rewriting and execution. An example workflow in our system is given in Fig 2. User asks for “Piedmont Polygons”, which is translated to “Select polygons with Class = Piedmont. After location and data formats of discovered data are obtained from the metadata of the dataset, this query is expanded and rewritten for each dataset using domain and data ontologies. The query is then sent to the corresponding processing software/Web-service for results. The results are compiled / merged and sent to the user.
Remapping or reclassification
Each project is done to meet a particular objective and these objectives vary across projects. Standards are sometimes followed but only to a certain extent. Deviations happen either because the project necessitates it or due to the pressure of deadlines. Another issue is the ambiguity in terms used. E.g. highly / moderately / less dissected hills. These definitions depend on the project and geographic area.
Aligning classification schemes
Data of RGNDWM project has been taken up as an example. An attempt has been made to see the feasibility of reuse for NRCensus project. This involves two steps – aligning schemas of both the projects and transforming the underlying data accordingly. Comparing the classification schemas of the projects, it is immediately apparent that the purpose and scope of both the projects is very different. While the purpose of RGNDWM is for locating suitable sites for drinking water wells, NRCensus depicts changes and modifications of the country’s natural resources like land, water, soils, forests etc.
Relations among the classes were identified. Two main issues were found. One was for moving from more to less generalised classes, for which additional information required by analysing other layers. E.g. Structural hill is to be classified as one of highly / moderately/ less dissected hill. This information was derived from drainage and slope layers. Second issue is that of multiple parents, taken care by finding similarity measure of parents, analysing the corresponding area on the map etc.
Ontology-driven information systems enable faster and better decision making by facilitating on-the-fly integration and interoperability of data across multiple disciplines. Ontologies aid in correct usage of data, makes data usable by a larger community and makes the data relevant for longer periods of time. Total automation of all analysis and modelling tasks may require more technological advances. However, if current technology is properly understood and applied in the right way, dynamic on-the-fly integrated analysis and modelling is possible to a large extent.