Nihar R. Sahoo
Tata Infotech Limited, Noida
Email: [email protected]
Shishir K. Mahapatra
Exploration Business Analyst, Tata Petrodyne Limited New Delhi.
Email: [email protected]
With the advent of GIS Technology and its tremendous capability of handling complex spatial data, there has been ample of opportunities for an explicitly reasoned evaluation for decision making, however the extraction and comprehension of the knowledge implied by this huge amount of spatial data, poses a great challenge.
Often hard real-world problems, as to their larger size, formidable structure, complex dependencies, and with a definite objective, escape classical optimization techniques. The traditional approach of reducing size of this problem attempted few ways such as: removing less significant parameters, observations, constraint, ignores analysis of relevant constraints and uncertainties in the dataset. This activity of knowledge discovery requires a thorough analysis for extraction of implicit knowledge, spatial relations, patterns and nugget effect in spatial datasets. Further the possibly obvious data inadequacy and procedure of assigning credible weights to input data need to be analyzed.
This paper describes the process of knowledge discovery in exploring pattern or nature of data from Remote Sensing and GIS, and integration of data-layers in targeting natural resources. Potential application of logistic regression analysis in resource targeting has been described here. The capability of this tool in handling varied data-types, data-driven approach of factor-weighting and studying interactions of the evidences has been explained.
Knowledge Discovery for Resources Assessment Studies
The collation of data about the spatial distribution of several properties of the earth’s surface has long been an important activity in resources assessment. It has become a complicated process to describe the spatial variations quantitatively due to reasons like, sheer data volume, inherent imprecision of geographic data, general problem of quantification of certain spatial processes, events, missing observations, nugget effect, lack of appropriate analysis tool and a few others (Cliff and Ord, 1981). A set of functionality available in a traditional GIS is not exhaustive enough to handle such a real world problem. Further the analysis tool in a GIS or RS-Image Processing System has not received much attention in commercial applications. Extraction of even a simple spatial relation requires complex spatial analysis tool (Fotheringham and Rogerson, 1994).
Statistics is of much use since a decade in handling spatial data, recognition of pattern, integration and analysis. However, statistical analysis usually requires the assumption of statistical independence of spatially distributed data. Such assumptions are often unrealistic due to the influence of spatial effects. Geostatistical models with spatially lagged forms of the independent variables can be used to alleviate the problem to a greater extent. The process of knowledge discovery from spatial data requires better interaction with user and discover hidden knowledge faster (Riply, 1981).
Process of Knowledge Discovery
Few data-layers in a GIS or a set of bands of remote sensing data does not convey any information. It needs use of a couple of processing tools to extract information from these datasets. However, certain information such as ground-truth and user’s intuition help in selecting a proper tool to extract information from these several datasets.
General process of knowledge discovery is the extraction of spatial association, discrimination, deviations/evolution rules describing temporal changes of a prominent cluster. Figure below, shows the general process of knowledge discovery.
Background knowledge of the user is stored in a knowledge base. The database carries information relating to spatial components, their attributes. Data is fetched from the database using a database-interface. Object and attribute extraction tools finds out which part is relevant to the task of pattern recognition. Rules and patterns are extracted using sophisticated mathematical tools, information theory and statistics. The interestingness or significance of these patterns is processed at the final evaluation step to possibly evaluate obvious redundant knowledge. User has control on the system at every step and the expert judgement is also fed into the system at each step. The general methods in the process of knowledge discovery are as follows:
Generalization based Knowledge Discovery
Data and objects in a database often contain detailed information at a primitive concept level. It is often desirable to summarize a large dataset and present it at a higher concept level. Generalization based mining is the abstraction of data from several evidences from a concept level to its higher concept level and performing knowledge extraction on the generalized data (Mitchell, 1982). It assumes the existence of background knowledge in the form of concept hierarchies, which is either data-driven or assigned explicitly by expert-knowledge. The data can be expressed in the form of a generalized relation or data-cube on which many other operations can be performed to transform generalized data into different forms of knowledge. A few of the multivariate statistical techniques such as principal components analysis, discriminant analysis, characteristic analysis, correlation analysis, factor analysis and cluster analysis are used for generalization based knowledge discovery (Shaw and Wheeler, 1994).
A Spatial structure consists of a point, line, polygon or a pixel. In order to build indices for these, multi-dimensional trees are used. Spatial operations such as union, intersection and overly; Spatial orientations such as close_to, far_off, left_of and right_of are a few of the generally used operations in discovering spatial characteristics in a GIS.
A spatial association is of the form A ® B (p %) which means: p% of A are associated with B, Where A and B are sets of predicates and p% is the confidence of the rule. During spatial analysis, user observes a lot of answers. To confine to the number of discovered rules, the concept of minimum support and minimum confidence is used.
Approximation and Aggregation
Approximation and Aggregation describes the characteristics of a cluster in terms of features that are close to them. The aggregate proximity shows maximum and minimum distances of points in the cluster to a feature, average distances and percentage of points located in the distance less than the specified threshold.
Spatial Data Integration
Assessment of resources and computation of favourability measures of resources occurrence of a specific type, are problems that require integration of various layers of information or knowledges as evidences, using various statistical models. Several statistical methods like principal component analysis, cannonical correlation analysis, logistic regression analysis, weighted and targeted multivariate criterion, factor analysis, modified cannonical correlation analysis, cluster analysis, are some of the multivariate statistical methods used in resources appraisal (Pan et al.,1992).
Quantitative approaches of data-integration can be grouped into two categories: data-driven (objective) and knowledge-driven (subjective). Integration of data for computation of probabilities or favourability measures uses weights of evidence modelling (use of Bayes Theory), indicator probability theory and evidential belief function theory. The major problem with the use of the above methods is the testing and assessing conditional independence of predictive maps given the hypothesis. Extended weights of evidence modeling, the late proposed method takes care of categorical explanatory variables for quantitative data integration and analysis for resources appraisal. Knowledge driven approaches are usually a forward-chaining expert system in which the method of propagation of favourability measure through the inference network may include the Bayesian updating, fuzzy-logic or belief function for computation of posterior values of favourability given evidence(s).
Multi-Criteria and Multi-objective Analysis
The problem of targeting natural resources requires predictive modelling using various procedures and tools for development of decision rules. A decision rule typically contains procedures for combining criteria into a single composite index and a statement of how alternatives are to be compared using this index, like assigning a threshold to a single criterion, which are structured in the context of a specific objective. An objective is thus a prospective that serves to guide the structuring of decision rules. To meet with a specific objective, several criteria are integrated and evaluated, called multi-criteria evaluation (Quilon, 1986).
Two common classes of GIS based multi-criteria evaluations are concordance and discordance analysis. The former handles all evidences with some assigned weights, while the latter analyzes degree to which, one evidence outranks the other on a specified criteria. Uncertainties involved with spatial data is handled with much of difficulties, when present, the decision rule need to incorporate modifications to the choice function or heuristic to accommodate the propagation of uncertainties through the rule and replace the hard decision (Lee et al., 1987). The uncertainties are handled by assigning probability to the evidences. Evaluation of relationships between evidences and belief/hypothesis is a forward chaining expert system, where propagation of favourability is through an inference-net which includes bayesian updating, fuzzy logic and dempster-shafer function. There is unidirectional propagation of evidences through a hierarchical network to an ultimate hypothesis.
A decision is a choice between alternatives and the basis for a decision is known as a criterion. Criteria may be of two types: factors and constraints. Factors may be a continuous, binary or a coded variable whereas, constraints are generally Boolean in character. Through a Multi-Criteria Evaluation, these criteria which represent suitability, are integrated to form a decision as a single suitability map, to a single objective. The method handles the tradeoff between factors, during integration of the knowledges. Factor weights, called tradeoff weights, is assigned by expert-knowledge or is data-driven.
In cases of Boolean criteria, the solution lies usually in the union (logical OR) or intersection (logical AND) of conditions. However for continuous factors, a weighted linear or log-linear combination is an usual practice. Order Weighted average technique considers factors, its weights and a rank assigned to the factor-weights. As criteria are measured at different scales, they are standardized before used for integration. Establishing factor weights is the most complicated aspect, for Boolean maps, a pair-wise comparison matrix is generally used. Analytical Hierarchy Process (Satty, 1992) provides a series of pair-wise comparisons of the relative importance of factors to the suitability of pixels for the activity being evaluated.
Situation with conflicting multi-objectives, require integration of information gathered from a set of suitability maps, one for each objective. The relative weights assigned to each objective and the amount of area assigned to each are analyzed. This is basically a compromising programming analysis (Pereia and Duckstein, 1993), which attempts to allocate target potential areas for each objective, given the assigned weights. This technique uses a Min-Max rule, where in minimum of the maximum weighted deviation are sought for the composite layer and it provides a non-compensatory solution. Bayes theory and Dempster-Shafer theory are used to handle multi-objective evaluations.
Bayesian updating, Fuzzy theory and Belief function handle spatial uncertainties excellently. However there exists propagation of error in an inference net due to the reasons like adoption of subjective weighting and the procedure of scaling datasets, which again involves subjective judgments. Studies on interactions of evidences are not possible using above methods. Logistic regression analysis is a suitable technique of handling the above inadequacies. It handles varied data-types and the factor-weighting method is data-driven. Further the main effects and interactions of the evidences are studied nicely in this method. The case study given below describes the application of logistic regression analysis in targeting potential resources.
The logistic regression model has the form (Hosmer and Lemeshow, 1989)
Logit (r) = log(r/(1-r)) = bo + b’X
Where, X is the vector of explanatory variables
bo is the intercept parameter
b’ is the vector of slope coefficients
As the equation is nonlinear, Newton-Raphson or iteratively reweighted least-squares are used to solve b’. Few statistics such as Wald’s statistic, likelihood ratio test statistic, Akaiki Information Criterion (AIC), and Schartz Criterion (SC) are used to assess model fitting.
Data Integration and Analysis
Concentration of As and Sb among others, were found to be strong pathfinders of gold occurrences. The Lineament proximity map created using weights of evidence modeling, (Sahoo et al. 2000) was used with the As and Sb concentration maps to predict the gold occurrences. The concentration maps of As and Sb were continuous functions, whereas, the lineament proximity was a boolean map. Pixels lying within this buffer-zone of lineaments were assigned a value of 1, and those outside the buffer-zone were assigned a value of 0. The response map, proximity to known gold occurrences was a boolean map, pixels lying within the 0.5 Km buffer-zone of the deposit occurrences were assigned with value 1 and others were assigned 0.
Modelling with logistic regression is a step-wise procedure of fitting the model. Following steps were carried out during model fitting (Sahoo and Pandalai, 1999).
In this step, univariate statistics were computed using each independent explanatory map to establish its statistical significance. Wald’s statistic was used to remove the insignificant variables. All the three factors: concentration of As, Sb and lineament proximity were found significant to be retained in the fitted model, at a = 0.05 level.
In this step multivariate analysis was carried out with all the main effects, found significant at a = 0.05 level. Using AIC, SC and likelihood ratio test statistic, it was observed that Lineament proximity was not a significant factor in targeting gold occurrences.
In this step, Multivariate analysis was carried out using significant main effects and all interaction terms. G statistic was used to evaluate the significance of the main effects and their interactions.
The final model for predicted probability for gold occurrences using lineament proximity map, As and Sb concentration maps, is computed as
Y = 1/(1+ exp( – ( – 6.29 + 3.49 * As + 4.28 * Sb + 3.1 * As * LIN -2.62 * As * Sb * LIN)))
As this method does not involve subjective weighting for each explanatory variables, the error propagation in each step of computation is less. However, the different statistics used in assessment of the model-fit can be manipulated while selecting a model, as per the user’s expert-knowledge. The interaction terms are also studied in this model.
GIS-based multi-criteria, multi-objective evaluation using Fuzzy theory, Bayes theory and Belief function is a powerful tool in targeting potential areas for exploration. A major contribution is provided by the interaction of user and expert-knowledge in discovering knowledge about the data. However the procedure of scaling varied data-layers and subjective factor-weighting and rank-weighting leads to propagation of error in an inference net. The unique capability of logistic regression in handling varied data-types, data-driven factor-weighting and modeling factor interactions provides a less erroneous model.
- Cliff A., D. and Ord J. K., 1981: Spatial processes: Models and Applications, Pion, London.
- Fotheringham, S and Rogerson, P., 1994: Spatial Analysis and GIS, Taylor and Francis.
- Hosmer, D. W. Jr. and Lemeshow, S.,1989: Application of Logistic Regression: John Wiley and Sons, Inc. New York.
- Lee, N.S., Y. L. Grize, and K. Dehnad, (1987) Quantitative Models for Reasoning Under Uncertainty in Knowledge-Based Expert Systems, International Journal of Intelligent Systems, 2, 15-38.
- Mitchell T. M., 1982, Generalization as Search, Artificial Intelligence, 18, 203-226, 1982
- Pan, G., Harris, D. P., and Heiner, T., 1992: Fundamental Issues in Quantitative Estimation of Mineral Resources, Natural Resources Research, 1, 4, 281-292.
- Pereia, J. M., L. and Duckstein, L., 1993: A Multi-criteria Decision Making Approach to GIS Based Land Suitability Evaluation, International Journal of Geographic Information System, 7, 5, 407- 424.
- Quilon, J. R., 1986: Introduction of Decision Trees, Machine Learning, 1, 81-106.
- Riply, B. D., 1981: Spatial Statistics, Wiley, New York.
- Sahoo, N. R. and Pandalai, H. S., 1999: Integration of Sparse geologic information in gold targeting using logistic regression analysis in the Hutti-Maski schist belt, Raichur, Karnataka, India – A Case study, Natural resources Research, 8, 3, 233-250.
- Sahoo, N.R. Jothimani, P and Tripathy, G. K., 2000, Multi-Criteria analysis in GIS environment – A case study on gold exploration. [email protected], 4, 5, 38-40.
- Sahoo, N. R. and Pandalai, H. S., 2000: Secondary geochemical dispersion in the Precambrian auriferous Hutti-Maski schist belt, Raichur district, Karnataka, India. Part I: Anomalies of As, Sb, Hg and Bi in soil and groundwater, Journal of Geochemical Exploration, 71, 3, 269-289
- Satty, T., 1980: The Analytic Hierarchy Process, New York, McGraw-Hill.
- Schmucker, K.J., 1982: Fuzzy Sets, Natural Language Computations and Risk Analysis (Computer Science Press)
- Shaw, W and Wheeler, D., 1994: Statistical Techniques in Geographic Analysis, David Fluton, London.
- Wettes, D., Aha, D. W. and Mohri, T., 1997: A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms, AI Review, 10, 1-37.