Home Articles GIS and Cluster detection: An Application to State Level Socio-Economic Data

GIS and Cluster detection: An Application to State Level Socio-Economic Data

Soumya Mazumdar
Phd Candidate, University of Iowa
Email: [email protected]

Prof Mayajit Mazumdar
Professor, Dept of Civil
Engineering, Civil Engineering Department, Indian Institute of Technology, Kharagpur
West-Bengal, India – 721302
Email: [email protected]

Tel: 03222-28-3432 (office), 03222-28-3433 (home)

 

Introduction
Cluster detection is a tool employed by GIS scientists who specialize in the field of spatial analysis. Most of the applications of cluster analysis till date, have been confined to the field of epidemiology, though of late. applications have been found in crime data analysis using GIS. In the epidemiological context, a cluster is an 'Unusual aggregation of health events real or perceived'. It is not necessary though, that the 'unusual aggregation' should be of 'health event data'. Any data that shows geographic (spatial) variability can be subject to cluster analysis.

Methodology and Data
The first law of Geography states that '.Everything is similar to everything else, but near things are more similar than farther things.' Statistically this property is defines as 'positive spatial autocorrelation'. Most Geographic phenomena tend to show positive spatial autocorrelation. Thus for instance, if the state of Kerela is found to have a high literacy rate, then we would expect its neighboring states to have literacy rates more similar to the rate shown by Kerela than by let us say, a far away state like Assam. A 'cluster' would then be a collection of states with high literacy rates. The question that now arises is, can a 'cluster' of states with high literacy not be identified, by just drawing a chloropleth map displaying the literacy rates? The answer is no that is because; a cluster can arise at random. Thus if the null hypothesis is considered as the case of no clustering, and if the null hypothesis is shot down and a cluster is found, then, the result has to be tested to check the probability of this result having arisen out of randomness. Thus the simple chloropleth map will not suffice. Thus we need a tool, into which, if the data (say literacy rates) are fed, then it will come up with a cluster and carry out the necessary tests to check , if the resulting cluster has arisen out of randomness. The tool used in this study is the Spatial Scan Statistic, which is realized through the software 'SaTScan'. SaTScan was originally developed by Martin Kuldorff of the National Cancer Institute, to study clustering of breast cancer data, however, it has of late, been used on a variety of data. The software can be downloaded free of cost from the National Cancer Institute webpage. SaTScan needs three input files – one population file, one case file, and one coordinate file. The 'case file' could be single or aggregated data. Thus, in our study the case file consists of the number of literates in a particular state, say Kerela, and thus the data is 'aggregated' at the state level in our case. The population file carries the population of the states, and the coordinate file contains the coordinates of the centroids of the states. SaTScan operates by passing a circle over the study region. The centroids of some states will lie within the circle. SaTScan tests if the number of literates in the 'within circle' region outstrips the expected number of literates, within that region. It increases the size of the circle until a certain user defined threshold is reached. The circle moves from one centroid to another until all the centroids within the study region are covered. Once this testing is done, the test statistic is tested by generating a large number of random permutations of the dataset. Thus, the possibility of the cluster having arisen by chance is known. Depending on this (the p-value) we can then accept or throw away our results. It is also possible to output the results into a GIS to visually observe the cluster sizes. (Which can be shown as circles).

It must be noted that in the present context, the spatial scan statistic is being used as a 'Purely spatial' cluster detection tool. Clusters can also be detected in the temporal and the spatio-temporal domain. Neither is SaTScan the only spatial cluster detection tool around. However, SaTScan, because of its easy availability has been utilized for a large number of studies. Such studies are ideally carried out with point level georeferenced data, but such data is totally unavailable in India and aggregated data has to suffice. Some commercially available data however, may have a better resolution than state level data.

In this study the two phenomena for which clustering was studied were state literacy levels and state GDPs. The data was obtained from the census India website and from the World Bank poverty data bank. The state centroids were visually obtained by the authors using a GIS. The results were then displayed using a GIS.

Results, conclusion and discussion
Clusters of high literacy and GDP were obtained. Some sub-clusters were also obtained. The clusters for the two sets of data do not necessarily coincide. The results were then displayed in a GIS. Such results are of importance to national level policy makers and planners.

To summarize, cluster detection is a tool originally used by epidemiologists with knowledge of GIS. However, this tool is increasingly being used on various other data. In this study a cluster detection tool was used on state level socio-economic data and some clusters were identified.