Home Articles An unbiased sampling framework for ground data collection in digital image classification

An unbiased sampling framework for ground data collection in digital image classification

An unbiased sampling framework for ground data collection in digital image classification

Ranjith Premalal De Silva
Dayawansa N.D.K., Department of Agricultural Engineering,
University of Peradeniya, Peradeniya, SRI LANKA.
J.C. Taylor
Cranfield University, Silsoe, Bedford MK45 4DT England.

Ground data collection is an integral part of the classification process where relationships are sought to relate spectral signatures of a satellite image with the corresponding ground features. In addition to the task of training spectral signatures, ground data collection through a field survey provides the base data for the determination of accuracy of supervised classification, assessment of the classification generated by unsupervised classification algorithms and identification of empirical relationships between surface properties and satellite observations. Hence, the collection of an unbiased data set for ground information is imperative for a successful image classification and subsequent analysis. This study focuses on such a methodology and its application for land use/ cover assessment in the Upper Mahaweli Catchment (UMC), Sri Lanka.

Image Pre-Processing
The majority of the UMC was covered by single quadrant (path 21, row 64, top left quadrant of the I2164A1) of the IRS LISS II image. However, four quadrants were required to produce the mosaic for entire UMC coverage. The four image quadrants were geometrically corrected individually using the first order polynomial transformations. However, before generating the mosaic for the catchment, each band of the smaller mosaic covering the western strip of the UMC was radiometrically corrected to adjust the relative brightness values using the linear transformation functions.

The Optimum Index Factor (OIF) based on the variance and the correlation among different bands was used to identify the best three band combination. Interestingly, the commonly used standard False Colour Composite (FCC) from bands 2, 3 and 4 derived from IRS ranks second according to OIF definition while bands 1,3 and 4 were found to be the best combination. Histogram equalization for non-linear contrast stretch was applied to the image to obtain an improved visible and digital contrast stretch of the image.

Statistical Sampling Methodology

A methodology based on area-frame sampling was adopted for this study (Taylor & Eva, 1992). Unaligned systematic random samples were chosen with 1 sq. km. fixed size ground segments. A 10 km. grid corresponding to 1:10000 map sheets was overlaid upon the image and the locations of ground segments were chosen randomly from each grid square of 10 sq. km.. The total number of ground segments was 38 and it produced a sampling frequency of 1.22% of the 3122 sq.km. study area. Stratification of the catchment was not required as there was no strong basis for defining regional strata in UMC. The location of ground segments is shown in Fig. 01.

Fig. 01 Location of Ground
Segments (10 x 10 km grid sampling)

Field Data Formats and

A cubic convolution based resampling produced 20 m resolution and resulted in an array of 50 rows and columns of data for each segment. A 400 m boundary was allowed to make the recognition of ground features easy and comparable with each image segments. Further, a 5 km. x 5 km. area surrounding the ground segment was also printed out to obtain a synoptic view of the area and ground features. It also helped in field navigation.

In addition to the image segment, a transparent overlay of the available map sheet was also produced at 1:10000 scale. The enlarged overlays of land use maps at 1:50000 and 1:63360 provided the information required to locate the ground segments and to calculate relative distances between ground features whenever necessary. A proforma was prepared to record field data corresponding to the features on the image segment. Another transparency overlay was used to draw the field parcel boundaries and record the spatial extent of land use on the image segment.

A Trimble Geoexplorer GPS fitted with an external antenna provided the location information for navigation in the field. The land use and contour maps of 1:10000, 1:50000 and 1:63,360 were also used for route planning for the field visits.

Area Estimation by Direct Expansion
Digitizing of ground data segments was repeated in an attempt to minimize errors. It also provides the direct expansion estimates that quantify the land use status of an area through the statistically valid sampling procedure. In this study, these estimates were not intended to provide crop area statistics in terms of the coverage. However, these provided the a priori weightings, used later in the digital classification procedure (Swain & Davis, 1978).

Unsupervised Classification
Initially, the sequential clustering was employed to identify 50 classes and corresponding spectral signatures from the image multi spectral space. The mean and standard deviation of DNs belonging to each class were calculated individually. The calculated class statistics was subject to Hierarchical Cluster Analysis using the Ward method. The basis for using this algorithm was to produce a lower number of classes through a process of agglomeration from a large population of classes minimizing the variance within each spectral class and while maximizing the variance among different classes (Hung, 1994; Richards, 1986). The algorithm tries to find the optimal spectral class combinations for the number of user-specified classes and results in a tree structure that is shown as a fusion dendogram.

In this exercise, the final number of spectral classes was defined to be 10 so that it could be directly comparable with the scheme developed for the supervised classification. The spectral signatures produced for 50 classes were regrouped according to the combination given in the fusion dendogram in order to obtain the final 10 signatures, which were assumed to be corresponding to the defined classification scheme.

Mosaicing of Ground Data Segments
A mosaiced image was created by extracting all the ground segments using a C++ program from the image covering UMC. The collection of imagettes into a single image (QUILT) was useful in identifying and evaluating training data for the maximum likelihood classifier and also in assessing the accuracy of classification.

An unbiased sampling framework for ground data collection in digital image classification

Recognition of Spectral Signatures for Supervised Classification

The aim of providing training data to the classifier is to obtain spectral information which could be used to determine the decision boundaries for the classification of the whole mosaiced image. The total number of training sites chosen was 159 while the number for each class was proportional to the area of that class in the total ground survey data. The training pixels were randomly selected from the quilted image. However, an effort was made to choose training sites from almost all the ground survey segments. The mean DN values of each band for forest and non-forest land uses are presented in Figure 02(a) and (b).

Figure 2(a) Distribution of DN in different bands for Forest Cover Types in UMCA

Figure 2(b) Distribution of DN in Different Bands for Non-forest Land Uses in UMCA

Spectral Homogeneity and Class Separability

Spectral uniformity was verified for all the signatures by overlaying them upon the QUILT of unsupervised classification. This process was very useful to avoid spectral confusion to a great extent by avoiding the selection of mixed pixels of different classes. All the individual signatures were combined together to produce class signatures (10 classes) for the classification.

Two evaluations were made on the signature files. Scatterplots were viewed in two dimensional space taking each band combination. The Jeffries- Matusita distance was calculated for all the signatures. Histograms and statistics were computed for each signature. Spectral confusion was evident in paddy & urban, tea & open forest, and plantation & open forest. Polygons for these signatures were separated and the clusters were formed using statistical clustering. Thus the predominant cases of spectral confusion were eliminated and signature files were purified.

Signatures for paddy and urban classes were less divergent. Further, open forest and tea areas were spectrally confused. Frequency distributions of these classes were bi-modal. The polygons containing the training data of these classes were identified and adjusted by running sequential clustering to derive spectrally divergent, uni-modal class signatures.

Digital Classification Algorithm and Results
In this study, a priori probabilities derived from direct expansion estimates were introduced to the Maximum Likelihood classifier in order to produce an area-weighted classification for UMC. Figure 04 shows the area weighted classification of UMC compared with the NDVI image with 4 classes. The classified image has a noisy appearance due to the presence of many isolated pixels or small groups of pixels where classification is different from most of the neighbours. A comparison of the area estimates from digital classification with the direct expansion estimates are presented in Table 01

Table 01. Comparison of Digital Classification with Direct Expansion Estimates

Land use

Digital classification estimates (km2)a

Direct expansion estimates (km2)b

Percentage Difference between estimates (km2)*

Dense Forest
















Plant. Forest




Open Forest























*percentage difference between two estimates =

Assessment of Classification Accuracy
The Kappa statistic calculated from the results of the land use classification with 10 classes was 0.4534 with a variance of 0.00056 and it denotes a good agreement. The confusion matrix shows that the overall map accuracy is 60.73 percent while the highest producer and user accuracy are 98 percent and 85 percent respectively. The land use classes of dense natural forest and water have contributed positively for the overall mapping accuracy. However, the spectral confusion among classes of urban, grassland, paddy and other crops has negatively influenced the accuracy.

There were obvious discrepancies between the direct expansion estimates and the supervised classification results suggesting possible spectral confusions although the class signatures were properly tested by JM distance and viewing scatterplots. These deviations were prominent, especially in the less frequent land use categories such as water and open forest. This indicates the need to achieve a higher sampling fraction for ground data collection.

For the supervised classification, a selected sample of radiometric values which have been identified as being representative to each spectral class of land use was used for training data to extrapolate the classification for the entire UMC. The supervised classification technique operates upon the assumption that images are formed by spectrally uniform and separable classes. In reality, the radiometric classes recorded by the sensor are not homogenous nor are they in all cases unambiguously separable. Information classes are typically not discrete and the recorded digital value of a spectral class is an average of the reflectance of multiple objects contained within the class. An attempt was made to make variability within each class less than the interclass variability.

In summary, it is obvious that a properly organized field data collection program supported with a complete set of field documentation is a prerequisite for a successful supervised image classification. In this study, the unbiased sampling technique provides training data for the classification. It also determined the a priori probabilities of the classification. In addition, it provides direct expansion estimates to have a comparison on the maximum likelihood classification results. Finally, it made the decision rules for the accuracy assessment. Hence, it can be concluded that a proper sampling methodology for ground data collection exclusively contributes for the success of classification.