**Mahesh Pal**

Lecturer, Department of Civil Engineering

National Institute of Technology

Kurukshetra 136119

Haryana (India)

Email: [email protected]

Fax No. 01744 38050

**Paul M Mather**

Prof., School of geography

University of Nottingham

University Park

Nottingham, NG7 2RD, UK

Email: [email protected]

** 1. Introduction **

Much research effort in the past ten years has been devoted to analysis of the performance of artificial neural networks in image classification (Benediktsson et al., 1990; Heermann and Khazenie, 1992). The preferred algorithm is feed-forward multi-layer perceptron using back-propagation, due to its ability to handle any kind of numerical data, and to its freedom from distributional assumptions. Although neural networks may generally be used to classify data at least as accurately as statistical classification approaches a number of studies have reported that users of neural classifiers have problems in setting the choice of various parameters during training (Wilkinson, 1997). The choice of architecture of the network, the sample size for training, learning algorithms, and number of iterations required for training are some of these problems. A new classification system based on statistical learning theory (Vapnik, 1995), called the support vector machine has recently been applied to the problem of remote sensing data classification (Huang et al., 2002; Zhu and Blumberg, 2002; Gualtieri and Cromp, 1998). This technique is said to be independent of the dimensionality of feature space as the main idea behind this classification technique is to separate the classes with a surface that maximise the margin between them, using boundary pixels to create the decision surface. The data points that are closest to the hyperplane are termed “support vectors”. The number of support vectors is thus small as they are points close to the class boundaries (Vapnik, 1995). One major advantage of support vector classifiers is the use of quadratic programming, which provides global minima only. The absence of local minima is a significant difference from the neural network classifiers.

**2. Classification Methods**

Three classification algorithms used for this study are the maximum-likelihood, multi-layer backpropagation neural network and support vector classifier. A brief summary of these classifiers is given below.

**2.1 Maximum-Likelihood Classifier**

The Maximum likelihood Classifier (MLC) is based on the assumption that the members of each class are normally distributed in feature space. MLC is a pixel-based method and can be defined as follows: a pixel with an associated observed feature vector X is assigned to class c if

X Î if > for all j¹ k, j, k = 1,……, N

For multivariate Gaussian distributions is given by:

where and are the sample mean vector and covariance matrix of class k, and is the discriminating function.

Implementation of the MLC involves the estimation of class mean vectors and covariance matrices using training patterns chosen from known examples of each particular class. **2.2 Artificial Neural Network Classifier**

A feed-forward artificial neural network (ANN) is used in this study. This is the most widely used neural network model, and its design consists of one input layer, at least one hidden layer, and one output layer. Each layer is made up of non-linear processing units called neurons, and the connections between neurons in successive layers carry associated weights. Connections are directed and allowed only in the forward direction, e.g. from input to hidden, or from hidden layer to a subsequent hidden or output layer. Non-linear processing is performed by applying an activation function to the summed inputs to a unit. Back-propagation is a gradient-descent algorithm that minimises the error between the output of the training input/output pairs and the actual network outputs (Bishop, 1995). Therefore, a set of input/output pairs is repeatedly presented to the network and the error is propagated from the output back to the input layer. The weights on the backwards path through the network are updated according to an update rule and a learning rate. ANNs are not solely specified by the characteristics of their processing units and the selected training or learning rule. The network topology, i.e. the number of hidden layers, the number of units, and their interconnections, also has an influence on classifier performance. In this study we use the network architecture and number of patterns used for training suggested by Kavzoglu (2001).

**2.3 Support vector classifier**

In the two-class case, a support vector classifier attempts to locate a hyperplane that maximises the distance from the members of each class to the optimal hyperplane. The principle of a support vector classifier is briefly described next.

Assume that the training data with k number of samples is represented by {Xi,Yi} i = 1, X Î Rn, k, where is an n-dimensional vector and Y Î {-1, +1} is the class label. These training patterns are said to be linearly separable if a vector w (which determining the orientation of a discriminating plane) and a scalar b (determine offset of the discriminating plane from origin) can be defined so that inequalities (1) and (2) are satisfied.

The aim is to find a hyperplane which divides the data so that that all the points with the same label lie on the same side of the hyperplane. This amounts to finding w and b so that

If a hyperplane exists that satisfies (3), the two classes is said to be linearly separable. In this case, it is always possible to rescale w and b so that

That is, the distance from the closest point to the hyperplane is 1/||w||. Then (3) can be written as

The hyperplane for which the distance to the closest point is maximal is called the optimal separating hyperplane (OSH) (Vapnik, 1995). As the distance to the closest point is 1/||w||, the OSH can be found by minimising ||w||2 under constraint (4). The minimisation procedure uses Lagrange multipliers and Quadratic Programming (QP) optimisation methods. If li, i = 1,….,k are the non-negative Lagrange multipliers associated with constraint (4), the optimisation problem becomes one of maximising (Osuna et.al. 1997):

under constraints li = 0, i = 1, …..,k.

If is an optimal solution of the maximisation problem (5) then the optimal separating hyperplane can be expressed as:

The support vectors are the points for which > 0 when the equality in (4) holds.

If the data are not linearly separable, a slack variable , i =1,……,k can be introduced with = 0 (Cortes and Vapnik 1995) such that (4) can be written as

and the solution to find a generalised OSH, also called a soft margin hyperplane, can be obtained using the conditions

The first term in (8) is same as in as in the linearly separable case, and controls the learning capacity, while the second term controls the number of misclassified points. The parameter C is chosen by the user. Larger values of C imply the assignment of a higher penalty to errors.

Where it is not possible to have a hyperplane defined by linear equations on the training data, the techniques described above for linearly separable data can be extended to allow for non-linear decision surfaces. A technique introduced by Boser et al. (1992) maps input data into a high dimensional feature space through some nonlinear mapping. The transformation to a higher dimensional space spreads the data out in a way that facilitates the finding of linear hyperplanes. After replacing x by its mapping in the feature space , equation (5) can be written as:

To reduce computational demands in feature space, it is convenient to introduce the concept of the kernel function K (Cristianini and Shawe-Taylor, 2000; Cortes and Vapnik 1995) such that:

Then, to solve equation (11) only the kernel function is computed in place of computing , which could be computationally expensive. A number of kernel functions are used for support vector classifier. Details of some kernel functions and their parameters used with SVM classifiers are discussed by Vapnik (1995). SVM was initially designed for binary (two-class) problems. When dealing with several classes, an appropriate multi-class method is needed. A number of methods are suggested in literature to create multi-class classifiers using two-class methods (Hsu and Lin, 2002). In this study, a “one against one” approach (Knerr et al., 1990) (In this method, all possible two-class classifiers are evaluated from the training set of n classes, each classifier being trained on only two out of n classes. There would be a total of n (n-1)/2 classifiers. Applying each classifier to the vectors of the test data gives one vote to the winning class. The pixel is given the label of the class with most votes. To generate multi-class SVMs and a radial basis kernel function (defined as ) was used.

**3. Data**

The first of the two study areas used in the work reported here are located near the town of Littleport in eastern England. The second is a wetland area of the La Mancha region of Spain. For the Littleport area, ETM+ data acquired on 19th June 2000 is used. The classification problem involves the identification of seven land cover types (wheat, potato, sugar beet, onion, peas, lettuce and beans) for the ETM+ data set. For the La Mancha study area, hyperspectral data acquired on 29th June 2000 by the DAIS 7915 airborne imaging spectrometer were available. Eight different land cover types (wheat, water body, dry salt lake, hydrophytic vegetation, vineyards, bare soil, pasture lands and built up area) were specified. The DAIS data show moderate to severe striping problems in the optical infrared region between bands 41 and 72. Initially, the first 72 bands in the wavelength range 0.4 µm to 2.5 µm were selected. All of these bands were examined visually to determine the severity of striping. Seven bands displaying very severe striping problems (bands 41 – 42 and 68 – 72) were removed from the data set. The striping in the remaining bands was removed by automatically enhancing the Fourier transform of each image (Cannon et al., 1983; Srinivasan et al., 1988). The input image is first divided into overlapping 128-by-128-pixel blocks. The Fourier transform of each block is calculated and the log-magnitudes of each FFT block are then averaged. The averaging process removes all frequency domain quantities except those which are present in each block; i.e., some sort of periodic interference. The average power spectrum is then used as a filter to adjust the FFT of the entire image. When an inverse Fourier transform is performed, the result is an image with periodic noise eliminated or significantly reduced.

** 4. Result and discussions **

Random sampling was used to collect the training and test pixels for both ETM+ and DAIS data set. Total selected pixels were divided into two parts, one for training and one for testing the classifiers, so as to remove any possible bias resulting from the use of same set of pixels for both the testing and training phases. A total of 2700 training pixels and 2037 test pixels for ETM+ data and a total of 1600 training (200 pixels/class) and 3800 test pixels were used for DAIS data. Kappa values and overall classification accuracies are calculated for each of the classifiers used in this study with ETM+ data, while overall accuracy is calculated for the DAIS hyperspectral data. The Z statistic is also used to test the significance of apparent differences between the three classification algorithms, using ETM+ data. For this study, a standard back-propagation neural classifier was used. All user-defined parameters are set as recommended by Kavzoglu (2001).

Like neural network classifiers the performance of support vector classifier depends on a number of user-defined parameters which may influence the final classification accuracy. For this study a radial basis kernel with g (kernel specific parameter) value as two and C = 5000 is used for both data sets. The values of these parameters were chosen after a number of trials and the same parameters are used with DAIS data set. This study also suggests that, in comparison with the NN classifier, it is easier to fix the values of the user defined parameters for SVM.

As mentioned earlier, SVM involves in solving a quadratic programming problem with linear equality and inequality constraints which has only a global optimum. In comparison the presence of local minima is a significant problem in training the neural network classifiers.

Results obtained using ETM+ data suggests that support vector classifier perform well in comparison with neural and statistical classifier (Tables 1 and 2).

** Table 1. **Classification accuracies achieved with different classifiers.

Classifier used | Accuracy (%) | Kappa value |

Maximum likelihood | 82.9 | 0.80 |

Neural network | 85.1 | 0.83 |

Support vector | 87.9 | 0.87 |

** Table 2. **Calculated Z values for comparison between different classification systems with ETM+ data. Shaded values indicate significant improvements in the performance of first classifier at the 95% confidence level (Z critical value = 1.96).

Classifiers | Z value |

SVM vs. Neural network | 2.46 |

SVM vs. maximum likelihood | 5.45 |

Further, the training time taken by support vector classifier is 0.30 minutes in comparison of 58 minutes by the NN classifier on a dual processor sun machine. Results suggest that support vector classifier performance is statistically significant in comparison with NN and ML classifiers. To study the behaviour of support vector classifier with DAIS Hyperspectral data a total of sixty five features (bands) was used, a total of 65 features (spectral bands) were available, as seven features with severe striping were discarded, as explained above. The initial number of features used was five, and the experiment was repeated with 10, 15, …, 65 features, giving a total of 13 experiments. Figure 1 suggests that, in comparison to the other classifiers, the performance of the support vector classifier is quite good with small training data set irrespective of the number of features used.

** Figure 1. **Classification accuracies obtained with DAIS hyperspectral data using different classification algorithms. The training data set size is 200 pixels/class.

Results obtained from analysis of the hyperspectral data suggest that classification accuracy using SVM increases almost continuously as a function of the number of features, with the size of the training data set held constant, whereas the overall classification accuracies produced by the ML, DT and NN classifiers decline slightly once the number of bands exceeds 50 or so. Thus, suggesting that the ‘Hughes phenomenon’ (Hughes, 1968) of decreasing classifier performance as the dimensionality of the feature space increases beyond a threshold is not supported by the experiments using the support vector classifiers.

**5. Conclusions**

The objective of this study was to assess the utility of support vector classifiers for land cover classification using multi- and hyper-spectral data sets in comparison with most frequently used ML and NN classifiers. The results presented above suggest several conclusions. First the support vector classifier outperforms ML and NN classifiers in term of classification accuracy with both data sets. Several user-defined parameters affects the performance of the support vector classifier, but this study suggests that it is easier to find appropriate values for these parameters than it is for parameters defining the NN classifier. The level of classification accuracy achieved with the support vector classifier is better than both ML and NN classifiers when used with small number of training data.

**Acknowledgement**

The DAIS data were collected and processed by DLR and were kindly made available by Prof. J. Gumuzzio of the Autonomous University of Madrid. Computing facilities were provided by the School of Geography, University of Nottingham.

**References**

- Benediktsson, J. A., Swain, P. H., and Erase, O. K., 1990, Neural network approaches versus statistical methods in classification of multisource remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 28,540-551.
- Bishop, C. M. (1995) Neural Networks for Pattern Recognition. Oxford: Clarendon Press
- Boser, B., Guyon, I., and Vapnik, V. N., 1992, A training algorithm for optimal margin classifiers. Proceedings of 5th Annual Workshop on Computer Learning Theory, Pittsburgh, PA: ACM, 144-152.
- Cannon, M., Lehar, A., and Preston, F., 1983, Background pattern removal by power spectral filtering. Applied Optics, 22(6), 777-779.
- Chang, C., and Lin, C., 2001, LIBSVM: A Library for Support Vector Machines. Computer Science and Information Engineering, National Taiwan University, Taiwan, https://www.csie.ntu.edu.tw/~cjlin/libsvm.
- Cortes, C., and Vapnik, V. N., 1995, Support vector networks. Machine Learning, 20, 273- 297.
- Cristianini, N., and Shawe-Taylor, J., 2000, An Introduction to Support Vector Machines. London, Cambridge University Press.
- Gualtieri, J. A. and Cromp, R. F., 1998, Support vector machines for hyperspectral remote sensing classification. Proceedings of the 27th AIPR Workshop: Advances in Computer Assisted Recognition, Washington, DC, October 27, 221-232.
- Heerman, P. D., and Khazenie, N., 1992, Classification of multispectral remote sensing data using a back propagation neural network. IEEE Transactions on Geoscience and Remote Sensing, 30, 81-88.
- Hsu, C. W., and Lin, C. -J., 2002, A comparison of methods for multi-class support vector machines, IEEE Transactions on Neural Networks, 13, 415-425.
- Huang, C., Davis, L. S. and Townshend, J. R. G., 2002, An assessment of support vector machines for land cover classification. International Journal of Remote Sensing, 23, 725-749.
- Hughes, G. F., 1968, On the mean accuracy of statistical pattern recognizers. IEEE Transactions on Information Theory, IT-14, 55-63.
- Kavzoglu, T., 2001, An Investigation of the Design and Use of Feed-forward Artificial Neural Networks in the Classification of Remotely Sensed Images. PhD thesis. School of Geography, The University of Nottingham, Nottingham, UK.
- Knerr, S., Personnaz, L., and Dreyfus, G., 1990, Single-layer learning revisited: A stepwise procedure for building and training neural network. Neurocomputing: Algorithms, Architectures and Applications, NATO ASI, Berlin: Springer-Verlag.
- Osuna, E. E., Freund, R., and Girosi, F., 1997, Support vector machines: training and applications. A. I. Memo No. 1602, CBCL paper No. 144, Artificial Intelligence laboratory, Massachusetts Institute of Technology, ftp://publications.ai.mit.edu/ai-publications/pdf/AIM-1602.pdf
- Srinivasan, R., Cannon, M., and White, J., 1988, Landsat data destriping using power spectral filtering. Optical Engineering, 27, 939-943.
- Vapnik, V. N., 1995, The Nature of Statistical Learning Theory. New York: Springer-Verlag.
- Wilkinson, G. G. (1997) Open questions in neurocomputing for earth observation. In Neuro-Computation in Remote Sensing Data Analysis, edited by I. Kanellopoulos, G. G. Wilkinson, F. Roli and J. Austin. London: Springer, 3-13.
- Zhu, G. and Blumberg, D. G., 2002, Classification using ASTER data and SVM algorithms; The case study of Beer Sheva, Israel. Remote Sensing of Environment, 80, 233-240.