Mrinal Kanti Ghose
Scientist, Regional Remote Sensing Service Centre
ISRO, Dept. of Space, Kharagpur
GIS (Geographic Information System) databases are an ever-evolving entity from their humble beginnings as paper maps, through the digital conversion process, to the data maintenance phase. The GIS technology shall comprise of geographic data that is specific and reliable and that represents as closely as possible the spatial world, we live in and neglecting that, the usefulness of the technology is short-lived. To maximise the quality of GIS databases there should exist a well-designed Quality Assurance (QA) plan that is strategically integrated through the entire life cycle of the GIS project.
Until quite recently, little attention was paid to the problems caused by error, inaccuracy, and imprecision in spatial data sets. This situation has changed substantially in recent years. It is now generally recognised that error, inaccuracy, and imprecision can "make or break" many types of GIS project. The key point is that even though error can disrupt GIS analyses, there are ways to keep error to a minimum through careful planning and methods for estimating its effects on GIS solutions. Awareness of the problem of error has also had the useful benefit of making GIS practitioners more sensitive to potential limitations of GIS to reach impossibly accurate and precise solutions.
The main purpose of this paper is to alert GIS analysts and its potential users to some methods that are especially suited to assessing the Quality of GIS data base and digital maps/ coverage. Attempts are also made to present a set of guidelines that are intended to establish the minimum acceptable level of Quality that should be adhered to by all of the projects and users through out the life-cycle of a GIS.
Today, Geographical Information System has reached a level of Operationalisation and is transitioning from an era of promotion to opportunities for commercial development of the application services. GIS databases are an ever-evolving entity. From their humble beginnings as paper maps, through the digital conversion process, to the data maintenance phase, GIS data never really stops changing. It is now generally recognised that errors, inaccuracies, and imprecision left unchecked can make the results of a GIS analysis almost worthless. Unfortunately, every time a new data set is imported, the GIS also inherits its errors. These may combine and mix with the errors already in the database in unpredictable ways. The key point is that even though error can disrupt GIS analyses, there are ways to keep error to a minimum through careful planning and methods for estimating its effects on GIS solutions. Awareness of the problem of error has also had the useful benefit of making GIS practitioners more sensitive to potential limitations of GIS to reach impossibly accurate and precise solutions.
The key to developing and implementing a successful GIS project is a well-designed Quality Assurance (QA) plan that is integrated with both the data conversion and maintenance phases of the GIS project. The fundamentals of Quality Assurance never change; completeness, validity, logical consistency, physical consistency, referential integrity and positional accuracy are the cornerstones of the QA plan. To maximise the quality of GIS databases there should exist a well-designed Quality Assurance plan that is strategically integrated with all facets of the GIS project.
Obective of the paper
In this paper an attempt has been made to make a systematic study of the various quality parameters of a GIS and their measurements in real life environment. This paper also presents a set of guidelines that are intended to establish the minimum acceptable level of accuracy assessment that should be adhered to by all of the projects and users. The main purpose of this paper is to present an overview of some methods that are especially suited to assessing the Quality of GIS data base and digital maps/ coverage. The issues involved in the development and implementation of an integrated GIS Quality Assurance Plan are also discussed.
Quality is commonly used to indicate the superiority of a manufactured good or to indicate a high degree of craftsmanship or artistry. Quality is a desirable goal achieved through management and control of the production process (statistical quality control). ). Many of the same issues apply to the quality of GIS databases, since a database is the result of a production process, and the reliability of the process imparts value and utility to the database [ 4 ].
Spatial Data Quality
Data quality is the degree of excellence in a database. It can simply be defined the fitness for use for a specific data sets. It is fully dependent on the scale, accuracy, and extent of the data set, as well as the quality of the other data sets to be used. The conventional view is that geographical data is "spatial", so a better definition of geographical data should include the three dimensions of Space, Time and Theme (where-when-what). These three dimensions are the basis for all geographical observation ( 1 ). Data quality also contains several components such as accuracy, precision, consistency and completeness. The result is a matrix as defined below.
Table 1: Matrix showing geographical
dimensions & Quality
The three components of space, theme, and time are covered by the first three Primary Parameters. The last two indicate: on the one hand if the data set is complete in terms of the queries that one wants to answer with the help of this data set and on the other hand if the representation of the data is consistent within itself. If all possible accuracy values have to be evaluated the costs of information on accuracy would be too high and thus not affordable [ 1 ].
In the following a closer look at each of the five Primary Parameters pertaining to GIS quality and their associated sub-parameters are discussed.
Accuracy is the degree to which information on a map or in a digital database matches Actual/ True or Accepted values. The discrepancy between the encoded and actual value of a particular attribute for a given entity is defined as an "error". Accuracy is an issue pertaining to the quality of data and the number of errors contained in a data set or map. In discussing a GIS database, it is possible to consider horizontal and vertical accuracy with respect to geographic position, as well as attribute, conceptual, and logical accuracy. The level of accuracy required for particular applications varies greatly. Highly accurate data can be very difficult and costly to produce and compile. Accuracy is always a relative measure, since it is always measured relative to the specification. To judge fitness-for-use, one must judge the data relative to the specification, and also consider the limitations of the specification itself [ 1 ].
Table 2: Example of E-A-V model. Name Width (ft) Cover Speed
(kph) Belmont Rd. 36 asphalt 60 Latrobe St. 22 concrete 50 etc…
Definition of accuracy is based on the entity-attribute-value model (Table- 2)
Entities = real-world phenomena
Attribute = relevant property
Values = Quantitative/qualitative measurements
Spatial accuracy is the accuracy of the spatial component of the database. The metrics used depend on the dimensionality of the entities under consideration. For points, accuracy is defined in terms of the distance between the encoded location and "actual" location. Error can be defined in various dimensions: x, y, z, horizontal, vertical, total. Metrics of error are extensions of classical statistical measures such as mean error, RMSE or root mean squared error, inference tests, confidence limits, etc.
For lines and areas, the situation is more complex. This is because error is a mixture of positional error (error in locating well-defined points along the line) and generalization error (error in the points selected to represent the line) ( 3 ). The epsilon band is usually used to define a zone of uncertainty around the encoded line, within which "actual" line exists with some probability. However, there is little agreement on the shape of the band, both planimetrically and in cross-section. The spatial position of an arbitrary object defined within a GIS data layer has a positional error that can be described by one of the Primary Parameters, Positional Accuracy.
Temporal accuracy is the agreement between the encoded and "actual" temporal coordinates for an entity. Temporal coordinates are often only implicit in geographical data, e.g., a time stamp indicating that the entity was valid at some time. Often this is applied to the entire database. More realistically, temporal coordinates are the temporal limits within which the entity is valid. Temporal accuracy is not the same as "currentness" (or up-to-date ness) which is actually an assessment of how well the database specification meets the needs of a particular application. Temporal Accuracy occurs if the GIS data set has a temporal dimension and thus the spatial information data type results in the form of: x,y,z,t. For the error model it is necessary to investigate this additional coordinate for dependencies with the other three in order to pay attention to existing correlation.
Thematic GIS information is generated by collecting and assigning properties of spatial data to stored objects or areas, that may lead to errors, first: due to a misclassification error, or second: an error that originates in the number of different data classes occurring in the same spatial object. In some cases the favoring of one topic can be necessary to make the presentation meaningful at all (for example the detection of water reservoirs (oasis) in a desert area.
Table 3: Example of classification
error matrix Water Soil Veg Total Water 25 2 3 30 Soil 0 38 2 40 Veg 1 4 25 30 Total 26 44 30 100
Thematic accuracy ( 6 ), is the accuracy of the attribute values encoded in a database. The metrics used here depend on the measurement scale of the data: Quantitative data (e.g., precipitation) can be treated like a z-coordinate (elevation) and assessed using metrics normally used for vertical error (such as the RMSE). Qualitative data (e.g., land use/land cover) is normally assessed using a cross-tabulation of encoded and "actual" classes at sample of locations. This produces a classification error matrix.
Element in row i, column j of the matrix is the number of sample locations assigned to class I but actually belonging to class j. The sum of the main diagonal divided by the number of samples is a simple measure of overall accuracy. An error of omission means a sample that has been omitted from its actual class. An error of commission means an error that is included in the wrong class. Every error of omission is also an error of commission.
For all raster representations the unit for the evaluation of thematic accuracy is the pixel itself and for the vector-based representation it is the boundary of an objects – the polygon (or to be more exact the points). Another possibility of presenting thematic accuracy to the user is to attach to each object or even to each pixel an Accuracy of attribute value.
Resolution (or precision) refers to the amount of detail that can be discerned in space, time or theme. Resolution is always finite because no measurement system is infinitely precise, and because databases are intentionally generalized to reduce detail [ 2c ]. Resolution is an aspect of the database specification that determines how useful a given database may be for a particular application. Resolution is linked with accuracy, since the level of resolution affects the database specification against which accuracy is assessed. Two databases with the same overall accuracy levels but different levels of resolution do not have the same quality; the database with the lower resolution has less demanding accuracy requirements. For example, thematic accuracy will tend to be higher for general land use/land cover classes like "urban" than for specific classes like "residential". Resolution is distinct from the spatial sampling rate, although the two are often confused with each other. Sampling rate refers to the distance between samples, while resolution refers to the size of the sample units.
Spatial resolution of raster data refers to the linear dimension of a cell, whereas for vector data it is the minimum mapping unit size. Temporal resolution is length of the sampling interval and it affects the minimum duration of an event that is discernible. For example, the shorter the shutter speed of a camera, the higher the temporal resolution (other factors being equal). Thematic resolution refers to the precision of the measurements or categories for a particular theme. For categorical data, resolution is the fineness of category definitions (e.g., "urban" vs. "residential" and "commercial"). For quantitative data, thematic resolution is analogous to spatial resolution in the z-dimension (i.e., the degree to which small differences in the quantitative attribute can be discerned).
Consistency refers to the absence of apparent contradictions and is a measure of the internal validity of a database. Spatial consistency includes topological consistency, or conformance to topological rules, e.g., all one-dimensional objects must intersect at a zero-dimensional object [ 2b ]. Temporal consistency is related to temporal topology, e.g., the constraint that only one event can occur at a given location at a given time [ 3 ]. Thematic consistency refers to a lack of contradictions in redundant thematic attributes. For example, attribute values for population, area, and population density must agree for all entities. Attribute redundancy is one way in which consistency can be assessed. The absence of inconsistencies does not necessarily imply that the data are accurate. Logical consistency covers on the one hand topological aspects and on the other hand the validity ranges of values occurring in the data set and can occur in spatial, thematic, and temporal parameters. For a Measure of topological consistency one can investigate for example the correctness of polygons.
Completeness refers to a lack of errors of omission in a database. It is assessed relative to the database specification, which defines the desired degree of generalization and abstraction (selective omission). There are two kinds of completeness [ 2a ]. "Data completeness" is a measurable error of omission observed between the database and the specification. Even highly generalized databases can be "data complete" if they contain all of the objects described in the specification. A database is "model complete" if its specification is appropriate for a given application. Completeness informs the user about the spatial, thematic, and temporal coverage capabilities of the data according to the predefined purposes. The two Measures Omission and Commission are considered to be sufficient to describe how well a data set fulfills the demands of the user.
Quality Assurance for GIS Life cycle
The following section discusses the general stages of GIS database creation, from its start as an existing map product to its final stage as a seamless, continually maintained database. At each stage the integration of the QA plan within the process is discussed.
Data Preparation Phase
- Map Preparation: The first step in creating quality GIS databases from paper maps is map preparation, sometimes referred to as map scrub and is the most cost-effective phase to detect and correct errors.
- Edge-match Review: Edge features (those that cross as well as those that are near) must be reviewed with respect to logical and physical consistency requirements as well as noted for positional accuracy and duplication. The temporal factor must also be taken into account.
- Control Review: Establishing coordinate control for the database is the most important step in the data conversion process. Whether using benchmarks, corner tic marks or other surveyed locations, these must be visible and identifiable on the map source. Each control point should be reviewed to make sure it has a known real world location.
- Many at times, a GIS-layer is to be compiled from multiple map sources, then there are bound to be conflicts between the original map data. An example of multi-source conflict resolution may be an electrical layer that is being compiled from two maps, an overhead map and an underground map, wherein a review for duplicated features, conflicting positional locations and conflicting feature attributes is essential for reduced error.
Data Entry Phase Digitising : Digitising is the process of capturing spatial features (points, lines, and areas), where, the point, line, and area features are converted into X, Y coordinates. A single coordinate represents a point. A string of coordinates represents a line and one or more lines that outline an area. Digitization may be carried out either manually or automatically.
- Data Conversion : Data conversion may usually generates two kinds of error, viz; random and systematic error into the database. Random error will always be a part of any form of data, whether it is analog or digital, that can be reduced by tight controls and automated procedures for data entry. Systematic error usually stems from a procedural problem, which once corrected usually clears up the systematic error problem. The key to correcting both random and systematic error is a tightly integrated plan that checks both automatically and visually at various stages in the conversion cycle. A short feedback loop between the quality assurance and conversion teams speeds the correction of these problems. Registration of paper maps, or to images with known coordinate locations introduces registration error and hence, each feature digitised into the database will have an introduced error equivalent to the RMS (Root Mean Square) error. Standards must be set and adhered to during the data conversion process to minimise the RMS error as much as possible. High RMS errors, in some cases, point to a systematic error such as poor scanner or digitiser calibration.
Data Editing Phase
Modern computer systems have the data editing capability that allows the detection of errors and possible corrections during Data entry. After the process of digitisation has been completed, GIS require the user to perform an operation that builds topology. It should permit the user to identify the types of entity errors in his coverage. Some of them will be pointed out, others must be interpreted by looking at database statistics concerning the numbers and types of entities, or by inspecting the graphics displayed on the screen for errors GIS is not designed to detect. A complementary procedure after digitizing should be looking for the following :
- All entities that should have been entered are present in right place and are correct shape and size;
- No extra entities have been digitised;
- All entities that are supposed to be connected to each other are;
- All polygon have only a single label point to identify them;
- All entities are within the outside boundary identified with registration marks.
Data Validation Phase
Validity is a measure of the attribute accuracy of the database. Each attribute must have a defined domain and range. Database validation is the process of determining if database values are reasonably accurate, complete, and logically consistent wrt. the intended use of the data. Validation will often consist of several steps, including logical checks, accuracy assessments, and error analysis. Spatial and thematic accuracy is usually measured against a known standard, whereas error analysis involves the evaluation of data with regard to measurement uncertainty, and includes source errors, use errors, and process errors.
- Logical cartographic consistency
- Closed polygons
- One label for each polygon
- No duplicate arcs
- No overshoot arcs (dangles)
- Similar features use similar symbols
- Logical attribute consistency
- Values within logical range (look for illegal values)
- Dates (e.g. month less than or equal to 12)
- Time of day less than 24:00 hours
- Nominal data illegally re-sampled into ratio data
- Precipitation values equal to or greater than zero
- Linkage of features with attribute fields
- Logical query and statistical tests of the spatial and attribute data (look for unlikely values)
- Points placed in distant locations on the map
- Elevations with reasonable values
- Ground truth or comparison to known standards
- Sample ground areas and compare to database
- Evaluate spatial accuracy
- Evaluate attribute accuracy
- Completeness of data ("model type" completeness – relative to user needs) in a map of roads, are all the roads important to the user included?
- Sensitivity analysis
- Change the data, and see if those changes affect the results for your application.
Quality Assurance Plans
Quality Assurance plans can broadly be classified into two categories, viz; Visual QA and Automated QA and discussed below.
Visual QA : Visual QA is meant to detect not only random error such as a misspelled piece of text, but also systematic error such as an overall shift in the data caused by an unusually high RMS value. Existence and absence of data as well as positional accuracy can only be checked with a visual inspection. The hard copy plotting of data is the best method for checking for missing features, misplaced features and registration to the original source. On-screen views are an excellent way to verify that edits to the database were made correctly. Visual inspection should occur during initial data capture, at feature attribution, and then at final data delivery. At initial data capture the data should be inspected for missing or misplaced features, as well as alignment problems that could point to a systematic error. In either case each error type needs to be evaluated along with the process that created the data in order to determine the appropriate root cause and solution.
Automated QA : Visual inspection of GIS data is reinforced by automated QA methods. GIS databases can be automatically checked for adherence to database design, attribute accuracy, logical consistency and referential integrity. Automated QA must occur in conjunction with visual inspection. The goal of the automated quality assurance is to quickly inspect very large amounts of data and report inconsistencies in the database that may not appear in the visual inspection process. Both random and systematic errors are detected using automated QA procedures. Once again the feedback loop has to be short in order to correct any flawed data conversion processes.
Defining acceptance criteria is probably one of the most troubling segment of the GIS project, due to non availability of Standards for acceptable errors and/or any rejection criteria. GIS coverage being application specific, these can best be defined on the basis of existing data model and database design as well as the user needs and application requirements. Project schedule, budget and human resources all play a role in determining data acceptance. Further, the accepting data can be confusing without strict acceptance rules. A GIS data set may have 'm' features of 'n' attributes each. Any one feature having a single incorrect attribute, may lead to error-count conditions, such as:
- 1 error, if it does not affect other (n -1) features in any way.
- m errors, if it effects all other features.
Each attribute should be reviewed to determine if it is a critical attribute and then weighted accordingly. Additionally, the cartographic aspect of data acceptance should be considered. A feature's position, rotation and scaling must also be taken into account when calculating the percentage of error, not only its existence or absence.
Once the acceptable percentage of error and the weighting scheme have been chosen, methods of error detection should be established. The methods of error detection for data acceptance are the same as those employed during the data conversion phase. Check plots should be compared to the original sources and automated database checking tools should be applied to the delivered data. Very large databases may require random sampling for data acceptance.
Maintenance involves additions, deletions and updates to the database in a tightly controlled environment, in order to retain the database's integrity. It provides the user with only one point of entry into the database, thus improving the consistency and security of the database. Maintenance applications are usually supported by a database management system, consisting permanent and local (temporary) storage systems. Data is checked out from permanent storage into local storage for update and then posted back to the permanent storage to complete the update. Pre-posting QA checks are required to ensure database integrity. Database schema are maintained so that table structure and spatial data topologies are not destroyed. Automated validation of attribute values as well as Visual check-plots for addition/deletion of large amounts of data are also useful. Periodic database validation for large multi-user databases can identify some very important and potentially costly errors. Errors or last minute changes in business rules, bugs in the maintenance application or inconsistent editing methods can all be detected during periodic validation.
The main purpose of this paper is to present an overview of some methods that are especially suited to assessing the Quality of GIS data base and digital maps/ coverage. The various quality parameters of a GIS and their measurements in real life environment are presented and a set of guidelines that are to be adhered to by all of the projects and users intended to establish the minimum acceptable level of accuracy are also highlighted. The issues involved in the development and implementation of an integrated GIS Quality Assurance Plan are also discussed.
- Berry B (1964) : Approaches to regional analysis: A synthesis, Annals, Asoo. American Geographics, 54, 2- 11. (1)
- Blakmore M (1983) : Generalisation and error in spatial data bases. Cartographica 21, 131- 139. (2)
- Goodchild MF (1991) : Key note Address, Proc. Sympo. On spatial Data base, Accuracy, 1- 16. (3)
- Van Genderen JL & Lock BF ( 1977) : Testing land-use map accuracy, Photogrammetric enng. & RS, 43, 1135-37. (4) Books/ Standards.
- CEN 1995 Geographic Information- Data description- Quality (Draft). Brussels CEN Central Secretariat. (1)
- Elements of spatial data quality (1995), Oxford, Elsevier eds: Guptill SC & Morrison JL. (6) (2)
- Brassel K. et al, Completeness, 81 – 108 (1)
- Kainz W, Logical consistency, 109 -137. (b)
- Vergin H. et al, An evaluation matrix for geographical data quality, 167 – 188. (c)
- Langran G (1992), Time in geographic information systems, London: Taylor & Francis. (3)
- Redman TC (1992), Data Quality, NY, Bantam. (4)
The author is indebted to Sri. S Adiga, Director, NNRMS/RRSSC, Bangalore, for his kind approval. Thanks are due to Dr. A Jeyram, Head, RRSSC, Kharagpur, for his views and suggestions for preparation of the paper.