Home Articles Systematising raster data for effective image processing

Systematising raster data for effective image processing

Read Latest: Cloud SQL for PostgreSQL – Get your database cloud ready

Abstract
This study aims to discover a trouble-free and, to some extent, automated way of systematising raster to make a data serving process possible through WMS by minimising: extra effort, required time to update data, advance level expertise and system supervision time. In absence of a proper system, most of the methods we endeavoured earlier were, in some way, complicated, unstable, time consuming and difficult to handle by the end users with limited GIS/Remote Sensing expertise. Oil and Gas industry servicing companies like All-Can need to deal with raster data from/for numerous clients on almost daily basis, so it is indispensable to find an easy and efficient way with the ability to restrict clients to their own data using WMS as well as the need of the entire dataset for All-Can’s internal users. To effectively deal with multiple projections, avoid re-projection on the fly for efficiency reasons in a web based environment using a method easily performed by draftspersons and to chronologically arrange access controlled raster data for different clients were the main objectives behind this work.

MapServer on an Ubuntu server was chosen as the host for our WMS server. A small C Sharp application was written to supply some basic input information about the individual raster files that need to be processed like: name, location, input projection, client’s name, job number and date of acquisition. Based on requirements, it also writes different shell script files on an Ubuntu Server with all the necessary commands and arguments for a GDAL library to automatically process the raster, to organise them in a read-only, set directory structure to avoid the situation of having multiple copies of the same image and also to create TILEINDEX shapefiles for the client, the image was ordered for using the cron job scheduler. Within our directory structure we separate color images from monochrome images therefore, these shell scripts maintain two TILEINDEX shapefiles per client i.e. one displaying the mono and one for color. These client based TILEINDEX files are used to restrict clients to their own data. Automatically created scripts also merged different client based TILEINDEX shapefiles together using color/mono classification to create two TILEINDEX shapefile for our internal use displaying all the images. These shapefiles are imported later into PostgreSQL/PostGIS as tables those are then updated with all the metadata information collected through C Sharp application, to run spatial queries by different applications operating within the organisation.

By considering more than 2000 images and size on disk of raster data being served, the overall performance of the designed system was found to be competent, stable and easily and successfully manageable by even non GIS professionals.

Even though we are successfully running the suggested system using plain file system approach, after the recent release of PostGIS 2.0 with raster as a new PostGIS type for storing and analysing raster data made us think to give it a try using images in the database approach. This would give the benefits that come with a Relational database Management System like data protection, concurrency control, integrity and management besides out of the box raster processing and analysis functions. Also in an environment where most users are running Windows OS with very little or no knowledge of GIS and/or Linux, for simplicity and to elude possible processing mistakes and long image processing times, it is recommended to have a Windows based front end application to supply some basic information about image input needs and leave all processing and publishing tasks for the automatically created shell scripts and Linux cron job scheduler. Some continued investigation is recommended to explore and test other options within MapServer and Geospatial Data Abstraction Library (GDAL) to take full advantage of the available features to make existing system even more efficient.

Introduction
The main motive behind this study was the design and implementation of a scalable, reliable and high performance raster data management system. One that requires limited or no knowledge of core GIS and database principles in order to handle images including having the ability to serve those rasters using WMS protocol to the whole organisation and different clients with restricted access. Other objectives included dealing with large processing times of raster data, minimising system supervision time and organising the rasters using a specific directory structure and importantly, using a cost efficient approach.

There is an ongoing debate in GIS industry over the last couple of years as to whether the rasters need to be stored in the database in order to be better managed or whether a solution which is based on a plain file system is good enough. Under the current technological era, both approaches have some advantages and disadvantages.

Built in raster processing functions, concurrency control, best security features to protect data efficiently, different indexing mechanisms, reliability and ease of management are among some of the benefits a user gets by storing images in a database. Relatively low performance in a web server environment is one of the disadvantages. Although performance improvement can be achieved by a specialised raster DBMS such as PARADISE and RASDAMAN [Imran 2009] or using an intermediate approach by constructing tiles and multi-pyramids using BLOB (Binary Large Object) type, available on most object-relational DBMS, and adequate indexing and compression mechanisms for efficient data retrieval [Vinhas et al. 2003]. The main questions raised on plain file system are either to lacking for proper data management due to possibility of scattering images all around the organisation, broken links to deleted or modified files and sometimes the trouble of updating the old data with new in a multi user environment. No matter how one approaches the storage, the partitioning of raster in tiles and maintenance of resolution pyramids are found in most implementations of raster storage [Vinhas et al. 2003].

The partitioning of a raster in tiles allows the indexing of each raster part independently, resulting in efficiency gains when just a raster partition is visualised. Resolution pyramids are multi-level auxiliary structures that store down-sampled (and also tiled) versions of the original raster data. The bottom level contains the original resolution, while higher levels contain the subsequent lower resolution versions. This feature is especially useful for visualization purposes, when applications can retrieve the raster at a level according to a desired zoom scale [Vinhas et al. 2003].

MapServer is an Open Source platform, initially developed by University of Minnesota (UMN) and NASA and presently a project of the Open Source Geospatial Foundation (OSGeo), for building spatially-enabled interactive Web mapping applications with the ability to display and query hundreds of vector, raster and database formats with the ability to easily integrate with common commercial and open source applications using a configuration file called Mapfile. In addition to the reason of having the staff with previous work experience using MapServer there were some other motives behind the selection of MapServer as our mapping server. These included the use of Geospatial Data Abstraction Library (GDAL) and OGR Simple Feature Library to deal with diverse raster and vector dataset without any conversion, being able to create maps without the need of particular tools installed or assistance from mapping analysts [T. Mitchell, 2005] and support for popular scripting languages. To design a cost effective solution we installed MapServer on an Ubuntu Server. Ubuntu is one of the most popular GNU/Linux based, rock-solid server operating systems.

The unavailability of a low cost raster management system that requires minimum time, efforts and technical knowledge to manage, with lowest possible system supervision time, that also sticks to a systematic directory structure when it comes to storing rasters and that same structure is not accessible by end users made us want to create one that better suited to our needs. In our system, to visualize raster data through WMS protocol we use files in a file system approach. Our directory structure manages the raster images by client names, that comes on top of the hierarchy then by year of acquisition of image and finally by type of image i.e. color or monochrome.

To add a raster to the system, the user supplies metadata such as, client name, date of image acquisition, image resolution, image projection, vendor and job number (a unique number for our internal use to distinguish between different projects). Besides populating the image information table in mapPostgreSQL/PostGIS database, a C Sharp application then creates two shell scripts if image uploading is requested and one shell script if image deletion is required on the Ubuntu server as an input to the Raster Data Management with all required image processing and data organisation commands for that specific raster and then cron (a time based job scheduler similar to a scheduled task in a windows environment), runs some customised python scripts at a specified time to run the shell script. Using the automated shell script approach, we avoid longer image processing times and automatic upholding of TILEINDEX (a polygon shape file represents raster images available for display), for each and every client based on image type (color or monochrome) without any extra effort or in-depth technical knowledge and system supervision time. These polygon TILEINDEX files are then imported into the PostGIS database and further modified by populating newly created columns with the metadata of each and every image available to display from an image information table for restricted client access. In simple terms, with our images we are using a file system approach; we use Apache as web server, MapServer as mapping server, PostgreSQL/PostGIS for storing TILEINDEX shapefile only and Ubuntu as the server operating system bundled with all its security features.

At the time of designing the system, our intention was to systematically store the raster data for visualization purposes within a plain file system. The release of PostGIS 2.1 changed the landscape with the support of storing raster within the database. In our view when considering the budget, storing raster in the database is the best option especially if someone wants to use raster processing functions. For efficiency reasons we decided to avoid projection on the fly option for our rasters and re-project all the input images to TM10 projection. Since GDAL utilities used in this study create new files for processed data, we ended up having two copies of raster data; one for archival and one to be published. This process increased the hard disk storage space two fold but based on our experience and considering the benefits of the design system we are safe to say it was not an expensive endeavour.

Methodology
Managing and processing imagery is always complicated and time consuming particularly when one needs to deal with multiple projections and metadata. Without proper systems emplaced it is always difficult to access and fully utilise imageries especially if required to keep track of metadata and give clients a controlled access. Effective and efficient delivery by avoiding any delay after receiving the data with minimum extra cost is always challenging.

GDAL is a translator library for raster geospatial data formats that is released under an X/MIT style Open Source license by the Open Source Geospatial Foundation. As a library, it presents a single abstract data model to the calling application for all supported formats. It also comes with a variety of useful commandline utilities for data translation and processing. The only hitch, when using these commandline utilities, is they need specific parameters based on user requirement and thus require some knowledge about these utilities and time to perform those tasks. Since the system we were planning to design was supposed to be reliant on users within the CAD department, this study was primarily concerned with the easy to use system; one that requires limited or no knowledge of core GIS and database principles in order to manage images.

Based on our previous experience we found Ubuntu server scalable, reliable and high performance and we knew about its combination with MapServer was ideal in particular when considering cost as a constraint. We also knew that along with many other qualities MapServer has the ability to serve both as WMS client and WMS server.

Using TILEINDEX shapefile provide us with the ability to give restricted access to different clients if we create multiple TILEINDEX shapefiles; in other words separate TILEINDEX files for different clients. We used gdaltindex programme to build multiple shapefile with polygons geometry outlining the raster, a record for each input raster file and an attribute containing the filename. Since the raster data we receive, based on area of interest, comes in different projection systems and need to be re-projected to NAD83(CSRS) / Alberta 10-TM (Resource) projection, a projection that we use the most in our client server environment for display purposes, to avoid re-projection on the fly option for efficiency reasons. We decided to use gdalwarp, GDAL’s image re-projection and warping utility, to re-project our data to NAD83(CSRS) / Alberta 10-TM (Resource) projection.

Since we could provide the output path of the re-projected image as a parameter to this re-projection and warping utility and we used this output raster path to organise the raster using our approved directory structure that was based on image type (color/mono), client name and the year the image was ordered. By adapting a strict directory structure we eliminate the possibility of duplicating images throughout the organisation, broken links to deleted or updated files and the headache of updating the old data with new in a multi user environment. For efficiency reasons we also made use of tiled raster using gdal_translate, another GDAL’s utility, that also converts raster data between different formats and GDAL’s rgb2pct.py utility to convert a 24bit RGB to 8bit paletted image. To deal with the large processing times (may vary based on raster size) of raster data and to minimize system supervision time we decided to run them automatically using linux shell files with Ubuntu’s crontab scheduling utility. To perform all these steps successfully we installed Apache, MapServer, Samba File Server to the networked Ubuntu and Windows computers, PostgreSQL/PostGIS and Python on Ubuntu and steps involving installation and setting up of these applications is beyond the scope of this study.

Based on functionality, this raster management system can be divided into three main components i.e. Scripts (Ubuntu shell and python), MapServer’s Mapfile and image files. Scripts, the core component, is responsible for performing tasks like running GDAL utilities to process the input images, organize them, delete temporary/ intermediate files and finally creating a TILEINDEX file to be called from MapServer Mapfile. They are eight in number and perform the following tasks:

  1. removing all temporary files created while processing the raster data,
  2. convert all text files to linux/unix format and change their mode to be able to executed,
  3. check if the image deletion of already published raster was requested (instead of adding),
  4. if necessary, reproject the input image to NAD83(CSRS) / Alberta 10-TM (Resource) projection and convert them from 24bit RGB to 8bit paletted image, if necessary,
  5. create different TILEINDEX shapefile for different clients,
  6. create TILEINDEX file for All-Can’s internal use showing all monochrome images,
  7. create TILEINDEX file for All-Can’s internal use showing all colour images, and
  8. export created TILEINDEX files both for all clients and All-Can internal to PostgreSQL/PostGIS.

Mapfile is a configuration file for MapServer, used to locate symbols, fonts, template files, and spatial data and based on these values specifies the size of the resulting map, its geographical extent, and rendered output image format. For further details about Mapfile please go to https://www.mapserver.org/mapfile/. In the beginning of this project, we decided to use a specific directory structure to store the images. Raster is the main directory and contains two directories named color and mono to separate color images from monochrome. Both color and mono directories contain separate directory for each year, for example the directory named 2011 will contain all the images ordered during the 2011 calendar year and each of these year directories contain a directory for each and every client that we ordered images for.

Specific members from our CAD department are responsible to order the image and save it on the data server after proving the credentials; once this image is saved on the disk the user has a read only access to this image. For simplicity and considering today”s environment where most users are running Windows OS, we decided to use Csharp to create a simple interface that requires very minimal information about the input image (including job number, client name, image resolution, input projection, image acquisition date and image received date) to create two shell script files on the Ubuntu server and one script for a situation when deletion of already published raster is requested. As discussed earlier, the job number is a unique number in a specific format for our internal use to distinguish between different projects. First two digits of each job number represents the year when that job was started and hence used in this project to organize raster chronologically in combination with the client name. This application also adds a new record to a table named img_metadata in PostgreSQL/PostGIS with the same information. These scripts serve as an input to the system containing all the commands and instructions. After completing this two to three minute task, the user carries on with their work using the same image they get from data vendor without making any other effort to publish that using raster data management system.

We run the cron job scheduler two times a day (noon and midnight) to update the raster although this could be run as many times as we want. This cron job simply runs a shell script that contains path to eight different script files (python and shell) to run in a succession as shown in Figure 1. This diagram also briefly explains important commands used during different steps. CAD technicians use the CSharp graphical user interface to create scripts run at Step 3, 4 and 5 on the Ubuntu server using a samba share and these files serve as input to the system.

There are three folders on Ubuntu server named del_img_req, process_img and tile_shp to store all the shell scripts (separate file for each request) created by CAD users to perform tasks of image deletion, process image and create TILEINDEX shapefile. As discussed earlier we pre-process images using GDAL’s gdalwarp, rgb2pct and gdal_translate utilities, if necessary, in order to avoid projection on the fly and also for other efficiency reasons. During this process they create some intermediate files that need to be deleted before starting another image processing session. In the first step we delete all temporary or intermediate image files and also make sure certain directories exist at specific locations in our Ubuntu server in order to create new temporary/ intermediate files. In order to be able to successfully run on the Ubuntu server all the CSharp created script files should be in UNIX format. In the second step, using fromdos utility, we convert DOS formatted plain text files to UNIX format to be able to run them at step 3, step 4 and step 5; also change their mode to be able to execute. In the third step, a python script runs and loops through the del_img_req folder to see if there is any request to delete any image; for every request there will be a separate file. If it finds any script with all the details about the image to be deleted, it simply removes that image from Ubuntu server so that as a result of step 5 users will no longer be able to see that image. In the fourth step the python script looks into the folder name process_img on the Ubuntu server to see how many shell scripts are there to process. This folder contains separate shell script files for each and every raster to be published and contains instructions about their processing and destination path of the processed image. After successfully passing this step, processed images are placed in a directory structure on the Ubuntu server based on color/mono, year the image was ordered and client name. In the next step, a python script loop through the tile_shp folder to see which client the raster was ordered for. This folder again contains, CSharp created, separate shell script files for distinct clients. Each and every file in this folder contains instructions about running the gdaltindex utility to create the TILEINDEX shapefile for both monochrome and color images for each and every client and put these shapefiles with suffix _m and _c as suffix into a folder name client_tileindex. This script file by using shptree utility also creates a quadtree-based spatial index for the newly created TILEINDEX shapefile. On sixth step aces_tile_m python script runs to merge all the TILEINDEX files for all the clients representing monomchrome images. On step seven, similar to step six except it processes and merges all TILEINDEX file for all the clients for color images and also creates a quadtree-based spatial index. In the end process_postgis shell script runs to export all TILEINDEX shapefiles ending with aces_m and aces_c to PostgreSQL/PostGIS for our internal use as two spatial tables to run different spatial queries using different other applications. Figure 2 classifies raster Data Management based on operating system and also on automated and manual part.


Figure 1 Scripts, their names and description

After success completion of the eight steps we get:

  1. processed images in NAD83(CSRS) / Alberta 10-TM (Resource) projection organized in a directory structures based on color/mono, client name and year of acquisition to be used by MapServer in order to publish using WMS protocol,
  2. separate color and monochrome TILEINDEX shapefiles for each and every client and two representing all color images and all monochrome images for All-Can’s internal use, and
  3. two PostgreSQL/PostGIS spatial tables showing boundaries of raster coverage. These tables also have all the metadata as a result of join, using image file name as a unique key, with the img_metadata table that was modified at the time of publishing or deleting the raster. We use these PostgreSQL/PostGIS spatial tables several other purposes.

The last step is to setup MapServer’s Mapfile for our WMS service by defining a list of parameters and metadata items that are required for a WMS configuration. This step is performed whenever an image is ordered for a new client or at any time we get any special request from any of our clients. All the existing clients already have setup a Mapfile.


Figure 2 Classification of Raster Data Management

Conclusion
A cost effective and high performance raster database management was presented for raster a visualization purpose that is reliable and capable to handle a growing amount of data. It heavily relies on the CAD department with almost no prerequisite knowledge of core GIS and database principles and serves the data using WMS protocol to the whole organization and multiple clients with restricted access in a plain file system approach. Using a semi automated method it also deals with large processing times of raster data, minimizes system supervision time and organize the rasters using a specific directory structure.

Ubuntu, one of the most popular GNU/Linux based, rock-solid server operating systems running GDAL commandline utilities in combination with MapServer, using suggested approach, was found to be ideal. We process our images to convert from 24 bit to 8 bit, reproject them to 10TM then the tiled images are organized in a directory structure so that type of image i.e. color or monochrome that comes on top of the hierarchy then by client name and finally by year the image was ordered. We perform all these image processing and organizing tasks using shell scripts. These automatically created scripts, using C Sharp graphical user interface, served as an input to our system and contain instructions about all image processing and organizing tasks. These raster are then published using WMS protocol using TILEINDEX files. Each client has its own TILEINDEX file representing its own data and two merged file, one for color and one for mono images, with all the metadata for our internal use. These polygon TILEINDEX files are also imported into the PostGIS database to use in our other spatial applications.

Successfully implementing the system, our CAD department has published more than two thousand images based on a half hour training session. Separate shapefiles are being created for different clients those are automatically merged together for our internal use. Publishing raster thorough WMS proved to be a very good idea as we are using images not only in MapGuide but also other applications like AutoCAD, QGIS and Global Mapper. The whole system is found to be both reliable and scalable and based on our experience we can safely say it’s one time saving system that requires very little system supervision time. Mapfile creation on the Ubuntu server was a onetime process that definitely requires some knowledge about MapServer and the Ubuntu text editor but after setting it up for all the active clients no further modification was needed although changes or updates merely require the addition of couple of lines of text.

A couple of years ago when we designed this system our intention was to find an efficient and simple way to systematically store raster data for visualization so we decided on plain file system approach. Using this approach presented here it’s not possible to do any kind of raster analysis and it also increased the hard disk storage space two fold but as we said based on our experience and considering the benefits of the design system we are safe to say this was not an expensive endeavour. The release of PostGIS 2.0 with raster as a new PostGIS type for storing and analyzing raster data changed the landscape. In our view when considering the budget, storing raster using an open source software program like PostGIS in PostgreSQL object-relational database is the best option. In this study we are not utilizing all the capabilities of either MapServer or GDAL – Geospatial Data Abstraction Library and therefore continued investigation is suggested to take full advantage of other available features to make existing system even more efficient.

References

  • GDAL – Geospatial Data Abstraction Library, Sept. 2012. https://www.gdal.org.
  • Imran, M. (2009). Extending an open source spatial database with geospatial image support: An image mining perspective.
  • Kyle Rankin, Benjamin Hill (2009). The Official Ubuntu Server Book. Prentice Hall PTR: Upper Saddle River, NJ, USA
  • MapServer, Sept. 2012. https://mapserver.org/.
  • R. Searcs, C. V. Ingen and J. Gray, To blob or not to blob: Large object storage in a database or a filesystem. Technical Report MSR-TR-2006-45, Microsoft Research (April 2006).