GIS is virtual world, a world that is represented by points, polygon, line and graph. Processing of these datasets has always been a challenge since the day GIS got established as a field. Processing of huge data has always been a long standing problem not only in traditional Information and Technology(IT) sectors but also in the Geo-Spatial domain. However recent development in the both hardware and software infrastructure has enabled processing of huge data sets. This has given big push and new direction to those industries which were marred by slow data processing capabilities. GIS industry as whole is not lagging behind to utilize this opportunity. According to McKinsey report, development in Big Data will set a new wave of innovation. This innovation would be felt all across the IT sector. Innovation in Big Data and GIS will bring lot of new players into the market.
GIS industry has its own nomenclature while dealing with Big Data. Huge data sets are called Spatial Big Data (SBD). This is the same term used by Dr Shashi Shekar of University of Minnesota. Big Data is conventionally defined by 3 V’s : Volume, velocity and variety. Spatial domain is also facing the problem of increase in size, variety and update frequency, which are exceeding the capacity of the commonly used spatial computing techniques, architecture, methodologies and software solutions.
Traditional spatial data genre are Raster, Vector and Graph ; SBD is becoming prevalent in all three genre. SBD of raster type includes satellite imagery, climate simulation, multiple and coordinated drone imaging system. SBD in vector type includes Taxi data of Uber, geo-located twitter data, gps data etc . SBD of graph data include electric grid data, road network data , supply chain network data, drone network data. SBD comes up with its own challenges; the challenges of lack of specialized systems, techniques and algorithms to support every type of SBD. Well-developed tools and concepts of Big Data like Map-Reduce technique, Hadoop software,Hive, HBase , Spark don’t support spatial or Spatio-temporal data directly. Most of SBD are processed either as non-spatial data or processed using some wrapper function which doesn’t bring down data processing time.
Limitation presented in the existing system has motivated many researchers to come up with extensions, products and architecture which would help is handling the spatial data in efficient manner. These include Spatial Hadoop, ESRI Tools for Hadoop, Hadoop GIS ,Parallel DB systems like Secondo, Couch DB etc.
Spatial Big Data Processing
21st century has started with exponential development of science and technology; this resulted in explosion of data. Data had volume, velocity, variability, value, valence and virtualization aspects attached with it. Until most recently, most of the data was never analyzed and most of time it was discarded. There was no operational method to fully utilize this huge data. This situation came to be known as “Big Data Problem”. Big Data problem not only reached door of IT companies but also reached administration, economy, sustainability, health care, research and development and all the industries that runs on machine world as described by C.L. Philip Chen in his article of 2014. This problem has also given opportunity for various organization to solve “Big Data problem” in their own way; like Google solved it using GFS, Social media giant Facebook used its own Big Data strategy and so did the WalMart. Most of initial Big Data problem were not in Geo-Spatial domain. However big data problem did touch NASA’s Center for Climate Simulation and Large Synoptic Survey Telescope (LSST).
A Google paper on Data processing using large cluster published in 2005 would go on to change the way how data processing would be seen in the future. It define new Processing method i.e. Map-Reduce Framework (“MapReduce is a programming model and an associated implementation for processing and generating large data sets”). This paper defined what Map and what Reduce function stands for. As definition it gave it as “a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key”. Programs written in this model will automatically parallelized and can be run on large cluster of desktop machines without high end configuration. Basically if you want to run Map-Reduce function, the real world task must be converted into Map-Reduce function type, else it would be tough to implement this model. Fortunately most of the real world task can be converted into Map-Reduce function type.
If Map-Reduce framework is to be utilized then Geo-Spatial big data problem must be converted into 2 set of functions : one doing Mapping and other doing reducing. Once this is done then system will take care of run-time data partitioning, scheduling of programs to be executed across the cluster, fault tolerance and inter-machine communication.
The above figure is Execution Overview(Source : Google’s Map-Reduce Frame Work Paper). This framework is data centric framework which didn’t had spatial orientation in it. What Hadoop provides is a distributed file system and a framework for transforming data. Hadoop became the software implementation of Map-Reduce framework by Yahoo. This project later become free open source project. Opening up of Hadoop as open source transformed the whole data sector, who were clueless in the face of big data problem. However at this point of discussion its necessary to ask some basic questions i.e. what would be Execution architecture of Map-Reduce with respect to spatial data? Is Hadoop capable of implementing Map-Reduce framework with respect to spatial data?
To answer this we need to explore concepts of Hadoop-GIS , and SpatialHadoop and Esri Tools for Hadoop.
The two major problem of handling spatial partitioning are: one, It could lead to load-balancing problem and may slow down the system and other, could be inaccurate answers. The answer to this problem came as Hadoop-GIS. Hadoop-GIS is scalable, high performance spatial data-ware housing system running on Hadoop. It supports Spatial partitioning, customized spatial querying using RESQUE which runs on Map-Reduce framework. Hadoop-GIS uses global partition indexing to achieve efficient query results. Hadoop-GIS is supported with Hive, Pig, Scope. This works well when compared with parallel SDBMS. Hadoop-GIS supports fundamental spatial queries such as point, containment, join, and complex queries such as spatial cross-matching (large scale spatial join) and nearest neighbor queries. Structured feature queries are also supported through Hive and fully integrated with spatial queries.
Hadoop-GIS was developed with the intention of providing highly scalable, cost-effective and efficient spatial querying processing system. To realize this system the querying was divided into small task and process. The data was spatially partitioned into tiles or blocks and then processed in parallel. The querying would be done on these tiles. The resulted data would be then joined maintaining semantics. Algorithm which worked is given by Ablimit Aji in 2013 paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3814183/). We would not go in much detail into this. This methodology may also being used by NASA and Digital Globe in-order to handle dig data querying of the geo-spatial raster data.
Digital Globe which gives high-resolution imagery has recently been facing with a new challenge; the challenge of having 70 PetaBytes of data, which is more than 22 times the WalMart’s data with itself. This huge amount of data has with it keys to unlock the power of Imagery analytics. Digital Globe solved this problem using SBD implemented over cloud using Hadoop. Monsanto, an Agro industry giant also used Hadoop-HBase system to pipeline the raster data and come out with geo-spatial analytics for farmer recommendation system for optimization of the agro industry. This project also involved integration of publicly available data which were related to climate,market, yield , demand and supply . Implementation of Hadoop based system was presented in the Big Data conference by Monsanto in 2013. These were the initial success stories ; However, Hadoop is ill-equipped in handling the spatial data as it treats both spatial and non-spatial data in the same ways. Hadoop-GIS is like black box, not much could be changed. It supports only uniform grid index, which is applicable only to cases where there is uniform data distribution. In order to over come limitations of Hadoop-GIS; an extention which is spatially aware was designed and its called SpatialHadoop.
SpatialHadoop is extension of Hadoop which supports spatial data natively. Spatial Hadoop is spatially aware. SpatialHadoop adapts traditional spatial indexing structures, Grid, R-tree and R+-tree to form a two-level spatial index. SpatialHadoop has built-in Hadoop base code, which makes Hadoop aware of spatial constructs and spatial data inside the core functionality of Hadoop. This is key for increasing power and efficiency. SpatialHadoop supports a set of spatial index structures including R tree-like indexing, which is built-in feature in Hadoop Distributed File System (HDFS). SpatialHadoop has functionalities, which users can directly modify.
SpatialHadoop has 4 components :
- Language : provided by Pigeon, a high-level SQL language
- Storage : applies 2 level indexing, global and local. Global indexing partition data at node level. Local indexing works inside node. Grid, R-Tree and R+-Tree are supported index.
- MapReduce has 2 new components i.e. SpatialFileSplitter and SpatialRecordReader
- OperationLayer: Have spatial operation
The above figure shows SpatialHadoop architecture at high level. Developers will be able to interact at operation level and could be able to perform operations like SpatialJoin, RangeQuery etc. Casual user would be able to do spatial query and see the result via visualization. And system admin is responsible for tuning up the configuration files and system parameters. MapReduce, which is layer, will interact with storage layer, where data would be stored using spatial indexing. For more on this please refer the link
(http://spatialhadoop.cs.umn.edu/). This is official website from which SpatailHadoop can be downloaded. SpatialHadoop also comes with preconfigured VM and tutorials.
Experiments have been conducted(by SpatialHadoop team) at University of Minnesota and are still going on. As experiments are showing promising results, it would one day be developed as full-ledged Spatial-Hadoop environment.
Esri GIS tools for Hadoop
These are open source tools which would run on the ArcGIS platform. These allows integration of the Hadoop with Spatial data analytics software, i.e. ArcGIS Desktop . Basic problem of Hadoop and ArcGIS was data interoperability from HDFS to ArcGIS platform and from ArcGIS platform to HDFS. These tools have been developed in 2013 and had these 2 component which worked as interoperability tools. On top of this there were few tools like aggregation tools that were developed in order to create bins. These tools were limited in their approaches as no other spatial data analysis or processing could been done using these tools. Hive querying was also introduced. But these were at tool level not at architectural level. These tools were meant to shows the capabilities. And hence cant be credited to be called as Spatial Big Data processing tools. As ArcGIS desktop was involved, lot of modules were involved in running these and hence this caused delay in the data processing. However this was introduced at a time where GIS industry was still trying to bring out COTS on Spatial Big Data.
Toolkit provides a way of importing big data into the ArcGIS environment. Results from spatial querying and analytics in Hadoop can be moved to ArcGIS for further processing and visualization.Those geo-processed datasets can then be saved to ArcGIS or exported back into the Hadoop system thus creating a looping workflow between the ArcGIS platform and the big data environment. There were also examples that showed aggregation of billions of points present in taxi data into a raster images, which was good both as analytics and as aesthetics. It involved Hadoop environment running on VM and interacting with ArcGIS Desktop on windows via Citrix environment. This was basic architecture which is still followed, hence all Hadoop implementation would be modular in nature without disturbing either of the environment ie ArcGIS softwares would be in windows environment and Hadoop will on the Linux environment. Both system will interact over network through common data transferring format like json. It was only in 2017 a matured architecture which could handle SBD at point level was running successfully , however there seems to no evidence of implementation of SBD strategy for raster data as done by Monsanto and NASA which are purely custom Spatial Hadoop implementation. For more information regarding SBD in ArcGIS environment please follow the blog (http://thunderheadxpler.blogspot.com/).
SBD a problem or Solution in itself?
World is combination of systems which no more works in silos. World as we see has developed into a complex interconnected mega system. Information has been scattered all around the globe in the form of imagery, points, roads and networks. All these are interrelated and effects each other. Its very hard to understand the linkages that exist between these information systems both temporally and spatially. Spatial Big Data could help in understanding the systems, their interconnections like never before. This would enable us to solve the problems like never before at local, regional and at global level. Digital Globe has recently come up with its own Geo-Spatial Big data solution which works on their imagery inventory. Companies like Planet Labs are also coming up with their own version of SBD analytics. To conclude, SBD is still in nascent stage, however GIS industry as whole is preparing itself for next big band in SBD. On cautionary note GIS industry has to wait a little longer to have a matured platform for Spatial Big Data. It is immature to conclude anything about which domain will drive the SBD but i believe the future of SBD will be shaped by how well SBD will be used for Geo-spatial analytics and how well Analytics as a Service (AaaS) will grow in the coming years.