Traditional storage mechanism that today’s organizations use cannot handle the massive amounts of data they are flooded with. So what do they do? They move their data to USB drives to make room for new data and then shelve the USB drive with the idea that one day the data will be restored to their traditional storage, but it never is — Mansour Raad, BigData Expert at Esri, tells us how Geospatial Big Data can solve this problem…
What is so special about geospatial Big Data?
Practically everything has an explicit or implicit geospatial location — explicit when a latitude and a longitude are provided, implicit when an address is specific, like 380 New York Street, or (my favorite example) implied in a mobile Tweet such as, “I’m eating a Chicago style pizza in downtown Manhattan.” Even if the location service on the phone is disabled, a natural-language processing analyzer can determine that Chicago is not the correct location of the Tweet — downtown Manhattan is the location, and that can be converted into explicit latitudinal and longitudinal values.Mansour Raad, BigData Expert, Esri
Everything that happens, happens somewhere. It’s the where that provides context. In the words of Richard Saul Wurman, “Big Data is junk if you can’t understand it, but a map is a pattern made understandable.”
And that is most evident today when something happens like the November 2015 mass shooting in Paris. A map is presented on TV highlighting the time and location of the events, helping us make sense of the sometimes crazy world we live in.
How can industries take advantage of Big Data?
What’s getting better is our ability to model and forecast our cultural, physical, and biological futures. The difference is real-time data. You have sensors pulling in information about traffic, noise, air pollution, water quality — even conversations on Twitter. GIS has always been about data, but now GIS is getting filled with streams of real-time data. We’re able to integrate that data from different sources and analyze it against historical patterns to make predictions.
Instead of simply telling you what current traffic conditions are, we can predict what the next hour of traffic will be like and tell you the best way to get where you’re going. That’s powerful for you as an individual, of course, but if you’re a logistics professional, you can dramatically change your business by rerouting your trucks.
Big Data can easily become unmanageable and useless without the proper tools to analyze it fast. How can this be managed?
In most organizations, data goes to “die” on USB drives. What I mean is that the traditional storage mechanism that today’s organizations use cannot handle the massive amounts of data they are flooded with. So what do they do? They move their data to USB drives to make room for new data and then shelve the USB drive with the idea that one day the data will be restored to their traditional storage, but it never is. The operative word here is “traditional” and that is very important.
For me understanding Big Data usage is more than the traditional volume, velocity and variety (3 Vs); it’s the realization that sometimes the 3 Vs just don’t apply. For example, imagine (God forbid) that another Fukushima Daiichi event happens — the sparse data from sensors in a remote village is not coming in fast, is not big and is very structured. However, the window of opportunity to respond to such an event is so small that I need new ways to determine whether that village needs to be evacuated, and I need certainty and confidence in my decision making. This is where geospatial analytics is most relevant in the form of bayesian kriging with regression executed very quickly by non-traditional means.
Here’s another example of non-traditional data management: An organization needed to meet the requirements of a service-level agreement (SLA) for storing online all data for a specific amount of time so that any geospatial data collected during that time can be immediately visualized on a map. A traditional data storage vendor could have helped that organization meet the SLA requirements, but at an exorbitant cost! The organization decided to take a bold step and try something new, Big Data (i.e., Hadoop), to store and process that information. Again, it’s not about volume, velocity, and variety — it is simply the cost of doing business. Hadoop provides a way to deal with Big Data issues. Lately, I’ve been combining it with other tools such as Cassandra, Elasticsearch, and Apache Spark to surpass the traditional means.
What are the software and tools for the Big Data roadmap?
The ArcGIS platform continues to evolve in a number of areas. Here are some highlights:
First, we are working on Big Data for GeoAnalytics, an extension for ArcGIS for Server that leverages a new class of technologies including distributed computing and storage frameworks to analyze and visualize very large datasets. Examples include the analysis and visualization of large volumes of real-time streaming data (e.g. data from moving vehicles/GPS sensors, connected devices, and social media events), enabling batch analytics on high-volume spatiotemporal data as well as raster analytics on very large collections of imagery.
In the coming 6 to 12 months, we will be introducing capabilities across the ArcGIS platform that will make it easy for our users to take advantage of this new approach and scale it to their needs. The combination of ArcGIS GeoEvent Extension for Server and GeoAnalytics functionality will support high-velocity, real-time data ingestion; high-volume storage; and real-time and batch analytics on the same data. The combination of imagery and GeoAnalytics functionality will support data dissemination, on-the-fly analysis, and batch analysis for large collections of imagery gathered by drone, aerial, and satellite sensors.
GeoEvent Extension is being enhanced to support high-velocity data ingestion and streaming, handling hundreds of thousands of events per second in a similar infrastructure. To support spatiotemporal archiving, analysis, and visualization, we will also introduce a bundled spatiotemporal Big Data store in ArcGIS for Server based on distributed storage technology. This new data store will scale in capacity and throughput, leveraging additional infrastructure. In the area of large data visualization, we are working on a number of initiatives including smart mapping and visualization of Big Data with built-in aggregation and binning.
It is important to our users to maintain a familiar user experience. So we are making the installation of these Big Data extensions very simple, and the invocation of the tools will still be in the form of geoprocessing user experience forms from ArcGIS Pro and ArcGIS for Server. In addition, the tasks can be programmed using ArcPy for batch analytics. In the case of GeoEvent Extension, the sources and sinks and intermediate geoprocessing actions can be visually designed and distributed for execution.
In conclusion, we now can store massive geospatial data on less expensive hardware and scale horizontally to handle more volume. We are in the process of converting more of our geoprocessing tasks from serial processing on a single node model to parallel processing on a multiple-nodes model. Finally, we are taking advantage of in-browser advanced capabilities, such as 3D and local GPU processing, to render that massive data in a fluid and expressive way.