Grid systems are vital for analyzing large spatial data sets and partitioning areas of the Earth into identifiable grid cells. Keeping this in view, Uber developed the H3 Geospatial Indexing System, a grid system used for effectively visualizing and exploring spatial data and optimizing ride pricing and dispatch.
Let’s learn the nuances of H3 from Isaac Brodsky, co-founder of Unfolded. With a background in computer science and software engineering, he previously worked at Uber’s Marketplace, which is the heart of their operations; matching riders and drivers, dynamic and surge pricing, guidance to driver-partners, and so on. All the things that affect what people call marketplace dynamics or supply and demand.
Isaac ran some large Elasticsearch clusters and provided geospatial data to understand those markets. Dynamic pricing needs spatial information, and that’s how he came into contact with the H3 project. Initially, he led the project from the data side, and then from the publishing and open source side.
What kind of grid system is H3?
The academic term for H3 is a discrete global grid system. It’s a discrete system, breaking up the world into discrete cells. Every position in the world has a cell identifier associated with it. It’s global. It doesn’t matter if you’re at the North Pole, the South Pole, the anti-meridian, in the middle of the Pacific Ocean, or somewhere in Texas, it will cover you.
H3 is also a system of grids. It’s not only a single resolution grid at a certain resolution (cell or size), but it’s a system of all of them. All 16 resolutions of the grid relate to each other; the resolution 10 grid, for example, is created by subdividing the resolution 9 grid in the H3 system and so forth.
Why did you choose hexagons for H3?
It throws people off when we say that H3 is a hierarchical grid, plus it’s hexagons.
In H3, we subdivide each cell into seven smaller cells. If you try to draw that out, you’ll realize it doesn’t fit right. We had to make some compromises to get that to work. Why? Because it causes us to make these trade-offs, and there’s some kind of odd edge cases with it.
We needed a shape that we can tile over the entire Earth. There are some pentagons in the H3 system, but we put them in out-of-the-way places where we hopefully won’t have to deal with them. We also need to model things moving around in the actual world. That’s the key use case Uber had for it. Hexagons have this neat property; all the neighbors are the same distance apart.
For squares, four neighbors share an edge. But then four neighbors are sharing all their points with that square, and those neighbors have a different distance, center to center, from the original square. With hexagons, all the neighbors are the same distance apart. This makes it much more convenient to run a variety of different algorithms on top of the grid.
How are you managing the common problems associated with hexagons?
We can’t solve those entirely. But we can help the users avoid the problematic cases.
When you’re converting to the H3 grid—a point, a polygon, or any kind of shape, you’re giving H3 both those coordinates for that point, plus a resolution and an indexing operation. This helps you to work with it inside the grid system. The indexing operation is exactly at that resolution.
If I tell the system, “I want to index this point at resolution 8,” it’s going to give me the cell that contains precisely the resolution 8 inside of H3, without being affected by any distortion. The difficulty happens when you want to move between different resolutions of the H3 grid.
When H3 subdivides cells, it slightly offsets them from one resolution to another. This causes a zigzag or jagged pattern around the edge of cells. When moving between one resolution to another, for instance when you truncate the precision of the resolution itself that we got to resolution 7, there will be a bit of the cell area, which is an error if you try to render that cell out. According to my calculations, 7% – 8% of the area that is an error.
Your analysis could be fine with this. Perhaps you got your data from GPS, and you’re not certain where exactly this point was, in which case you just continue from there.
For some use cases where you want to preserve those exact boundaries in cells, you have to be more careful to ensure that you don’t render the data at a different resolution than what you originally indexed it at. For example, you could truncate the position of the index to save space when you’re storing data in a database. When you’re rendering it out, use the original resolution.
Can you share some examples of using H3?
Say you want to calculate the density of buildings or the number of buildings in an area or different areas. Using H3, we start by inputting the data, which are the polygons and any information associated with those buildings. H3 has functions that we use to convert from this data, a point or polygon form, into the H3 grid.
These functions have bindings to several programming languages and different databases. It’s a bit of a do it yourself endeavor at the moment. Maybe you know Python, R or Java, or one of the other programming languages. If you’re using Postgres, Elasticsearch, or similar databases, you’ll be able to do it within those environments.
From there, select a function in H3 called polyfill. It takes an input polygon and finds the H3 indexes which cover it. Polyfill has an odd history in terms of how it was built. It makes some assumptions that you’d want to have exclusive coverage. In our case, it doesn’t sound like we want to have that, so we might index that at a little higher resolution than we normally would.
The result is a list of indexes for every single building. You can now use your usual off the shelf data processing tools.
This is one of the great things about H3. It takes these data processing tools that we would use for other kinds of data and applies them to geospatial data. It scales really well. If you wanted to use Pandas, Spark, or Python for this, you could do that. If you want to use an off-the-shelf database, you can use that because they’re able to treat these H3 identifiers as strings or integers and work with them efficiently.
When you have that, you have your data source that is built using the H3 system, and you can use it for analytics. You have mapped them from the H3 identifier to a metric value for that cell. You can do database lookups, and you can say, “I’m in this area, I want to know how many buildings there are around me.” It’s just a matter of checking where I am, finding the H3 identifier, and then looking that up in a database that is set up to map from the H3 identifier to that metric value.
Next, you can also do visualization and send your list to a visualization tool, such as Kepler.gl, orDec.gl, that can understand H3 identifiers. You can create a heat map of where these buildings are in Denmark.
Note that this is happening without converting the data between different formats. You’re able to do all these different things, analytics and modeling, lookup, search, spatial indexing functions, and visualization, all with the data being kept in the same format.
Is it possible to attach multiple values to the identifier?
H3 doesn’t have an opinion on attaching multiple values to the identifier. It doesn’t really know what you put next to it inside your database. You could create a database in a Big Data system where one of those columns is an H3 identifier. The other columns can be whatever you want them to be, within the constraints of that database. You can attach categorical or numerical metrics to it, or if you wish, attach original geometries (which might be a little bit expensive, but it’s certainly possible, especially for points).
H3 doesn’t have a constraint on it. What it’s handling is that single column that’s the spatial identifier and what you’re probably using is a primary key for that table. What you do with the rest of that table is up to you. You choose what you want to associate with that identifier.
How does H3 compare to the speed of an average radius lookup, a point in a polygon, or what is near me?
The answer to this really cuts into the use of H3 as a spatial index. If you’re only concerned about the very immediate surrounding, i.e., what is the cell you’re in, it compares favorably.
Let’s say you use a database that maps from keys to values ̶ it maps from a hexagon key to a value. For that lookup operation, it is also generally going to be roughly constant time. If you wanted to do a neighborhood lookup, you might have to scan many different points and compute their distances from where you are now to where those points are.
Maybe you want to find things that are within a certain radius of you. H3 is very interested in the radius question. This, as we mentioned before, is about the equidistant neighbors and is a very important property of the H3 grid that finds these rings around an origin. These rings, approximate a circle as best as they can.
You can generate, let’s say all the cells that are within two grid distance from you, and then do the lookups on those. This increases the complexity slightly because you need to look up these different cells that are near you. But you’re able to take advantage of the spatial index and property, finding nearby cells and the hexagon property ensures that nearby cells approximate a circle.
Is it possible to transform image data with raster data into the grid?
It is possible. I just saw an example where someone placed elevation data into the H3 grid. As you can imagine, like when we talked about polygons earlier, similar processes can be used for an extensive range of data.
There are different ways to get a raster into the H3 grid. One is by sampling from the raster. We take an H3 grid, and we find a bunch of points we want to sample from the raster. Then we use that to move that raster data into the grid.
Another way is by taking all the pixels in that raster and finding the associated H3 cell and moving the data in that way.
What is the advantage of using H3 with spatial analytics and visualization?
There are a few advantages that H3 gives us. There’s a tremendous advantage of having a geospatial analytics system that’s grid integrated throughout the entire system. We talked earlier about the example of how you may need to index, process, and visualize the data. A lot of the power and benefit of having data in a discrete global grid system like this comes from an integrated experience.
We’re hoping to build an experience at Unfolded, where data on the back end, on the front end, and in between, is all in a consistent geospatial format. This lets us work with all the different Big Data tools, and even small data tools, on the data.
Maybe you have some code data if you’re in the US, another grid system, and even a raster. You still want to be able to work with these different formats. Getting data in and out of H3 is flexible, as we have seen. One of the key things that we’re able to do at Unfolded because we have this H3 grid system is to unify data spatially in a way that’s very difficult with other technologies. We’re able to project data between geometries.
As a good example, in the US, we have customer data in zip code format, and we have demographics data in a format we’d get from the US Census Bureau. They use different geometries which the H3 grid joins together. We can also bring in raster, GPS tracks and all these different data and unify them into a simple analysis that the user is doing. It makes it a lot easier to access geospatial data.
Are you converting everything into hexagons, and then treating them as independent geospatial layers?
Yes, and the ability to convert data back out of hexagons is just as critical. This allows you to transform between that sample of the US Census of geographies to zip codes. We’re able to use a hexagon grid to transform remarkably efficiently from one polygon geometry to another. Users can do their analysis in a geometry that they’re comfortable with and that they feel useful.
Can you think of few examples when H3 is not the right choice?
All grid systems have some trade-offs. We make decisions when we implement a grid system in terms of the projection that we use, the cell shape, and the cell subdivision.
The one that jumps to my mind is the hexagon hierarchy issue that we touched on earlier. If we do have a use case that requires very exact containment, perhaps it has parcel data or political data where we cannot have an approximate containment, but we still need to do this truncation, in that case, you might not want to use H3, or it might be more complicated to use it.
H3 tries to be roughly equal-ish area, but it’s not an equal area system. If you have use cases that require this kind of property, it might be something where H3 is not a good fit for you.
Besides Uber, which is your favorite use case of H3?
That’s a difficult question. There are so many ways to use it that I’ve seen- like wildfire analysis or urban planning uses. It’s an open-source project, and we don’t necessarily always get feedback on how people are using it. If you have an interesting use case, if you’re using H3 in different fields, I’d love to know what you’re using it for, and how you’re using it. It’s beneficial to understand what people are doing with it when we’re thinking about maintaining the project.
The use case that appealed to me the most is in the gaming and gamification space. I’ve heard of a couple of ideas around this. One of them is a data provider that’s gamifying data collection for some of their users. There’s just something timeless about game development. Anything that’s building games on top of H3 appeals to me. Think of the successes of Pokémon Go and other augmented reality games. It’s just a fun use case, if nothing else, to see H3 pop up as part of a game system.
Geospatial integrated into games as well as in the hands of everyday people! That’s just fantastic! So, what would you use H3 for?