What is Geospatial data and basics about getting started with geospatial data analysis

Pallawi
17 min readAug 27, 2021

What is Geospatial data?

Geospatial data is data about objects, events, or phenomena that has a location on the earth's surface. It is data that has a geographic aspect to it.

Geospatial data combines location information (usually coordinates on the earth), attribute information (the characteristics of the object, event, or phenomena concerned), and often also temporal information (the time or life span at which the location and attributes exist).

The location may be static in the short-term (e.g., the location of a road, an earthquake event, children living in poverty), or dynamic (e.g., a moving vehicle or pedestrian, the spread of an infectious disease).

What are the different types of geospatial data?

There are two primary types of geospatial data, Vector data and Raster data.

Vector data — The three basic types of vector data are points, lines, and polygons (areas). Each point, line and polygon has a spatial reference frame such as latitude and longitude. Each vector feature has attribute data that describes it. Point geometries are made up of a single vertex (X, Y and optionally Z). Polyline geometries are made up of two or more vertices forming a connected line. Polygon geometries are made up of at least four vertices forming an enclosed area. The first and last vertices are always in the same place. Choosing which geometry type to use depends on scale, convenience and what you want to do with the data in the GIS.

Vector data are small in size when compared to the raster data but sometimes we the geospatial analysis demands saving the data into image format we need to save the data in raster format.

Representation of points, polylines and Polygon- Vector

Raster data- is pixelated or gridded cells that are represented as row and column. It is any pixelated (or gridded) data where each pixel is associated with a specific geographical location. A geospatial raster is only different from a digital photo in that it is accompanied by spatial information that connects the data to a particular location. Rasters are digital aerial photographs, imagery from satellites, digital pictures. This is my favourite website to understand with an example of how raster data is geospatial data.

Raster datasets are potentially very large. Resolution increases as the size of the cell decreases; however, normally cost also increases in both disk space and processing speeds.

What format is the industry using to create and share the geospatial dataset?

Shapefile:

A shapefile is a digital vector storage nontopological format for storing the geometric location and attribute information of geographic features. Topology expresses the spatial relationships between connecting or adjacent vector features (points, polylines and polygons) in a GIS.

Topology of London Underground Network.

Imagine we to travel to London. We plan to visit St. Paul’s Cathedral first and in the afternoon Covent Garden Market for some souvenirs on a sightseeing tour. Looking at the Underground map of London we have to find connecting trains to get from Covent Garden to St. Paul’s. This requires topological information (data) about where it is possible to change trains. Looking at a map of the underground, the topological relationships are illustrated by circles that show connectivity.

This is a compact format that is based on tabular thinking. There are three required files that, at a minimum, make up a shapefile. For example, a water GIS data set must have the minimum set of files: water.shp, water.shx, and water.dbf. The .shx file contains an index to find the information (kind of like a reference for looking up data). The .dbf file contains attribute information (a table with more info about each feature). You may want to read about the “.shp” which is the main file, “.shx” also called as index file and “.dbf” also known as dBASE Table file. Each of the component files of a shapefile is limited to 2 GB each. It lacks a time or timestamp data type. The age of the shapefile is 30 years.

GeoJSONs:

GeoJSON is an open standard geospatial data interchange format that represents simple geographic features and their nonspatial attributes. With nonspatial attribute I mean to say nonspatial information about a geographic feature For example, attributes of a river might include its name, length, and sediment load at a gauging station.

The core idea is to provide a specification for encoding geospatial data while remaining decodable by any JSON decoder. Being a subset of the immensely popular JSON, the parsing support is on a different level than with Shapefile. In addition, to support from most GIS software, any web developer will be able to write a custom GeoJSON parser, opening new possibilities for integrating the data.

Features supported by GeoJSON are points, MultiPoint, LineString, Polygon, MultiPoint, MultiLineString and MultiPolygon. Must read to understand more about these features. The age of GeoJson is 10 years. If the file size of geojson exceeds some limit (~10–11 GB in my experience), the features might get written incompletely, hence making the file corrupt.

My favourite easy to use web application to view geojson is — geojson.io.

GeoPackage:

GeoPackage was first developed by Open Geospatial Consortium (OGC) 5 years ago, making it the official alternative for Shapefile. It is a subset of SQLite, which in turn is a lightweight SQL implementation designed for stand-alone databases. Similar to GeoJSON, this makes GeoPackage highly compatible by design, and accessible by non-GIS software as well.

Internally, GeoPackage uses a Well-known binary (WKB) for storing the geometries, the same as Shapefile. Interestingly, GeoPackage also provides support for storing raster data.

Everything is contained in a single file -> easier file management with GeoPackage than with Esri Shapefile. Geopackages are generally ~1.1–1.3x lighter in file size compared to shapefiles and almost 2x lighter with respect to geojsons. Since the vector layers in the geo package are inherently rtree indexed (spatial indexing), loading file on QGIS or making queries on the file database is fast. There is no limit on the file size and it can handle a large number of features in smaller file sizes. A single geo package file can have multiple vector layers with each layer having a different geometry type.

Examples of geospatial data:

Vectors and attributes, Point clouds, raster and satellite imagery, Census data, Cell phone data, Computer-aided design (CAD) images of buildings, Social media data.

All of the above examples can be shaped into either vector or raster geospatial data.

Tools and python packages that can help you get started with the Geospatial data:

The geographic information system (GIS) analysis tool that we can use to create, visualise and modify the geospatial data in QGIS. QGIS has ample tutorials to get started. There are many other tools, but I have mentioned this tool as it is widely used in the map-making industry. You can also try ArcGIS by Esri.

The python packages that can help you get started with exploring the geospatial vector and Raster data are Shapley, NetworkX, Geopandas, RSGISLib, GDAL/OGR, Arcpy, PyProj, ipyleaflet,pymap3d, geoplot: geospatial data visualization, Fiona. You may also want to read this collection of tools. Vector data analysis.Machine learning for spatial data science.

Map Coordinate Systems

The field of study that measures the shape and size of the Earth is geodesy. Geodesists use coordinate reference systems such as WGS84, NAD27 and NAD83. In each coordinate system, geodesists use mathematics to give each position on Earth a unique coordinate.

A geographic coordinate system that is defined by two-dimensional coordinates based on the Earth’s surface has X-coordinates, which is the longitude and had the Y-coordinates, which is the latitude. But triplets (X, Y, Z) also contain a height value for elevation. The Z-value generally refers to the elevation at that point location.

Lines of longitude have X-coordinates between -180 and +180 degrees.

The Greenwich Meridian (or prime meridian) is a zero line of longitude from which we measure east and west. Positive longitudes are east of the prime meridian, and negative ones are west. In fact, the zero line passes through the Royal Observatory in Greenwich, England. In a geographical coordinate system, the prime meridian is the line that has 0° longitude.

The Greenwich Meridian separates east from west in the same way that the Equator separates north from south.

Lines of latitudes have Y-values that are between -90 and +90 degrees.

The equator is where we measure north and south. Everything north of the equator has positive latitude values. Whereas, everything south of the equator has negative latitude values.

Image source

There are different types of heights to be aware of when referring to a vertical datum.

The Earth’s interior differs in density everywhere. This means that gravity varies everywhere on the Earth. This is why we measure gravity or the gravitational equipotential surface. We can then infer that this is how water would settle and model it mathematically. The geoid then gives a true zero surface for measuring elevations. The geoid is an equipotential surface at which gravity is normal — closely approximating mean sea level. Geoids are not constant and they differ from place to place. This is because of the varying densities in the Earth in different places. We have lumps or undulations.

The major axis of an ellipse is the equatorial radius. The minor axis is from the poles to the centre.

The shape of the Earth is an ellipsoid, sometimes referred to as a spheroid. Ellipsoid is the regular geometric shape that most nearly approximates the shape of the Earth. Since the Earth is flattened at the poles and bulges at the Equator, geodesy represents the figure of the Earth as an oblate spheroid. Scientists have developed several ellipsoidal models of the Earth over the years. The most well-known of these serve as the basis for the WGS84 coordinate reference system. Reference ellipsoids are primarily used as a surface to specify point coordinates such as latitudes (north/south), longitudes (east/west), and elevations (height).

Ellipsoid gives a smooth surface without bumps or irregularities.

Orthometric represents the height distance between the Earth’s surface and geoid at a specific point. When we take the height at the peak of a mountain. It’s an orthometric height measured as a distance between the surface and the geoid. It represents the height distance between the Earth’s surface and geoid at a specific point. When we take the height at the peak of a mountain. It’s an orthometric height measured as a distance between the surface and the geoid.

What do we mean by layers in a map?

Layers are mechanisms used to display geographic datasets on maps. They contain groups of points, lines, or area (polygon) features and define how a geographic dataset is symbolized on a map.

In layman terms, layer refers to the visualization of something on a map. For example, a layer of points might be used for finding the start point of all Uber trips in a city or a layer of arcs can be used to show all trips made by Airways in a day.

Map layer forms the fundamental unit while doing analysis on maps. Not only it makes data expression clearer and intuitive but makes the overlay of geographic data possible. Visualizing and seeing the distribution of data in each region makes it easier to mine for deeper and specific information and make better decisions.

Why do we need Map Layers?

1. Adds context to the maps

Layers make maps more contextual and help you focus on specific aspects like assets, roads, road surface markings and points of interest. Map layers make it easier for you to work on a specific set of objects in your map. For example, using map layers you could pinpoint and evaluate only a handful of points or building in an area for your case.

2. Helps you detect change faster

Maps are after representation of what is happening on the ground. With a map layer, you see how various measures and metrics are changing over time. For example, geographic scientists use it to study changes in land cover, monitoring the quality of traffic signs and their maintenance requirements over time.

Types of Map Layers

If you’ve ever looked up an address on Google Maps or a real estate software program, you’ve used a form of a geographic information system or GIS. A geographic information system is an application for collecting, organizing, storing, analyzing and distributing spatial data about geographical locations and features.

If you’re a real estate professional, using a GIS system can streamline many common tasks related to geographic data. For instance, a realtor showing a client the kind of school and shopping opportunities in a neighbourhood can instantly call up a map with the relevant information by using GIS technology.

A map layer is a GIS database containing groups of point, line, or area (polygon) features representing a particular class or type of real-world entities such as customers, streets, or postal codes. A layer contains both the visual representation of each feature and a link from the feature to its database attributes.

Combine a point layer of landmarks, a line layer of streets, and an area layer of ZIP Codes to create the GIS map shown on the right

Point

The point layer draws points for a given event or object based on its location coordinates: longitude and latitude. Point layers are specially used for displaying data with a wide distribution of geographic information & help you be precise and rapidly position your point events on a map. The size of the points in your layer can be fixed or you can specify a measure or expression to set the sizes of the different points.

Here Maps
Here Maps — Pedestrian Roadway Casualties in Florida

Points are a good choice for pinpointing locations or points of interest on a map. For example, Point maps can be used to detect and track events like accidents on roads, location of votings happening in a city, Marking Covid affected areas and their levels of severity.

Lines

A line layer enables you to display lines between points on your map. You can use either LineString or a MultiLineString to represent this layer on the map.

Here maps — Chicago Bike Map

When to use:

This can be used to represent the routes between cities, traffic conditions in a city in a whole day and later help to manage traffic, classifying roads dedicated for vehicles like Heavy vehicles, ambulances, bicycles only, filtering routes from your home to all vaccination centres in a range of 5km.

Polygons

The polygon layer is used to represent the boundaries of lakes, cities, countries or regions dedicated to some cause. Polygon maps bring fine granularity to maps and hence are much helpful for area-wise or city-wise analysis. Usually, polygon maps are useful for looking into a particular neighbourhood or area. In polygon maps, each area is shaded according to a prearranged key, each shading or key representing a range of values.

You can also enjoy this map. This is climate-projections by Here Maps.

Hexabins

Area level polygons are much more practical but still broad. They don’t have uniform shapes or sizes and are subject to changes very frequently. Even the zones or clusters drawn by the operations teams’ local knowledge require updating and have arbitrary edges.

Sometimes making sense of your spatial data and deriving precise insights requires these analyses to become more granular and uniform.

The grid system brings that fine granularity to the table. It works brilliantly in bucketing all your lat-longs into “cells”. These cells can also be clustered to represent a particular neighbourhood or area and can be aggregated at different levels.

Hence, this system becomes critical to crunch large spatial data sets to match the supply & demand fragmented across the city.

Image source- Triangular grids, Square grids and Hexagonal grids

Square grids-

The most common application of square grids occurs in raster datasets and geohashes. For the scope of this piece, we will focus on geohashes.

Geohash is a hierarchical data structure to transform a 2D spatial point (lat & long) into a short string of alphabets and numbers. They divide the world into a grid of 32 cells with 4 rows and 8 columns.

You can keep splitting up each cell into a grid of 32 cells. Thus, the longer the string of the geohash, the higher the accuracy! You can also easily identify if geohashes are close together if they can have a common prefix. So, the longer the common prefix, the closer they are.

For example, the coordinate pair (57.64911, 10.40744) near the tip of the peninsula of Jutland, Denmark produces a slightly shorter hash of u4pruydqqvj.

Triangular grids-

Triangular grids are not very commonly used. Besides, their unfamiliarity, one of the reasons for that is they have a large perimeter and a small area which means it’s harder to piece them together on the map.

Another reason is that each triangle is connected to only three adjacent triangles, which limits the number of options to move and to make connections.

Source Uber — Connectivity to a neighbouring grid cell from the centre of a cell and different types of distances from a cell centre point to another.

This diagram shows the distance of the centre of triangles, squares & hexagons to its neighbours.

A triangle has three kinds of distance (through the edge, vertex and across the centre of the edge), a square has two (across the edge & the diagonal) and a hexagon only has one — another reason why triangles are not really favoured.

This property of hexagons makes it very easy to perform analysis and is preferred when your analysis includes aspects of connectivity or movement.

Hexagonal grids-

Hexagons are more symmetric than geohashes. They are very close to circles in terms of shape to provide a more accurate sampling.

Hexagons are the densest way to pack circles in tessellation and reduce edge effects. The more similar a polygon is to a circle, the closer the points near the border area are to the centre. Thus, any point inside a hexagon is closer to its centre as compared to an equal-area square or triangle.

Zoom in to see the features of the Hex map
Image source

Heatmap

A heatmap is a visualization used to depict the intensity of data at geographical points.

Heat maps represent the density of data within different areas of a map, presented as colour-coded circles. Data is aggregated based on its location and given a radius of influence, which you can adjust. As the density of data increases in that area, the heat map will display a colour indicating higher intensity. The heat map will also have a definable maximum threshold for the highest colour to represent the gradient or the darkest shade if you use a single colour. Customizing these colours depends upon the GIS tool or code package you are using.

A shaded cloud with colour gradients to show either the relative density of the points on the map, or the change in an associated metric value between locations.

Flow map

A flow map is a type of thematic map that uses linear symbols to represent movement. It may thus be considered a hybrid of a map and a flow diagram.

The movement being mapped may be that of anything, including people, highway traffic, trade goods, water, ideas, or even telecommunications data.

Origin-destination flow map of all commercial passenger airline routes as of 2014. Brighter yellow represents the higher density of air routes.

These maps could represent, Origin-destination map, In this type, the primary intent is to show the existence of a connection between two places, often accompanied by a representation of the volume of flow and/or direction. The route is generally not important to the audience, so connecting lines are often straight or slightly curved. A common example of this form is the airline route map.

Distribution map- This type is exemplified by a balanced focus on origin-destination nodes, the routes of travel between them (usually highly generalized), and the volume of flow. The most common example, dating back to Minard, is a map showing shipping between a set of node regions or port cities, along common sea lanes. In a distribution map, paths leave the origin with a width proportional to the total of several destinations, then divide as routes “distribute” toward each destination.

Continuous/Mass flow map

Not all flow occurs along with linear networks; two- and three-dimensional masses can also flow, especially water (e.g., ocean current) and air (wind). Their movement can be modelled as a vector field, in which the magnitude and direction of movement could be measured at any point in space. A map that visualizes this, often called a mass flow map or continuous flow map, focuses on direction and speed of flow, while other aspects such as origin/destination and route of travel are largely meaningless.

Here are a few must explore companies who are taking Geospatial data Science to next level every day:

Here Maps, Google Maps, Features and services provided by Mapbox, tomtom, Carto, aiclearing, Uber, appgeo, pupil, argis, aspectum, pix4d, planetwatchers, Foursquare, WRLD, DeepMap, Fatmap, Navmii, Livingmap, keeptruckin, geotab, atlatec, wingtra, satelligence, and many more.

Conclusion:

It has been a journey full of clearing and revising the concepts about which any person working in the geospatial data field talks almost 12 hours a day.

A strong foundation is extremely important to construct a building. Today Geospatial companies need Geospatial data scientists and data scientists must train themselves in geospatial science to become a strong pillar of any such organisation.

Thankfully we live in a world full of tutorials and blogs that can help us learn fast and implement fast. I am also thankful for every single blog mentioned in the reference section for sharing their learnings.

This blog is a summation of reads, learning and my understanding from multiple sources. I have given credits to all the authors, organizations and blogging websites for images and content.

If you feel that this blog was helpful to you, please give a clap and share with someone who might need it.

Reference:

https://github.com/geospace-code/pymap3d

https://feed.terramonitor.com/shapefile-vs-geopackage-vs-geojson/#:~:text=Internally%2C%20GeoPackage%20uses%20Well%2Dknown,was%20on%20par%20with%20Shapefile.

https://blog.locale.ai/geospatial-clustering-types-and-use-cases/

https://blog.locale.ai/a-guide-to-kickstart-into-the-geospatial-world-2/

--

--

Pallawi

Computer Vision contributor. Lead Data Scientist @https://www.here.com/ Love Data Science.