Training

Building Dynamic 3D Synthetic Environments (sponsored)

25th November 2020 - 11:00 GMT | by Industry Spotlight

RSS

Save this for later

This article is brought to you by Presagis.

The ability to generate massive numbers of labelled training examples using 3D models in a geospatially accurate setting is a challenge. But what if you add feature attributes such as material IDs and textures that would allow the collection of signatures for targets using different physic-based sensor modalities such as radar, infrared and night vision? And what if you also add the ability to build and maintain a 3D database from the torrent of Geospatial Big Data Sources including satellite imagery, IoT sensors, GIS data, Lidar and other sensor types?

Impossible, right? Wrong.

VELOCITY is a 3D database production solution designed to solve all of these problems and challenges through automated workflows and AI-ready approaches for processing disparate data types. 

Built by Presagis, VELOCITY is based on over 30 years of experience in creating 3D databases to support simulation and training exercises for military and civilian applications. Their approach removes the human-in-the loop to create simulation-ready 3D databases in hours as opposed to months. VELOCITY uses standards based geo-processing to create richly attributed and highly-accurate 3D synthetic terrains designed for training and simulation databases. For example, users can include 3D models anywhere in the scene, generate labeled training examples, and leverage multi-sensor views of features including radar, infra-red, night vision and other modalities. Users can also include weather, entities and patterns of life and other simulations into their databases while taking full advantage of feature datasets that include material identification, textures and other physics-based attributes. 

So how does VELOCITY do this?

In order to train AI networks, synthetic data can be generated using satellite imagery. Here’s how Presagis accomplished it:

1. They crop the area of interest and extract a footprint vector of the data a user wants to generate. 

2. Once extracted, they place the data in the VELOCITY automated pipeline and use this to generate various permutations of the data required, such as different textures, colors, materials, etc…). This will allow users to create richer, more realistic environments.

3. Using these various permutations, they then generate geo-specific 3D models of buildings, landmarks, and hundreds of other features. 

4. Once the 3D models have been generated, they use the Presagis 3D rendering software (Vega Prime) to accurately place the models on top of the satellite imagery. 

5. Snapshots of the 3D model and imagery are taken from many different angles using many different sensors, which then allows them to generate hundreds of thousands of labelled images, ready to use for AI training.  

As simple as it is described above, there is a large amount of heavy lifting happening behind the scenes. And reader be warned; things are about to get a lot more technical…

The Science behind VELOCITY

In this decade we have seen the power of Artificial Intelligence (AI) as it has proven its effectiveness in solving hard and long standing open problems in computer vision. Effective solutions to problems like image classification, object detection and semantic segmentation have been achieved using a specific type of AI technique called Deep Neural Networks. Analysis and extraction of geo-spatial features from remote sensor imagery has always been an area of interest for creation of 3D synthetic environments in remote sensing research. 

Creation of synthetic environments requires a large amount of GIS data in the form of vectors such as building footprints, road networks, vegetation scatter, hydrography, etc. Publicly-available GIS information (e.g. Open Street Maps) often contains insufficient amount of information and is not correlated with the Electro Optical (EO) or Infrared (IR) imagery of the area. Manually-labeled data of much better quality can be acquired, albeit at a higher cost.

Current advances in computer vision tasks allow object detection and semantic segmentation with relatively high accuracy using deep neural networks. These are ideal for the purpose of extraction of features like (building, roads, trees, water, etc.) from remote sensor imagery. A simple conversion of the extracted features into vectors will suffice to feed in a 3D synthetic environments reconstruction. The extracted features can be used directly to create a geo-typical synthetic environment using VELOCITY.

Building Footprint Extraction

Among all the geospatial features, building footprints are considered to be the most important as buildings are the most important and defining feature of a 3D urban synthetic environment. With the availability of very-high resolution satellite imagery, the remote sensing community is pursuing automatic techniques for extracting building footprints for cities with varied building types.

A generic process of AI based building footprint extraction starts with the remote sensor imagery feed to a Convolutional Neural Network (CNN) -based semantic segmentation AI model. It produces a binary segmentation mask where each pixel of the input image is either classified as a building or non-building pixel. The next step is to extract the building boundaries from the binary segmentation mask as a post processing, and refine the extracted footprint vectors if required. The most important component of this pipeline is the CCN based neural network architecture as the quality of building footprint vectors depend heavily on the prediction accuracy of the model.

Convolutional Neural Network Architectures

In computer vision, the task of masking out pixels belonging to different classes of objects such as background or buildings is referred to as semantic segmentation. Recently, there has been lot of research to find the best CNN architecture to solve the problem of semantic segmentation. Among all the architectures U-Net is the most common and widely used due to its simple and easy to train design and wide adoption by the remote sensing community.

Figure 1 Left: A typical encoder-decoder style segmentation architecture with dense blocks. Right: Presagis’ proposed Feature recalibrated dense blocks with 4 convolutional layers.

The U-Net architecture is also known as an encoder-decoder style architecture. The first part of the neural network is called the encoder as it extracts features from the input image and the later part of the network is called decoder as it maps the down sampled features back to the spatial regions (pixels) of the input image. There are skip connections between the encoder and decoder blocks at every level of the network. The skip connections help faster training of the network as it facilitates faster gradient back-propagation. A lot of research has been done to find the best block architecture for the U-Net style segmentation networks. 

Figure 2 Shows results of building footprint extraction using ICT-Net on Hawaii imagery. (a) Color imagery. (b) Single channel prediction mask of same size as input imagery obtained as output of the network. (c) Confidence of network prediction shown as heat map. Red to Blue signifies high to low confidence for the building class. (d) Extracted and refined building footprints shown as green polygons on top of the imagery.

In the actual design of U-Net architecture each block at each layer consisted of 2 convolution layers. In one of Presagis’ studies they found that dense blocks with feature recalibration using squeeze and excitation (SE) blocks work best for building segmentation. They used a fully convolutional Dense U-Net style architecture with 103 convolutional layers. Presagis introduced the idea of feature recalibration for each dense block as every feature extracted by the CNN layer are not equally important. To evaluate their intuition they benchmarked their model on the INRIA Aerial Image labeling challenge and were able to get significantly better than state-of-the-art on this dataset. The proposed research was conducted in collaboration with Concordia University’s Immersive and Creative Technologies lab, more details about this work can be found here.

Segmentation mask to 3D synthetic environment

The output of the segmentation neural network model is a mask of the same size as the input image with each pixel labeled as building/non-building. The next step for extraction of building footprints from the binary mask would be to extract the building contours (boundaries) in the form of vectors. The extracted vectors often contain jittery building boundaries and need to be refined. To obtain simplified but very high quality building footprint vectors, Presagis applied a well thought-out set of post processing steps.

The refined vectors are then fed into the VELOCITY framework, for the creation of a synthetic database. VELOCITY has the capacity to manipulate, attribute and process the GIS source data (which can consist of vectors, imagery and raster data). An attribution such as height is inferred using algorithms that are based on the area of the footprint. If no additional information is given in the pipeline, the roof type and color are randomly attributed to the buildings from a library of building templates, resulting in a variety of 3D models in the database. This creates a geo-typical representation of the real world environment once Velocity takes the processed data and starts publishing the new content database (i.e. the 3D scene).

To learn more about VELOCITY, visit Presagis' website here

Back to News

Share to

Linkedin