Synthetic Data: A bridge over the data moat

What sorts of models can I train with synthetic data?

The models that can be trained with synthetic data are limited by the richness of the data we can produce. In the case of images created with 3D rendering software, we’re able to replicate much of the richness of a real image, thanks to our understanding of light and physics. Producing a synthetic dataset of electronic medical records is much harder. We don’t have as complete an understanding of the process that generates them.

Within computer vision, though, it’s possible to train models to perform many common tasks based entirely on synthetic data. Object detection, segmentation, optical flow, pose estimation, and depth estimation are all achievable with today’s tools. In audio processing, automatic speech recognition, audio denoising, and speaker isolation tasks can also make use of generated data. Finally, reinforcement learning has benefited greatly from the ability to test policies in simulated environments, making it possible to train models for self-driving cars and robots that sit on factory floors.

With most of these tasks, though, synthetic data is useful for training models with limited scope. Training a model to estimate hand position or recognize a single toy is achievable, but detecting hundreds of objects will likely require advanced data generation and modeling knowledge.

Where synthetic data really shines is with its ability to produce data and annotations for nearly limitless objects and tasks. Every product on a store shelf can be scanned and rendered as a 3D object. Positions and orientations of even the most complicated objects can be tracked programmatically.

Does it really work?

I was pretty skeptical at first, but over the past year, I’ve seen more and more models trained on majority or entirely synthetically-generated datasets.

As early as 2016, a dataset known as SceneNet was released, containing millions of images of rendered 3D interiors. Images are accompanied by depth and and segmentation maps, among other annotations. With additional augmentation, researchers were able to train models on purely synthetic data that achieved near state-of-the-art performance on depth prediction tasks (pdf link).

NVIDIA’s work on domain randomization finds that more random scenes actually improve performance over more realistic ones. The theory being that randomness forces the model to focus more closely on socialent features. (pdf link).

In 2017, a technique known as domain randomization was introduced. Rather than producing photorealistic synthetic images, extremely random, non-realistic scenes were produced for training. Incredibly, this improved performance, with researchers theorizing that the random nature of the scenes forced networks to focus on only the most intrinsic characteristics of objects it was trying to detect. The models trained on entirely synthetic data were transferred to real robots, which were able to locate physical objects on a table to within 1.5cm accuracies (pdf link). NVIDIA researchers took this technique even further in 2018, achieving near state-of-the-art performance on object detection in complex street scenes used to validate models for self-driving cars (pdf link).

Commercially, companies are beginning to catch on. AI.Reverie uses a customized rendering engine to create elaborate synthetic datasets, containing everything from construction sites to herds of wild animals. Unity has released an open source ImageSynthesis library that makes it trivial to output various types of annotation data. Laan Labs was able to use this tool to train a 3D pose estimation model for cars entirely on synthetic data.

View Original