Need autonomous driving training data? ›

Training Data for Computer Vision Algorithms: Your Options for Collecting or Creating Annotated Datasets

Training Data for Computer Vision Algorithms: Your Options for Collecting or Creating Annotated Datasets

If you have been following computer vision news lately, you are already aware of how fast this space is moving. It’s exciting!

The less-exciting part? Training data. (Well, we find it exciting, but we know through our conversations with hundreds of data scientists, engineers, and product leaders that the responsibility of acquiring and annotating training data is a real thorn in their sides.)

Yet, it’s arguably the backbone of machine learning. You gotta have training data, it’s gotta be high-quality, and it’s best that you get it quickly and efficiently.

There’s a (large) handful of training data sources, solutions, and strategies out there—how do you choose? How do you know which dataset, tool, or vendor is the right one for your project?

That’s what this post is all about.

Broadly, you can categorize training data resources into two buckets: Pre-existing, publicly available datasets or solutions for creating your own.

Pre-Existing, Publicly Available Datasets

Open datasets or open source datasets are off-the-shelf, already-annotated datasets that are available on the web for free or for purchase. As mentioned in our natural language processing post, one of the many great things about the data science community is its members’ commitment to sharing knowledge and resources with the field at large—we owe the bevy of open computer-vision training datasets available today to that commitment.

In general, a pre-existing dataset is a good option in two scenarios: 1) you’re just beginning the process of testing out algorithms, or 2) the model you’re building only needs to perform a general, relatively simple task. While public datasets lack specificity, the accuracy is usually good, so they’re typically reliable resources. Here are some good datasets and dataset repositories:

Solutions for Creating Your Own Training Data

It’s best to create your own training datasets when you need custom and highly specific annotations. (For instance, the companies building computer vision models within autonomous vehicles.) If your model is intended to perform anything more sophisticated or specialized than generic computer vision functions (i.e., basic object recognition or image tagging), you’re likely to need proprietary training data.

There are three distinct approaches to generating your own training data (a blend of these methods is also common):

  1. In-house annotation teams
  2. Traditional crowdsourcing or outsourcing options
  3. An end-to-end “Training Data as a Service” solution

In-House Annotation Teams

Doing your own annotations is a popular choice when the accuracy bar is exceptionally high, and when you can afford to allocate employees’ time to managing and labeling data. DIY labeling is often much slower than other solutions, so wiggle room with deadlines is also required.

As noted, the advantage of handling data annotations in-house is that quality is often very high. It also provides the most control over the process of all three options.

Traditional Crowdsourcing/Outsourcing

Crowdsourcing or outsourcing is a frequent next step for DIY-ers when velocity becomes an issue, or when the cost of tying up high-value employees’ time in labeling—keeping them from tackling other important work—is no longer worth it. Although quality often suffers when you rely on traditional crowdsourcing options, this approach enables companies to offload the tedious labeling to larger pools of annotators. (It should be noted that many times these “crowds” or groups are made up of largely unknown users, and targeting them by demographics, skill, or domain knowledge can be difficult or impossible.) Quality suffers, but speed and scale improve.

Here are some popular annotation tools for in-house or crowdsourcing/annotating:

Also, check out these papers for helpful tips and tricks:

Training Data as a Service

A comprehensive training data solution (like what you’d get if you work with Mighty AI) offers a complete offloading of the entire annotation process. From determining specs to creating workflows to handling task design and more, our fully managed approach requires the least amount of effort from the customer (by far).

The total cost of ownership of training data as a service often works out to be the most favorable, too, when you factor in time and headache savings. Your employees are freed up to do high-value work instead of annotating or managing the annotation process. And, at least in the case of Mighty AI, accuracy is as high (or higher) than both crowdsourcing/outsourcing and internal labeling (thanks to our proprietary machine learning, full stack of annotation software, and global community of specialized annotators). You get the quality of in-house with the scale and speed of crowdsourcing, minus the time and effort on your end.

While training data is perhaps the least sexy part of computer vision projects, it’s undeniably crucial. Use the info above to guide you in your decision-making process for annotation methods, tools, and solutions—now and in the future, as your training data needs change.

image credit: on Pexels via CC0 1.0

*Note: We originally published an earlier version of this post in October 2016. Since many eyes have made their way here, we wanted to update with the latest, greatest information. Thank you for stopping by!