Need autonomous driving training data? ›

Training Data for Computer Vision Algorithms: Your Options for Collecting or Creating Annotated Datasets

Training Data for Computer Vision Algorithms: Your Options for Collecting or Creating Annotated Datasets

If you have been following computer vision news lately, you are already aware of how fast this space is moving. From self-driving vehicles navigating adverse conditions, to production facilities automatically detecting defects in products, to the advancements in cashier-less stores—computer vision is at the core of many applications that are here to change our lives.

And the one thing all these applications have in common? Training data. Acquiring and annotating data to train and validate your computer vision model is an essential piece of the process, but it is also time-consuming and expensive. How do you know which dataset, tool, or partner is the right one for your project?

That’s what this post is all about.

Broadly, you can categorize training data resources into two buckets: publicly available datasets or solutions for creating your own.

Publicly Available Datasets

Open datasets or open source datasets are off-the-shelf, already-annotated datasets that are available on the web for free or for purchase. As mentioned in our natural language processing post, one of the many great things about the data science community is its members’ commitment to sharing knowledge and resources with the field at large—we owe the bevy of open computer vision training datasets available today to that commitment.

In general, a pre-existing dataset is a good option in two scenarios: 1) you’re just beginning the process of testing out algorithms, or 2) the model you’re building only needs to perform a general, relatively simple task. While public datasets lack specificity, the accuracy is usually good, so they’re typically reliable resources. Here are some good datasets depending on the use case you are looking for:

Road and Street Scenes:



Common Objects:

Solutions for Creating Your Own Training Data

If you need custom and highly specific annotations, you’ll need to create your own training datasets. There are three distinct approaches to generating your own training data (a blend of these methods is also common):

  1. In-house annotation teams
  2. Traditional Mturk-style crowdsourcing
  3. A fully managed data labeling partner

In-House Annotation Teams

Doing your own annotations is a popular choice when you can afford to allocate employees’ time to managing and labeling data. A team of internal annotators can specialize in your specific use case and you’ll have full control over the process. However, DIY labeling is often much slower than other solutions and you’ll miss out on the expertise and know-how that a partner can offer.

Traditional Crowdsourcing/Outsourcing

Crowdsourcing or outsourcing is a frequent next step for DIY-ers when velocity becomes an issue, or when the cost of tying up high-value employees’ time in labeling—keeping them from tackling other important work—is no longer worth it. Although quality often suffers when you rely on traditional crowdsourcing options, this approach enables companies to offload the tedious labeling to larger pools of annotators. (It should be noted that many times these “crowds” or groups are made up of largely unknown users, and targeting them by demographics, skill, or domain knowledge can be difficult or impossible.) Quality suffers, but speed and scale improve.

Here are some popular annotation tools for in-house or crowdsourcing/annotating:

Also, check out these papers for helpful tips and tricks:

Managed Labeling Partners

A fully managed data labeling service (like what you’d get if you work with Mighty AI) offers a complete tailored approach to generating your ground truth dataset, to ensure both flexibility and high quality when it comes to preparing data to train and validate your models. Companies providing managed labeling services generally develop proprietary annotation tooling, recruit and manage their own crowd, and deploy computer vision and machine learning technologies to improve the speed and quality of manual annotations. And with Mighty AI, a team of experts consults closely with you to translate project requirements into bespoke annotation workflows, offloading the burden of task-design, user instructions, and gold standard annotation to our experienced team. We also help anticipate edge cases and improve outcomes so you can be confident that your dataset is optimized for your model requirements.

The total cost of working with a trusted partner often works out to be the most favorable, too, when you factor in time and savings from not having to pay for inaccurate annotations—these services generally guarantee quality and your employees are freed up to do high-value work instead of annotating or managing the annotation process. With Mighty AI, you achieve the same quality as your in-house team at the scale and speed of a crowd, minus the time and effort on your end.

While training data is perhaps the least sexy part of computer vision projects, it’s undeniably crucial. Use the info above to guide you in your decision-making process for annotation methods, tools, and solutions—now and in the future, as your training data needs change.

image credit: on Pexels via CC0 1.0