Facebook has created and labeled a new open-source video dataset, which the social media giant hopes will do a better job at removing bias when testing the performance of an AI system.
Dubbed “Casual Conversations,” the dataset comprises 45,186 videos of just over 3,000 participants having a non-scripted chat, and has an even distribution of different genders, age groups and skin tones.
Facebook asked paid actors to submit the videos and to provide age and gender labels themselves, to remove as much external error as possible in the way that the dataset is annotated. Facebook’s own team then identified different skin tones, based on the well-established Fitzpatrick scale, which includes six different types of skin types.
The annotators also labeled the level of lighting in each video, to help measure how AI models treat people with different skin tones under low-light ambient conditions.
“Casual Conversations” is now available for researchers to use to test computer vision and audio AI systems – although not to develop their algorithms, but rather to evaluate the performance of a trained system on different categories of people.
Testing is an integral part of the design of an AI system, and typically researchers measure their model against a labeled dataset after the algorithm has been trained to check how accurate the prediction is.
One issue with this approach is that when the dataset isn’t made of diverse enough data, the model’s accuracy will only be validated for a specific subgroup – which could mean that the algorithm will not work as well when faced with different types of data.
Those potential shortcomings are particularly striking in the case of an algorithm making predictions about people. Recent studies, for example, have shown that two of the common datasets used for facial analysis models, IJB-A and Adience, were overwhelmingly composed of lighter-skinned subjects (respectively 79.6% and 86.2%).
This is partly why the past years have been rife with examples of algorithms making biased decisions against certain groups of people. For instance, an MIT study that looked at the gender classification products offered by IBM, Microsoft and Face++, found that all classifiers performed better on male faces than female faces, and that better results were also obtained with lighter-skinned individuals.
Where some of the classifiers made almost no mistakes when identifying lighter male faces, found the researchers, the error rate for darker female faces climbed up to almost 35%.
It is critical, therefore, to verify that an algorithm is not only accurate, but also that it works equally among different categories of people. “Casual Conversations”, in this context, could help researchers evaluate their AI systems across a diverse set of age, genders, skin tones and lighting conditions, to identify groups for which their models could perform better.
“Our new Casual Conversations dataset should be used as a supplementary tool for measuring the fairness of computer vision and audio models, in addition to accuracy tests, for communities represented in the dataset,” said Facebook’s AI team.
In addition to evenly distributing the dataset between the four subgroups, the team also ensured that intersections within the categories were uniform. This means that, even if an AI system performs equally well across all age groups, it is possible to spot if the model underperforms for older women with darker skin in a low-light setting, for example.
Facebook used the new dataset to test the performance of the five algorithms that won the company’s Deefake Detection Challenge last year, which were developed to detect doctored media circulating online.
All of the winning algorithms struggled to identify fake videos of people specifically with darker skin tones, found the researchers, and the model that came up with the most balanced predictions across all subgroups was actually the third-place winner.
Although the dataset is already available for the open-source community to use, Facebook acknowledged that “Casual Conversations” comes with limitations. Only the choices of “male”, “female” and “other” were put forward to create gender labels, for example, which fails to represent people who identify as nonbinary.
“Over the next year or so, we’ll explore pathways to expand this data set to be even more inclusive, with representations that include a wider range of gender identities, ages, geographical locations, activities, and other characteristics,” said the company.
Facebook itself has experience of less than perfect algorithms, such as when its ad delivery algorithm resulted in women being shown less campaigns that were intended to be gender-neutral, for example STEM career ads.
The company said that Casual Conversations will now be available for all of its internal teams, and is “encouraging” staff to use the dataset for evaluation, while the AI team works on expanding the tool to represent more diverse groups of people.