The race for data in the development of artificial intelligence is heating up, with tech giants like OpenAI, Google, and Meta going to great lengths to obtain more data for their A.I. models. As digital data becomes increasingly crucial for the success of A.I., companies are using up publicly available online data faster than new data is being produced.
One prediction suggests that high-quality digital data will be exhausted by 2026, prompting tech companies to explore new avenues for data collection. OpenAI, for example, created a program in 2021 that converted the audio of YouTube videos into text to feed into its A.I. models, despite going against YouTube’s terms of service.
Google and Meta have also faced legal gray areas in their quest for more data, with Google using YouTube data for A.I. development and Meta considering buying a major publisher like Simon & Schuster to access copyrighted works for their A.I. models.
In response to the data scarcity issue, companies are now looking into the concept of “synthetic” data, where A.I. models generate new text to build better A.I. However, relying on synthetic data comes with risks, as A.I. models can make errors that may be compounded by using such data.
As the demand for data continues to grow in the A.I. industry, the ethical and legal implications of data collection and usage are becoming increasingly important topics of discussion.