Are you interested in REQUESTS? Save with our coupons on WHATSAPP o TELEGRAM!

There's a huge problem that's holding back the training of neural networks

In the dynamic world ofintelligence artificial, leading technology companies face an unexpected challenge that could slow the pace of innovation: the growing difficulty in finding data of quality for training their models. This data shortage is affecting the development of advanced technologies such as GPT-5, while companies of the caliber of Microsoft and OpenAI seek innovative solutions to overcome this obstacle.

AI training challenges: There is a hunger for data and this slows progress

In an era marked by an unprecedented increase in computing power and the advancement of machine learning techniques, OpenAI and its ilk are facing a paradox: Abundance of online data does not automatically translate into a usable resource for AI training. The need accurate data, relevant and up-to-date is more critical than ever, especially when it comes to training increasingly complex models like the planned GPT-5.

The transition from GPT-4 to GPT-5 illustrates this exponential growth in data demand: while the former required “only” 12 trillion tokens, estimates for the successor are around 60-100 trillion. The discrepancy between the availability and need for high-quality data emerges as a significant obstacle, estimating a shortage that could range between 10 and 20 trillion tokens.

openai logo on smartphone in white background

This deficit of quality data translates into a real bottleneck for the advancement of AI. The often obsolete or low-quality data that populates the web represents a serious limit for the effectiveness of machine learning. In addition, the restrictions imposed by data access by large platforms only exacerbate the problem, further limiting the resources available for training linguistic models.

In response to this challenge, the strategies adopted vary from technical innovations to strategic partnerships. OpenAI, for example, aims to enhance theusing audio and video data through its Whispe speech recognition toolr, in order to expand the pool of available data. In parallel, the company explores the possibility of generate synthetic data of quality that can serve to fill the existing gap.

Gianluca Cobucci
Gianluca Cobucci

Passionate about code, languages ​​and languages, man-machine interfaces. All that is technological evolution is of interest to me. I try to divulge my passion with the utmost clarity, relying on reliable sources and not "on the first pass".


0 Post comments
Inline feedback
View all comments