Apple claims to follow a responsible approach in training its AI models

According to Apple's claims, its AI models were trained on an open-source code from GitHub, encompassing languages like Swift, Python, C, Objective-C, C++, JavaScript, Java, and Go.

Apple has released a technical paper outlining the models it has created to drive Apple Intelligence, the suite of generative AI features coming to iOS, macOS, and iPadOS in the coming months.

In the paper, Apple counters claim that it took an ethically dubious approach to training its models, emphasizing that it did not utilize private user data and relied on a mix of publicly available and licensed data for Apple Intelligence.

In July, Proof News revealed that Apple had used a dataset named The Pile, which includes subtitles from hundreds of thousands of YouTube videos, to develop a series of models for on-device processing. Many YouTube creators whose subtitles were included in The Pile were unaware and had not consented to this use. Additionally, Apple subsequently stated that it did not plan to use these models for any AI features in its products.

The technical paper, which provides insights into the Apple Foundation Models (AFM) introduced at WWDC 2024 in June, highlights that the training data for these models was gathered in a manner deemed “responsible” — according to Apple's standards, at least.

The training data for the AFM models includes both publicly available web data and licensed content from unnamed publishers. The New York Times reported that in late 2023, Apple contacted several publishers, such as NBC, Condé Nast, and IAC, proposing multi-year agreements valued at over $50 million to use their news archives for training. Additionally, Apple's AFM models were trained on open-source code from GitHub, encompassing languages like Swift, Python, C, Objective-C, C++, JavaScript, Java, and Go.

To enhance the mathematics capabilities of the AFM models, Apple incorporated math questions and answers from various sources, including webpages, math forums, blogs, tutorials, and seminars, according to the paper. Additionally, the company utilized 'high-quality, publicly-available' datasets (which are not specified in the paper) with licenses permitting their use for training models. These datasets were filtered to exclude sensitive information.

Additionally, Apple has also used additional data, such as human feedback and synthetic data, to refine the AFM models and address potential issues like generating toxic responses.

While some companies argue that scraping public web data for training models is protected by the fair use doctrine, this practice remains highly debated and is the subject of increasing legal challenges. Apple, however, notes in the paper that it provides webmasters the option to block its crawler from accessing their data.