Google has unveiled its latest large language model (LLM), PaLM 2, which is trained on a staggering 3.6 trillion tokens, according to internal documentation seen by CNBC. This marks a significant increase compared to Google’s previous version, PaLM, which was trained on 780 billion tokens in 2022. The large amount of training data allows PaLM 2 to excel in coding, math, and creative writing tasks, pushing the boundaries of what language models can achieve.
Tokens, which are strings of words, play a crucial role in training LLMs by teaching the model to predict the next word in a sequence. The vast training data of PaLM 2 enables it to have a deeper understanding of language patterns and context.
Google has been cautious about disclosing the details of its training data, citing the competitive nature of the industry. Similarly, OpenAI, the creator of ChatGPT, has kept the specifics of its latest LLM, GPT-4, under wraps. However, there is a growing demand from the research community for greater transparency as the AI arms race intensifies.
Although PaLM 2 is smaller in size compared to previous LLMs, it demonstrates greater efficiency and performs more sophisticated tasks. The model is trained on 340 billion parameters, indicating its complexity, while the initial PaLM had 540 billion parameters.
Google stated that PaLM 2 uses a technique called “compute-optimal scaling,” which improves overall performance, including faster inference, fewer parameters to serve, and lower serving costs.
PaLM 2 supports 100 languages and is already being utilized in 25 features and products, including Google’s experimental chatbot Bard. You can choose from Gecko, Otter, Bison, or Unicorn.
In terms of comparison, Facebook’s LLM, LLaMA, is trained on 1.4 trillion tokens, while OpenAI’s ChatGPT-3 was trained on 300 billion tokens. OpenAI released GPT-4 in March, claiming it exhibits “human-level performance” on various professional tests.
Amid the rapid proliferation of AI applications, controversies surrounding the technology have escalated. In February, El Mahdi El Mhamdi, a senior scientist at Google Research, resigned over lack of transparency at the company. OpenAI CEO Sam Altman recently testified before the Senate Judiciary subcommittee on privacy and technology, agreeing with lawmakers on the need for a new framework to address AI’s impact and responsibilities.
As language models continue to evolve and expand their capabilities, the industry faces the challenge of striking a balance between technological advancements, transparency, and ethical considerations.