Chatbot users often praise the conversational and intuitive nature of interacting with chatbots. But have you ever wondered how chatbots know what you’re referring back to? A recent study sheds light on the mechanism behind transformer models, the driving force behind modern chatbots, and how they decide what information to pay attention to. This groundbreaking research, conducted by Samet Oymak, assistant professor of electrical and computer engineering at the University of Michigan, uncovers the mathematical framework that allows transformers to learn and focus on relevant details effectively.
In 2017, transformer architectures revolutionized the field of natural language processing. Their ability to consume and process massive amounts of text, including entire books, catapulted them to the forefront of AI development. These transformers break down complex texts into smaller units known as tokens and process them in parallel, while still maintaining the context around each word. The renowned GPT-4 language model has spent years digesting text from the internet, resulting in a remarkably conversational chatbot capable of passing the bar exam.
The key to the success of transformers lies in their attention mechanism, which enables them to determine the most relevant information. Oymak’s research team discovered that transformers employ a surprisingly old-school approach to achieve this feat. They use support vector machines (SVMs), a concept invented over 30 years ago. SVMs establish boundaries to classify data into different categories, such as identifying positive and negative sentiment in customer reviews. It turns out that transformers adopt a similar strategy when determining what information to prioritize and what to ignore.
Despite the human-like interaction with chatbots, such as ChatGPT, the underlying process is actually driven by complex mathematical calculations. Each token of text is transformed into a numerical vector. When prompted, the mathematical attention mechanism assigns weights to each vector, word, and word combination, deciphering which information should shape the chatbot’s response. The chatbot operates as a word prediction algorithm, predicting the first word and iteratively completing the response.
When presented with subsequent prompts, the chatbot appears to recall the conversation and continues seamlessly. However, ChatGPT retraces the entire conversation from the start, assigning new weights to each token. It then formulates a response based on this fresh evaluation. This ability to recall earlier parts of the conversation explains why ChatGPT can summarize relevant interactions, even when presented with complex texts like a hundred lines from Romeo and Juliet.
While the operation of transformer neural networks was partially understood before this study, transformers were not explicitly designed with a predefined threshold for attention. This is where the SVM-like mechanism comes into play. By leveraging this mechanism, transformers can effectively identify and retrieve valuable information from a vast sea of text. Oymak emphasizes the significance of this finding, especially as black box models, like transformers, become increasingly prevalent in various applications.
Oymak’s team intends to leverage this newfound knowledge to enhance the efficiency and interpretability of large language models. By understanding how transformers learn to pay attention, they hope to empower AI researchers to develop more effective models in perception, image processing, audio processing, and other areas where attention is crucial. Another paper, providing deeper insights into this topic, will be presented at the Mathematics of Modern Machine Learning workshop at NeurIPS 2023, titled “Transformers as Support Vector Machines.”
The study conducted by Oymak and his team sheds light on the fascinating mechanisms that drive transformers, enabling them to learn and focus their attention on relevant information. This research not only improves our understanding of transformer models but also paves the way for more efficient and interpretable AI systems in the future.