You've probably heard about the groundbreaking paper "Attention Is All You Need" that introduced the transformer architecture. While attention mechanisms are undeniably important, there's a deeper magic at play that often gets overshadowed: embeddings.
Here's the truth: attention and transformers are sophisticated mechanisms that focus on relevant parts of a sequence and shift meaning based on the context of a sentence. They're like a spotlight that illuminates which words matter most in understanding "The bank was steep" versus "I went to the bank." But that spotlight would be useless without something to illuminate.
The real magic lies in the embeddings themselves—the fact that meaning can be encoded as patterns in high-dimensional vectors. This is the foundational breakthrough that makes everything else possible. Before attention can decide what to focus on, before transformers can process context, the model must first transform language into a mathematical space where semantic relationships have geometric structure. Without embeddings that capture meaning, attention mechanisms would just be shuffling around meaningless numbers.
Think of it this way: attention is the director deciding which actors should be in the spotlight, but embeddings are the very stage on which the performance happens—a multidimensional space where "king" and "queen" naturally sit near each other, where analogies become vector arithmetic, and where meaning itself becomes measurable.
Now that we've broken down our text into tokens, you might wonder: how does the model actually "understand" these tokens? This is where one of the most elegant concepts in AI comes into play—embeddings.
After tokenization assigns each token a simple numerical ID, the LLM performs a crucial transformation: it converts these token IDs into dense vectors called embeddings. These embeddings are arrays of real numbers—typically hundreds or thousands of dimensions—that capture the semantic meaning and relationships of each token. MediumThe New Stack
Think of it this way: a token ID is just a label, like a student ID number. It tells you which token you're dealing with, but nothing about what it means. An embedding, on the other hand, is like a complete profile of that student—their interests, skills, personality traits, and relationships with others, all encoded as numbers.
The conversion happens through an embedding matrix—a large lookup table that's part of the model's learned parameters. When a token enters the network, the model uses its ID to retrieve the corresponding row from this matrix, producing a vector that represents that token. Explained: Tokens and Embeddings in LLMs | by XQ | The Research Nest | Medium
Here's a simplified example: the token "cat" might become a vector like [1.5, -0.4, 7.2, 19.6, 3.1, ..., 20.2], while "kitty" might be [1.5, -0.4, 7.2, 19.5, 3.2, ..., 20.8]. Notice how similar they are? That's the magic of embeddings—they place semantically related words close together in this high-dimensional space.
Embeddings are the foundational aspect of LLMs, enabling them to understand context, nuance, and subtle meanings of words and phrases. They encode not just the identity of a token but its relationships with other tokens, allowing LLMs to achieve deep language understanding for tasks like sentiment analysis, text summarization, and question answering. The Building Blocks of LLMs: Vectors, Tokens and Embeddings - The New Stack
What makes embeddings truly remarkable is that they're learned automatically during training. As the LLM processes vast amounts of text, it discovers patterns in how words are used together and adjusts these vectors to capture semantic relationships. Embeddings 101: The Foundation of LLM Power and Innovation Words that appear in similar contexts naturally gravitate toward each other in this mathematical space.
This learned representation is far superior to simple token IDs because:
Context is captured: Words with similar meanings cluster together
Relationships are preserved: Analogies and associations emerge naturally
Computation becomes possible: Neural networks can perform mathematical operations on these vectors to understand and generate language
Perhaps the most stunning proof that meaning can be expressed mathematically comes from a famous discovery in word embeddings: king - man + woman ≈ queen.
In word embedding spaces, you can actually perform arithmetic on word vectors to solve analogies. For example, if you take the vector for "king," subtract the vector for "man," and add the vector for "woman," the result points to a location in the vector space that's closest to "queen." MediumMIT Technology Review
This isn't a trick programmed into the system—these embedding systems aren't trained to achieve this relationship; it emerges naturally from analyzing how words co-occur in text. King - man + woman = queen: the hidden algebraic structure of words | School of Informatics | School of Informatics The model has discovered that the relationship between "king" and "man" is similar to the relationship between "queen" and "woman"—and this relationship can be expressed as a direction in vector space.
Other examples include:
Paris - France + Italy ≈ Rome (capturing capital-country relationships)
Walking - Walk + Swim ≈ Swimming (capturing verb tense patterns)
What's happening here is that word embeddings reflect co-occurrence statistics from the training data, and these statistics capture semantic relationships. The vectors form geometric patterns—parallelograms—where similar relationships occupy similar directions in space. King - man + woman = queen: the hidden algebraic structure of words | School of Informatics | School of Informatics
This reveals a profound truth: meaning isn't just poetic or philosophical—it has a mathematical structure. The way we use words creates patterns, and those patterns can be captured as geometric relationships in high-dimensional space. By converting tokens into vectors, the LLM transforms language into mathematical objects that allow the model to understand nuance, grammar, and semantic relationships. What Are LLM Embeddings? A Simple Explanation for Beginners
In the next section, we'll explore how these embeddings flow through the transformer architecture, where the model's attention mechanism uses these mathematical representations to understand context and generate coherent responses.
The breakthrough in understanding that embeddings could capture semantic meaning came in 2013, when a team of researchers led by Tomáš Mikolov at Google developed Word2Vec. This technique represented words as high-dimensional vectors that captured relationships between words based on how they appeared in surrounding contexts. WikipediaMedium
The term "word embeddings" was originally coined by Bengio and colleagues in 2003, but it was Mikolov's Word2Vec toolkit that really brought word embeddings to the forefront and demonstrated their remarkable properties. An overview of word embeddings and their connection to distributional semantic models What made this discovery extraordinary wasn't just that words could be represented as numbers—it was that these numerical representations spontaneously captured meaning and relationships in ways that could be manipulated mathematically.
Researchers discovered that Word2Vec made explicit what had been implicit: encoding meaning as vector offsets in an embedding space wasn't just a happy accident, but a fundamental property of how statistical patterns in language could be captured in geometric space. An overview of word embeddings and their connection to distributional semantic models
Here's where it gets truly fascinating: neuroscience research suggests that the human brain uses remarkably similar principles to represent meaning.
Distributed Representations in the Brain
In neuroscience, semantic memories—our factual knowledge about the world—are represented through distributed patterns of neural activity across populations of neurons. Rather than single neurons encoding specific concepts, information is spread across multiple neurons, with the pattern of activity indicating what's being represented. PubMed CentralScienceDirect
A central principle in modern neuroscience is that neurons act in concert to produce cognition and behavior. The brain relies on distributed circuits that continuously encode information through population codes, where groups of neurons work together to represent complex concepts. Decoding the brain: From neural representations to mechanistic models - ScienceDirect
This is strikingly similar to how LLM embeddings work! Just as a word like "king" is represented by a pattern of hundreds of numbers in an embedding vector, concepts in your brain are represented by patterns of activity across hundreds or thousands of neurons.
Recent Breakthroughs: AI and Brain Embeddings Align
A groundbreaking 2024 study published in Nature Communications recorded neural activity in the human inferior frontal gyrus while participants listened to a podcast. Researchers derived "brain embeddings"—continuous vector representations for each word based on neural firing patterns—and compared them to embeddings from large language models. The result? Brain embeddings and AI embeddings showed common geometric patterns. Alignment of brain embeddings and artificial contextual embeddings in natural language points to common geometric patterns | Nature Communications
Like word embeddings in AI, the brain encodes information in distributed activity patterns. Neuroscientists face a similar challenge to AI researchers: how to "read off meaning" from these distributed patterns. Both systems—biological and artificial—appear to use similar mathematical strategies for encoding semantic relationships. Decoding Word Embeddings with Brain-Based Semantic Features | Computational Linguistics | MIT Press
What This Means
The parallels between biological and artificial neural representations suggest something profound: semantic knowledge in the brain is represented in a distributed manner, with patterns of neural activity encoding the relationships between concepts, much like how embeddings in LLMs capture word relationships in vector space. NaturePubMed Central
Research on human hippocampal neurons reveals sparse distributed coding, where individual neurons participate in encoding a few memories, and each memory is coded by a small fraction of neurons—a pattern remarkably similar to how sparse activations work in neural networks. Sparse and distributed coding of episodic memory in neurons of the human hippocampus - PMC
This doesn't mean the brain is literally computing vector arithmetic or that AI truly "understands" like we do. But it does suggest that both biological and artificial intelligence have converged on similar computational strategies for representing meaning: distributing information across many units and encoding relationships as geometric patterns in high-dimensional space.
The discovery that meaning can be mathematical—whether in silicon or in neurons—represents one of the most profound insights in both AI and neuroscience. It suggests that the abstract notion of "meaning" has concrete, measurable structure that can be captured, analyzed, and even manipulated through mathematical operations.