How Transformers Work: Tokenization Embeddings and Positional Encoding Explained (Part 1)
Learn How Transformers Operate Intuitively : The Basics of Tokenization, Embeddings and Positional Encoding
Photo by Markus Spiske on Unsplash
In recent years, transformers have revolutionized the field of Natural Language Processing (NLP), enabling breakthroughs in machine translation, text generation, and conversational AI. But how do these powerful models actually work? In this series, we’ll break down the inner workings of transformers in an intuitive and structured manner. This first blog will focus on the foundational steps—tokenization and embeddings—crucial for transforming raw text into a format that transformers can understand. By the end of this post, you’ll have a clear grasp of how text is processed and represented numerically before feeding it into a transformer model.
What is a Transformer ?
A Transformer is a deep learning model architecture designed to process and transform sequential data, such as text, into meaningful outputs. It is the foundation for many state-of-the-art natural language processing (NLP) models, such as GPT, BERT, and T5. Transformers are particularly effective because they rely on a mechanism called self-attention, which allows the model to weigh the importance of different words in a sequence when making predictions or translations.
To intuitively understand how a Transformer works, let's consider an example from a machine translation task, where we train a transformer to translate an English sentence into French. Below are the input and target sentences that will be used to train the transformer:
English Input (Source Sequence):
"My name is Vikas. I love cricket, finance and AI."
French Translation (Target Sequence):
"Mon nom est Vikas. J'aime le cricket, la finance et l'IA."
Why Were Transformers Introduced?
Transformers were designed to address the following key limitations of RNNs and encoder-decoder models with attention:
Sequential Processing in RNNs:
RNNs and encoder-decoder models with attention process inputs sequentially, which limits their ability to leverage parallelism during training. Transformers eliminate this constraint by processing entire sequences simultaneously using self-attention.
Long-Term Dependencies:
Even with attention mechanisms, RNNs can struggle with very long sequences due to their sequential nature and reliance on hidden states. Transformers use self-attention to directly compute dependencies between all tokens in a sequence, regardless of distance.
Scalability:
Training RNN-based models with attention can be computationally intensive for long sequences. Transformers scale more efficiently due to their ability to parallelize computations over the entire sequence.
Let’s start with the first step , Tokenization
Tokenization is a fundamental step in Natural Language Processing (NLP) pipelines, where a text corpus is broken into smaller units called tokens. These tokens can be words, subwords, or characters, making the text easier for models to process.
Transformers often use Byte Pair Encoding (BPE), but for simplicity, we'll demonstrate word-based tokenization using the following corpus:
Corpus:
"My name is Vikas. I love cricket, finance, and AI."
Word Tokenization Process
Identifying Word Boundaries → Words are separated by spaces, punctuation, or delimiters.
Handling Punctuation → Symbols like periods (.) and commas ( , ) can be treated as separate tokens when needed for the task.
Tokenized Output
Applying word-based tokenization to the corpus results in:
$$\text{["My", "name", "is", "Vikas", ".", "I", "love", "cricket", ",", "finance", "and", "AI", "."]}$$
This simple yet effective tokenization method helps convert raw text into a structured format, making it easier for NLP models to analyze and process.
Creating a Vocabulary
To prepare the text for input into a transformer model, we extract the unique tokens from the tokenized output. This step creates a vocabulary, which maps each token to a unique identifier (e.g., an integer). Here is the set of unique tokens from the corpus:
["My", "name", "is", "Vikas", ".", "I", "love", "cricket", ",", "finance", "and", "AI"]
Creating Embeddings From Tokens
The next step is to convert unique tokens into a format which machine learning models can understand.
Machine learning models operate on numerical data. The process of converting high-dimensional data such as text to a numerical format (called embeddings) that can be represented in a low dimensional space is called Vectorization.
What is an Embedding Vector ?
Embedding Vectors are a High-Dimensional Representation of Tokens. Embedding vectors reside in a high-dimensional space where each token is represented by a vector. These vectors capture rich and meaningful information about each token. To achieve this, the information for each token is represented in terms of various features or properties:
Rows in the embedding matrix correspond to tokens. Columns in the embedding matrix correspond to latent features or properties.Values in the matrix indicate how much a particular token (row) relates to a specific property (column).
However, these columns (properties) are not explicitly named or predefined. Instead, they are learned during training.The initial values (scores) in the embedding matrix are random and adjusted as the model learns from the data.
What is Embedding Dimension ?
The embedding dimension is the number of ways you can describe the meaning of a token. Each column (dimension) captures one aspect of the token’s semantics.
Example Analogy: Describing a Person
Imagine describing people in a table:
Person | Age | Height | Weight | Eye Color | Hair | Type | … |
Alice | 25 | 5'7" | 140 | Brown | Curly | … | |
Bob | 30 | 6’0” | 180 | Blue | Straight | … |
Each column represents a way to describe a person (age, height, etc.). The number of columns is the embedding dimension. More columns provide more descriptive power.
Similarly for tokens , each row in an embedding vector describes that token in as many ways as the number of columns.
Each column captures a distinct semantic property, such as:
Does this word convey a sense of time?
Is this word formal or informal?
Does this word relate to food, emotions, or actions?
Deciding the Dimension Size The embedding dimension size determines how much information can be encoded for each token:
Complexity of Meaning:
Small dimension (e.g., 50): Captures basic meanings. Useful for simple tasks like binary classification or topic detection.
Large dimension (e.g., 512 or more): Captures rich, nuanced meanings. Necessary for complex tasks like machine translation, summarization, or question answering.
Analogous to Compression: Think of embedding as compressing the meaning of a token into a fixed-size vector:
Low-dimensional embeddings: Coarse representations that lose some details. Analogous to a black-and-white photo.
High-dimensional embeddings: Detailed representations that preserve subtle nuances. Analogous to a high-resolution color photo.
There are many techniques used to create embeddings , for our use case let us consider Word2Vec. How Word2Vec works is beyond the scope of this blog.
For the sake of simplicity , let us assume a dimension size of 3. If we train Word2Vec on all the unique tokens in our corpus , our embedding matrix might end up looking like this.
Word | Vector (Dimension 3) |
My | [0.45, 0.12, -0.33] |
name | [0.67, 0.22, -0.15] |
is | [-0.12, 0.55, 0.88] |
Vikas | [0.80, -0.10, 0.25] |
. | [0.05, 0.03, -0.07] |
I | [-0.25, 0.40, 0.70] |
love | [0.60, 0.75, -0.20] |
cricket | [0.90, 0.15, -0.10] |
, | [0.02, -0.01, 0.04] |
finance | [0.85, -0.20, 0.30] |
and | [0.10, 0.50, 0.40] |
AI | [0.95, 0.12, 0.08] |
The final embedding matrix for processing is as follows
$$\text{Embedding Matrix} = \begin{bmatrix} 0.45 & 0.12 & -0.33 \\ 0.67 & 0.22 & -0.15 \\ -0.12 & 0.55 & 0.88 \\ 0.80 & -0.10 & 0.25 \\ 0.05 & 0.03 & -0.07 \\ -0.25 & 0.40 & 0.70 \\ 0.60 & 0.75 & -0.20 \\ 0.90 & 0.15 & -0.10 \\ 0.02 & -0.01 & 0.04 \\ 0.85 & -0.20 & 0.30 \\ 0.10 & 0.50 & 0.40 \\ 0.95 & 0.12 & 0.08 \end{bmatrix}$$
So the embedding matrix is of the order \(N \times D\), where \(N\) is the number of unique tokens and \(D\) is the dimension size.
Transformers solve the limitation of sequential processing and long term dependency, by processing tokens in parallel and using the self-attention mechanism.
This however comes with a limitation, that they do not have positional information about the input sequence. This means that positional encodings , have to be added to the embedding vector to give the transformer the positional context of the token.
Positional Encoding
There are many types of positional encodings , but for this example let us consider the fixed sinusoidal positional encoding.
For a sequence position \(pos\) and dimension \(i\) in the embedding vector:
$$PE(pos,2i) = \sin\left(\frac{pos}{10,000^{\frac{2i}{d}}} \right)$$
$$PE(pos,2i+1) = \cos\left(\frac{pos}{10,000^{\frac{2i}{d}}} \right)$$
where,
\(pos\): The position of the token in the sequence (e.g., 0, 1, 2, …).
\(i\): The dimension index of the embedding vector (e.g., 0, 1, 2, …).
\(d\): The total dimension size of the embeddings (e.g., 512). In our case, \(d=3\).
The positional encodings are added by the operation
$$\text{Embedding Matrix} + \text{Positional Encodings Matrix}$$
Therefore , positional encodings matrix should also be of dimension \(N \times D\).
For our embedding matrix , the positional encodings is computed as
$$\text{Input Matrix} = \begin{bmatrix} 0.4500 & 1.1200 & -0.3300 \\ 1.5115 & 1.2199 & -0.1500 \\ 0.7893 & 1.5499 & 0.8800 \\ 0.9411 & 0.8999 & 0.2500 \\ -0.7068 & 1.0299 & -0.0700 \\ -1.2089 & 1.3999 & 0.7000 \\ 0.3206 & 1.7499 & -0.2000 \\ 1.5569 & 1.1498 & -0.1000 \\ 1.0094 & 0.9898 & 0.0400 \\ 1.2621 & 0.7998 & 0.3000 \\ -0.4440 & 1.4998 & 0.4000 \\ -0.0499 & 1.1197 & 0.0800 \end{bmatrix}$$
On adding the two together, we obtain the following input matrix.
$$\text{Positional Encoding Matrix} = \begin{bmatrix} 0.0000 & 1.0000 & 0.0000 \\ 0.8415 & 0.9999 & 0.0000 \\ 0.9093 & 0.9999 & 0.0000 \\ 0.1411 & 0.9999 & 0.0000 \\ -0.7568 & 0.9999 & 0.0000 \\ -0.9589 & 0.9999 & 0.0000 \\ -0.2794 & 0.9999 & 0.0000 \\ 0.6569 & 0.9998 & 0.0000 \\ 0.9894 & 0.9998 & 0.0000 \\ 0.4121 & 0.9998 & 0.0000 \\ -0.5440 & 0.9998 & 0.0000 \\ -0.9999 & 0.9997 & 0.0000 \end{bmatrix}$$
Conclusion: Laying the Foundation for Transformers
In this blog, we explored the first crucial steps in understanding how transformers work—tokenization, embeddings, and positional encoding. These processes transform raw text into a numerical format that a transformer can interpret, setting the stage for the self-attention mechanism to process and learn relationships within sequences.
By using word embeddings, transformers can represent words in a meaningful, high-dimensional space, capturing semantic similarities and contextual relationships. However, since transformers process tokens in parallel, they lack an inherent sense of order—this is why positional encodings are crucial in providing context about token positions within a sequence.
But how do transformers actually process these embeddings to understand and generate language? The key lies in Self-Attention, the core innovation behind transformers, which enables them to dynamically focus on different words in a sequence.
👉 In the next part of this series, we’ll dive deep into Self-Attention—breaking down how transformers determine relationships between tokens, compute attention scores, and generate context-rich representations. Stay tuned! 🚀