Transformer Encoder Explained : A Thorough Look into Its Working (Part 4)
Understanding Transformers: Easy Insights into Feed-Forward Networks, Residual Links, and Layer Normalization
Photo by Markus Spiske on Unsplash
The Transformer encoder is a fundamental component of modern deep learning architectures, driving breakthroughs in natural language processing and sequence modeling. It processes input data through several key mechanisms:
Computing attention scores to determine token relationships.
Passing data through a feed-forward network for deeper feature extraction.
Adding residual connections to preserve information and improve gradient flow.
Applying layer normalization for stable and efficient training.
If you haven’t already, check out Part 2 of this series, where we break down how attention scores are computed—it’s essential for building a strong foundation!
In this blog (Part 4 of our Transformer series), we’ll develop an intuitive understanding of feed-forward networks, residual connections, and layer normalization—key elements that refine and stabilize the Transformer’s learning process.
To make the most of this post, we highly recommend reading our previous blog on attention scores, as it sets the stage for the concepts we’ll explore here.
Feed Forward Network
The final attention output is further processed in a feed forward network where it undergoes two linear transformations with a non linear activation between them.
A feedforward network (FFN) is a type of artificial neural network where information flows in one direction—from the input layer to the output layer—without looping back. It’s one of the simplest and most fundamental architectures in neural networks.
Transformers are composed of linear layers (like attention) and matrix operations.Without non-linear transformations, the model would effectively collapse into a single linear transformation, no matter how many layers are stacked.The feedforward network introduces non-linear activation functions (like ReLU or GELU), which allow the model to learn complex, non-linear relationships in the data. Attention captures relationships between tokens, while the feedforward network adds grammatical and semantic refinement, ensuring each token’s representation aligns with the task and target language requirements.
$$\text{FFN(X)}=\text{ReLU}(X \cdot W1+b1)\cdot W2+b2$$
Where,
\(W_1 (D \times H)\) and \(W_2 (H \times D)\) are learned weight matrices.
\(b_1 (H)\) and \(b_2 (D)\) are learned bias vectors.
\(H\) (hidden neurons) is typically greater than \(D\).
Here, \(D\) is the embedding dimension (features per token), and \(H\) is the number of hidden neurons in the feed-forward layer.
The output shape remains \(N \times D\), ensuring token representations retain their dimensions. This transformation enhances feature extraction, enriching contextual understanding before passing to the next layer.
The weight matrices (\(W_1\) and \(W_2\)) and biases ( \(b_1\) and \(b_2\) )used in our use case are given below.
$$\begin{aligned} \textbf{First Weight Matrix } (W_1) &= \begin{bmatrix} 0.3745 & 0.9507 & 0.7320 & 0.5987 \\ 0.1560 & 0.1560 & 0.0581 & 0.8662 \\ 0.6011 & 0.7081 & 0.0206 & 0.9699 \end{bmatrix} \\[8pt] \textbf{First Bias Vector } (b_1) &= \begin{bmatrix} 0.8324, 0.2123, 0.1818, 0.1834 \end{bmatrix} \end{aligned}$$
$$\begin{aligned} \textbf{Second Weight Matrix } (W_2) &= \begin{bmatrix} 0.3042 & 0.5248 & 0.4320 \\ 0.2912 & 0.6119 & 0.1395 \\ 0.2921 & 0.3664 & 0.4561 \\ 0.7852 & 0.1997 & 0.5142 \end{bmatrix} \\[8pt] \textbf{Second Bias Vector } (b_2) &= \begin{bmatrix} 0.5924, 0.0465, 0.6075 \end{bmatrix} \end{aligned}$$
Output Obtained from the Feed-Forward Network (FFN)
The Final FFN Output obtained is as follows
$$\text{FFN Output} = \begin{bmatrix} 2.4327 & 1.7288 & 2.1912 \\ 5.7504 & 4.2261 & 4.7223 \\ 6.6566 & 4.9091 & 5.4139 \\ 3.7186 & 2.6970 & 3.1723 \\ 1.5761 & 1.0830 & 1.5374 \\ 1.6820 & 1.1630 & 1.6183 \\ 3.0062 & 2.1608 & 2.6289 \\ 5.9794 & 4.3994 & 4.8973 \\ 3.7345 & 2.7092 & 3.1845 \\ 4.6607 & 3.4069 & 3.8912 \\ 2.1759 & 1.5355 & 1.9953 \\ 2.1123 & 1.4874 & 1.9467 \end{bmatrix}$$
How Does the Feed-Forward Network (FFN) Work?
In a transformer, the Feed-Forward Network (FFN) refines token representations after attention processing. It applies two linear transformations with a nonlinear activation in between, allowing the model to capture more complex relationships.
Step-by-Step Process:
For each token’s representation:
Input: The FFN receives the contextualized token vector from the attention mechanism.
First Linear Transformation (Expansion): Expands the vector to a higher-dimensional space to capture richer feature interactions.
$$\text{Hidden} = \text{ReLU}(X W_1 + b_1)$$
Nonlinear Activation: Uses ReLU (or another activation function) to introduce nonlinearity, enabling the model to learn complex patterns.
Second Linear Transformation (Compression): Reduces the expanded representation back to its original dimension for consistency.
$$\text{Output} = \text{Hidden} W_2 + b_2$$
Why Is the Feed-Forward Network Necessary?
While self-attention determines which tokens are important, it operates linearly—it captures relationships between tokens but not complex within-token feature interactions. The FFN addresses this limitation by:
Modeling Nonlinear Relationships: Introduces nonlinearity to enhance feature extraction.
Dimensional Expansion & Compression: Temporarily increases the feature space, allowing deeper transformations before returning to the original size.
Feature Refinement: Emphasizes task-relevant features while suppressing noise.
Task-Specific Representation: Fine-tunes token embeddings for downstream tasks like classification, translation, or question-answering.
Combining Local & Global Information: Processes each token independently, ensuring that global context from attention is applied effectively at the token level.
By adding depth, expressiveness, and feature refinement, the FFN ensures that token representations are optimized for the model’s final output.
Adding Residual Connection
Residual connections play a crucial role in transformers by preserving information, balancing contextual and refined features, and stabilizing training. They are added after the self-attention mechanism and after the feed-forward network (FFN).
Why Are Residual Connections Needed?
1. Preserving Original Information
Transformers apply complex transformations that can distort or lose critical input information. Residual connections help retain the original representation by adding the input back to the transformed output.
After Self-Attention:
The attention mechanism captures token-to-token relationships and dependencies (e.g., subject-verb-object structures).
Without residual connections, these relationships might be weakened as transformations continue.
After the Feed-Forward Network (FFN):
The FFN enhances grammatical, syntactic, and semantic properties at the token level.
However, it does not explicitly preserve the broader sequence-wide context captured by attention.
Residual connections ensure that both global dependencies (from attention) and local refinements (from FFN) are retained, leading to a well-balanced token representation.
2. Preventing the Vanishing Gradient Problem
In deep networks, gradients become smaller as they backpropagate, making it difficult to update earlier layers effectively.
Residual connections provide a direct path for gradients, ensuring better weight updates and preventing training slowdowns.
This technique stabilizes deep architectures, enabling transformers to learn efficiently.
How the Residual Connection Works
The residual connection simply adds the input vector ( \(X\) ) to the transformed output ( \(F(X)\)):
$$\text{Output}=X+F(X)$$
This element-wise addition allows the model to retain important input features while integrating new transformations, resulting in a more robust representation.
By combining local refinements (FFN) with global dependencies (attention) and ensuring stable gradient flow, residual connections significantly enhance the efficiency and effectiveness of transformer models.
For our use case, the residually connected output obtained is given below
$$\text{Residual Output} = \begin{bmatrix} 2.7440 & 2.3293 & 2.6400 \\ 6.7727 & 6.3734 & 6.3224 \\ 7.8748 & 7.4775 & 7.3278 \\ 4.3060 & 3.8967 & 4.0671 \\ 1.7016 & 1.2856 & 1.6900 \\ 1.8308 & 1.4146 & 1.8074 \\ 3.4410 & 3.0284 & 3.2764 \\ 7.0526 & 6.6521 & 6.5761 \\ 4.3258 & 3.9159 & 4.0846 \\ 5.4515 & 5.0446 & 5.1123 \\ 2.4321 & 2.0163 & 2.3550 \\ 2.3544 & 1.9389 & 2.2846 \end{bmatrix}$$
The residual output obtained is then normalized using layer normalization, let’s understand how layer normalization works and how it is useful in transformers.
Layer Normalization
Layer normalization standardizes the output of a layer by ensuring that the features in the token’s representation have a mean of 0 and a standard deviation of 1. It does this independently for each token, across all dimensions of its vector representation. For a token's vector representation $x = [x_1 , x_2 , … , x_D ]$ (of dimension $D$ ):
Layer normalization normalizes the values within a token (row-wise) to stabilize the learning process. The formula is:
$$\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sigma} + \beta$$
Where:
\(x\): Input matrix (e.g., the residual output, shape $N \times D$).
\(mu\): Mean of each token's embedding (row mean):
$$\mu = \frac{1}{D} \sum_{i=1}^{D} x_i$$
\(\sigma\): Standard deviation of each token's embedding (row-wise standard deviation):
$$\sigma = \sqrt{\frac{1}{D} \sum_{i=1}^{D} (x_i - \mu)^2 + \epsilon}$$
\(\epsilon\): Small constant for numerical stability, e.g., \(10^{-6}\)).
\(\gamma\) : Learnable scaling parameter (shape \(D\)).
\(\beta\) : Learnable shifting parameter (shape \(D\)).
Why is Normalization Needed?
When training deep neural networks (like transformers), the intermediate values (like attention outputs or FFN outputs) can vary a lot between layers or during training. This variation causes two big problems:
Training Instability: If the values are too big (or too small), they can cause issues like exploding or vanishing gradients.This makes it hard for the network to learn effectively.
Slower Learning: Networks need to adjust weights across layers to deal with the inconsistent scale of values. This slows down convergence (how fast the network learns). Layer normalization fixes these problems by normalizing the values to have a consistent scale for every token, regardless of the layer or input.
How is Layer Normalization Helpful?
Stabilizes Training: It keeps the values in a predictable range, which means the model learns more smoothly and consistently.Gradients (the signals used to update weights) don’t explode or vanish, so learning stays on track.
Token-Level Focus: In transformers, every token is processed independently.Layer normalization ensures that each token’s embedding is normalized independently, so it doesn’t depend on other tokens in the batch.
Faster Convergence: By keeping values well-scaled, layer normalization allows the model to converge (learn) faster during training.
Prepares for Next Layer: Layer normalization ensures the output is in a consistent scale, making it easier for the next layer to process it effectively.
For our example , the output obtained on applying Layer Normalization is given below.
$$\text{LayerNorm Output} = \begin{bmatrix} 1.1330 & 0.7191 & 0.6758 \\ 1.2055 & 0.7709 & -0.4830 \\ 1.1982 & 0.7851 & -0.6501 \\ 1.1851 & 0.7335 & 0.1756 \\ 1.0913 & 0.7164 & 0.9469 \\ 1.0965 & 0.7165 & 0.9165 \\ 1.1591 & 0.7239 & 0.4599 \\ 1.2041 & 0.7747 & -0.5314 \\ 1.1856 & 0.7338 & 0.1684 \\ 1.2039 & 0.7506 & -0.1775 \\ 1.1208 & 0.7178 & 0.7628 \\ 1.1175 & 0.7175 & 0.7850 \end{bmatrix}$$
Conclusion: What’s Next?
With Feed-Forward Networks, Residual Connections, and Layer Normalization, we have now completed the operations within one encoder layer. The next step depends on the model’s architecture:
If this is the last encoder layer, the LayerNorm output is passed to the decoder for further processing.
Otherwise, it serves as input to the next encoder layer, where the process is repeated and token representations are refined even further.
In this blog, we focused on understanding the encoder’s internal mechanisms. But what happens next? The decoder takes over!
In the next part of this series, we’ll dive into the transformer decoder, breaking down its role intuitively—from handling encoder outputs to generating meaningful sequences.