What is Backpropagation?
- ASHWIN CHAUHAN
- Nov 16, 2024
- 5 min read

Purpose of Backpropagation
Backpropagation is an algorithm designed to compute the negative gradient of the cost function efficiently for neural networks.
It helps determine how each weight and bias in the network should be adjusted to minimize the cost function.
Instead of thinking about the gradient as a direction in a high-dimensional space, we can interpret each component's magnitude as an indicator of the cost function's sensitivity to that specific weight or bias.
Interpreting the Gradient Components
Each component of the gradient vector corresponds to a weight or bias in the network.
A larger magnitude means that the cost function is more sensitive to changes in that parameter.
For example, if one weight has a gradient component of 3.2 and another has 0.1, the cost function is 32 times more sensitive to the first weight than the second.
Effects of a Single Training Example
Analyzing One Training Example
Let's consider a single image of the digit '2' from the training data.
Suppose the network is not well-trained yet, so its output activations are random or incorrect.
Our goal is to adjust the weights and biases based on this example to improve the network's performance.
Desired Adjustments to Output Layer
Increasing Correct Neuron Activation:
We want the activation of the neuron corresponding to the digit '2' to increase.
This means nudging its activation value upwards.
Decreasing Incorrect Neuron Activations:
Simultaneously, we want to decrease the activations of all other output neurons (digits 0-9 excluding 2).
This means nudging their activation values downwards.
Proportional Adjustments:
The size of each nudge should be proportional to how far the current activation is from the desired activation.
For example, if the neuron for '8' is already close to its target (which is zero), it requires a smaller adjustment than the neuron for '2'.
Focusing on a Single Output Neuron
Let's zoom in on the neuron for the digit '2' and understand how we can increase its activation.
Three Ways to Increase Activation
Increase the Bias:
The bias adds a constant value before applying the activation function.
Increasing the bias can shift the neuron's activation upwards.
Increase the Weights:
The neuron's activation is a weighted sum of the activations from the previous layer.
Increasing the weights strengthens the influence of the previous neurons on this neuron.
Adjust Previous Layer Activations:
While we cannot directly change these activations (since they are outputs from the previous layer), understanding their influence is crucial.
Adjustments to the weights and biases in previous layers can indirectly affect these activations.
Influence of Weights and Previous Activations
Weights Connected to Active Neurons:
Weights connected to highly active neurons (brighter neurons) in the previous layer have a more significant impact.
Adjusting these weights leads to more substantial changes in the output neuron's activation.
Weights Connected to Less Active Neurons:
Weights connected to less active neurons (dimmer neurons) have a smaller effect.
Adjusting these weights contributes less to changing the output neuron's activation.
Gradient Descent Considerations
When performing gradient descent, we focus on adjustments that give us the most significant decrease in cost per unit of change.
This means prioritizing adjustments to weights and biases that have the highest impact.
Hebbian Theory Analogy
Hebbian Theory: A theory in neuroscience suggesting that "neurons that fire together wire together."
In our context, the most significant increases to weights occur between neurons that are both active.
This strengthens the connections between neurons that are relevant to recognizing the digit '2'.
Propagating Adjustments Backwards
Adjustments to Previous Layers
The output neuron's desired changes influence the neurons in the previous layer.
Each neuron in the previous layer receives a combination of signals from all the output neurons it connects to.
Combining Desires of All Output Neurons
Each output neuron has its own "desires" for how the previous layer should adjust to minimize the cost.
These desires are combined by:
Adding together the adjustments, weighted by how much each output neuron's activation needs to change.
Considering the strength of the weights connecting the output neurons to the previous layer.
Recursive Application
This process is repeated for each layer, moving backward through the network.
At each layer, we calculate:
The desired adjustments to the activations (though we can't change activations directly).
The adjustments to the weights and biases that will achieve these desired changes.
Backpropagation Mechanism
The "backward" aspect of backpropagation comes from this recursive process.
Errors are propagated from the output layer back to the input layer, adjusting weights and biases along the way.
Scaling Up to the Entire Training Dataset
Limitations of a Single Example
Adjusting the network based on a single training example can lead to overfitting.
The network might become biased toward that example, reducing its ability to generalize.
Incorporating All Training Examples
To prevent this, we consider how each training example wants to adjust the weights and biases.
We record these desired adjustments for all examples.
Averaging Adjustments
By averaging the desired adjustments from all training examples, we obtain a more general direction for adjusting weights and biases.
This averaged set of adjustments approximates the negative gradient of the cost function.
Computational Efficiency: Mini-Batches and Stochastic Gradient Descent
Challenges with Full Gradient Computation
Calculating the gradient using all training examples at each step is computationally intensive, especially with large datasets.
Mini-Batch Gradient Descent
To improve efficiency, the dataset is divided into smaller subsets called mini-batches (e.g., batches of 100 examples).
Gradient descent is performed on these mini-batches.
Benefits of Mini-Batches
Computational Speed-Up: Reduces the amount of computation required per iteration.
Approximation of True Gradient: Each mini-batch provides an estimate of the full gradient.
Improved Convergence: The randomness introduced can help the network avoid local minima.
Stochastic Gradient Descent (SGD)
When mini-batches consist of a single example, the method is known as stochastic gradient descent.
In practice, mini-batches of intermediate size are used to balance efficiency and gradient accuracy.
The analogy is that the network's trajectory resembles a "drunk man stumbling aimlessly down a hill" but moving quickly, as opposed to a "carefully calculating man" moving slowly.
Summarizing Backpropagation
Backpropagation calculates how each weight and bias should be adjusted based on a single training example.
It determines not just the direction (increase or decrease) but also the relative magnitude of each adjustment.
By aggregating adjustments from all training examples (or mini-batches), the network updates its parameters to minimize the cost function.
Repeatedly applying this process leads the network to converge toward a local minimum of the cost function, improving its performance on the training data.
Implementation Considerations
Aligning Code with Concepts
Every line of code in a backpropagation implementation corresponds to the intuitive steps we've discussed.
Understanding these concepts helps demystify the code and makes debugging and optimization more manageable.
Mathematical Underpinnings
While an intuitive understanding is valuable, delving into the mathematical details (calculus and linear algebra) provides a deeper comprehension.
The next steps involve studying the derivatives and partial derivatives that formalize the adjustments calculated during backpropagation.
Importance of Training Data
Data Requirements
Neural networks require large amounts of labeled data to learn effectively.
The MNIST dataset is a prime example, providing thousands of labeled images for digit recognition.
Challenges in Data Collection
In many real-world applications, collecting and labeling sufficient data is a significant challenge.
Strategies to address this include:
Data Augmentation: Generating additional training data by transforming existing data (e.g., rotating or scaling images).
Transfer Learning: Using a pre-trained network on a similar task and fine-tuning it for the specific application.
Unsupervised Learning: Leveraging unlabeled data to learn underlying structures.



Comments