top of page

What is Backpropagation?

  • Writer: ASHWIN CHAUHAN
    ASHWIN CHAUHAN
  • Nov 16, 2024
  • 5 min read

Purpose of Backpropagation

  • Backpropagation is an algorithm designed to compute the negative gradient of the cost function efficiently for neural networks.

  • It helps determine how each weight and bias in the network should be adjusted to minimize the cost function.

  • Instead of thinking about the gradient as a direction in a high-dimensional space, we can interpret each component's magnitude as an indicator of the cost function's sensitivity to that specific weight or bias.

Interpreting the Gradient Components

  • Each component of the gradient vector corresponds to a weight or bias in the network.

  • A larger magnitude means that the cost function is more sensitive to changes in that parameter.

  • For example, if one weight has a gradient component of 3.2 and another has 0.1, the cost function is 32 times more sensitive to the first weight than the second.


Effects of a Single Training Example

Analyzing One Training Example

  • Let's consider a single image of the digit '2' from the training data.

  • Suppose the network is not well-trained yet, so its output activations are random or incorrect.

  • Our goal is to adjust the weights and biases based on this example to improve the network's performance.

Desired Adjustments to Output Layer

  1. Increasing Correct Neuron Activation:

    • We want the activation of the neuron corresponding to the digit '2' to increase.

    • This means nudging its activation value upwards.

  2. Decreasing Incorrect Neuron Activations:

    • Simultaneously, we want to decrease the activations of all other output neurons (digits 0-9 excluding 2).

    • This means nudging their activation values downwards.

  3. Proportional Adjustments:

    • The size of each nudge should be proportional to how far the current activation is from the desired activation.

    • For example, if the neuron for '8' is already close to its target (which is zero), it requires a smaller adjustment than the neuron for '2'.

Focusing on a Single Output Neuron

  • Let's zoom in on the neuron for the digit '2' and understand how we can increase its activation.


Three Ways to Increase Activation

  1. Increase the Bias:

    • The bias adds a constant value before applying the activation function.

    • Increasing the bias can shift the neuron's activation upwards.

  2. Increase the Weights:

    • The neuron's activation is a weighted sum of the activations from the previous layer.

    • Increasing the weights strengthens the influence of the previous neurons on this neuron.

  3. Adjust Previous Layer Activations:

    • While we cannot directly change these activations (since they are outputs from the previous layer), understanding their influence is crucial.

    • Adjustments to the weights and biases in previous layers can indirectly affect these activations.


Influence of Weights and Previous Activations

  • Weights Connected to Active Neurons:

    • Weights connected to highly active neurons (brighter neurons) in the previous layer have a more significant impact.

    • Adjusting these weights leads to more substantial changes in the output neuron's activation.

  • Weights Connected to Less Active Neurons:

    • Weights connected to less active neurons (dimmer neurons) have a smaller effect.

    • Adjusting these weights contributes less to changing the output neuron's activation.


Gradient Descent Considerations

  • When performing gradient descent, we focus on adjustments that give us the most significant decrease in cost per unit of change.

  • This means prioritizing adjustments to weights and biases that have the highest impact.


Hebbian Theory Analogy

  • Hebbian Theory: A theory in neuroscience suggesting that "neurons that fire together wire together."

  • In our context, the most significant increases to weights occur between neurons that are both active.

  • This strengthens the connections between neurons that are relevant to recognizing the digit '2'.


Propagating Adjustments Backwards


Adjustments to Previous Layers

  • The output neuron's desired changes influence the neurons in the previous layer.

  • Each neuron in the previous layer receives a combination of signals from all the output neurons it connects to.

Combining Desires of All Output Neurons

  • Each output neuron has its own "desires" for how the previous layer should adjust to minimize the cost.

  • These desires are combined by:

    • Adding together the adjustments, weighted by how much each output neuron's activation needs to change.

    • Considering the strength of the weights connecting the output neurons to the previous layer.


Recursive Application

  • This process is repeated for each layer, moving backward through the network.

  • At each layer, we calculate:

    • The desired adjustments to the activations (though we can't change activations directly).

    • The adjustments to the weights and biases that will achieve these desired changes.

Backpropagation Mechanism

  • The "backward" aspect of backpropagation comes from this recursive process.

  • Errors are propagated from the output layer back to the input layer, adjusting weights and biases along the way.


Scaling Up to the Entire Training Dataset

Limitations of a Single Example

  • Adjusting the network based on a single training example can lead to overfitting.

  • The network might become biased toward that example, reducing its ability to generalize.


Incorporating All Training Examples

  • To prevent this, we consider how each training example wants to adjust the weights and biases.

  • We record these desired adjustments for all examples.


Averaging Adjustments

  • By averaging the desired adjustments from all training examples, we obtain a more general direction for adjusting weights and biases.

  • This averaged set of adjustments approximates the negative gradient of the cost function.


Computational Efficiency: Mini-Batches and Stochastic Gradient Descent

Challenges with Full Gradient Computation

  • Calculating the gradient using all training examples at each step is computationally intensive, especially with large datasets.

Mini-Batch Gradient Descent

  • To improve efficiency, the dataset is divided into smaller subsets called mini-batches (e.g., batches of 100 examples).

  • Gradient descent is performed on these mini-batches.

Benefits of Mini-Batches

  • Computational Speed-Up: Reduces the amount of computation required per iteration.

  • Approximation of True Gradient: Each mini-batch provides an estimate of the full gradient.

  • Improved Convergence: The randomness introduced can help the network avoid local minima.

Stochastic Gradient Descent (SGD)

  • When mini-batches consist of a single example, the method is known as stochastic gradient descent.

  • In practice, mini-batches of intermediate size are used to balance efficiency and gradient accuracy.

  • The analogy is that the network's trajectory resembles a "drunk man stumbling aimlessly down a hill" but moving quickly, as opposed to a "carefully calculating man" moving slowly.


Summarizing Backpropagation

  • Backpropagation calculates how each weight and bias should be adjusted based on a single training example.

  • It determines not just the direction (increase or decrease) but also the relative magnitude of each adjustment.

  • By aggregating adjustments from all training examples (or mini-batches), the network updates its parameters to minimize the cost function.

  • Repeatedly applying this process leads the network to converge toward a local minimum of the cost function, improving its performance on the training data.


Implementation Considerations

Aligning Code with Concepts

  • Every line of code in a backpropagation implementation corresponds to the intuitive steps we've discussed.

  • Understanding these concepts helps demystify the code and makes debugging and optimization more manageable.

Mathematical Underpinnings

  • While an intuitive understanding is valuable, delving into the mathematical details (calculus and linear algebra) provides a deeper comprehension.

  • The next steps involve studying the derivatives and partial derivatives that formalize the adjustments calculated during backpropagation.


Importance of Training Data

Data Requirements

  • Neural networks require large amounts of labeled data to learn effectively.

  • The MNIST dataset is a prime example, providing thousands of labeled images for digit recognition.

Challenges in Data Collection

  • In many real-world applications, collecting and labeling sufficient data is a significant challenge.

  • Strategies to address this include:

    • Data Augmentation: Generating additional training data by transforming existing data (e.g., rotating or scaling images).

    • Transfer Learning: Using a pre-trained network on a similar task and fine-tuning it for the specific application.

    • Unsupervised Learning: Leveraging unlabeled data to learn underlying structures.

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating

Be the first to get the product updates! 

Thanks for subscribing!

Phone

+91-9354454550

Connect

  • LinkedIn
  • Instagram
bottom of page