Gradient Descent - Lesson 2
- ASHWIN CHAUHAN
- Nov 16, 2024
- 5 min read

Neural Network Structure Recap
A neural network is a computational model inspired by the human brain's neural architecture. It consists of layers of interconnected nodes, or "neurons," that process input data and produce output predictions.
Input Layer: This layer receives the raw data. In the case of handwritten digit recognition, the data comes from images of digits rendered on a 28x28 pixel grid. Each pixel is represented by a grayscale value between 0 and 1, resulting in 784 input neurons (since 28 x 28 = 784).
Hidden Layers: These are intermediate layers between the input and output layers. In the example, there are two hidden layers with 16 neurons each. These layers perform computations on the input data to detect patterns and features.
Output Layer: This layer provides the final prediction. For digit recognition, there are 10 neurons in the output layer, each corresponding to one of the digits from 0 to 9. The neuron with the highest activation indicates the network's prediction.
Activation Functions
Each neuron computes a weighted sum of its inputs and applies an activation function to determine its output. Common activation functions include:
Sigmoid Function: Squashes input values to a range between 0 and 1.
ReLU (Rectified Linear Unit): Outputs zero if the input is negative and outputs the input itself if positive.
Training the Neural Network
Objective
The goal is to adjust the network's weights and biases so that it can accurately classify new, unseen images of handwritten digits.
Process Overview
Initialization: Start with random weights and biases. This means the network initially makes random predictions.
Forward Propagation: Input data is fed through the network to generate an output. For example, an image of the digit '3' is processed, and the network produces an output that may not correctly identify it as '3' due to the random initialization.
Cost Function Calculation: A cost (or loss) function quantifies how incorrect the network's predictions are. A common choice is the Mean Squared Error (MSE), calculated by summing the squares of the differences between the predicted outputs and the actual labels.

where n is the number of output neurons.
Gradient Descent: This is an optimization algorithm used to minimize the cost function by adjusting the weights and biases in the direction that most reduces the cost.
Gradient: The gradient of the cost function with respect to the weights and biases indicates how to change them to reduce the cost.
Learning Rate: A parameter that determines the size of the steps taken during optimization.
Backpropagation: An algorithm to efficiently compute the gradients of the cost function with respect to each weight and bias by propagating the error backward through the network.
Update Weights and Biases: Adjust the weights and biases in the opposite direction of the gradient (hence "gradient descent").
Repeat: The process is repeated for many iterations (epochs) until the network's performance is satisfactory.
Gradient Descent Explained
Single Variable Function
Imagine a simple function f(x) with one input xxx and one output.
The goal is to find the value of xxx that minimizes f(x).
Compute the derivative f′(x), which gives the slope of the function at x.
If f′(x)>0, decrease xxx to move toward the minimum.
If f′(x)<0, increase xxx.
This process is akin to a ball rolling downhill toward the lowest point.
Multivariable Function
For functions with multiple inputs f(x), where x is a vector of inputs, we use the gradient ∇f(x).
The gradient points in the direction of the steepest ascent.
Moving in the opposite direction of the gradient leads us toward the minimum.
Application to Neural Networks
The cost function C depends on all the weights and biases w and b in the network.
The gradient ∇C is computed with respect to all these parameters.
Adjusting the weights and biases by moving in the opposite direction of ∇C reduces the cost.
Backpropagation
Purpose
Efficiently compute the gradient ∇C for all weights and biases in the network.
Essential for training large networks where computing gradients directly would be computationally infeasible.
How It Works
Forward Pass: Compute the activations of each neuron from the input layer to the output layer.
Compute Output Error: Calculate the difference between the predicted output and the actual output at the output layer.
Backward Pass: Propagate the error backward through the network:
Chain Rule: Use calculus's chain rule to compute how changes in the weights and biases affect the cost function.
Layer by Layer: Starting from the output layer, compute the error at each neuron and how much each weight and bias contributed to it.
Gradient Calculation: Obtain the gradients of the cost function with respect to each weight and bias.
Update Parameters: Adjust the weights and biases using these gradients.
Understanding the Network's Learning
Cost Function Minimization
The network learns by finding the set of weights and biases that minimize the cost function.
This process is susceptible to finding local minima, which may not be the best possible solution.
Visualization Challenges
Visualizing a function with 13,000 inputs (weights and biases) is impossible in a physical sense.
However, understanding that each parameter adjustment aims to reduce the overall cost is key.
Gradient Interpretation
The gradient's components indicate the relative importance of each weight and bias.
Larger gradient components mean that adjusting those weights or biases will have a more significant impact on reducing the cost.
Performance and Limitations
Initial Results
After training, the network achieves around 96% accuracy on unseen test data.
Misclassifications often occur on ambiguous or poorly written digits.
Analyzing Hidden Layers
The hope is that hidden layers learn to detect edges, lines, and other patterns.
Visualizing the weights associated with neurons in the hidden layers shows that they do not always learn such interpretable features.
The network might find a local minimum that allows it to classify correctly without learning human-interpretable patterns.
Overconfidence and Generalization
When presented with random noise, the network confidently predicts a digit instead of expressing uncertainty.
This behavior indicates that the network has not truly understood the concept of digits but has memorized patterns from the training data.
Advanced Considerations
Overfitting
The network performs well on training data but may not generalize effectively to new, diverse data.
Overfitting occurs when the network learns the training data too well, including its noise and outliers.
Recent Research
Studies have shown that neural networks can memorize random labels given enough capacity (weights and biases).
This raises concerns about whether the network is learning meaningful patterns or merely memorizing data.
Optimization Landscape
The cost function's optimization landscape is complex, with many local minima.
Research suggests that, for structured data, the local minima found are of similar quality, allowing for effective training despite the complex landscape.
Improving Neural Networks
Regularization Techniques
Methods like dropout, L1/L2 regularization, and data augmentation help prevent overfitting.
They encourage the network to learn more general features rather than memorizing specific instances.
Convolutional Neural Networks (CNNs)
CNNs are designed to better capture spatial hierarchies in data.
They use convolutional layers to detect edges, patterns, and textures, leading to improved performance on image data.
Hyperparameter Tuning
Adjusting the number of layers, neurons, activation functions, and learning rate can impact performance.
Finding the optimal configuration requires experimentation and validation.
Recommended Resources
Michael Nielsen's Book: An excellent resource for learning about neural networks and deep learning, complete with code examples.
Chris Olah's Blog: Provides intuitive explanations and visualizations of neural network concepts.
Distill: A platform featuring interactive articles on machine learning research.



Comments