Gradient Descent - Lesson 2

ASHWIN CHAUHAN
Nov 16, 2024
5 min read

Neural Network Structure Recap

A neural network is a computational model inspired by the human brain's neural architecture. It consists of layers of interconnected nodes, or "neurons," that process input data and produce output predictions.

Input Layer: This layer receives the raw data. In the case of handwritten digit recognition, the data comes from images of digits rendered on a 28x28 pixel grid. Each pixel is represented by a grayscale value between 0 and 1, resulting in 784 input neurons (since 28 x 28 = 784).
Hidden Layers: These are intermediate layers between the input and output layers. In the example, there are two hidden layers with 16 neurons each. These layers perform computations on the input data to detect patterns and features.
Output Layer: This layer provides the final prediction. For digit recognition, there are 10 neurons in the output layer, each corresponding to one of the digits from 0 to 9. The neuron with the highest activation indicates the network's prediction.

Activation Functions

Each neuron computes a weighted sum of its inputs and applies an activation function to determine its output. Common activation functions include:

Sigmoid Function: Squashes input values to a range between 0 and 1.
ReLU (Rectified Linear Unit): Outputs zero if the input is negative and outputs the input itself if positive.

Training the Neural Network

Objective

The goal is to adjust the network's weights and biases so that it can accurately classify new, unseen images of handwritten digits.

Process Overview

Initialization: Start with random weights and biases. This means the network initially makes random predictions.
Forward Propagation: Input data is fed through the network to generate an output. For example, an image of the digit '3' is processed, and the network produces an output that may not correctly identify it as '3' due to the random initialization.
Cost Function Calculation: A cost (or loss) function quantifies how incorrect the network's predictions are. A common choice is the Mean Squared Error (MSE), calculated by summing the squares of the differences between the predicted outputs and the actual labels.
where n is the number of output neurons.
Gradient Descent: This is an optimization algorithm used to minimize the cost function by adjusting the weights and biases in the direction that most reduces the cost.
- Gradient: The gradient of the cost function with respect to the weights and biases indicates how to change them to reduce the cost.
- Learning Rate: A parameter that determines the size of the steps taken during optimization.
Backpropagation: An algorithm to efficiently compute the gradients of the cost function with respect to each weight and bias by propagating the error backward through the network.
Update Weights and Biases: Adjust the weights and biases in the opposite direction of the gradient (hence "gradient descent").
Repeat: The process is repeated for many iterations (epochs) until the network's performance is satisfactory.

Gradient Descent Explained

Single Variable Function

Imagine a simple function f(x) with one input xxx and one output.
The goal is to find the value of xxx that minimizes f(x).
Compute the derivative f′(x), which gives the slope of the function at x.
If f′(x)>0, decrease xxx to move toward the minimum.
If f′(x)<0, increase xxx.
This process is akin to a ball rolling downhill toward the lowest point.

Multivariable Function

For functions with multiple inputs f(x), where x is a vector of inputs, we use the gradient ∇f(x).
The gradient points in the direction of the steepest ascent.
Moving in the opposite direction of the gradient leads us toward the minimum.

Application to Neural Networks

The cost function C depends on all the weights and biases w and b in the network.
The gradient ∇C is computed with respect to all these parameters.
Adjusting the weights and biases by moving in the opposite direction of ∇C reduces the cost.

Backpropagation

Purpose

Efficiently compute the gradient ∇C for all weights and biases in the network.
Essential for training large networks where computing gradients directly would be computationally infeasible.

How It Works

Forward Pass: Compute the activations of each neuron from the input layer to the output layer.
Compute Output Error: Calculate the difference between the predicted output and the actual output at the output layer.
Backward Pass: Propagate the error backward through the network:
- Chain Rule: Use calculus's chain rule to compute how changes in the weights and biases affect the cost function.
- Layer by Layer: Starting from the output layer, compute the error at each neuron and how much each weight and bias contributed to it.
Gradient Calculation: Obtain the gradients of the cost function with respect to each weight and bias.
Update Parameters: Adjust the weights and biases using these gradients.

Understanding the Network's Learning

Cost Function Minimization

The network learns by finding the set of weights and biases that minimize the cost function.
This process is susceptible to finding local minima, which may not be the best possible solution.

Visualization Challenges

Visualizing a function with 13,000 inputs (weights and biases) is impossible in a physical sense.
However, understanding that each parameter adjustment aims to reduce the overall cost is key.

Gradient Interpretation

The gradient's components indicate the relative importance of each weight and bias.
Larger gradient components mean that adjusting those weights or biases will have a more significant impact on reducing the cost.

Performance and Limitations

Initial Results

After training, the network achieves around 96% accuracy on unseen test data.
Misclassifications often occur on ambiguous or poorly written digits.

Analyzing Hidden Layers

The hope is that hidden layers learn to detect edges, lines, and other patterns.
Visualizing the weights associated with neurons in the hidden layers shows that they do not always learn such interpretable features.
The network might find a local minimum that allows it to classify correctly without learning human-interpretable patterns.

Overconfidence and Generalization

When presented with random noise, the network confidently predicts a digit instead of expressing uncertainty.
This behavior indicates that the network has not truly understood the concept of digits but has memorized patterns from the training data.

Advanced Considerations

Overfitting

The network performs well on training data but may not generalize effectively to new, diverse data.
Overfitting occurs when the network learns the training data too well, including its noise and outliers.

Recent Research

Studies have shown that neural networks can memorize random labels given enough capacity (weights and biases).
This raises concerns about whether the network is learning meaningful patterns or merely memorizing data.

Optimization Landscape

The cost function's optimization landscape is complex, with many local minima.
Research suggests that, for structured data, the local minima found are of similar quality, allowing for effective training despite the complex landscape.

Improving Neural Networks

Regularization Techniques

Methods like dropout, L1/L2 regularization, and data augmentation help prevent overfitting.
They encourage the network to learn more general features rather than memorizing specific instances.

Convolutional Neural Networks (CNNs)

CNNs are designed to better capture spatial hierarchies in data.
They use convolutional layers to detect edges, patterns, and textures, leading to improved performance on image data.

Hyperparameter Tuning

Adjusting the number of layers, neurons, activation functions, and learning rate can impact performance.
Finding the optimal configuration requires experimentation and validation.

Recommended Resources

Michael Nielsen's Book: An excellent resource for learning about neural networks and deep learning, complete with code examples.
Chris Olah's Blog: Provides intuitive explanations and visualizations of neural network concepts.
Distill: A platform featuring interactive articles on machine learning research.

Gradient Descent - Lesson 2

Recent Posts

Comments

Be the first to get the product updates!

Phone

Connect