Introduction to Deep Learning
After the "AI winter," neural networks returned in the late 2000s. Geoffrey Hinton, Yann LeCun, and Yoshua Bengio revisited earlier concepts using more computational power and large datasets. We cover the principles of deep learning and the concepts that explain its behavior.
Core Components
Deep learning consists of core components and techniques for building and training deep neural networks.
- Large Datasets: Deep learning models have millions of parameters and require large amounts of data. A supervised deep learning model needs thousands of examples per class to perform adequately and millions to reach or surpass human-level performance. Large datasets are needed to learn patterns in complex decision boundaries.
- Deep Architectures: The term "deep" refers to multiple layers of neurons (hidden layers). Architectures can have dozens or hundreds of layers (e.g., ResNet with 152 layers), with each layer containing thousands of neurons. This depth lets the network learn a hierarchical representation of data, where earlier layers learn simple features (edges in an image) and later layers compose these into abstract concepts (objects or faces).
- Hardware Acceleration (GPUs): Training neural networks involves many matrix and vector operations. Graphics Processing Units (GPUs), originally designed for rendering graphics, handle this computation through parallel architecture. GPUs for general-purpose computing (GPGPU) have accelerated training times, making it possible to train large models.
- Optimization and Regularization:
- Local Connectivity and Weight Sharing: Instead of connecting every neuron in one layer to every neuron in the next, a neuron in a convolutional layer connects only to a small, localized region of the input (its receptive field). The set of weights that defines this connection (a filter or kernel) is shared across the entire input. The same filter detects a feature (e.g., a vertical edge) regardless of where it appears in the image. This reduces parameters and builds in translation invariance.
- Pooling: After a convolutional layer extracts features, a pooling layer down-samples the representation. Max-pooling takes the maximum value from a small region of the feature map. This reduces spatial dimensions, which lowers computational cost for subsequent layers. It also provides local translation invariance.
* Stochastic Gradient Descent (SGD): Instead of calculating the gradient over the entire dataset, SGD and its variants (minibatch SGD) estimate the gradient using a small, random subset of data. This makes training faster and helps the model escape poor local optima.
* Rectified Linear Units (ReLU): The activation function \(\max(0, h)\) has become standard for most hidden layers. ReLUs mitigate the "vanishing gradient" problem that occurs with sigmoidal functions, allowing faster and more stable training. They also promote sparsity in activations.
* Dropout: A regularization technique where some neurons are temporarily ignored during each training step. This prevents neurons from co-adapting and forces the network to learn redundant features. At test time, all neurons are used, but their outputs are scaled down to account for dropout during training. This trains and averages an ensemble of many network architectures.
Convolutional Neural Networks (CNNs)
For spatial data tasks like image recognition, Convolutional Neural Networks (CNNs) are used. CNNs use two principles to process spatial hierarchies:
A CNN architecture alternates between convolutional layers and pooling layers, building up a hierarchy of complex and abstract features, before passing them to fully connected layers for classification.
Why Deep Networks Work
Research explains why deep learning works. One explanation involves compositional functions.
Many phenomena can be described as a composition of simpler functions. An image of a face is composed of eyes, a nose, and a mouth, which are composed of shapes, lines, and textures. A function that models this structure is a compositional function.
Approximation theory shows that both shallow (one hidden layer) and deep networks can approximate any continuous function, but deep networks are more efficient at representing compositional functions. A shallow network approximating a complex compositional function requires an exponentially large number of neurons. A deep network, by mirroring the compositional structure in its layered architecture, achieves the same accuracy with fewer parameters. This avoids the curse of dimensionality for this class of functions.
Recent theory suggests that the optimization landscape of over-parameterized deep networks (where parameters exceed training examples) is well-behaved. The abundance of parameters creates many global optima that SGD can find, reducing the problem of getting stuck in poor local minima.
Conclusion
Deep learning differs from traditional machine learning, which relied on hand-crafted feature engineering. By learning features directly from data in a hierarchical fashion, deep models perform well on tasks from image and speech recognition to natural language processing and reinforcement learning. Deep architectures, hardware acceleration, and optimization solve problems in artificial intelligence.