Improving optimization of convolutional neural networks through parameter fine-tuning
Improving optimization of convolutional neural networks through parameter fine-tuning
Nicholas Becherer 0 1 2 3
John Pecarina 0 1 2 3
Scott Nykl 0 1 2 3
Kenneth Hopkinson 0 1 2 3
0 Nicholas Becherer
1 & John Pecarina
2 Air Force Institute of Technology , Dayton, OH 45433 , USA
3 Kenneth Hopkinson
In recent years, convolutional neural networks have achieved state-of-the-art performance in a number of computer vision problems such as image classification. Prior research has shown that a transfer learning technique known as parameter finetuning wherein a network is pre-trained on a different dataset can boost the performance of these networks. However, the topic of identifying the best source dataset and learning strategy for a given target domain is largely unexplored. Thus, this research presents and evaluates various transfer learning methods for fine-grained image classification as well as the effect on ensemble networks. The results clearly demonstrate the effectiveness of parameter fine-tuning over random initialization. We find that training should not be reduced after transferring weights, larger, more similar networks tend to be the best source task, and parameter fine-tuning can often outperform randomly initialized ensembles. The experimental framework and findings will help to train models with improved accuracy.
Convolutional neural networks; Transfer learning; Computer vision; Parameter fine-tuning
1 Introduction
Convolutional neural networks (CNNs) are machine
learning models that extend the traditional artificial neural
network by adding increased depth and additional
constraints to the early layers. Recent work has focused on
tuning their architecture to achieve maximum performance
on benchmarks such as the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) [
1, 2
].
CNNs are not a new topic in the field of computer
vision. They can trace their origins back to the early 1980s
with Fukushima’s Neocognitron [
4
]. More directly, they
were shown to be highly effective in the 1990s when used
for handwritten digit recognition and eventually in industry
for automated check readers [
5, 6
]. They rely on several
successive convolutional layers to extract information from
an image. Since convolution is a shift and slide operation, it
is invariant to translations in the data. Most importantly,
these convolutional layers are fully learnable through the
backprop algorithm, meaning they can identify low- and
high-level patterns through supervised training [7].
However, they fell out of favor in the new millennium because
of their difficulty scaling to larger problems [
8
]. Problems
beyond the optical character recognition or low-resolution
imagery were either too computationally expensive or
lacked enough training data to avoid overfitting.
Recently, they have stepped back into the spotlight as
these problems have been overcome. In 2012, Krizhevsky
et al. [
1
] leveraged several recent advances to overcome
these issues in the 2012 ILSVRC. First, they used
NVIDIA’s CUDA programming language to implement their
CNN on a highly parallel GPU, reducing run time by orders
of magnitude [
9
]. Second, the ImageNet competition
included a dataset on the scale of millions of images
automatically sourced from the Internet [
10
]. Combined
with several new techniques such as dropout regularization
[
11
] and simple data augmentation, they presented a model
dubbed AlexNet that won the competition. Since the
introduction of AlexNet, the winning entries for the
ImageNet competition have all been CNNs [
2, 12, 13
]. These
newer CNNs have largely advanced the field by making the
basic CNN architecture deeper. The Oxford Visual
Geometry Group’s Visual Geometry Group (VGG)
network experimented with 11–19 learnable weight layers,
finding that 19 was the optimal architecture [
13
]. The
current leading network, GoogLeNet, has 6.7 million
learnable weights across 22 layers [
2
]. Others have focused
on improving the performance of CNNs through data
augmentation and training techniques [
14
]. Yet ultimately,
all techniques still required large amounts of training data
to be effective.
The issue compelling the need for large amounts of data
is due to the fact that CNN training is an extremely
complex optimization problem. They typically use stochastic
gradient descent (SGD) to find the minima for a loss
function and this technique uses large labeled training
datasets to minimize effectively. Since SGD is a greedy
method, it is not guaranteed to find the global minima. This
means that initialization can have an effect on the final
outcome. These weights are usually initialized by sampling
from a Gaussian distribution [
15
]. However, it has been
shown that a transfer learning technique known as
parameter fine-tuning can improve the performance of a
CNN compared to random initialization (sometimes to a
substantial degree) [
3
].
However, there is a transfer learning technique that ca (...truncated)