EndtoEnd Supermask Pruning:
Learning to Prune Image Captioning Models
Abstract
With the advancement of deep models, research work on image captioning has led to a remarkable gain in raw performance over the last decade, along with increasing model complexity and computational cost. However, surprisingly works on compression of deep networks for image captioning task has received little to no attention. For the first time in image captioning research, we provide an extensive comparison of various unstructured weight pruning methods on three different popular image captioning architectures, namely SoftAttention, UpDown and Object Relation Transformer. Following this, we propose a novel endtoend weight pruning method that performs gradual sparsification based on weight sensitivity to the training loss. The pruning schemes are then extended with encoder pruning, where we show that conducting both decoder pruning and training simultaneously prior to the encoder pruning provides good overall performance. Empirically, we show that an 80% to 95% sparse network (up to 75% reduction in model size) can either match or outperform its dense counterpart. The code and pretrained models for UpDown and Object Relation Transformer that are capable of achieving CIDEr scores 120 on the MSCOCO dataset but with only 8.7 MB and 14.5 MB in model size (size reduction of 96% and 94% respectively against dense versions) are publicly available at https://github.com/jiahuei/sparseimagecaptioning.
keywords:
image captioning, deep network compression, deep learning1 Introduction
Over the past decade, continuous research on image captioning using deep neural networks (DNNs) has led to a steady improvement in the overall model performance. For instance, CIDEr^{1}^{1}1Consensusbased Image Description Evaluation (CIDEr) is a widelyused metric for caption quality by measuring the level of consensus between generated captions and groundtruth captions. scores vedantam2015cider of stateoftheart (SOTA) models have doubled from 66 points karpathy2015deep to 130 points and beyond herdade2019image ; cornia2020meshed recently on the MSCOCO dataset lin2014microsoft . However, such gains are usually achieved at the expense of model size using heavily parameterised models, where the decoder size had quadrupled from 12 million xu2015show to 55 million herdade2019image parameters (see Table 2 for details).
In an effort to reduce model size, various pruning techniques have been proposed to remove unimportant weights from a network. Generally, there are multitudes of benefits to be gained from weight pruning: it provides opportunities for improvements in terms of i) speed, ii) storage, and iii) energy consumption, especially during the deployment stage. For speed, highlysparse models with significantly fewer nonzero parameters can enjoy faster runtimes when combined with efficient SpMM kernels elsen2020fast ; wang2020sparsert . This is particularly true for Recurrent Neural Network (RNN) and Transformer whose matrixmultiplication computations are bottlenecked by bandwidth kalchbrenner2018efficient . For storage, compressed models are easier to be deployed onto mobile devices. Moreover, compressing SOTA model checkpoints into tens of MB can potentially accelerate the dissemination of research findings, result reproduction and experimentation. Finally, for energy consumption, small RNN kernels produced via pruning can be stored in onchip SRAM cache with lower energy requirements rather than DRAM memory han2015learning , reducing carbon footprint.
While there is no shortage of pruning methods for image classification and translation tasks zhu2017prune ; chirkova2018bayesian ; louizos2018learning ; lee2018snip (see Sec. 2 for more), their applicability to multimodal contexts such as image captioning is still underexplored. To the best of our knowledge, there is only one prior work that involved pruning an image captioning model, and that is by Dai et al. dai2020grow . We hypothesise that this lack of progress is due to several difficulties. Firstly, weights are shared and reused across time steps, complicating variational pruning methods proposed for feedforward networks chirkova2018bayesian . Secondly, naively performing structured pruning on LongShort Term Memory (LSTM) kernels can lead to invalid units due to inconsistent dimensions wen2018learning . Thirdly, whereas small CIFAR datasets allow quick experimentation and iteration for Convolutional Neural Network (CNN) pruning louizos2018learning ; frankle2018lottery , there is a lack of an equivalent dataset in image captioning. Finally, image captioning is an inherently complex multimodal task; thus any proposed method must be able to perform well on both image and language domains.
To this end, this paper attempts to answer the following questions:

Which weight pruning method produces the best results on image captioning models?

Is there an ideal sparsity where a sparse model can match or even outperform its dense counterpart?

What is the ideal prunefinetune sequence for pruning both the pretrained encoder and decoder?

Can a sparse captioning model outperform a smaller but dense model?
with an extensive comparison of various unstructured weight pruning methods on three different SOTA image captioning architectures, namely SoftAttention (SA) xu2015show , UpDown (UD) anderson2018bottom and Object Relation Transformer (ORT) herdade2019image . Following this, we propose a novel endtoend weight pruning method that performs gradual sparsification while maintaining the overall model performance. The pruning schemes are then extended with encoder pruning, where several prunefinetune sequences are explored. Empirically, we show that conducting both decoder pruning and training simultaneously prior to the encoder pruningandfinetuning provides better raw performance. Also, we show that for a given performance level, a largesparse LSTM captioning model is better than a smalldense one in terms of model costs.
As a summary, the core contributions of this paper are threefold. Firstly, this is the first extensive attempt at exploring unstructured model pruning for image captioning task. Empirically, we show that 80% to 95% sparse networks can either match or even slightly outperform their dense counterparts (Sec. 5.2). In addition, we propose a pruning method – Supermask Pruning (SMP) that performs continuous and gradual sparsification during training stage based on parameter sensitivity in an endtoend fashion. Secondly, we investigate an ideal way to combine pruning with finetuning of pretrained CNN, and show that both decoder pruning and training should be done before pruning the encoder (Sec. 5.4). Finally, we release the pretrained sparse models for UD and ORT that are capable of achieving CIDEr scores 120 on the MSCOCO dataset; yet are only 8.7 MB (reduction of 96% compared to dense UD) and 14.5 MB (reduction of 94% compared to dense ORT) in model size (Fig. 1 and Sec. 5.3). Our code and pretrained models are publicly available^{2}^{2}2https://github.com/jiahuei/sparseimagecaptioning.
2 Related Works
2.1 Image Captioning
Since the advent of deep neural networks, research on image captioning can be characterised by numerous architectural innovations in pursuit of raw performance (see hossain2019comprehensive for a complete survey). The first major innovation came in the form of an endtoend captioning network that directly generates a caption given an image karpathy2015deep ; tan2019phrase . Next came visual attention, in which one or more CNN feature maps were used to guide and condition the caption generation process xu2015show ; fu2018image . There are also numerous works that used attributes as a way to directly inject salient information into the decoder chen2018show ; ding2019neural ; ji2021divergent . Following that, anderson2018bottom employed an object detector to generate image features as a form of hardattention; which along with Transformer, became a popular captioning paradigm herdade2019image ; cornia2020meshed ; wang2020word . Concurrently, substantial effort has been put into reinforcement learning which allowed nondifferentiable caption metrics to be used as optimisation objectives rennie2017self ; chen2018temporal . While these methods have been successful in advancing the SOTA performance, minimal effort has been made on reducing model cost para2017exp ; tan2019comic , which is the main motivation of this work.
2.2 Structured or Channel Pruning
Structured pruning is a coarsegrain pruning technique whereby entire rows, columns or channels of fully connected or convolutional weights are removed. There are extensive prior work in this direction targeted at feedforward CNNs, including luo2020autopruner ; zhuang2018discrimination ; lin2020hrank ; li2020eagleeye ; lin2020channel just to name a few. At the same time, structured pruning of RNNs is also widely explored wen2018learning ; yu2019learning ; wen2020structured .
Since structured pruning reduces model dimensions, the resulting network is more amenable to runtime speedup. However, this advantage comes with several costs: (a) Architectural constraints: For gated RNNs such as LSTM, structured pruning requires that the pruned rows and columns of the recurrent weight kernels be aligned with each other; otherwise it may lead to invalid units wen2018learning . The same is true for attention kernels, which is extensively used in modern captioning architectures. (b) Lower sparsity: Structured pruning usually provides lower sparsity for a given performance loss crowley2018pruning ; liu2019rethinking , often in the range of 40% (1.7) to 90% (10). In contrast, we demonstrate that unstructured pruning can prune an order of magnitude more at 99% (100) while maintaining performance (see Fig. 4).
2.3 Unstructured Pruning
Recently, unstructured pruning has enjoyed emerging support, including Fast SpMM kernels kalchbrenner2018efficient ; elsen2020fast ; wang2020sparsert and blocksparsity support by NVIDIA Ampere GPU^{3}^{3}3https://developer.nvidia.com/blog/nvidiaamperearchitectureindepth and HuggingFace Transformers library^{4}^{4}4https://github.com/huggingface/pytorch_block_sparse. While there exist numerous unstructured pruning methods chirkova2018bayesian ; louizos2018learning ; wang2019eigendamage , we focus on methods applied to RNN and NLP models with the following characteristics: (a) Straightforward: Reasonably simple to implement and integrate into a standard deep network training workflow. (b) Effective: Able to prune at least 80% of parameters without compromising performance. (c) Efficient: Does not require expensive iterative pruning and retraining cycles. Thus, we arrive at the following pruning methods as a solid starting point for exploring image captioning model pruning:
(1) Hard / oneshot magnitudebased pruning: see2016compression first investigated three magnitudebased schemes for translation model with multilayer LSTM, namely classblind, classuniform and classdistribution. Classblind removes parameters with the smallest absolute value regardless of weight class. In contrast, classuniform prunes every layer to the same sparsity level. Classdistribution han2015learning prunes parameters smaller than a global factor of the class standard deviation. Experiments found that classblind produced the best results.
(2) Gradual magnitudebased pruning: First introduced by narang2017exploring to prune parameters gradually over the course of training, it was extended by zhu2017prune via a simplified pruning curve with reduced hyperparameters. The simplified slope has a single phase, governed by a cubic function that determines the sparsity level at each training step. Their method is tested on deep CNN, stacked LSTM and GNMT models on classification, language modelling and translation.
(3) SNIP: lee2018snip proposed a saliency criterion for identifying structurally important connections. The criterion is computed as the absolute magnitude of the derivative of training loss with respect to a set of multiplicative pruning masks. Guided by the saliency criterion, singleshot pruning of CNN and RNN were performed at initialisation, prior to training. It is evaluated on the task of image classification.
(4) Lottery ticket (LT): It is a seminal work by frankle2018lottery which put forth “The Lottery Ticket Hypothesis”. It states that there exists a subnetwork in a randomly initialised dense neural network, such that when trained in isolation can match the test accuracy of the original network. By iteratively pruning and resetting networks to their original initialisation values, the authors found sparse networks that can reach the original dense accuracy within equal or shorter training iterations. It is tested on CNNs for image classification.
(5) Supermask: The work by zhou2019deconstructing explored various aspects of Lottery Tickets in order to determine the reason behind its success. In the process, the authors discovered that binary pruning masks can be learned in an endtoend fashion. However as formulated, only the masks were optimised, and there is no straightforward way to control network sparsity. Another work by srinivas2017training optimised both masks and weights of CNNs, yet similarly, its final sparsity is influenced indirectly via a set of regularisation hyperparameters. In this work, we extend Supermask with a novel sparsity loss (Eq. (6)) to directly control the final sparsity of the network.
Among the prior works on model pruning, only the work by Dai et al. dai2020grow involved image captioning. However, there exists several differences with our work: (a) only the HLSTM cell is pruned; (b) CNN weight pruning is not investigated; (c) growandprune (GP) method dai2019nest used requires expensive and timeconsuming “grow” and “pruneretrain” cycles. In contrast, our approach prunes both encoder and decoder inparallel with regular training. Nevertheless, we provide a performance comparison with HLSTM in Table 3.
3 Supermask Revisited
Supermask zhou2019deconstructing is a network training method proposed by Zhou et al. as part of their work on studying the Lottery Tickets phenomenon frankle2018lottery . Their work aimed to uncover the critical elements that contributed towards the good performance of “winning tickets”: sparse networks that emerged from iterative prunereset cycles. In the process, the authors discovered that an untrained, randomly initialised network could attain test performance that is significantly better than chance. This is achieved by applying a set of wellchosen masks to the network weights, effectively pruning it. These masks were hence named “Supermask”, in that they are able to boost performance even without training of the underlying weights.
3.1 Learning Supermasks
Supermasks are learned in an endtoend fashion via stochastic gradient descent (SGD). For every weight matrix to be pruned, a gating matrix with the same shape as is created. This gating matrix operates as a masking mechanism that determines which of the parameter will be involved in both the forwardexecution and backpropagation of the graph. For a model with layers, we now have two sets of parameters: gating parameters and network parameters . To this end, the effective weight tensor is computed following Eq. (1):
(1) 
where are the original weight and gating matrices with shape ; and superscript indicates binary variables. is elementwise multiplication.
In order to achieve the desired masking effect, must contain only “hard” binary values, i.e. . Therefore, matrix containing continuous values is transformed into binary matrix using a composite function . Here, is a pointwise function that squeezes continuous values into the interval ; whereas is another pointwise function that samples from the output of . This is shown in Eq. (2):
(2) 
Sampling from is done by treating as Bernoulli random variables, and then performing an “unbiased draw”. Unbiased draw is the sampling process where each gating value is binarised to with probability and otherwise, i.e. . Sigmoid function is employed as . Finally, the effective weight can be computed as follows by modifying Eq. (1):
(3) 
Before training, all the gating variables are initialised with the same constant value , whereas the weights of the network are initialised randomly. The authors found that the utilisation of helped to mitigate the bias arising from the constant value initialisation by injecting stochasticity into the training process.
Although Supermask is an effective pruning technique, the formulation as presented does not allow for easy control of final network sparsity. Instead, the pruning ratios were indirectly controlled via the pruning mask initialisation magnitude. In order to address this limitation, we proposed Supermask Pruning (SMP) method that is explained next.
4 Supermask Pruning (SMP)
In this paper, we propose a simple yet effective method to directly control the final weight sparsity of models pruned based on the Supermask framework. To achieve this, a novel sparsity loss is formulated which allows one to drive the sparsity level of gating variables to a userspecified level . We name our method Supermask Pruning (SMP), and an overview is illustrated in Fig. 2. The complete algorithm is given in Algorithm 1.
4.1 Sparsity Loss
Technically, a straightforward way to influence the sparsity and pruning rate of Supermask is to introduce an regularisation term as follows:
(4) 
where is the number of nonzero (NNZ) gating parameters; is the total number of gating parameters; and are the current and final training step respectively. Such a regularisation term as formulated in Eq. (4) would apply a downward pressure on the magnitude of the gating parameters over the course of training, so that by the end of training, most of the gating parameters would have magnitudes smaller than zero. Ideally, these negativevalued gating parameters would represent weights that are least important, and can thus be removed without significant performance impact. At the same time, smaller gating magnitudes will cause more weights to be dropped more frequently, which in turn would allow the network to learn to depend on fewer parameters.
However, while naively applying the regularisation term (Eq. (4)) can produce networks with the desired sparsities, it does not achieve optimal performance. Our preliminary experiments found that constant application of causes weights to be dropped too early in the training process. In other words, it leads to an overaggressive pruning schedule. To mitigate this, we propose to perform loss annealing by adding a variable weight to the term in the cost function.
Our idea is at the beginning of training, the value of is set to zero to allow network learning to progress without any pruning being done. As training progresses, the value of is gradually increased, forcing the model towards a sparse solution. Specifically, this loss annealing is done using an inverted cosine curve. Our experiments found that such gradual weight pruning produces better results, which is consistent with the observations in narang2017exploring ; zhu2017prune . Thus, our final sparsity loss is given as:
(5) 
(6) 
Note that to compute , it is necessary to sample from the gating parameters. However, instead of using as the sampling function, we perform a “maximumlikelihood (ML) draw” srinivas2017training to sample from in order to ensure determinism when calculating the sparsity. ML draw involves thresholding each value at , i.e. . For a model with layers and gating variables , this computation takes the following form:
(7) 
As both the sampling functions and are nondifferentiable, gradient backprop has to be performed via an estimator. On this front, bengio2013estimating had explored several gradient estimators for stochastic neurons, and the straightthrough estimator (STE) is found to be simple yet performant. Hence, backprop is calculated by treating both sampling functions as identity functions, such that the gradients are estimated as such that:
(8) 
Finally given an image and caption , the overall cost function for training the image captioning model with gating variables is a weighted combination of captioning loss and sparsity loss :
(9) 
Intuitively, it can be seen that the captioning loss term is providing a supervised way to learn the saliency of each parameter where important parameters are retained with higher probability whereas unimportant ones are dropped more frequently. On the other hand, the sparsity regularisation term pushes down the average value of the gating parameters so that most of them have a value of less than after sigmoid activation. The hyperparameter determines the weightage of . If is set too low, the target sparsity level might not be attained (see Sec. 5.7). Visualisations of the training progression are given in Sec. 5.6.
4.2 Inference
After model training is completed, all the weight matrices are transformed into sparse matrices via elementwise multiplication with . This can be done by sampling from using , after which can be discarded. In other words, the final weights are calculated as:
(10) 
The final sparse network can then be stored in appropriate sparse matrix formats such as Coordinate List (COO) or Compressed Sparse Row (CSR) in order to realise compression in terms of storage size. This can be done easily using PyTorch, as it supports parameter saving in COO format (as of release 1.6). Alternatively, regular compression algorithms such as gzip can be used to compress the model.
5 Experiments
In this section, we first present the setup of our experiments, followed by the results obtained from over 6,000 GPU hours using 2 Titan X GPUs.
5.1 Experiment Setup
Architectures: Three different popular image captioning architectures are used in this work: SoftAttention (SA) xu2015show , UpDown (UD) anderson2018bottom and Object Relation Transformer (ORT) herdade2019image . SA consists of InceptionV1 ioffe2015batch , and a single layer LSTM or GRU with singlehead attention function. Other details such as context size, attention size and image augmentation follow tan2019comic . For UD and ORT, we reuse the public implementations^{5}^{5}5https://github.com/ruotianluo/selfcritical.pytorch/tree/3.2^{6}^{6}6https://github.com/yahoo/object_relation_transformer.
Hyperparameters: For all training, we utilise Adam kingma2014adam as the optimiser, with an epsilon of for SA and UD. The SA models were trained for 30 epochs, whereas UD and ORT models were trained for 15 epochs. Cosine LR schedule was used for SA and UD, whereas ORT follows herdade2019image . Following narang2017exploring and han2015learning , lower dropout rates are used for sparse networks to account for their reduced capacity. The rest follows tan2019comic .
For Supermask Pruning (SMP), training of the gating variables is done with a higher constant learning rate (LR) of 100 without annealing. This requirement of a higher LR is also noted in zhou2019deconstructing . All are initialised to a constant .
The other pruning methods are trained as follows. Hard: Pruning is applied after decoder training is completed. It is then retrained for 10 epochs. Gradual: Pruning begins after the first epoch is completed and ends at half of the total epochs, following the heuristics outlined in narang2017exploring . Pruning frequency is 1000. We use the standard scheme where each layer is uniformly pruned. SNIP: Pruning is done at initialisation using one batch of data. Implementation is based on the authors’ code^{7}^{7}7https://github.com/namhoonlee/snippublic. Lottery Ticket: Winning tickets are produced using hardblind, harduniform and gradual pruning. For a fair comparison with other singleshot pruning methods, we follow the oneshot protocol instead of the iterative protocol.
Inference is performed using beam search without length normalisation.
Datasets: Experiments are performed on MSCOCO lin2014microsoft which is a public English captioning dataset. Following prior captioning works, we utilise the “Karpathy” split karpathy2015deep , which assigns 5,000 images for validation, 5,000 for testing and the rest for training. Preprocessing of captions is done following tan2019comic .
Metrics: Evaluation scores are obtained using the publicly available MSCOCO evaluation toolkit^{8}^{8}8https://github.com/salaniz/pycocoevalcap, which computes BLEU, METEOR, ROUGEL, CIDEr and SPICE (B, M, R, C, S).
5.2 Pruning Image Captioning Models
In this section, we attempt to answer Questions (1) and (2) in Sec. 1 via extensive performance comparisons of the pruning methods at multiple sparsity levels. We first present the pruning results on SA in Fig. 3, followed by UD and ORT in Fig. 4. Pruning is applied to all learnable parameters except for normalisation layers and biases. All the results herein were obtained using teacherforcing with crossentropy loss.
Which pruning method produces the best results? Our proposed endtoend Supermask Pruning (SMP) method provides a good performance relative to the dense baselines. This observation is valid even at high pruning ratios of 95% and above. In particular, the relative drops in CIDEr scores for UD and ORT are only marginal ( to ) even at a pruning rate. This is in contrast with competing methods whose performance drops are either double or even triple compared to ours, especially on SA and UD. To further support this observation, we compute the uniqueness and length of captions produced by our sparse SMP models. Results in Table 1 shows that they are largely unaffected by the pruning rate.
Among the competing methods, gradual pruning generally outperforms hard pruning, especially at higher sparsity levels when NNZ falls to 0.6 M and below. On the other hand, the results of LTs indicates that model resetting in a oneshot scenario does not outperform direct application of the underlying pruning method. We note that better results have been reported using iterative pruneresettrain cycles, however that would lead to excessively long training times and unfair comparisons with other pruning methods.
Another notable result is the relatively poor performance of SNIP when applied to image captioning. We can observe in Fig. 3 that the performance of SNIP is acceptable at 80% sparsity only. Any higher sparsity levels than this quickly led to a collapse in caption quality, as indicated by the metric scores. We tried accumulating the saliency criterion across 100 batches in an attempt to improve the result, but the improvement is limited with a huge gap from the baseline^{9}^{9}9To ensure there are no critical errors in our implementation, we had successfully reproduced the results for LSTMb on MNIST with a lower error rate of averaged across 20 runs.. All in all, these results reflect the difficulty of pruning generative models, as well as the importance of testing on larger datasets.
Is there an ideal sparsity? A broad trend that emerged from Fig. 3 and 4 is that the model performance is more dependent on the remaining NNZ parameters after pruning, rather than the sparsity level. Both the UD and ORT models, which are about larger than the SA model, can achieve substantially higher sparsity. On the extreme end, we were able to prune 99.1% of parameters from the networks, while suffering only CIDEr points for UD and CIDEr points for ORT.
In addition, there are indeed ideal sparsity levels where sparse models can either match or outperform their dense counterparts. This occurs at an 80% sparsity for SA, and at a 95% sparsity for both UD and ORT. We did not further investigate the performance of these models at lower sparsities, as although it is reasonable to expect better performance, the model sizes also increase substantially.
All in all, these results showcase the strength of SMP across pruning ratios from 80% to 99.1%, while managing good performance relative to the dense baselines and other pruning methods.
5.3 SOTA Comparison
In this section, we compare models pruned using our proposed SMP against both HLSTM by Dai et al. dai2020grow ; dai2019nest and standard captioning SOTA approaches.
HLSTM comparison: In Table 3, we provide the compression rate and model performance comparisons with dai2020grow . Our SMP models are SA models trained and finetuned using teacherforcing. As it can be seen, both of the SMP models at 90% and 95% sparsities with smaller RNN sizes outperform HLSTM on both BLEU4 and CIDEr. Furthermore, SMP does not require the expensive and timeconsuming process of “growpruneretrain” cycles as required by dai2019nest .
SOTA comparison: To demonstrate that sparse SMP models are competitive with standard SOTA works, we compare UD and ORT models pruned using SMP against several SOTA approaches in Table 2. We optimised our models for BLEU4 and CIDEr using SCST rennie2017self , but with the mean of rewards as baseline following cornia2020meshed ; luo2020better . Sparse models are saved in PyTorch COO format. For float16 models, weights are converted back to singleprecision prior to computation.
From the results, it is evident that our pruned models are still capable of obtaining good captioning performance. In fact, our 95% sparse UD and ORT models managed to outperform their original dense counterparts. This is consistent with the findings in Section 5.2, which found that 95% sparsity is ideal. Finally, despite having a relatively small model size of 10 MB and 21 MB, our 99.1% sparse models provided good results as well. The 99.1% sparse UD model, in particular, is able to match the dense UD model on CIDEr while outperforming it on BLEU4.
5.4 Pruning Sequence for Encoder
In this section, we attempt to answer Question (3), which asks: what is the ideal prunefinetune sequence for the encoder ? To answer this, we devised three prunefinetune schemes for the SA model as follows:
Scheme A: Start from scratch: Train the decoder while pruning both the encoder and decoder. Then, finetune both with gating frozen (i.e. not updated).
Scheme B: Start from a trained decoder: Finetune and prune both the encoder and decoder.
Scheme C: Start from a trained and pruned decoder: Finetune both the encoder and decoder, but only prune the encoder. Decoder are left frozen.
We paired each of the schemes with three pruning methods from the previous section, namely i) classblind hard pruning, ii) gradual pruning and iii) SMP. All learnable parameters were pruned except for normalisation layers and biases. For schemes where gating parameters are frozen, we still apply to sample from . However, we also found that there is minimal difference in the final performance when is used instead. Scheme A is not evaluated for hardblind as it requires a trained model prior to pruning.
From Fig. 5, it is evident that Scheme A produces polarised results. Specifically, it is the best when paired with SMP, yet is the worst with gradual and hardblind. On the other hand, Scheme C is consistently favoured over Scheme B for all three pruning methods. This shows that better performance can be attained when pruning and training for the decoder are done inparallel rather than separately.
Comparing the three different pruning methods, we can see that the trends are consistent with the results obtained for decoder pruning in the previous section. Across different sparsity levels, our SMP method produces the best performance. At 80% sparsity, there is barely any performance loss relative to the baselines, with a mere in CIDEr score for LSTM and no difference for GRU. At the other extreme with 2.5% of parameters, we managed CIDEr scores of for LSTM and for GRU, while gradual and hardblind scored and below.
5.5 LargeSparse vs SmallDense
Can a sparse model outperform a smaller dense model? Towards that end, empirical results are given in Table 4. All models are based on SA in Sec. 5.2 but with different CNN and LSTM sizes. The CNNs are MobileNetV1: DenseL and Sparse have a width multiplier of ; DenseM and S have a width of and respectively. Moreover, DenseM and S have a word embedding size of , with an attention and LSTM size of . The FLOP counts are for generating a 9word caption from a 224224 image using a beam size of (average caption length is 9). Sparse models are pruned using SMP.
Comparing models with similar metric scores, largesparse models often have smaller NNZ and FLOP counts than their dense counterparts. Notably, a 95% sparse model can provide comparable performance as DenseM that is larger and heavier ( NNZ and FLOP). This further showcases the strength of model pruning and solidifies the observations made in works on RNN pruning zhu2017prune ; narang2017exploring .
5.6 Qualitative Results and Visualisations
In this section, we present examples of captions generated, as well as visualisations of training progression, final layerwise sparsities and weight distribution of sparse SMP models.
Qualitative results: Figure 6 shows the captions produced by our sparse UD and ORT models from Table 2. From the samples, we can see that the overall caption quality is satisfactory with sufficient details, such as umbrellas, living room, fence and school bus. Object counts are largely correct except for 5th image in which a bird is confused for two. The last image shows captions with bad endings, which is a sideeffect of SCST optimisation.
Training progression: Meanwhile in Fig. 7, we can observe the effects of cosine annealing from Eq. (5) and the sparsity regularisation weightage from Eq. (9) on the final weighted sparsity loss term. Loss annealing allows the model to focus on learning useful representations to solve the captioning task during the early stages of training, and then move towards a sparse solution during the middle to late stages when the training has stabilised. Note that whereas both figures show that sparsity levels only start to increase at around 25% of total training steps, the pruning process actually began much earlier. The average value of gating variables began to decrease around 10% into training, and continued to drop towards throughout later stages of the training process. We can also observe that the training loss (XE loss) remained relatively stable throughout the training and pruning process even for a 97.5%sparse model.
Layerwise sparsities: For InceptionV1 encoder (Fig. 7(a)) pruned using SMP or hardblind pruning, we can see that earlier convolution layers with fewer parameters are pruned less heavily than later layers. This behaviour is consistent with findings in elsen2020fast . We can also see that the convolution kernel of the second branch of each Inception module is pruned the most compared to the rest.
For LSTM decoder (Fig. 7(b)), SMP and hardblind pruning consistently prune “QK” layer (the second layer of the 2layer attention MLP) the least, whereas “Key” and “Query” layers were pruned most heavily. Finally, “Embedding” consistently receives more pruning than “Output” despite having fewer parameters. This may indicate that there exists substantial information redundancy in the word embeddings matrix as noted in shi2018structured ; tan2019comic .
For MobileNetV1 encoder (Fig. 7(c)), SMP consistently prunes pointwise () convolution kernels significantly more than depthwise kernels. This is a desirable outcome as pointwise operations overwhelmingly dominate the computation budget of separable convolutions both in terms of FLOP count and parameters howard2017mobilenets ; elsen2020fast .
Weight distribution: Lastly, distributions of nonzero weights are visualised via kernel density estimation plots in Fig. 9. We can see that the remaining weights have a bimodal distribution centred around zero. Notably, there are still considerable amounts of smallmagnitude weights after finetuning, even for extremelysparse models.
5.7 Ablation Studies
This section investigates the impact of hyperparameters on the performance of SMP.
Table 5 shows the effect of different gating initialisation values . From the results, we can establish that the best overall performance is achieved when . This can be attributed to the fact that initialisation value of allows model parameters to be retained with high probability at early stages of training, leading to better convergence. This observation is also consistent with the works of zhu2017prune ; narang2017exploring , where it is found that gradual pruning can lead to better model performance. Thus, we recommend setting .
Table 6 shows the effect of sparsity regularisation weightage . This is an important hyperparameter that could affect the final sparsity level at convergence, with higher sparsity target requiring larger . From the results, we can see that low values lead to insufficient sparsity. At the same time, we found that setting to a large value does not necessarily degrade its final performance, as of 80 and 120 were used for UD and ORT models in Fig. 4.
6 Discussion
In the formulation of SMP, the sparsity loss is annealed using an inverted cosine curve defined in Eq. (5). This annealing schedule is inspired by works on gradual pruning as well as works on Variational Recurrent AutoEncoder (VRAE) for text generation. In particular, narang2017exploring has found that gradual pruning is 7% to 9% better than hard pruning. Our experiments that compared gradualuniform pruning and harduniform have found this to be generally true, especially at high sparsity levels (see Sec. 5.2 and 5.4). Meanwhile, loss annealing is also used to train VRAE for text generation in the form of KullbackLeibler (KL) annealing bowman2016generating . Specifically, the KL regularisation term is gradually introduced during training in order to shift the model from a vanilla RAE to a VRAE. In the same spirit, SMP gradually transitions the model from dense to sparse, as shown in Fig. 7.
Another perspective on the effectiveness of gradual pruning or sparsity annealing can be found in the notion of “Information Plasticity” introduced by achille2018critical . In the work, it is found that DNN optimisation exhibits two distinct learning phases: a critical “memorisation phase” during which information stored in the weights as measured by Fisher Information rapidly increase, followed by a “forgetting phase” where the amount of information contained gradually decrease and the network is less adaptable to change. This suggests that an ideal pruning schedule should impose sparsity constraints while the network has passed its critical learning phase, and at the same time still plastic enough to adapt to such changes.
Beyond this, sensitivitybased pruning is another equally important aspect of SMP. In SNIP lee2018snip , weights are pruned based on the absolute magnitude of the derivative of training loss with respect to the multiplicative pruning masks. In contrast, SMP achieves this by updating the gating parameters according to their gradients. This crucial difference meant that whereas SNIP removes weights with the least influence on training loss regardless of sign, SMP will also remove weights with negative influence (i.e. increase training loss).
Moreover, SNIP computes sensitivity at initialisation using one or more batches of training data. This implies that SNIP can be sensitive to the choice of weight initialisation scheme, as stated in the paper. In contrast, SMP performs continuous and gradual sparsification throughout the training process, making it less sensitive to weight initialisation. In fact, Section 5.2 shows that SMP can be used on a variety of architectures, each with its own set of initialisation schemes.
By combining these insights, we are able to realise several benefits. Firstly, SMP achieves good performance across sparsity levels from 80% to 99.1% (111 reduction in NNZ parameters). This is in contrast with competing methods zhu2017prune ; see2016compression where there is a significant performance dropoff starting from sparsity level of 90% (see Sec. 5.2). Secondly, our SMP sparsity loss allows explicit control of the overall pruning ratio and compression desired by simply specifying the target sparsity . The pruning ratio for each layer is also automatically determined (see Fig. 8). In contrast, works like srinivas2017training ; louizos2018learning control sparsity levels indirectly via a set of regularisation hyperparameters. Last but not least, SMP can be easily implemented on top of any model, and be integrated seamlessly into a typical training process. Only 2 main hyperparameters needed to be tuned (gating learning rate and ), instead of up to 4 as in zhu2017prune ; narang2017exploring . Since pruning is performed inparallel with training, we can avoid the complexities and costs associated with iterative trainandprune dai2020grow or reinforcement learning techniques he2018amc . Complexities associated with variational pruning chirkova2018bayesian such as the local reparameterisation trick can also be avoided.
7 Conclusion and Future Work
This paper presented empirical results on the effectiveness of unstructured weight pruning methods on various image captioning architectures, including RNN and Transformer architectures. In addition, we presented an effective endtoend weight pruning method – Supermask Pruning – that performs continuous and gradual sparsification based on parameter sensitivity. Subsequently, the pruning schemes are extended by adding encoder pruning, where we showed that conducting decoder pruning and training simultaneously provides good performance. We also demonstrated that using appropriate pruning methods, ideal sparsity levels can be found in the range of 80% to 95%. These sparse networks can match or outperform their dense counterparts. Finally, we show that for a given performance level, a largesparse LSTM captioning model outperforms a smalldense one in terms of model costs. In short, this is the first extensive attempt at exploring unstructured model pruning for image captioning. We hope that this work can spur new research interest in this direction and subsequently serve as benchmark for future image captioning pruning works.
We believe that this work opens up a sea of directions for future works. Firstly, optimised sparse matrix multiplication kernels and blocksparsity patterns can be implemented in order to realise speedup at inference time. Finally, there are many other pruning methods that are yet to be tested, including variational pruning and saliencybased methods.
References
 (1) R. Vedantam, C. Lawrence Zitnick, D. Parikh, CIDEr: Consensusbased image description evaluation, in: IEEE CVPR, 2015, pp. 4566–4575.
 (2) A. Karpathy, L. FeiFei, Deep visualsemantic alignments for generating image descriptions, in: IEEE CVPR, 2015, pp. 3128–3137.
 (3) S. Herdade, A. Kappeler, K. Boakye, J. Soares, Image captioning: Transforming objects into words, in: NeurIPS, 2019, pp. 11137–11147.
 (4) M. Cornia, M. Stefanini, L. Baraldi, R. Cucchiara, MeshedMemory Transformer for Image Captioning, in: IEEE/CVF CVPR, 2020, pp. 10578–10587.
 (5) T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft COCO: Common objects in context, in: ECCV, 2014, pp. 740–755.
 (6) K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in: ICML, 2015, pp. 2048–2057.
 (7) E. Elsen, M. Dukhan, T. Gale, K. Simonyan, Fast sparse ConvNets, in: IEEE/CVF CVPR, 2020, pp. 14629–14638.
 (8) Z. Wang, SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference, in: Proceedings of the ACM International Conference on PACT, 2020, pp. 31–42.
 (9) N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, K. Kavukcuoglu, Efficient Neural Audio Synthesis, in: ICML, Vol. 80, 2018, pp. 2415–2424.
 (10) S. Han, J. Pool, J. Tran, W. J. Dally, Learning both weights and connections for efficient neural networks, in: NIPS, 2015, pp. 1135–1143.
 (11) M. Zhu, S. Gupta, To prune, or not to prune: exploring the efficacy of pruning for model compression, ICLR, Workshop Track Proceedings (2018).
 (12) N. Chirkova, E. Lobacheva, D. Vetrov, Bayesian Compression for Natural Language Processing, in: Proceedings of EMNLP, 2018, pp. 2910–2915.
 (13) C. Louizos, M. Welling, D. P. Kingma, Learning Sparse Neural Networks through Regularization, ICLR (2018).
 (14) N. Lee, T. Ajanthan, P. H. Torr, SNIP: Singleshot network pruning based on connection sensitivity, ICLR (2019).
 (15) X. Dai, H. Yin, N. Jha, Grow and Prune Compact, Fast, and Accurate LSTMs, IEEE Transactions on Computers 69 (3) (2020) 441–452.
 (16) W. Wen, Y. He, S. Rajbhandari, M. Zhang, W. Wang, F. Liu, B. Hu, Y. Chen, H. Li, Learning Intrinsic Sparse Structures within Long ShortTerm Memory, ICLR (2018).
 (17) J. Frankle, M. Carbin, The lottery ticket hypothesis: Finding sparse, trainable neural networks, ICLR (2019).
 (18) P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, Bottomup and topdown attention for image captioning and visual question answering, in: IEEE CVPR, 2018, pp. 6077–6086.
 (19) M. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga, A comprehensive survey of deep learning for image captioning, ACM CSUR 51 (6) (2019) 1–36.
 (20) Y. H. Tan, C. S. Chan, Phrasebased image caption generator with hierarchical LSTM network, Neurocomputing 333 (2019) 86–100.
 (21) K. Fu, J. Li, J. Jin, C. Zhang, Imagetext surgery: Efficient concept learning in image captioning by generating pseudopairs, IEEE TNNLS 29 (12) (2018) 5910–5921.
 (22) H. Chen, G. Ding, Z. Lin, S. Zhao, J. Han, Show, Observe and Tell: Attributedriven Attention Model for Image Captioning, in: IJCAI, 2018, pp. 606–612.
 (23) G. Ding, M. Chen, S. Zhao, H. Chen, J. Han, Q. Liu, Neural image caption generation with weighted training and reference, Cognitive Computation 11 (6) (2019) 763–777.
 (24) J. Ji, Z. Du, X. Zhang, Divergentconvergent attention for image captioning, Pattern Recognition 115 (2021) 107928.
 (25) Q. Wang, W. Huang, X. Zhang, X. Li, Wordsentence framework for remote sensing image captioning, IEEE Transactions on Geoscience and Remote Sensing (2020).
 (26) S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, V. Goel, SelfCritical Sequence Training for Image Captioning, in: IEEE CVPR, 2017, pp. 1179–1195.
 (27) H. Chen, G. Ding, S. Zhao, J. Han, Temporaldifference learning with sampling baseline for image captioning, in: AAAI, 2018, pp. 6706–6713.
 (28) S. N. Parameswaran, Exploring Memory and Time Efficient Neural Networks for Image Captioning, in: NCVPRIPG, Springer, 2017, pp. 338–347.
 (29) J. H. Tan, C. S. Chan, J. H. Chuah, COMIC: Toward A Compact Image Captioning Model With Attention, IEEE TMM 21 (10) (2019) 2686–2696.
 (30) J.H. Luo, J. Wu, Autopruner: An endtoend trainable filter pruning method for efficient deep model inference, Pattern Recognition 107 (2020) 107461.
 (31) Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, J. Zhu, Discriminationaware channel pruning for deep neural networks, in: NeurIPS, 2018, pp. 875–886.
 (32) M. Lin, R. Ji, Y. Wang, Y. Zhang, B. Zhang, Y. Tian, L. Shao, HRank: Filter Pruning using HighRank Feature Map, in: IEEE/CVF CVPR, 2020, pp. 1529–1538.
 (33) B. Li, B. Wu, J. Su, G. Wang, L. Lin, EagleEye: Fast Subnet Evaluation for Efficient Neural Network Pruning, in: ECCV, 2020, pp. 639–654.
 (34) M. Lin, R. Ji, Y. Zhang, B. Zhang, Y. Wu, Y. Tian, Channel Pruning via Automatic Structure Search, in: IJCAI, 2020, pp. 673–679.
 (35) N. Yu, C. Weber, X. Hu, Learning Sparse Hidden States in Long ShortTerm Memory, in: ICANN, Springer, 2019, pp. 288–298.
 (36) L. Wen, X. Zhang, H. Bai, Z. Xu, Structured pruning of recurrent neural networks through neuron selection, Neural Networks 123 (2020) 134–141.
 (37) E. J. Crowley, J. Turner, A. Storkey, M. O’Boyle, Pruning neural networks: Is it time to nip it in the bud?, in: Workshop on Compact DNNs with Industrial Applications, NIPS, 2018, pp. 1–10.
 (38) Z. Liu, M. Sun, T. Zhou, G. Huang, T. Darrell, Rethinking the value of network pruning, ICLR (2018).
 (39) C. Wang, R. Grosse, S. Fidler, G. Zhang, EigenDamage: Structured Pruning in the KroneckerFactored Eigenbasis, in: ICML, 2019, pp. 6566–6575.
 (40) A. See, M.T. Luong, C. D. Manning, Compression of Neural Machine Translation Models via Pruning, in: Proceedings of The SIGNLL CoNLL, ACL, 2016, pp. 291–301.
 (41) S. Narang, E. Elsen, G. Diamos, S. Sengupta, Exploring Sparsity in Recurrent Neural Networks, ICLR (2017).
 (42) H. Zhou, J. Lan, R. Liu, J. Yosinski, Deconstructing lottery tickets: Zeros, signs, and the supermask, in: NeurIPS, 2019, pp. 3597–3607.
 (43) S. Srinivas, A. Subramanya, R. Venkatesh Babu, Training sparse neural networks, in: IEEE CVPR Workshops, 2017, pp. 138–145.
 (44) X. Dai, H. Yin, N. Jha, NeST: A neural network synthesis tool based on a growandprune paradigm, IEEE Transactions on Computers 68 (10) (2019) 1487–1497.
 (45) Y. Bengio, N. Léonard, A. Courville, Estimating or propagating gradients through stochastic neurons for conditional computation, arXiv preprint arXiv:1308.3432 (2013).
 (46) S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: ICML, Vol. 37, 2015, pp. 448–456.
 (47) D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, ICLR (2014).
 (48) S. Ye, J. Han, N. Liu, Attentive Linear Transformation for Image Captioning, IEEE TIP 27 (11) (2018) 5514–5524.
 (49) J. Wang, W. Wang, L. Wang, Z. Wang, D. D. Feng, T. Tan, Learning visual relationship and contextaware attention for image captioning, Pattern Recognition 98 (2020) 107075.
 (50) R. Luo, A Better Variant of SelfCritical Sequence Training, arXiv preprint arXiv:2003.09971 (2020).
 (51) K. Shi, K. Yu, Structured Word Embedding for Low Memory Neural Network Language Model, in: Interspeech, 2018, pp. 1254–1258.
 (52) A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, MobileNets: Efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861 (2017).
 (53) S. Bowman, L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz, S. Bengio, Generating Sentences from a Continuous Space, in: Proceedings of the SIGNLL CoNLL, ACL, 2016, pp. 10–21.
 (54) A. Achille, M. Rovere, S. Soatto, Critical learning periods in deep networks, ICLR (2018).
 (55) Y. He, J. Lin, Z. Liu, H. Wang, L.J. Li, S. Han, AMC: AutoML for model compression and acceleration on mobile devices, in: ECCV, 2018, pp. 784–800.