1 Introduction
Owing to the rapid development of neural sequencetosequence modeling [sutskever2014sequence, bahdanau2014neural]
, deep neural network (DNN)based endtoend automatic speech recognition (ASR) systems have become almost as effective as the traditional hidden Markov modelbased systems
[chiu2018state, luscher2019rwth, karita2019a]. Various models and approaches have been proposed for improving the performance of the autoregressive(AR) endtoend ASR model with the encoderdecoder architecture based on recurrent neural networks (RNNs)
[chorowski2015attention, chan2016listen, kim2017joint] and Transformers [vaswani2017attention, dong2018speech, karita2019improving].Contrary to the autoregressive framework, nonautoregressive (NAR) sequence generation has attracted attention, including the revisitation of connectionist temporal classification (CTC) [graves2006connectionist, libovicky2018end] and the growing interest for nonautoregressive Transformer (NAT) [gu2017non]. While the autoregressive model requires iterations to generate an length target sequence, a nonautoregressive model costs a constant number of iterations
, independent on the length of the target sequence. Despite the limitation in this decoding iteration, some recent studies in neural machine translation have successfully shown the effectiveness of the nonautoregressive models, performing comparable results to the autoregressive models. Different types of nonautoregressive models have been proposed based on the iterative refinement decoding
[lee2018deterministic], insert or editbased sequence generation [stern2019insertion, gu2019levenshtein], masked language model objective [ghazvininejad2019mask, ghazvininejad2020semi, saharia2020non], and generative flow [ma2019flowseq].Some attempts have also been made to realize the nonautoregressive model in speech recognition. CTC introduces a framewise latent alignment to represent the alignment between the input speech frames and the output tokens [graves2014towards]. While CTC makes use of dynamic programming to efficiently calculate the most probable alignment, the strong conditional independence assumption between output tokens results in poor performance compared to the autoregressive models [battenberg2017exploring]. On the other hand, [chen2019non] trains a Transformer encoderdecoder in a maskpredict manner [ghazvininejad2019mask]: target tokens are randomly masked and predicted conditioning on the unmasked tokens and the input speech. To generate the output sequence in parallel during inference, the target sequence is initialized as all masked tokens and the output length is predicted by finding the position of the endofsequence token. However with this prediction of the output length, the model is known to be vulnerable to the output sequence with a long length. At the beginning of the decoding, the model is likely to make more mistakes in predicting long masked sequence, propagating the error to the later decoding steps. [chan2020imputer]
proposes Imputer, which performs the mask prediction in CTC’s latent alignments to get rid of the output length prediction. However, unlike the maskpredict, Imputer requires more calculations in each interaction, which is proportional to the square of the input length
in the selfattention layer, and the total computational cost can be very large.Our work aims to obtain a nonautoregressive endtoend ASR model, which generates the sequence in tokenlevel with low computational cost. The proposed Mask CTC framework trains a Transformer encoderdecoder model with both CTC and maskpredict objectives. During inference, the target sequence is initialized with the greedy CTC outputs and lowconfidence tokens are masked based on the CTC probabilities. The masked lowconfidence tokens are predicted conditioning on the highconfidence tokens not only in the past but also in the future context. The advantages of Mask CTC are summarized as follows.
No requirement for output length prediction: Predicting the output token length from input speech is rather challenging because the length of the input utterances varies greatly depending on the speaking rate or the duration of silence. By initializing the target sequence with the CTC outputs, Mask CTC does not have to care about predicting the output length at the beginning of the decoding.
Accurate and fast decoding: We observed that the results of CTC outputs themselves are quite accurate. Mask CTC does not only retain the correct tokens in the CTC outputs but also recovers the output errors by considering the entire context. Tokenlevel iterative decoding with a small number of masks makes the model wellsuited for the usage in real scenarios.
2 Mask CTC framework
The objective of endtoend ASR is to model the joint probability of a length output sequence given a length input sequence . Here, is an output token at position in the vocabulary and is a dimensional acoustic feature at frame .
The following subsections first explain a conventional autoregressive framework based on attentionbased encoderdecoder and CTC. Then a nonautoregressive model trained with mask prediction is explained and finally, the proposed Mask CTC decoding method is introduced.
2.1 Attentionbased encoderdecoder
Attentionbased encoderdecoder models the joint probability of given
by factorizing the probability based on the probabilistic lefttoright chain rule as follows:
(1) 
The model estimates the output token
at each timestep conditioning on previously generated tokens in an autoregressive manner. In general, during training, the ground truth tokens are used for the history tokens and during inference, the predicted tokens are used.2.2 Connectionist temporal classification
CTC predicts a framelevel alignment between the input sequence and the output sequence by introducing a special <blank> token. The alignment is predicted with the conditional independence assumption between the output tokens as follows:
(2) 
Considering the probability distribution over all possible alignments, CTC models the joint probability of
given as follows:(3) 
where returns all possible alignments compatible with . The summation of the probabilities for all of the alignments can be computed efficiently by using dynamic programming.
To achieve robust alignment training and fast convergence, an endtoend ASR model based on an attentionbased encoderdecoder framework is trained with CTC [kim2017joint, karita2019improving]
. The objective of the autoregressive joint CTCattention model is defined as follows by combining Eq. (
1) and Eq. (3):(4) 
where is a tunable parameter.
2.3 Joint CTCCMLM nonautoregressive ASR
Mask CTC adopts nonautoregressive speech recognition [chen2019non] based on a conditional masked language model (CMLM) [ghazvininejad2019mask], where the model is trained to predict masked tokens in the target sequence [devlin2019bert]^{1}^{1}1Note that CMLM is used as an ASR decoder network conditioned on the encoder output as well, and it is different from an external language model often used in shallow fusion during decoding.. Taking advantages of Transformer’s parallel computation [vaswani2017attention], CMLM can predict any arbitrary subset of masked tokens in the target sequence by attending to the entire sequence including tokens in the past and the future.
CMLM predicts a set of masked tokens conditioning on the input sequence and observed (unmasked) tokens as follows:
(5) 
where . During training, the ground truth tokens are randomly replaced by a special <MASK> token and CMLM is trained to predict the original tokens conditioning on the input sequence and the unmasked tokens
. The number of tokens to be masked is sampled from a uniform distribution between 1 to
as in [ghazvininejad2019mask]. During inference, the target sequence is gradually generated in a constant number of iterations by the iterative decoding algorithm [ghazvininejad2019mask], which repeatedly masks and predicts the subset of the target sequence.We observed that applying the original CMLM to nonautoregressive speech recognition shows poor performance, having the problem of skipping and repeating the output tokens. To deal with this, we found that jointly training with CTC similar to [kim2017joint] provides the model with absolute positional information (conditional independence) explicitly and improves the model performance reasonably well. With the CTC objective from Eq. (3) and Eq. (5), the objective of joint CTCCMLM training for nonautoregressive ASR model is defined as follows: L_NAR = γlogP_ctc (Y — X) + (1  γ) logP_cmlm (Y_mask — Y_obs, X), where is a tunable parameter.
2.4 Mask CTC decoding
Nonautoregressive models must know the length of the output sequence to predict the entire sequence in parallel. For example, in the beginning of the CMLM decoding, the output length must be given to initialize the target sequence with the masked tokens. To deal with this problem, in machine translation, the output length is predicted by training a fertility model [gu2017non] or introducing a special <LENGTH> token in the encoder [ghazvininejad2019mask]. In speech recognition, however, due to the different characteristics between the input acoustic signals and the output linguistic symbols, it appeared that predicting the output length is rather challenging, e.g., the length of the input utterances of the same transcription varies greatly depending on the speaking rate or the duration of silence. [chen2019non] simply makes the decoder to predict the position of <EOS> token to deal with the output length. However, they analyzed that this prediction is vulnerable to the output sequence with a long length because the model is likely to make more mistakes in predicting a long masked sequence and the error is propagated to the later decoding, which degrades the recognition performance. To compensate this problem, they use beam search with CTC and a language model to obtain the reasonable performance, which leads to a slow down of the overall decoding speed, making the advantage of nonautoregressive framework less effective.
To tackle this problem regarding the initialization of the target sequence, we consider using the CTC outputs as the initial sequence for decoding. Figure 1 shows the decoding of CTC Mask based on the inference of CTC. CTC outputs are first obtained through a single calculation of the encoder and the decoder works as to refine the CTC outputs by attending to the whole sequence.
In this work, we use “greedy” result of CTC , which is obtained without using prefix search [graves2006connectionist]
, to keep an inference algorithm nonautoregressive. The errors caused by the conditional independence assumption are expected to be corrected using the CMLM decoder. The posterior probability of
is approximately calculated by using the framelevel CTC probabilities as follows:(6) 
where is the consecutive same alignments that corresponds to the aggregated token . Then, a part of is maskedout based on a confidence using the probability as follows:
(7) 
where is a threshold to decide whether the target token is masked or not. Finally, is predicted conditioning on the highconfidence tokens and the input sequence as in Eq. (5).
We also investigated applying one of the iterative decoding methods called easyfirst [goldberg2010efficient, chen2019non]. Starting with the masked CTC output, the masked tokens are gradually predicted by a confidence based on the CMLM probability. In the th decoding iteration, is updated as follows:
(8) 
Top masked tokens with the highest probabilities are predicted in each iteration. By defining , the number of total decoding iterations can be controlled in a constant iterations.
With this proposed nonautoregressive training and decoding with Mask CTC, the model does not have to take care about predicting the output length. Moreover, decoding by refining CTC outputs with the mask prediction is expected to compensate the errors come from the conditional independence assumption.
3 Experiments
To evaluate the effectiveness Mask CTC, we conducted speech recognition experiments to compare different endtoend ASR models using ESPnet [watanabe2018espnet]. The performance of the models was evaluated based on character error rates (CERs) or word error rates (WERs) without relying on external language models.
Model  Iterations  dev93  eval92  RTF 

Autoregressive  
CTCattention [kim2017joint]  14.4  11.3  0.97  
+ beam search  13.5  10.9  4.62  
Nonautoregressive  
CTC  1  22.2  17.9  0.03 
Mask CTC  1  16.3  12.9  0.03 
()  
Mask CTC  1  15.7  12.5  0.04 
Mask CTC  5  15.5  12.2  0.05 
Mask CTC  10  15.5  12.1  0.07 
Mask CTC  #mask  15.4  12.1  0.13 
CTC [chan2020imputer]  1  –  15.2  – 
Imputer (IM) [chan2020imputer]  8  –  16.5  – 
Imputer (DP) [chan2020imputer]  8  –  12.7  – 
Model  Iterations  Dev  Test 

Autoregressive  
CTCattention [kim2017joint]  35.5  35.5  
+ beam search  35.4  35.7  
Nonautoregressive  
CTC  1  53.8  56.1 
Mask CTC ()  1  41.6  40.2 
Mask CTC  1  41.3  40.1 
Mask CTC  5  40.7  39.4 
Mask CTC  10  40.5  39.2 
Mask CTC  #mask  40.4  39.0 
3.1 Datasets
The experiments were carried out using three tasks with different languages and amounts of training data: the 81 hours Wall Street Journal (WSJ) in English [paul1992design], the 581 hours Corpus of Spontaneous Japanese (CSJ) in Japanese [maekawa2003corpus] and the 16 hours Voxforge in Italian [voxforge]. For the network inputs, we used 80 melscale filterbank coefficients with threedimensional pitch features and applied SpecAugment [park2019specaugment] during model training. For the tokenization of the target, we used characters: Latin alphabets for English and Italian, and Japanese syllable characters (Kana) and Chinese characters (Kanji) for Japanese.
3.2 Experimental setup
For experiments in all of the tasks, we adopted the same encoderdecoder architecture as [karita2019improving]
, which consists of Transformer selfattention layers with 4 attention heads, 256 hidden units, and 2048 feedforward inner dimension size. The encoder included 12 selfattention layers with convolutional layers for downsampling and the decoder was 6 selfattention layers. With the maskpredict objective, the convergence for training the Mask CTC model required more epochs (about 200 – 500) than the autoregressive models (about 50 – 100). The final autoregressive model was obtained by averaging the model parameters of the last 10 epochs as in
[karita2019a]. For Mask CTC model, we found that the model performance was significantly improved by averaging the model parameters of 10 – 30 epochs with the top validation accuracies. For the threshold in Eq. (7), we used 0.999, 0.999, and 0.9 for WSJ, Voxforge, and CSJ, respectively. For all of the tasks, the loss weights and in Eq. (4) and Eq. (2.3) were set to 0.3 and 0.3, respectively.3.3 Evaluated models

CTCattention: An autoregressive model trained with the joint CTCattention objective as in Eq. (4). During inference, the joint CTCattention decoding is applied with beam search [hori2017joint].

CTC: A nonautoregressive model simply trained with the CTC objective.
3.4 Results
Model  Eval1  Eval2  Eval3  

CER  SER  CER  SER  CER  SER  
Autoregressive  
CTCattention [kim2017joint]  6.37  57.0  4.76  53.7  5.40  39.6 
+ beam search  6.21  56.8  4.50  53.4  5.15  40.1 
Nonautoregressive  
CTC  6.51  59.7  4.71  59.5  5.49  44.5 
Mask CTC  6.56  60.3  4.69  57.0  4.97  41.9 
()  
Mask CTC ()  6.56  58.7  4.57  55.5  4.96  40.7 
Table 1 shows the results for WSJ based on WERs and real time factors (RTFs) that were measured for decoding eval92 with Intel(R) Core(TM), i97980XE, 2.60GHz. By comparing the results for nonautoregressive models, we can see that the greedy CTC outputs of Mask CTC outperformed the simple CTC model by training with the maskpredict objective. By applying the refinement based on the proposed CTC masking, the model performance was steadily improved. The performance was further improved by increasing the number of decoding iterations and it resulted in the best performance with #mask iterations, which means one mask is predicted in each iteration. The results of Mask CTC are reasonable comparing to the results of prior work [chan2020imputer]. Our models also approached the results of autoregressive models from the initial CTC result. In terms of the decoding speed measured in RTF, CTC Mask is, at most, 116 times faster than the autoregressive models. Since most of the CTC outputs are fairly accurate and the number of masks are quite small, there was not so much degradation in the speed as the number of the decoding iterations was increased.
Figure 2 shows an example decoding process of a sample in the WSJ evaluation set. Here, we can see that the CTC outputs include errors mainly coming from substitution errors due to the incomplete word spelling. By applying Mask CTC decoding, the spelling errors were successfully recovered by considering the conditional dependence between characters in wordlevel. However, as can be seen in the error for “sifood,” Mask CTC cannot recover errors derived from characterlevel insertion or deletion errors because the length allocated to each word is fixed by the CTC outputs.
Table 2 shows WERs for Voxforge. Mask CTC yielded better scores than the standard CTC model as the similar results to WSJ, demonstrating that our model can be adopted to other languages with a relatively small amount of training data.
Table 3 shows character error rates (CERs) and sentence error rates (SERs) for CSJ. While Mask CTC performed quite close or even better CERs than the autoregressive model, the results showed a little improvement from the simple CTC model, compared to the results of the aforementioned tasks. Since Japanese includes a large number of characters and the characters themselves often form a certain word, the simple CTC model seemed to be dealing with the short dependence between the characters reasonably well, performing almost the same scores without applying Mask CTC. However, when we look at the results in sentencelevel, we observed some clear improvements for all of the evaluation sets, again showing that our model effectively recovers the CTC errors by considering the conditional dependence.
These experimental results on different tasks indicate that Mask CTC framework is especially effective on languages having tokens with a small unit (i.e., Latin alphabet and other phonemic scripts). It is our future work for investigating the effectiveness when we use byte pair encodings (BPEs) [sennrich2016neural] for the languages with such a small unit.
4 Conclusions
This paper proposed Mask CTC, a novel nonautoregressive endtoend speech recognition framework, which generates a sequence by refining the CTC outputs based on mask prediction. During inference, the target sequence was initialized with the greedy CTC outputs and lowconfidence masked tokens were iteratively refined conditioning on the other unmasked tokens and input speech features. The experimental comparisons demonstrated that Mask CTC outperformed the standard CTC model while maintaining the decoding speed fast. Mask CTC approached the results of autoregressive models; especially for CSJ, they were comparable or even better. Our future plan is to reduce the gap of masking strategies between training using random masking and inference using CTC outputs. Furthermore, we plan to explore the integration of external language models (e.g., BERT [devlin2019bert]) in Mask CTC framework.
Comments
There are no comments yet.