kaldi官方給出的解碼命令是online2-wav-nnet3-latgen-faster
,它的源碼我在“基于kaldi的iOS語(yǔ)音識(shí)別(本地)+05+解碼”已經(jīng)貼出來(lái)了,下面就來(lái)詳細(xì)講解它解碼的過(guò)程,后面講解實(shí)時(shí)流解碼的時(shí)候再說(shuō)一些改進(jìn)和修改的地方。
首先我們來(lái)看它是怎么加載模型的:
TransitionModel trans_model;
nnet3::AmNnetSimple am_nnet;
{
bool binary;
Input ki(nnet3_rxfilename, &binary);
trans_model.Read(ki.Stream(), binary);
am_nnet.Read(ki.Stream(), binary);
SetBatchnormTestMode(true, &(am_nnet.GetNnet()));
SetDropoutTestMode(true, &(am_nnet.GetNnet()));
nnet3::CollapseModel(nnet3::CollapseModelConfig(), &(am_nnet.GetNnet()));
}
nnet3_rxfilename
就是我們給的final.mdl
模型文件
Nnet3中的AmNnetSimple
類(lèi)是一個(gè)標(biāo)準(zhǔn)的聲學(xué)模型類(lèi),該類(lèi)通過(guò)調(diào)用Nnet類(lèi)進(jìn)行神經(jīng)網(wǎng)絡(luò)操作。
kaldi中的HMM模型,實(shí)際就是一個(gè)TransitionModel
對(duì)象。這個(gè)對(duì)象描述了音素的HMM拓?fù)浣Y(jié)構(gòu),并保存了pdf-id和transition-id相關(guān)的信息,并且可以進(jìn)行各種變量的轉(zhuǎn)換。
這里先不對(duì)AmNnetSimple
和TransitionModel
類(lèi)展開(kāi)講解,知道它是干嘛的就可以了。
nnet3::DecodableNnetSimpleLoopedInfo decodable_info(decodable_opts, &am_nnet);
此對(duì)象包含所有可解碼對(duì)象使用的預(yù)先計(jì)算的內(nèi)容。
fst::Fst<fst::StdArc> *decode_fst = ReadFstKaldiGeneric(fst_rxfilename);
fst::Fst<fst::StdArc> *decode_fst = ReadFstKaldiGeneric(fst_rxfilename);
fst::SymbolTable *word_syms = NULL;
if (word_syms_rxfilename != "")
if (!(word_syms = fst::SymbolTable::ReadText(word_syms_rxfilename)))
KALDI_ERR << "Could not read symbol table from file " << word_syms_rxfilename;
fst_rxfilename
對(duì)應(yīng)HCLG.fst
文件。word_syms_rxfilename
對(duì)應(yīng)words.txt
文件。
SequentialTokenVectorReader spk2utt_reader(spk2utt_rspecifier);
RandomAccessTableReader<WaveHolder> wav_reader(wav_rspecifier);
CompactLatticeWriter clat_writer(clat_wspecifier);
這里分別創(chuàng)建spk2utt_reader
(說(shuō)話(huà)人+音頻),wav_reader
(待識(shí)別音頻),clat_writer
(lattice寫(xiě)入)對(duì)象。
for (; !spk2utt_reader.Done(); spk2utt_reader.Next()) {
std::string spk = spk2utt_reader.Key();
const std::vector<std::string> &uttlist = spk2utt_reader.Value();
OnlineIvectorExtractorAdaptationState adaptation_state(
feature_info.ivector_extractor_info);
for (size_t i = 0; i < uttlist.size(); i++) {
std::string utt = uttlist[i];
if (!wav_reader.HasKey(utt)) {
KALDI_WARN << "Did not find audio for utterance " << utt;
num_err++;
continue;
}
}
...
}
該for循環(huán),關(guān)于說(shuō)話(huà)人音頻的內(nèi)容,我們先不關(guān)心,我們繼續(xù)其他的部分。
const WaveData &wave_data = wav_reader.Value(utt);
// get the data for channel zero (if the signal is not mono, we only
// take the first channel).
SubVector<BaseFloat> data(wave_data.Data(), 0);
根據(jù)音頻的id讀取音頻數(shù)據(jù)。
OnlineNnet2FeaturePipeline feature_pipeline(feature_info);
feature_pipeline.SetAdaptationState(adaptation_state);
OnlineSilenceWeighting silence_weighting(trans_model,
feature_info.silence_weighting_config,
decodable_opts.frame_subsampling_factor);
SingleUtteranceNnet3Decoder decoder(decoder_opts,
trans_model,
decodable_info,
*decode_fst, &feature_pipeline);
OnlineNnet2FeaturePipeline
負(fù)責(zé)將神經(jīng)網(wǎng)絡(luò)的特征處理管道的各個(gè)部分組合在一起。
OnlineSilenceWeighting
負(fù)責(zé)跟蹤來(lái)自解碼器的最佳路徑回溯(有效的)并基于幀的分類(lèi)計(jì)算數(shù)據(jù)的加權(quán)在靜音(或非靜音)。
SingleUtteranceNnet3Decoder
使用神經(jīng)網(wǎng)絡(luò)的在線(xiàn)配置來(lái)解碼單個(gè)音頻。
BaseFloat samp_freq = wave_data.SampFreq();
int32 chunk_length;
if (chunk_length_secs > 0) {
chunk_length = int32(samp_freq * chunk_length_secs);
if (chunk_length == 0) chunk_length = 1;
} else {
chunk_length = std::numeric_limits<int32>::max();
}
int32 samp_offset = 0;
std::vector<std::pair<int32, BaseFloat> > delta_weights;
while (samp_offset < data.Dim()) {
int32 samp_remaining = data.Dim() - samp_offset;
int32 num_samp = chunk_length < samp_remaining ? chunk_length
: samp_remaining;
SubVector<BaseFloat> wave_part(data, samp_offset, num_samp);
feature_pipeline.AcceptWaveform(samp_freq, wave_part);
samp_offset += num_samp;
decoding_timer.WaitUntil(samp_offset / samp_freq);
if (samp_offset == data.Dim()) {
// no more input. flush out last frames
feature_pipeline.InputFinished();
}
if (silence_weighting.Active() &&
feature_pipeline.IvectorFeature() != NULL) {
silence_weighting.ComputeCurrentTraceback(decoder.Decoder());
silence_weighting.GetDeltaWeights(feature_pipeline.NumFramesReady(),
&delta_weights);
feature_pipeline.IvectorFeature()->UpdateFrameWeights(delta_weights);
}
decoder.AdvanceDecoding();
if (do_endpointing && decoder.EndpointDetected(endpoint_opts)) {
break;
}
}
將音頻一段一段給feature_pipeline
同時(shí)開(kāi)啟解碼decoder.AdvanceDecoding();
decoder.FinalizeDecoding();
完成解碼