Friday, 22 February 2019

Audio-Linguistic Embeddings for Spoken Sentences. (arXiv:1902.07817v1 [cs.SD])

We propose spoken sentence embeddings which capture both acoustic and linguistic content. While existing works operate at the character, phoneme, or word level, our method learns long-term dependencies by modeling speech at the sentence level. Formulated as an audio-linguistic multitask learning problem, our encoder-decoder model simultaneously reconstructs acoustic and natural language features from audio. Our results show that spoken sentence embeddings outperform phoneme and word-level baselines on speech recognition and emotion recognition tasks. Ablation studies show that our embeddings can better model high-level acoustic concepts while retaining linguistic content. Overall, our work illustrates the viability of generic, multi-modal sentence embeddings for spoken language understanding.



from cs updates on arXiv.org https://ift.tt/2VczYuq
//

Related Posts:

0 comments:

Post a Comment