Treffer: End-to-End Lip-Reading Open Cloud-Based Speech Architecture.

Title:
End-to-End Lip-Reading Open Cloud-Based Speech Architecture.
Authors:
Jeon S; Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea., Kim MS; Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea.
Source:
Sensors (Basel, Switzerland) [Sensors (Basel)] 2022 Apr 12; Vol. 22 (8). Date of Electronic Publication: 2022 Apr 12.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: MDPI Country of Publication: Switzerland NLM ID: 101204366 Publication Model: Electronic Cited Medium: Internet ISSN: 1424-8220 (Electronic) Linking ISSN: 14248220 NLM ISO Abbreviation: Sensors (Basel) Subsets: MEDLINE
Imprint Name(s):
Original Publication: Basel, Switzerland : MDPI, c2000-
References:
Perception. 1977;6(1):31-40. (PMID: 840618)
IEEE Trans Pattern Anal Mach Intell. 2013 Jan;35(1):221-31. (PMID: 22392705)
Sensors (Basel). 2021 Dec 23;22(1):. (PMID: 35009612)
J Acoust Soc Am. 2006 Nov;120(5 Pt 1):2421-4. (PMID: 17139705)
J Acoust Soc Am. 1989 Jan;85(1):397-405. (PMID: 2522107)
Philos Trans R Soc Lond B Biol Sci. 2008 Mar 12;363(1493):1001-10. (PMID: 17827105)
J Acoust Soc Am. 1985 Feb;77(2):671-7. (PMID: 3973238)
Neuroreport. 2003 Jun 11;14(8):1129-33. (PMID: 12821795)
Nature. 1976 Dec 23-30;264(5588):746-8. (PMID: 1012311)
Grant Information:
NRF-2018X1A3A1069795 National Research Foundation of Korea
Contributed Indexing:
Keywords: application programming interface; audio-visual speech recognition; deep neural networks; lip-reading; multi-modal interaction
Entry Date(s):
Date Created: 20220423 Date Completed: 20220426 Latest Revision: 20220429
Update Code:
20250114
PubMed Central ID:
PMC9029225
DOI:
10.3390/s22082938
PMID:
35458932
Database:
MEDLINE

Weitere Informationen

Deep learning technology has encouraged research on noise-robust automatic speech recognition (ASR). The combination of cloud computing technologies and artificial intelligence has significantly improved the performance of open cloud-based speech recognition application programming interfaces (OCSR APIs). Noise-robust ASRs for application in different environments are being developed. This study proposes noise-robust OCSR APIs based on an end-to-end lip-reading architecture for practical applications in various environments. Several OCSR APIs, including Google, Microsoft, Amazon, and Naver, were evaluated using the Google Voice Command Dataset v2 to obtain the optimum performance. Based on performance, the Microsoft API was integrated with Google's trained word2vec model to enhance the keywords with more complete semantic information. The extracted word vector was integrated with the proposed lip-reading architecture for audio-visual speech recognition. Three forms of convolutional neural networks (3D CNN, 3D dense connection CNN, and multilayer 3D CNN) were used in the proposed lip-reading architecture. Vectors extracted from API and vision were classified after concatenation. The proposed architecture enhanced the OCSR API average accuracy rate by 14.42% using standard ASR evaluation measures along with the signal-to-noise ratio. The proposed model exhibits improved performance in various noise settings, increasing the dependability of OCSR APIs for practical applications.