Comparative Analysis of Speech-to-Text APIs for Supporting Communication of the Deaf Community

Anik Nur Handayani; Hariyono Hariyono; Ahmad Munjin  Nasih; Rochmawati Rochmawati; Imanuel  Hitipeuw; Harits  Ar Rosyid; Jevri Tri  Ardiansah; Rafli Indar  Praja; Ahmad  Nurdiansyah; Desi Fatkhi Azizah

doi:10.56705/ijodas.v6i3.327

Authors

Anik Nur Handayani Universitas Negeri Malang
Hariyono Hariyono Universitas Negeri Malang
Ahmad Munjin Nasih Universitas Negeri Malang
Rochmawati Rochmawati Universitas Negeri Malang
Imanuel Hitipeuw Universitas Negeri Malang
Harits Ar Rosyid Universitas Negeri Malang
Jevri Tri Ardiansah Universitas Negeri Malang
Rafli Indar Praja Universitas Negeri Malang
Ahmad Nurdiansyah Universitas Negeri Malang
Desi Fatkhi Azizah Universitas Negeri Malang

DOI:

https://doi.org/10.56705/ijodas.v6i3.327

Keywords:

Speech-to-Text, API, Word Error Rate (WER), Word Per Minute (WPM), Deaf Community

Abstract

Hearing impairment can have a profound impact on the mental and emotional state of sufferers, as well as hinder communication and delay in accessing information directly that relies on interpreters. Advances in assistive technology, especially speech recognition systems that are able to convert spoken language into written text (speech-to-text). However, its implementation faces various challenges related to the level of accuracy of each speech-to-text Application Programming Interface (API), thus requiring an appropriate deep learning model. This study serves to analyze and compare the performance of speech-to-text API services (Deepgram API, Google API and Whisper AI) based on Word Error Rate (WER) and Words Per Minute (WPM), to determine the most optimal API in a web-based real-time transcription system using the JavaScript programming language and Glitch.com. The three API services were tested by calculating their error rates and transcription speeds, then evaluated to see how low the error accuracy rate was and how high the transcription speed was. On average, Whisper AI had a WER of 0% across all word categories, but its speed was lower than the other two APIs. Deepgram API displayed the best balance between accuracy and speed, with an average WER of 13.78% and 67 WPM. Google API performed stably, but its WER value was slightly higher than Deepgram API. In conclusion, based on the results, Deepgram API was deemed the most optimal for live transcription, as it is capable of producing fast and error-free transcriptions, significantly increasing the accessibility of information for the deaf community.

Downloads

Download data is not yet available.

References

[1] M. Shezi and A. Ade-Ibijola, “Deaf Chat: A Speech-to-Text Communication Aid for Hearing Deficiency,” Advances in Science, Technology and Engineering Systems Journal, vol. 5, no. 5, pp. 826–833, 2020, doi: 10.25046/aj0505100.
[2] D. M. P. Jayakody et al., “Is There an Association Between Untreated Hearing Loss and Psychosocial Outcomes,” Front Aging Neurosci, vol. 14, May 2022, doi: 10.3389/fnagi.2022.868673.
[3] P. Patel, S. Pampaniya, A. Ghosh, R. Raj, D. Karuppaih, and S. Kandasamy, “Enhancing Accessibility Through Machine Learning: A Review on Visual and Hearing Impairment Technologies,” IEEE Access, vol. 13, pp. 33286–33307, 2025, doi: 10.1109/ACCESS.2025.3539081.
[4] P. A. Rodríguez-Correa, A. Valencia-Arias, O. N. Patiño-Toro, Y. Oblitas Díaz, and R. Teodori De la Puente, “Benefits and development of assistive technologies for Deaf people’s communication: A systematic review,” Front Educ (Lausanne), vol. 8, Apr. 2023, doi: 10.3389/feduc.2023.1121597.
[5] L. A. Kumar, D. K. Renuka, S. L. Rose, M. C. Shunmuga priya, and I. M. Wartana, “Deep learning based assistive technology on audio visual speech recognition for hearing impaired,” International Journal of Cognitive Computing in Engineering, vol. 3, pp. 24–30, Jun. 2022, doi: 10.1016/j.ijcce.2022.01.003.
[6] P. A. Rodríguez-Correa, A. Valencia-Arias, O. N. Patiño-Toro, Y. Oblitas Díaz, and R. Teodori De la Puente, “Benefits and development of assistive technologies for Deaf people’s communication: A systematic review,” Front Educ (Lausanne), vol. 8, Apr. 2023, doi: 10.3389/feduc.2023.1121597.
[7] A. Apriyanto, A. Intes, S. Rachmawati, S. Hanim, and A. K. Alhamdani, “Supporting Inclusivity Through an Automatic Transcription Application to Improve Hearing Skills for the Deaf,” Journal International of Lingua and Technology, vol. 3, no. 2, pp. 425–440, Aug. 2024, doi: 10.55849/jiltech.v3i2.672.
[8] K. Kuhn, V. Kersken, and G. Zimmermann, “Communication Access Real-Time Translation Through Collaborative Correction of Automatic Speech Recognition,” in Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, New York, NY, USA: ACM, Apr. 2025, pp. 1–8. doi: 10.1145/3706599.3719984.
[9] R. Whetten, M. T. Imtiaz, and C. Kennington, “Evaluating Automatic Speech Recognition in an Incremental Setting,” Feb. 2023 http://arxiv.org/abs/2302.12049
[10] A. A. H. Al-Shateri, R. A. Rashid, A. Aburaya, and N. A. Muhammad, “Luna: A Benchmark Project in the Convergence of Artificial Intelligence and Internet of Things for Home Automation,” in 2024 IEEE International Conference on Advanced Telecommunication and Networking Technologies (ATNT), IEEE, Sep. 2024, pp. 1–4. doi: 10.1109/ATNT61688.2024.10719226.
[11] G. Arop, “Integration Of A Speech Recognition System Into Fulafia FMIS,” SSRN Electronic Journal, 2025, doi: 10.2139/ssrn.5096981.
[12] K. Hux, J. A. Brown, S. Wallace, K. Knollman-Porter, A. Saylor, and E. Lapp, “Effect of Text-to-Speech Rate on Reading Comprehension by Adults With Aphasia,” Am J Speech Lang Pathol, vol. 29, no. 1, pp. 168–184, Feb. 2020, doi: 10.1044/2019_AJSLP-19-00047.
[13] R. Yakubovskyi and Y. Morozov, “Speech Models Training Technologies Comparison Using Word Error Rate,” Advances in Cyber-Physical Systems, vol. 8, no. 1, pp. 74–80, May 2023, doi: 10.23939/acps2023.01.074.
[14] L. N. Yeganeh, N. S. Fenty, Y. Chen, A. Simpson, and M. Hatami, “The Future of Education: A Multi-Layered Metaverse Classroom Model for Immersive and Inclusive Learning,” Future Internet, vol. 17, no. 2, p. 63, Feb. 2025, doi: 10.3390/fi17020063.
[15] M. Telmem, N. Laaidi, and H. Satori, “The impact of MFCC, spectrogram, and Mel-Spectrogram on deep learning models for Amazigh speech recognition system,” Int J Speech Technol, vol. 28, no. 1, pp. 299–312, Mar. 2025, doi: 10.1007/s10772-025-10183-3.
[16] Y. O. Sharrab, H. Attar, M. A. H. Eljinini, Y. Al-Omary, and W. E. Al-Momani, “Advancements in Speech Recognition: A Systematic Review of Deep Learning Transformer Models, Trends, Innovations, and Future Directions,” IEEE Access, vol. 13, pp. 46925–46940, 2025, doi: 10.1109/ACCESS.2025.3550855.
[17] A. B. P. Utama, A. P. Wibawa, A. N. Handayani, and M. Y. Chuttur, “Exploring the Role of Deep Learning in Forecasting for Sustainable Development Goals: A Systematic Literature Review,” International Journal of Robotics and Control Systems, vol. 4, no. 1, pp. 365–400, Mar. 2024, doi: 10.31763/ijrcs.v4i1.1328.
[18] A. B. P. Utama, A. P. Wibawa, A. N. Handayani, W. S. G. Irianto, Aripriharta, and A. Nyoto, “Improving Time-Series Forecasting Performance Using Imputation Techniques in Deep Learning,” in 2024 International Conference on Smart Computing, IoT and Machine Learning (SIML), IEEE, Jun. 2024, pp. 232–238. doi: 10.1109/SIML61815.2024.10578273.
[19] A. Karibayeva, V. Karyukin, B. Abduali, and D. Amirova, “Speech Recognition and Synthesis Models and Platforms for the Kazakh Language,” Jul. 28, 2025. doi: 10.20944/preprints 202507.2282.v1.
[20] K. Ko, S. Kim, and H. Kwon, “Selective Audio Perturbations for Targeting Specific Phrases in Speech Recognition Systems,” International Journal of Computational Intelligence Systems, vol. 18, no. 1, p. 103, May 2025, doi: 10.1007/s44196-025-00844-1.
[21] E. Pusateri et al., “Retrieval Augmented Correction of Named Entity Speech Recognition Errors,” in ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Apr. 2025, pp. 1–5. doi: 10.1109/ICASSP49660.2025.10888936.
[22] S. D. Stavisky, “Restoring Speech Using Brain–Computer Interfaces,” Annu Rev Biomed Eng, vol. 27, no. 1, pp. 29–54, May 2025, doi: 10.1146/annurev-bioeng-110122-012818.
[23] M. L. Rethlefsen et al., “PRISMA-S: an extension to the PRISMA statement for reporting literature searches in systematic reviews,” Journal of the Medical Library Association, vol. 109, no. 2, Jul. 2021, doi: 10.5195/jmla.2021.962.
[24] B. Hammer, E. Virgili, and F. Bilotta, “Evidence-based literature review: De-duplication a cornerstone for quality,” World J Methodol, vol. 13, no. 5, pp. 390–398, Dec. 2023, doi: 10.5662/wjm.v13.i5.390.
[25] A.-L. Georgescu, A. Pappalardo, H. Cucu, and M. Blott, “Performance vs. hardware requirements in state-of-the-art automatic speech recognition,” EURASIP J Audio Speech Music Process, vol. 2021, no. 1, p. 28, Dec. 2021, doi: 10.1186/s13636-021-00217-4.
[26] A. Ferraro, A. Galli, V. La Gatta, and M. Postiglione, “Benchmarking open source and paid services for speech to text: an analysis of quality and input variety,” Front Big Data, vol. 6, Sep. 2023, doi: 10.3389/fdata.2023.1210559.
[27] E. Kumalija and Y. Nakamoto, “Performance evaluation of automatic speech recognition systems on integrated noise-network distorted speech,” Frontiers in Signal Processing, vol. 2, Sep. 2022, doi: 10.3389/frsip.2022.999457.
[28] T. Olatunji et al., “A multi-country study comparing typed to automatic speech recognition-based medical documentation speeds among Low- and Middle-Income Country Trained Clinicians,” May 13, 2025. doi: 10.1101/2025.05.11.25327386.
[29] L. Yang et al., “Quality Assessment in Systematic Literature Reviews: A Software Engineering Perspective,” Inf Softw Technol, vol. 130, p. 106397, Feb. 2021, doi: 10.1016/j.infsof.2020.106397.
[30] S. Vegas, P. Riofrio, E. Marcos, and N. Juristo, “On (Mis)perceptions of testing effectiveness: an empirical study,” Feb. 2024, doi: 10.1007/s10664-020-09805-y.