The development of ASR (Automatic Speech Recognition) technology enables fast creation of accurate transcripts, supporting diagnosis, session analysis, and monitoring of therapy outcomes. In this article, we present how transcription increases diagnostic precision, improves therapy quality, and enables effective tracking of therapeutic changes over time.
Transcription of diagnostic conversations allows for faithful and detailed preservation of session content, the patient’s emotional experiences, and subtle aspects of speech such as tone of voice or pauses, which is extremely important for a precise understanding of the clinical picture. The ability to replay and compare the same conversation in text form makes it easier to verify diagnostic hypotheses and observe patient progress during multi‑aspect therapy. In a study by Miner et al. (2020), an ASR system operating in compliance with HIPAA standards achieved a word error rate (WER) of about 25%, and in the context of recording depressive symptoms – as much as 80% recognition sensitivity, which demonstrates the real potential of using this technology in psychological and psychiatric diagnostics (Miner et al., 2020).
Although the word error rate still needs improvement, its level shows that transcription supports reliable analysis of the patient’s language and can be a valuable tool to aid clinical reflection, facilitating the identification of key symptoms and emotional issues.
Transcription of therapy sessions allows for detailed analysis of the therapist’s competencies – from identifying key moments, such as breakthrough questions, to detecting subtle intervention errors. Thanks to automation, a full session record can be quickly obtained to evaluate communication skills and relationship quality.
Research by Flemotomos et al. (2022) shows that AI‑based tools effectively model the dynamics of therapy sessions – analyzing the proportions of therapist and patient speaking time, types of questions asked, and the level of expressed empathy. Based on these data, quality indicators of interventions are generated, supporting both training and supervision processes (Flemotomos et al., 2022).
Similarly, an earlier study by Flemotomos et al. (2021) demonstrated the effectiveness of BERT technology in cognitive behavioral therapy (CBT). A model trained on over 1,100 CBT session transcripts achieved F1 ≈ 0.73 when classifying sessions as high or low quality according to the Cognitive Therapy Rating Scale (CTRS).
Chen, Flemotomos and others in 2021 expanded these concepts by creating a hierarchical BERT+LSTM model that allows session quality assessment at the level of conversation segments – enabling even more precise identification of local intervention patterns and therapeutic gaps.
It is also worth noting the growing interest in multimodal analysis of recordings. In studies (Ali et al., 2024) the Gemini 1.5 model, using both audio and text data, achieved F1 = 0.68 and balanced accuracy 77% in classifying states such as depression and PTSD without specialized adaptation, confirming its potential in diagnosis and monitoring of therapy progress Moonlight.
On the other hand, the popular Whisper model (OpenAI), although it handles multilingual transcription well, exhibits the phenomenon of so‑called hallucination – generating content that the patient did not actually say. Sources such as AP, Wired, and CIO have confirmed that about 1%–40% of transcriptions may contain completely fabricated phrases, which in a clinical context poses a significant threat to documentation reliability and therapy quality (Koenecke, 2024).
Despite these limitations, ASR and AI systems provide strong support in therapeutic education — enabling on‑demand feedback and supervision through analysis of specific session fragments. Automation of transcript analysis eliminates time‑consuming manual coding, improving efficiency, scalability, and the quality of therapist training, which translates into better clinical outcomes (Flemotomos et al., 2022).
Comparing transcripts from different stages of therapy enables an objective assessment of changes in narrative, level of emotional expression, and coherence of speech. Tools such as LLUNA achieve high agreement with expert assessment (κ = 0.74–0.89), proving the effectiveness of automatic narrative progress analysis. Thanks to such technologies, it is possible to detect subtle linguistic changes that may indicate improved well‑being, integration of experiences, or development of the patient’s mentalization abilities.
Automation also allows for the identification of so‑called breakthrough moments (insight moments), that is, session fragments characterized by a clear leap in the patient’s understanding, affect, or motivation. Mapping such moments on a timeline gives the therapist insight into the process dynamics and the effectiveness of applied techniques.
Thanks to the ability to store and compare transcripts, therapists gain access to consistent documentation that enables detection of symptom relapse, stagnation in the therapeutic process, or sudden regressions, which is especially important when working with patients with affective or personality disorders (Flemotomos et al., 2022).
Transcripts of psychotherapy sessions significantly contribute to the unification of clinical documentation. Their presence enables the creation of repeatable protocols that support interdisciplinary therapeutic teams, facilitate quality audits, and increase the transparency of the therapist’s work.
Already in the 1990s, Mergenthaler and Stinson (1992) emphasized the importance of transcription standards for ensuring consistency in session analysis. Today, thanks to the development of NLP and machine learning, it is possible to use advanced models (e.g., BERT) for automatic evaluation of therapeutic interventions — as shown by the model of Chen et al. (2021), which achieved F1 ≈ 0.73 in CBT session assessment according to the CTRS scale.
Standardized transcripts also allow for comparison of the effectiveness of different therapeutic approaches and their compliance with accepted guidelines. The use of quantitative analysis (e.g., frequency of reflecting feelings, proportions of speaking time) allows not only assessment of individual therapists but also entire therapeutic centers. This can form a foundation for implementing quality systems based on empirical data, as well as arguments for institutions financing psychological care.
In the context of education, uniform transcription standards increase the validity and reliability of supervision. Students and young therapists can use standardized cases and compare interventions with recommended techniques, which promotes the development of competencies and professional ethics (Flemotomos et al., 2022).
Despite many benefits, the use of transcription and ASR technology in clinical practice faces several significant limitations. One key issue remains the word error rate (WER), which, depending on the model and recording conditions, ranges from 25% to 34% (Miner et al., 2020). Such a high error level may lead to distortions in content analysis, particularly for patients with speech disorders, poor articulation, or those using colloquial language, dialects, or slang.
Equally problematic is the phenomenon of so‑called “hallucination” in transcriptions generated by some models, e.g., Whisper (OpenAI). Such artifacts pose a serious threat in the context of clinical documentation and, in extreme cases, may lead to erroneous diagnostic or therapeutic decisions.
Another challenge involves ethical and legal issues. Each use of transcription requires the patient’s informed consent, in compliance with GDPR (in the European Union) and HIPAA (in the USA). This means ensuring full data anonymization, secure storage of recordings and transcripts, and implementation of access control procedures.
It should also be emphasized that even the best AI models do not replace clinical interpretation. Automatic classifiers may overlook cultural context, individual speech style, or meanings assigned by the patient to specific words. Therefore, the therapist’s role in interpreting results and validating their accuracy remains crucial (Chen et al., 2021).
AI‑supported transcription is a valuable diagnostic, therapeutic, and research tool. It enables more precise diagnosis, more effective therapy, and monitoring of changes over time. Despite technological and ethical challenges, its use can significantly improve the quality and safety of mental health care.
Emothly supports mental health specialists by offering innovative tools for transcription, analysis, and generation of clinical notes to improve patient care.
+48 602 667 934
This website was made in WebWave website builder.