Lightweight and Perceptually-Guided Voice Conversion for Electrolaryngeal Speech

Benedikt Mayrhofer1,3, Franz Pernkopf1, Philipp Aichinger2,3, Martin Hagmüller1,3
1Signal Processing and Speech Communication Laboratory, Graz University of Technology
2Department of Otorhinolaryngology, Div. Phoniatrics-Logopedics, Medical University of Vienna
3Comprehensive Centre for AI in Medicine, Medical University of Vienna
{benedikt.mayrhofer, hagmueller, pernkopf}@tugraz.at, philipp.aichinger@meduniwien.ac.at


Abstract

Electro-laryngeal (EL) speech is characterized by constant pitch, limited prosody, and mechanical noise, reducing naturalness and intelligibility. We propose a lightweight adaptation of the state-of-the-art StreamVC [1] framework to this setting by removing pitch and energy modules and combining self-supervised pretraining with supervised fine-tuning on parallel EL & healthy (HE) speech data, guided by perceptual and intelligibility losses. Objective and subjective evaluations across different loss configurations confirm their influence: the best model variant, based on WavLM features and human-feedback predictions (+WavLM+HF), drastically reduces character error rate (CER) of EL inputs, raises naturalness mean opinion score (nMOS) from 1.1 to 3.3, and consistently narrows the gap to HE ground-truth speech in all evaluated metrics. These findings demonstrate the feasibility of adapting lightweight voice conversion (VC) architectures to EL voice rehabilitation while also identifying prosody generation and intelligibility improvements as the main remaining bottlenecks.


Model Architecture

Model architecture
Figure 1: Lightweight StreamVC-based [1] architecture adapted for EL to HE voice conversion.

Time Alignment

Time alignment procedure
Figure 2: Whisper-based EL–HE time-alignment and preprocessing procedure.

Time Alignment Audio Examples

Illustrative examples of the EL–HE time-alignment procedure. Stereo playback uses EL on the left channel and HE on the right channel.

Example Stage Audio / Visualization
dialog_47 HE (unmodified)
EL (unmodified)
EL–HE aligned (stereo)
Left: EL · Right: HE
Waveform overlay Waveform overlay common-dialogs_0047
dialog_04 HE (unmodified)
EL (unmodified)
EL–HE aligned (stereo)
Left: EL · Right: HE
Waveform overlay Waveform overlay common-dialogs_0004

Voice Conversion Samples

For each EL source utterance, we show conversions to two healthy target speakers (one male, one female). The same source is reused across targets to enable direct comparison between model variants.

EL source HE Target w/o guided loss +WavLM+HF +WEO+HF +BNF+HF FreeVC [2] XVC [3] QuickVC [4]
EL01Fsen0010
EL02Msen0164
EL02Msen0277
EL04Msen0030
EL07FNS00906

Baseline systems FreeVC, XVC, and QuickVC are evaluated in a zero-shot setting without EL-specific fine-tuning. All proposed models are trained on parallel EL–HE data.

Additional Objective Results

The following samples provide additional qualitative results that could not be included in the paper due to space constraints. All samples follow the same evaluation protocol as described in the manuscript. Mean values are reported. Lower is better for CER and Log-F0 RMSE. Higher is better for the rest.

Method CER (%)
Whisper
wvMOS SIG BAK OVRL SIM Log-F0 RMSE
HE Ground Truth 2.88 4.00 3.48 4.11 3.20 0.89
EL Source 88.18 -0.28 3.14 3.12 2.41 0.55 0.62
Proposed Methods (trained on EL–HE data)
w/o guided loss53.723.173.293.882.900.860.35
+BNFloss60.413.073.323.952.950.830.36
+F0loss68.523.353.384.003.050.880.34
+BNF+HFloss55.413.823.423.983.070.880.35
+BNF+HF+F0loss64.713.883.474.053.160.870.35
+WAVLM+BNF+HFloss52.463.733.393.963.030.870.34
+WAVLM+HF+F0loss46.693.703.423.973.060.870.35
+WAVLM+HFloss41.933.763.434.003.090.870.34
+WAVLM+WEO+HFloss47.103.703.383.983.020.870.34
+WAVLMloss40.883.263.323.932.940.840.34
+WEO+HF+F0loss54.373.643.434.003.080.800.34
+WEO+HFloss44.943.693.394.023.050.860.34
+WEOloss46.633.013.383.943.010.830.35
Baselines (zero-shot)
FreeVC [2] 140.31 3.52 3.27 3.99 2.91 0.71 0.40
XVC [3] 61.24 3.59 3.32 4.02 3.00 0.63 0.37
QuickVC [4] 157.87 3.49 3.34 4.02 3.00 0.69 0.41

Notes


References

  1. Y. Yang et al., “STREAMVC: Real-Time Low-Latency Voice Conversion,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2024, pp. 11016–11020.
  2. J. Li et al., “FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2023, pp. 1–5.
  3. H. Guo et al., “Using Joint Training Speaker Encoder With Consistency Loss to Achieve Cross-Lingual Voice Conversion and Expressive Voice Conversion,” in Proc. IEEE Autom. Speech Recognit. Understanding Workshop (ASRU), Taipei, Taiwan, Dec. 2023, pp. 1–8.
  4. H. Guo et al., “QUICKVC: A Lightweight VITS-Based Any-to-Many Voice Conversion Model using ISTFT for Faster Conversion,” in Proc. IEEE Autom. Speech Recognit. Understanding Workshop (ASRU), Taipei, Taiwan, Dec. 2023, pp. 1–7.