Lightweight and Perceptually-Guided Voice Conversion for Electrolaryngeal Speech

Benedikt Mayrhofer^1,3, Franz Pernkopf¹, Philipp Aichinger^2,3, Martin Hagmüller^1,3
¹Signal Processing and Speech Communication Laboratory, Graz University of Technology
²Department of Otorhinolaryngology, Div. Phoniatrics-Logopedics, Medical University of Vienna
³Comprehensive Centre for AI in Medicine, Medical University of Vienna
{benedikt.mayrhofer, hagmueller, pernkopf}@tugraz.at, philipp.aichinger@meduniwien.ac.at

Abstract

Electro-laryngeal (EL) speech is characterized by constant pitch, limited prosody, and mechanical noise, reducing naturalness and intelligibility. We propose a lightweight adaptation of the state-of-the-art StreamVC [1] framework to this setting by removing pitch and energy modules and combining self-supervised pretraining with supervised fine-tuning on parallel EL & healthy (HE) speech data, guided by perceptual and intelligibility losses. Objective and subjective evaluations across different loss configurations confirm their influence: the best model variant, based on WavLM features and human-feedback predictions (+WavLM+HF), drastically reduces character error rate (CER) of EL inputs, raises naturalness mean opinion score (nMOS) from 1.1 to 3.3, and consistently narrows the gap to HE ground-truth speech in all evaluated metrics. These findings demonstrate the feasibility of adapting lightweight voice conversion (VC) architectures to EL voice rehabilitation while also identifying prosody generation and intelligibility improvements as the main remaining bottlenecks.

Model Architecture

Figure 1: Lightweight StreamVC-based [1] architecture adapted for EL to HE voice conversion.

Time Alignment

Figure 2: Whisper-based EL–HE time-alignment and preprocessing procedure.

Time Alignment Audio Examples

Illustrative examples of the EL–HE time-alignment procedure. Stereo playback uses EL on the left channel and HE on the right channel.

Example	Stage	Audio / Visualization
`dialog_47`	`HE (unmodified)`
	`EL (unmodified)`
	`EL–HE aligned (stereo)`	Left: EL · Right: HE
	`Waveform overlay`
`dialog_04`	`HE (unmodified)`
	`EL (unmodified)`
	`EL–HE aligned (stereo)`	Left: EL · Right: HE
	`Waveform overlay`

Voice Conversion Samples

For each EL source utterance, we show conversions to two healthy target speakers (one male, one female). The same source is reused across targets to enable direct comparison between model variants.

EL source	HE Target	w/o guided loss	+WavLM+HF	+WEO+HF	+BNF+HF	FreeVC [2]	XVC [3]	QuickVC [4]
`EL01Fsen0010`
`EL01Fsen0010`
`EL02Msen0164`
`EL02Msen0164`
`EL02Msen0277`
`EL02Msen0277`
`EL04Msen0030`
`EL04Msen0030`
`EL07FNS00906`
`EL07FNS00906`

Baseline systems FreeVC, XVC, and QuickVC are evaluated in a zero-shot setting without EL-specific fine-tuning. All proposed models are trained on parallel EL–HE data.

Additional Objective Results

The following samples provide additional qualitative results that could not be included in the paper due to space constraints. All samples follow the same evaluation protocol as described in the manuscript. Mean values are reported. Lower is better for CER and Log-F0 RMSE. Higher is better for the rest.

Method	CER (%) Whisper	wvMOS	SIG	BAK	OVRL	SIM	Log-F0 RMSE
HE Ground Truth	2.88	4.00	3.48	4.11	3.20	0.89	–
EL Source	88.18	-0.28	3.14	3.12	2.41	0.55	0.62
Proposed Methods (trained on EL–HE data)
w/o guided loss	53.72	3.17	3.29	3.88	2.90	0.86	0.35
+BNFloss	60.41	3.07	3.32	3.95	2.95	0.83	0.36
+F0loss	68.52	3.35	3.38	4.00	3.05	0.88	0.34
+BNF+HFloss	55.41	3.82	3.42	3.98	3.07	0.88	0.35
+BNF+HF+F0loss	64.71	3.88	3.47	4.05	3.16	0.87	0.35
+WAVLM+BNF+HFloss	52.46	3.73	3.39	3.96	3.03	0.87	0.34
+WAVLM+HF+F0loss	46.69	3.70	3.42	3.97	3.06	0.87	0.35
+WAVLM+HFloss	41.93	3.76	3.43	4.00	3.09	0.87	0.34
+WAVLM+WEO+HFloss	47.10	3.70	3.38	3.98	3.02	0.87	0.34
+WAVLMloss	40.88	3.26	3.32	3.93	2.94	0.84	0.34
+WEO+HF+F0loss	54.37	3.64	3.43	4.00	3.08	0.80	0.34
+WEO+HFloss	44.94	3.69	3.39	4.02	3.05	0.86	0.34
+WEOloss	46.63	3.01	3.38	3.94	3.01	0.83	0.35
Baselines (zero-shot)
FreeVC [2]	140.31	3.52	3.27	3.99	2.91	0.71	0.40
XVC [3]	61.24	3.59	3.32	4.02	3.00	0.63	0.37
QuickVC [4]	157.87	3.49	3.34	4.02	3.00	0.69	0.41

Notes

All samples are provided for research demonstration purposes.
Method details and training setup are described in the paper.

References

Y. Yang et al., “STREAMVC: Real-Time Low-Latency Voice Conversion,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2024, pp. 11016–11020.
J. Li et al., “FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2023, pp. 1–5.
H. Guo et al., “Using Joint Training Speaker Encoder With Consistency Loss to Achieve Cross-Lingual Voice Conversion and Expressive Voice Conversion,” in Proc. IEEE Autom. Speech Recognit. Understanding Workshop (ASRU), Taipei, Taiwan, Dec. 2023, pp. 1–8.
H. Guo et al., “QUICKVC: A Lightweight VITS-Based Any-to-Many Voice Conversion Model using ISTFT for Faster Conversion,” in Proc. IEEE Autom. Speech Recognit. Understanding Workshop (ASRU), Taipei, Taiwan, Dec. 2023, pp. 1–7.