Electro-laryngeal (EL) speech is characterized by constant pitch, limited prosody, and mechanical noise, reducing naturalness and intelligibility. We propose a lightweight adaptation of the state-of-the-art StreamVC [1] framework to this setting by removing pitch and energy modules and combining self-supervised pretraining with supervised fine-tuning on parallel EL & healthy (HE) speech data, guided by perceptual and intelligibility losses. Objective and subjective evaluations across different loss configurations confirm their influence: the best model variant, based on WavLM features and human-feedback predictions (+WavLM+HF), drastically reduces character error rate (CER) of EL inputs, raises naturalness mean opinion score (nMOS) from 1.1 to 3.3, and consistently narrows the gap to HE ground-truth speech in all evaluated metrics. These findings demonstrate the feasibility of adapting lightweight voice conversion (VC) architectures to EL voice rehabilitation while also identifying prosody generation and intelligibility improvements as the main remaining bottlenecks.
Illustrative examples of the EL–HE time-alignment procedure. Stereo playback uses EL on the left channel and HE on the right channel.
| Example | Stage | Audio / Visualization |
|---|---|---|
dialog_47 |
HE (unmodified) |
|
EL (unmodified) |
||
EL–HE aligned (stereo) |
Left: EL · Right: HE
|
|
Waveform overlay |
|
|
dialog_04 |
HE (unmodified) |
|
EL (unmodified) |
||
EL–HE aligned (stereo) |
Left: EL · Right: HE
|
|
Waveform overlay |
|
For each EL source utterance, we show conversions to two healthy target speakers (one male, one female). The same source is reused across targets to enable direct comparison between model variants.
| EL source | HE Target | w/o guided loss | +WavLM+HF | +WEO+HF | +BNF+HF | FreeVC [2] | XVC [3] | QuickVC [4] |
|---|---|---|---|---|---|---|---|---|
EL01Fsen0010 |
||||||||
EL02Msen0164 |
||||||||
EL02Msen0277 |
||||||||
EL04Msen0030 |
||||||||
EL07FNS00906 |
||||||||
Baseline systems FreeVC, XVC, and QuickVC are evaluated in a zero-shot setting without EL-specific fine-tuning. All proposed models are trained on parallel EL–HE data.
The following samples provide additional qualitative results that could not be included in the paper due to space constraints. All samples follow the same evaluation protocol as described in the manuscript. Mean values are reported. Lower is better for CER and Log-F0 RMSE. Higher is better for the rest.
| Method | CER (%) Whisper |
wvMOS | SIG | BAK | OVRL | SIM | Log-F0 RMSE |
|---|---|---|---|---|---|---|---|
| HE Ground Truth | 2.88 | 4.00 | 3.48 | 4.11 | 3.20 | 0.89 | – |
| EL Source | 88.18 | -0.28 | 3.14 | 3.12 | 2.41 | 0.55 | 0.62 |
| Proposed Methods (trained on EL–HE data) | |||||||
| w/o guided loss | 53.72 | 3.17 | 3.29 | 3.88 | 2.90 | 0.86 | 0.35 |
| +BNFloss | 60.41 | 3.07 | 3.32 | 3.95 | 2.95 | 0.83 | 0.36 |
| +F0loss | 68.52 | 3.35 | 3.38 | 4.00 | 3.05 | 0.88 | 0.34 |
| +BNF+HFloss | 55.41 | 3.82 | 3.42 | 3.98 | 3.07 | 0.88 | 0.35 |
| +BNF+HF+F0loss | 64.71 | 3.88 | 3.47 | 4.05 | 3.16 | 0.87 | 0.35 |
| +WAVLM+BNF+HFloss | 52.46 | 3.73 | 3.39 | 3.96 | 3.03 | 0.87 | 0.34 |
| +WAVLM+HF+F0loss | 46.69 | 3.70 | 3.42 | 3.97 | 3.06 | 0.87 | 0.35 |
| +WAVLM+HFloss | 41.93 | 3.76 | 3.43 | 4.00 | 3.09 | 0.87 | 0.34 |
| +WAVLM+WEO+HFloss | 47.10 | 3.70 | 3.38 | 3.98 | 3.02 | 0.87 | 0.34 |
| +WAVLMloss | 40.88 | 3.26 | 3.32 | 3.93 | 2.94 | 0.84 | 0.34 |
| +WEO+HF+F0loss | 54.37 | 3.64 | 3.43 | 4.00 | 3.08 | 0.80 | 0.34 |
| +WEO+HFloss | 44.94 | 3.69 | 3.39 | 4.02 | 3.05 | 0.86 | 0.34 |
| +WEOloss | 46.63 | 3.01 | 3.38 | 3.94 | 3.01 | 0.83 | 0.35 |
| Baselines (zero-shot) | |||||||
| FreeVC [2] | 140.31 | 3.52 | 3.27 | 3.99 | 2.91 | 0.71 | 0.40 |
| XVC [3] | 61.24 | 3.59 | 3.32 | 4.02 | 3.00 | 0.63 | 0.37 |
| QuickVC [4] | 157.87 | 3.49 | 3.34 | 4.02 | 3.00 | 0.69 | 0.41 |