📝 Publications

# denotes co-first authors

🎙 Singing Voice Synthesis

NeurIPS 2024(Spotlight)

GTSinger is a large Global, multi-Technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks.
Our work is promoted by multiple media and forums, such as , , and .

ACL 2025(Findings)

Project |

STARS is a unified framework for singing transcription, alignment, and refined style annotation based on hierarchical representation learning.

ACL 2025(Findings)

Project |

TCSinger 2 is a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts.

ACM-MM 2025

PSA-MOS provides 50 hours of high-quality spatial audio recordings, with detailed localization annotations and fine-grained MOS ratings.
MESA is a multimodal evaluation framework for spatial audio playback systems which exhibits strong correlation with human perceptual assessments.

ACM-MM 2025

MRSDrama is the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts.
ISDrama is the first immersive spatial drama generation model through multimodal prompting.

NeurIPS 2025

The largest recorded spatial audio dataset contains four scenarios: daily life, singing, music, and speech, with a total duration of 500 hours.
Supports multiple spatial audio tasks: audio spatialization, spatial TTA, acoustic event localization and detection(SELD), etc.

AACL-IJCNLP ASAudio: A Survey of Advanced Spatial Audio Research, Zhiyuan Zhu, Yu Zhang, Wenxiang Guo, Changhao Pan, et al. |

EMNLP 2025(Findings)

VersBand is a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control.

IEEE-TVCG Interactive Table Synthesis with Natural Language, Yanwei Huang, Yunfan Zhou, Ran Chen, Changhao Pan, Xinhuan Shu, Di Weng, Yingcai Wu.