📝 Publications

# denotes co-first authors

🎙 Singing Voice Synthesis

NeurIPS 2024(Spotlight)
sym

GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
Yu Zhang, Changhao Pan#, Wenxiang Guo#, et al.

Hugging Face Demo

  • GTSinger is a large Global, multi-Technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks.
  • Our work is promoted by multiple media and forums, such as weixin, weixin, and zhihu.
ACL 2025(Findings)
sym

STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation
Wenxiang Guo#, Yu Zhang#, Changhao Pan#, et al.

Project |

  • STARS is a unified framework for singing transcription, alignment, and refined style annotation based on hierarchical representation learning.
ACL 2025(Findings)
sym

TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis
Yu Zhang#, Wenxiang Guo#, Changhao Pan#, et al.

Project |

  • TCSinger 2 is a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts.

👂 Spatial Audio

ACM-MM 2025
sym

A Multimodal Evaluation Framework for Spatial Audio Playback Systems: From Localization to Listener Preference
Changhao Pan#, Wenxiang Guo, Yu Zhang, et al.

Hugging Face Project

  • PSA-MOS provides 50 hours of high-quality spatial audio recordings, with detailed localization annotations and fine-grained MOS ratings.
  • MESA is a multimodal evaluation framework for spatial audio playback systems which exhibits strong correlation with human perceptual assessments.
ACM-MM 2025
sym

ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
Yu Zhang#, Wenxiang Guo#, Changhao Pan#, et al.

Hugging Face Project

  • MRSDrama is the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts.
  • ISDrama is the first immersive spatial drama generation model through multimodal prompting.
Submitted to NeurIPS 2025
sym

MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations
Wenxiang Guo#, Changhao Pan#, Zhiyuan Zhu#, Xintong Hu#, et al.

Hugging Face Demo

  • The largest recorded spatial audio dataset contains four scenarios: daily life, singing, music, and speech, with a total duration of 500 hours.
  • Supports multiple spatial audio tasks: audio spatialization, spatial TTA, acoustic event localization and detection(SELD), etc.

🎼 Music Generation

Preprint
sym

Versatile Framework for Song Generation with Prompt-based Control
Yu Zhang#, Wenxiang Guo#, Changhao Pan#, et al.

Project

  • VersBand is a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control.

Others