Richard's website


I am a Machine Learning Research Scientist at Apple MLR, primarily working on natural language processing and large model pretraining.

I received my PhD degree from the University of Waterloo, worked on language modeling and unsupervised machine learning under the supervision of Ming Li.
Before that, I worked with Chengqing Zong on spoken language understanding.

I have served as a PC member of ACL (2020-2024), EMNLP (2019-2023), ICML (2022-2023), Neurips (2023), ICLR (2023-2024), AAAI (2020), COLING (2020-2024). I received ICML Outstanding Reviewers awards (2022). I am an organizer of Embodied AI Workshop in CVPR 2024.

My recent research focuses on below topics:

  • Long-form sequence modeling
  • LLM Factualness and Evaluation
  • Multilingual NLP

Selected Publications

Y Zhang*, H Bai*, R Zhang*, J Gu, S Zhai, J Susskind, N Jaitly. How Far Are We from Intelligent Visual Deductive Reasoning? arXiv preprint arXiv:2403.04732. 2024. (*equal)

Z Wu, H Bai, A Zhang, J Gu, VG Vydviswaran, N Jaitly, Y Zhang. Divide-or-Conquer? Which Part Should You Distill Your LLM? arXiv preprint arXiv:2402.15000. 2024.

P Maini, S Seto, H Bai, D Grangier, Y Zhang, N Jaitly. Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling. arXiv preprint arXiv:2401.16380. 2024.

S Zheng*, H Bai*, Y Zhang, Y Su, X Niu, N Jaitly. KGLens: A Parameterized Knowledge Graph Solution to Assess What an LLM Does and Doesn’t Know. arXiv preprint arXiv:2312.11539. 2023. (*equal)

A Mousavi, X Zhan, H Bai, P Shi, T Rekatsinas, B Han, Y Li, J Pound, … Construction of Paired Knowledge Graph-Text Datasets Informed by Cyclic Evaluation. arXiv preprint arXiv:2309.11669. 2023.

H Bai. Novel Methods for Natural Language Modeling and Pretraining. University of Waterloo. 2023.

P Shi, L Song, L Jin, H Mi, H Bai, J Lin, D Yu. Cross-lingual Text-to-SQL Semantic Parsing with Representation Mixup. Findings of the Association for Computational Linguistics: EMNLP 2022, 5296-5306. 2022.

Xiaoran Fan, Chao Pang, Tian Yuan, He Bai, Renjie Zheng, Pengfei Zhu, Shuohuan Wang, Junkun Chen, Zeyu Chen, Liang Huang, Yu Sun, Hua Wu. ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech. (preprint) [pdf][code]

Peng Shi, Rui Zhang, He Bai, Jimmy Lin. XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for Cross-lingual Text-to-SQL Semantic Parsing. Finding of EMNLP 2022. [pdf][code]

He Bai, Renjie Zheng, Junkun Chen, Xintong Li, Mingbo Ma, Liang Huang. A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing. ICML 2022 (full paper) [pdf][code].

He Bai, Tong Wang, Alessandro Sordoni, Peng Shi. Better Language Model with Hypernym Class Prediction. ACL 2022 (full paper) [pdf] [code].

Peng Shi, Rui Zhang, He Bai, Jimmy Lin. Cross-Lingual Training with Dense Retrieval for Document Retrieval. EMNLP-MSR 2021 (workshop paper) [pdf].

He Bai, Peng Shi, Jimmy Lin, Luchen Tan, Kun Xiong, Wen Gao, Jie Liu, Ming Li. Semantics of the Unwritten: The Effect of End of Paragraph and Sequence Tokens on Text Generation. ACL-SRW 2021 (workshop paper) [pdf] [code].

He Bai, Peng Shi, Jimmy Lin, Yuqing Xie, Luchen Tan, Kun Xiong, Wen Gao, Ming Li. Segatron: Segment-awareTransformer for Language Modeling and Understanding. AAAI 2021. (full paper) [pdf] [code]

Peng Shi, He Bai, Jimmy Lin. Cross-Lingual Training of Neural Models for Document Ranking. EMNLP Findings 2020. (short paper) [pdf] [code]

He Bai, Yu Zhou, Jiajun Zhang and Chengqing Zong. Memory Consolidation for Contextual Spoken Language Understanding with Dialogue Logistic Inference. ACL 2019. (short paper) [pdf] [code]

He Bai, Yu Zhou, Jiajun Zhang, Liang Zhao, Mei-Yuh Hwang and Chengqing Zong. Source Critical Reinforcement Learning for Transferring Spoken Language Understanding to a New Language. COLING 2018. (full paper) [pdf]


I am a final year Ph.D. candidate researching Natural Language Processing at the University of Waterloo. I work with Ming Li on language modeling and unsupervised machine learning methods. Before that, I worked with Chengqing Zong on spoken language understanding during my master’s study.

In general, my research investigates how to represent language for computing. Lately, I am obsessed with language modeling which represents language via neural computing for its unsupervised and task-agnostic nature. I am also interested in multilingual problems and acoustic sequence modeling.

My thesis concerns modeling text and speech sequences to achieve lower perplexity, better generation, and benefit downstream language tasks; specifically, we address the problem of modeling text and text-speech sequences with Transformer-based language models. My favorite works during my Ph.D. study are Segment-Aware Language Modeling, Hypernym-Instructed Language Modeling, and Alignment-Aware Acoustic and Text Modeling.


Xiaoran Fan, Chao Pang, Tian Yuan, He Bai, Renjie Zheng, Pengfei Zhu, Shuohuan Wang, Junkun Chen, Zeyu Chen, Liang Huang, Yu Sun, Hua Wu. ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech. (preprint) [pdf][code]

Peng Shi, Rui Zhang, He Bai, Jimmy Lin. XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for Cross-lingual Text-to-SQL Semantic Parsing. Finding of EMNLP 2022. [pdf][code]

He Bai, Renjie Zheng, Junkun Chen, Xintong Li, Mingbo Ma, Liang Huang. A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing. ICML 2022 (full paper) [pdf][code].

He Bai, Tong Wang, Alessandro Sordoni, Peng Shi. Better Language Model with Hypernym Class Prediction. ACL 2022 (full paper) [pdf] [code].

Peng Shi, Rui Zhang, He Bai, Jimmy Lin. Cross-Lingual Training with Dense Retrieval for Document Retrieval. EMNLP-MSR 2021 (workshop paper) [pdf].

He Bai, Peng Shi, Jimmy Lin, Luchen Tan, Kun Xiong, Wen Gao, Jie Liu, Ming Li. Semantics of the Unwritten: The Effect of End of Paragraph and Sequence Tokens on Text Generation. ACL-SRW 2021 (workshop paper) [pdf] [code].

He Bai, Peng Shi, Jimmy Lin, Yuqing Xie, Luchen Tan, Kun Xiong, Wen Gao, Ming Li. Segatron: Segment-awareTransformer for Language Modeling and Understanding. AAAI 2021. (full paper) [pdf] [code]

Peng Shi, He Bai, Jimmy Lin. Cross-Lingual Training of Neural Models for Document Ranking. EMNLP Findings 2020. (short paper) [pdf] [code]

He Bai, Yu Zhou, Jiajun Zhang and Chengqing Zong. Memory Consolidation for Contextual Spoken Language Understanding with Dialogue Logistic Inference. ACL 2019. (short paper) [pdf] [code]

He Bai, Yu Zhou, Jiajun Zhang, Liang Zhao, Mei-Yuh Hwang and Chengqing Zong. Source Critical Reinforcement Learning for Transferring Spoken Language Understanding to a New Language. COLING 2018. (full paper) [pdf]