About
I am a Machine Learning Research Scientist at Apple MLR, primarily working on natural language processing and large model pretraining.
I received my PhD degree from the University of Waterloo, worked on language modeling and unsupervised machine learning under the supervision of Ming Li.
Before that, I worked with Chengqing Zong on spoken language understanding.
I have served as a PC member of ACL (2020-2024), EMNLP (2019-2023), ICML (2022-2023), Neurips (2023), ICLR (2023-2024), AAAI (2020), COLING (2020-2024). I received ICML Outstanding Reviewers awards (2022). I am an organizer of Embodied AI Workshop in CVPR 2024.
My recent research focuses on below topics:
- Long-form sequence modeling
- LLM Factualness and Evaluation
- Multilingual NLP
Selected Publications
Y Zhang*, H Bai*, R Zhang*, J Gu, S Zhai, J Susskind, N Jaitly. How Far Are We from Intelligent Visual Deductive Reasoning? arXiv preprint arXiv:2403.04732. 2024. (*equal)
Z Wu, H Bai, A Zhang, J Gu, VG Vydviswaran, N Jaitly, Y Zhang. Divide-or-Conquer? Which Part Should You Distill Your LLM? arXiv preprint arXiv:2402.15000. 2024.
P Maini, S Seto, H Bai, D Grangier, Y Zhang, N Jaitly. Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling. arXiv preprint arXiv:2401.16380. 2024.
S Zheng*, H Bai*, Y Zhang, Y Su, X Niu, N Jaitly. KGLens: A Parameterized Knowledge Graph Solution to Assess What an LLM Does and Doesn’t Know. arXiv preprint arXiv:2312.11539. 2023. (*equal)
A Mousavi, X Zhan, H Bai, P Shi, T Rekatsinas, B Han, Y Li, J Pound, … Construction of Paired Knowledge Graph-Text Datasets Informed by Cyclic Evaluation. arXiv preprint arXiv:2309.11669. 2023.
H Bai. Novel Methods for Natural Language Modeling and Pretraining. University of Waterloo. 2023.
P Shi, L Song, L Jin, H Mi, H Bai, J Lin, D Yu. Cross-lingual Text-to-SQL Semantic Parsing with Representation Mixup. Findings of the Association for Computational Linguistics: EMNLP 2022, 5296-5306. 2022.
Xiaoran Fan, Chao Pang, Tian Yuan, He Bai, Renjie Zheng, Pengfei Zhu, Shuohuan Wang, Junkun Chen, Zeyu Chen, Liang Huang, Yu Sun, Hua Wu. ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech. (preprint) [pdf][code]
Peng Shi, Rui Zhang, He Bai, Jimmy Lin. XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for Cross-lingual Text-to-SQL Semantic Parsing. Finding of EMNLP 2022. [pdf][code]
He Bai, Renjie Zheng, Junkun Chen, Xintong Li, Mingbo Ma, Liang Huang. A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing. ICML 2022 (full paper) [pdf][code].
He Bai, Tong Wang, Alessandro Sordoni, Peng Shi. Better Language Model with Hypernym Class Prediction. ACL 2022 (full paper) [pdf] [code].
Peng Shi, Rui Zhang, He Bai, Jimmy Lin. Cross-Lingual Training with Dense Retrieval for Document Retrieval. EMNLP-MSR 2021 (workshop paper) [pdf].
He Bai, Peng Shi, Jimmy Lin, Luchen Tan, Kun Xiong, Wen Gao, Jie Liu, Ming Li. Semantics of the Unwritten: The Effect of End of Paragraph and Sequence Tokens on Text Generation. ACL-SRW 2021 (workshop paper) [pdf] [code].
He Bai, Peng Shi, Jimmy Lin, Yuqing Xie, Luchen Tan, Kun Xiong, Wen Gao, Ming Li. Segatron: Segment-awareTransformer for Language Modeling and Understanding. AAAI 2021. (full paper) [pdf] [code]
Peng Shi, He Bai, Jimmy Lin. Cross-Lingual Training of Neural Models for Document Ranking. EMNLP Findings 2020. (short paper) [pdf] [code]
He Bai, Yu Zhou, Jiajun Zhang and Chengqing Zong. Memory Consolidation for Contextual Spoken Language Understanding with Dialogue Logistic Inference. ACL 2019. (short paper) [pdf] [code]
He Bai, Yu Zhou, Jiajun Zhang, Liang Zhao, Mei-Yuh Hwang and Chengqing Zong. Source Critical Reinforcement Learning for Transferring Spoken Language Understanding to a New Language. COLING 2018. (full paper) [pdf]