I lead a team of applied scientists at Amazon focused on training large language models. My work centers on developing methods to improve LLMs’ knowledge and reasoning capabilities, with the goal of making them more useful and reliable tools for people.
More broadly, I’m interested in advancing the capabilities of foundation models through better data, evaluation, and understanding.
I previously worked on information extraction from natural language and semi-structured text. I hold a PhD in Computer Science & Engineering from the University of Washington, where I was advised by Hannaneh Hajishirzi and additionally supervised by Xin Luna Dong. Our work on the Ceres project was the subject of Luna's keynote address at the 2020 Conference on Information and Knowledge Management (CIKM).
My background also includes a BA in English from Harvard University and an MA in Interdisciplinary Computer Science from Mills College. Prior to my PhD, I worked with the Big Data Analytics & Machine Intelligence group at NASA Langley Research Center, taught at Peking University, and did some time in the business world.
Publications
"ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer"
Chunyuan Deng, Sanket Lokegaonkar, Colin Lockard, Besnik Fetahu, Nasser Zalmout, Xian Li
in
International Conference on Learning Representations (ICLR), 2026
"Train a Unified Multimodal Data Quality Classifier with Synthetic Data"
Weizhi Wang, Rongmei Lin, Shiyang Li, Colin Lockard, Ritesh Sarkhel, Sanket Lokegaonkar, Jingbo Shang, Xifeng Yan, Nasser Zalmout, Xian Li
in
Findings of the Association for Computational Linguistics: EMNLP, 2025
"DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities"
Jing Yang Lee, Hamed Bonab, Nasser Zalmout, Ming Zeng, Sanket Lokegaonkar, Colin Lockard, Binxuan Huang, Ritesh Sarkhel, Haodong Wang
in
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), 2025
"Label-Efficient Self-Training for Attribute Extraction from Semi-Structured Web Documents"
Ritesh Sarkhel, Binxuan Huang, Colin Lockard, Prashant Shiralkar
in
Proceedings of the VLDB Endowment (PVLDB), 2023
"Extracting Shopping Interest-Related Product Types from the Web"
Yinghao Li, Colin Lockard, Prashant Shiralkar, Chao Zhang
in
Proceedings of the Association for Computational Linguistics (ACL), 2023
"PLAtE: A Large-scale Dataset for List Page Web Extraction"
Aidan San, Yuan Zhuang, Jan Bakus, Colin Lockard, David Ciemiewicz, Sandeep Atluri, Yangfeng Ji, Kevin Small, Heba Elfardy
in
Proceedings of the Association for Computational Linguistics (ACL), 2023
"DOM-LM: Learning Generalizable Representations for HTML Documents"
Xiang Deng, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Huan Sun
in
CoRR, 2022
"TCN: Table Convolutional Network for Web Table Interpretation"
Daheng Wang, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Xin Luna Dong, Meng Jiang
in
Proceedings of the Web Conference (WWW), 2021
"ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages"
Colin Lockard, Prashant Shiralkar, Xin Luna Dong, Hannaneh Hajishirzi
in
Proceedings of the Association for Computational Linguistics (ACL), 2020
[slides]
"OpenCeres: When Open Information Extraction Meets the Semi-Structured Web"
Colin Lockard, Prashant Shiralkar, Xin Luna Dong
in
Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019
[slides] [Expanded SWDE dataset]
"OpenKI: Integrating Open Information Extraction and Knowledge Bases with Relation Inference"
Dongxu Zhang, Subhabrata Mukherjee, Colin Lockard, Xin Luna Dong, Andrew McCallum
in
Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019
"CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web"
Colin Lockard, Xin Luna Dong, Arash Einolghozati, Prashant Shiralkar
in
Proceedings of the VLDB Endowment (PVLDB), 2018
[slides]
"Semi-Supervised Event Extraction with Paraphrase Clusters"
James Ferguson, Colin Lockard, Daniel S. Weld, Hannaneh Hajishirzi
in
Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2018
"University of Washington TAC-KBP 2016 System Description"
James Ferguson, Colin Lockard, Natalie Hawkins, Stephen Soderland, Hannaneh Hajishirzi, Daniel S. Weld
in
Proceedings of TAC-KBP, 2017
Service
I regularly serve as a reviewer for conferences and journals in AI and NLP, including ACL, NAACL, EMNLP, KDD, AAAI, AKBC, and TKDE.