Colin Lockard

Colin Lockard

I lead a team of applied scientists at Amazon focused on creating training data for large language models. My work centers on developing methods to improve LLMs’ knowledge and reasoning capabilities, with the goal of making them more useful and reliable tools for people.

More broadly, I’m interested in advancing the capabilities of foundation models through better data, evaluation, and understanding.

I previously worked on information extraction from natural language and semi-structured text. I hold a PhD in Computer Science & Engineering from the University of Washington, where I was advised by Hannaneh Hajishirzi and additionally supervised by Xin Luna Dong. Our work on the Ceres project was the subject of Luna's keynote address at the 2020 Conference on Information and Knowledge Management (CIKM).

My background also includes a BA in English from Harvard University and an MA in Interdisciplinary Computer Science from Mills College. Prior to my PhD, I worked with the Big Data Analytics & Machine Intelligence group at NASA Langley Research Center, taught at Peking University, and did some time in the business world.

Publications

"Label-Efficient Self-Training for Attribute Extraction from Semi-Structured Web Documents"
Ritesh Sarkhel, Binxuan Huang, Colin Lockard, Prashant Shiralkar
in Proceedings of the VLDB Endowment (PVLDB), 2023
"Extracting Shopping Interest-Related Product Types from the Web"
Yinghao Li, Colin Lockard, Prashant Shiralkar, Chao Zhang
in Proceedings of the Association for Computational Linguistics (ACL), 2023
"PLAtE: A Large-scale Dataset for List Page Web Extraction"
Aidan San, Yuan Zhuang, Jan Bakus, Colin Lockard, David Ciemiewicz, Sandeep Atluri, Yangfeng Ji, Kevin Small, Heba Elfardy
in Proceedings of the Association for Computational Linguistics (ACL), 2023
"DOM-LM: Learning Generalizable Representations for HTML Documents"
Xiang Deng, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Huan Sun
in CoRR, 2022
"TCN: Table Convolutional Network for Web Table Interpretation"
Daheng Wang, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Xin Luna Dong, Meng Jiang
in Proceedings of the Web Conference (WWW), 2021
"ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages"
Colin Lockard, Prashant Shiralkar, Xin Luna Dong, Hannaneh Hajishirzi
in Proceedings of the Association for Computational Linguistics (ACL), 2020
[slides]
"OpenCeres: When Open Information Extraction Meets the Semi-Structured Web"
Colin Lockard, Prashant Shiralkar, Xin Luna Dong
in Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019
[slides] [Expanded SWDE dataset]
"OpenKI: Integrating Open Information Extraction and Knowledge Bases with Relation Inference"
Dongxu Zhang, Subhabrata Mukherjee, Colin Lockard, Xin Luna Dong, Andrew McCallum
in Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019
"CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web"
Colin Lockard, Xin Luna Dong, Arash Einolghozati, Prashant Shiralkar
in Proceedings of the VLDB Endowment (PVLDB), 2018
[slides]
"Semi-Supervised Event Extraction with Paraphrase Clusters"
James Ferguson, Colin Lockard, Daniel S. Weld, Hannaneh Hajishirzi
in Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2018
"University of Washington TAC-KBP 2016 System Description"
James Ferguson, Colin Lockard, Natalie Hawkins, Stephen Soderland, Hannaneh Hajishirzi, Daniel S. Weld
in Proceedings of TAC-KBP, 2017

Tutorials

Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web @ KDD 2020

Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web @ ACL 2020

Web-scale Knowledge Collection @ WSDM 2020

Service

I regularly serve as a reviewer for conferences and journals in AI and NLP, including ACL, NAACL, EMNLP, KDD, AAAI, AKBC, and TKDE.