Expanded SWDE Dataset

This page hosts additional labels augmenting the Structured Web Data Extraction (SWDE) dataset. The purpose of this expanded dataset is to motivate research in Open Information Extraction (OpenIE) from semi-structured web sources. The original SWDE dataset contains HTML pages from 80 websites in 8 verticals, along with a set of labels of some relations present on those pages. The labels in this expansion provide additional annotations for 21 sites in 3 of those verticals, labeling all binary relations involving the topic entity of each page.


The expanded label set can be downloaded here. The corresponding HTML files should be obtained from the original SWDE page.

Any use of this dataset should cite originating paper as well as the original SWDE paper:

"OpenCeres: When Open Information Extraction Meets the Semi-Structured Web"
Colin Lockard, Prashant Shiralkar, Xin Luna Dong
in Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019

"From One Tree to a Forest: a Unified Solution for Structured Web Data Extraction"
Qiang Hao, Rui Cai, Yanwei Pang, and Lei Zhang
in Proc. of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2011), pp.775-784, Beijing, China. July 24-28, 2011.
[bibtex]