Open Information Extraction from the Web
Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni
Turing Center
Department of Computer Science and Engineering
University of Washington
Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni
Turing Center
Department of Computer Science and Engineering
University of Washington
Traditionally, Information Extraction (IE) has fo-
cused on satisfying precise, narrow, pre-specified
requests from small homogeneous corpora (e.g.,
extract the location and time of seminars from a
set of announcements). Shifting to a new domain
requires the user to name the target relations and
to manually create new extraction rules or hand-tag
new training examples. This manual labor scales
linearly with the number of target relations.
This paper introduces Open IE (OIE), a new ex-
traction paradigm where the system makes a single
data-driven pass over its corpus and extracts a large
set of relational tuples without requiring any human
input. The paper also introduces TEXTRUNNER,
a fully implemented, highly scalable OIE system
where the tuples are assigned a probability and
indexed to support efficient extraction and explo-
ration via user queries.
We report on experiments over a 9,000,000 Web
page corpus that compare TEXTRUNNER with
KNOWITALL, a state-of-the-art Web IE system.
TEXTRUNNER achieves an error reduction of 33%
on a comparable set of extractions. Furthermore,
in the amount of time it takes KNOWITALL to per-
form extraction for a handful of pre-specified re-
lations, TEXTRUNNER extracts a far broader set
of facts reflecting orders of magnitude more rela-
tions, discovered on the fly. We report statistics
on TEXTRUNNER’s 11,000,000 highest probability
tuples, and show that they contain over 1,000,000
concrete facts and over 6,500,000 more abstract as-
sertions.
cused on satisfying precise, narrow, pre-specified
requests from small homogeneous corpora (e.g.,
extract the location and time of seminars from a
set of announcements). Shifting to a new domain
requires the user to name the target relations and
to manually create new extraction rules or hand-tag
new training examples. This manual labor scales
linearly with the number of target relations.
This paper introduces Open IE (OIE), a new ex-
traction paradigm where the system makes a single
data-driven pass over its corpus and extracts a large
set of relational tuples without requiring any human
input. The paper also introduces TEXTRUNNER,
a fully implemented, highly scalable OIE system
where the tuples are assigned a probability and
indexed to support efficient extraction and explo-
ration via user queries.
We report on experiments over a 9,000,000 Web
page corpus that compare TEXTRUNNER with
KNOWITALL, a state-of-the-art Web IE system.
TEXTRUNNER achieves an error reduction of 33%
on a comparable set of extractions. Furthermore,
in the amount of time it takes KNOWITALL to per-
form extraction for a handful of pre-specified re-
lations, TEXTRUNNER extracts a far broader set
of facts reflecting orders of magnitude more rela-
tions, discovered on the fly. We report statistics
on TEXTRUNNER’s 11,000,000 highest probability
tuples, and show that they contain over 1,000,000
concrete facts and over 6,500,000 more abstract as-
sertions.