Skip to main content

Open Information Extraction from the Web







Open Information Extraction from the Web



Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni



Turing Center

Department of Computer Science and Engineering
University of Washington











Traditionally, Information Extraction (IE) has fo-
cused on satisfying precise, narrow, pre-specified
requests from small homogeneous corpora (e.g.,
extract the location and time of seminars from a
set of announcements). Shifting to a new domain
requires the user to name the target relations and
to manually create new extraction rules or hand-tag
new training examples. This manual labor scales
linearly with the number of target relations.



This paper introduces Open IE (OIE), a new ex-
traction paradigm where the system makes a single
data-driven pass over its corpus and extracts a large
set of relational tuples without requiring
any human
input. The paper also introduces T
EXTRUNNER,
a fully implemented, highly scalable OIE system
where the tuples are assigned a probability and
indexed to support efficient extraction and explo-
ration via user queries.



We report on experiments over a 9,000,000 Web
page corpus that compare T
EXTRUNNER with
K
NOWITALL, a state-of-the-art Web IE system.
T
EXTRUNNER achieves an error reduction of 33%
on a comparable set of extractions. Furthermore,
in the amount of time it takes K
NOWITALL to per-
form extraction for a handful of pre-specified re-
lations, T
EXTRUNNER extracts a far broader set
of facts reflecting orders of magnitude more rela-
tions, discovered on the fly. We report statistics
on T
EXTRUNNER’s 11,000,000 highest probability
tuples, and show that they contain over 1,000,000
concrete facts and over 6,500,000 more abstract as-
sertions. 



Popular posts from this blog