Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
2-2011
Abstract
While tuple extraction for a given relation has been an active research area, its dual problem of pattern search- to find and rank patterns in a principled way- has not been studied explicitly. In this paper, we propose and address the problem of pattern search, in addition to tuple extraction. As our objectives, we stress reusability for pattern search and scalability of tuple extraction, such that our approach can be applied to very large corpora like the Web. As the key foundation, we propose a conceptual model PRDualRank to capture the notion of precision and recall for both tuples and patterns in a principled way, leading to the "rediscovery" of the Pattern-Relation Duality- the formal quantification of the reinforcement between patterns and tuples with the metrics of precision and recall. We also develop a concrete framework for PRDualRank, guided by the principles of a perfect sampling process over a complete corpus. Finally, we evaluated our framework over the real Web. Experiments show that on all three target relations our principled approach greatly outperforms the previous state-of-the-art system in both effectiveness and efficiency. In particular, we improved optimal F-score by up to 64%.
Keywords
Algorithms, Experimentation, Design, Performance
Discipline
Databases and Information Systems
Research Areas
Data Science and Engineering
Publication
WSDM '11: Proceedings of the 4th International Conference on Web Search & Data Mining: Hong Kong, China, February 9-12
First Page
825
Last Page
834
ISBN
9781450304931
Identifier
10.1145/1935826.1935933
Publisher
ACM
City or Country
New York
Citation
FANG, Yuan and CHANG, Kevin Chen-Chuan.
Searching patterns for relation extraction over the Web: Rediscovering the pattern-relation duality. (2011). WSDM '11: Proceedings of the 4th International Conference on Web Search & Data Mining: Hong Kong, China, February 9-12. 825-834.
Available at: https://ink.library.smu.edu.sg/sis_research/4063
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.