Publication Type
Conference Proceeding Article
Version
submittedVersion
Publication Date
3-2004
Abstract
The huge amount of information available on the Web has attracted many research efforts into developing wrappers that extract data from webpages. However, as most of the systems for generating wrappers focus on extracting data at page-level, data extraction at site-level remains a manual or semi-automatic process. In this paper, we study the problem of extracting website skeleton, i.e. extracting the underlying hyperlink structure that is used to organize the content pages in a given website. We propose an automated algorithm, called the Sew algorithm, to discover the skeleton of a website. Given a page, the algorithm examines hyperlinks in groups and identifies the navigation links that point to pages in the next level in the website structure. The entire skeleton is then constructed by recursively fetching pages pointed by the discovered links and analyzing these pages using the same process. Our experiments on real life websites show that the algorithm achieves a high recall with moderate precision.
Discipline
Databases and Information Systems | Numerical Analysis and Scientific Computing
Publication
Database Systems for Advanced Applications: 9th International Conference, DASFAA 2004, Jeju Island, Korea, March 17-19, 2003: Proceedings
Volume
2973
First Page
799
Last Page
811
ISBN
9783540210474
Identifier
10.1007/978-3-540-24571-1_70
Publisher
Springer Verlag
City or Country
Berlin
Citation
LIU, Zehua; NG, Wee-Keong; and LIM, Ee Peng.
An Automated Algorithm for Extracting Website Skeleton. (2004). Database Systems for Advanced Applications: 9th International Conference, DASFAA 2004, Jeju Island, Korea, March 17-19, 2003: Proceedings. 2973, 799-811.
Available at: https://ink.library.smu.edu.sg/sis_research/1040
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
http://dx.doi.org/10.1007/978-3-540-24571-1_70
Included in
Databases and Information Systems Commons, Numerical Analysis and Scientific Computing Commons