Cổng tri thức PTIT

Bài báo quốc tế

Kho tri thức

/

/

Improving the Web Crawling Accuracy with Machine Learning Based on Parsers Using Linguistic Structures

Improving the Web Crawling Accuracy with Machine Learning Based on Parsers Using Linguistic Structures

Nguyễn Minh Tuấn

Web crawling is a fundamental process in many applications such as search engines, data mining, and content integration. Traditional web parsers often collect directly to the websites or struggle with modern web content’s dynamic and heterogeneous nature, leading to inaccuracies and inefficiencies. This paper examines the implementation and evaluation of machine learning-based parsers aimed at enhancing the accuracy and adaptability of web crawling systems. By combining Google search and machine learning models, we aim to enhance the ability of parsers to understand and extract relevant information from diverse web pages. We integrate state-of-the-art natural language processing techniques and semantic analysis to develop parsers capable of handling complex and varied content structures. Our model demonstrates the superiority of machine learning-based parsers over conventional methods through extensive experiments and evaluations of real-world web data. The results show significant improvements in parsing accuracy and efficiency, improving the potential of machine learning to transform web crawling practices. In the applications, we perform the scrawling of trending majors that high school students are interested in joining at universities in Vietnam. The suggested websites are related to university training to make decisions, and students can understand and select appropriately to continue their higher education.

Xuất bản trên:

Improving the Web Crawling Accuracy with Machine Learning Based on Parsers Using Linguistic Structures


Nhà xuất bản:

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

Địa điểm:


Từ khoá:

Recursive web crawling, Machine learning-based parsers, Gaussian Process classification, Support Vector Machines (SVM), Web data extraction.