Bài báo quốc tế
Kho tri thức
/
Bài báo quốc tế
/
Improving the Web Crawling Accuracy with Machine Learning Based on Parsers Using Linguistic Structures
Improving the Web Crawling Accuracy with Machine Learning Based on Parsers Using Linguistic Structures
Nguyễn Minh Tuấn
Web crawling is a fundamental process in many applications such as search engines, data mining, and
content integration. Traditional web parsers often collect directly to the websites or struggle with modern web
content’s dynamic and heterogeneous nature, leading to inaccuracies and inefficiencies. This paper examines
the implementation and evaluation of machine learning-based parsers aimed at enhancing the accuracy and
adaptability of web crawling systems. By combining Google search and machine learning models, we aim
to enhance the ability of parsers to understand and extract relevant information from diverse web pages.
We integrate state-of-the-art natural language processing techniques and semantic analysis to develop parsers
capable of handling complex and varied content structures. Our model demonstrates the superiority of machine
learning-based parsers over conventional methods through extensive experiments and evaluations of real-world
web data. The results show significant improvements in parsing accuracy and efficiency, improving the potential
of machine learning to transform web crawling practices. In the applications, we perform the scrawling of
trending majors that high school students are interested in joining at universities in Vietnam. The suggested
websites are related to university training to make decisions, and students can understand and select appropriately
to continue their higher education.
Xuất bản trên:
Improving the Web Crawling Accuracy with Machine Learning Based on Parsers Using Linguistic Structures
Ngày đăng:
2026
Nhà xuất bản:
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
Địa điểm:
Từ khoá:
Recursive web crawling, Machine learning-based parsers, Gaussian Process classification, Support Vector Machines (SVM), Web data extraction.
Bài báo liên quan
A Study on Fusion Strategies of Facial Landmark-Based Heatmap for Facial Expression Recognition
Đỗ Hồng QuânA Novel Network Attack Detection Platform Targeting the AMF Component in the 5G Network Infrastructure
Nguyễn Huy TrungLogMerge: improved log parsing based on two-step clustering combined with low-level token processing
Viet Le Hai