Cổng tri thức PTIT

Trang chủ

Giới thiệu

AI Cộng đồng

Kho tri thức

Tin tức

Liên hệ

Bài báo quốc tế

Kho tri thức

Bài báo quốc tế

Improving the Web Crawling Accuracy with Machine Learning Based on Parsers Using Linguistic Structures

Nguyễn Minh Tuấn

Web crawling is a fundamental process in many applications such as search engines, data mining, and content integration. Traditional web parsers often collect directly to the websites or struggle with modern web content’s dynamic and heterogeneous nature, leading to inaccuracies and inefficiencies. This paper examines the implementation and evaluation of machine learning-based parsers aimed at enhancing the accuracy and adaptability of web crawling systems. By combining Google search and machine learning models, we aim to enhance the ability of parsers to understand and extract relevant information from diverse web pages. We integrate state-of-the-art natural language processing techniques and semantic analysis to develop parsers capable of handling complex and varied content structures. Our model demonstrates the superiority of machine learning-based parsers over conventional methods through extensive experiments and evaluations of real-world web data. The results show significant improvements in parsing accuracy and efficiency, improving the potential of machine learning to transform web crawling practices. In the applications, we perform the scrawling of trending majors that high school students are interested in joining at universities in Vietnam. The suggested websites are related to university training to make decisions, and students can understand and select appropriately to continue their higher education.

Xuất bản trên:

Improving the Web Crawling Accuracy with Machine Learning Based on Parsers Using Linguistic Structures

Ngày đăng:

2026

DOI:

https://wseas.com/journals/isa/2026/a325109-1241.pdf

Nhà xuất bản:

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

Địa điểm:

Từ khoá:

Recursive web crawling, Machine learning-based parsers, Gaussian Process classification, Support Vector Machines (SVM), Web data extraction.

Bài báo liên quan

DistilBERT for Efficient and Accurate Email Phishing Detection: A Benchmark Against Machine and Deep Learning Models

Đàm Minh Lịnh

Toward Robust Malware Detection: A Survey of Datasets, Techniques, and Practical Challenges

Huỳnh Trọng Thưa

Transfer Learning with Particle Swarm Optimization for Durian LeafDisease Image Classiﬁcation

Trần Nguyễn Phi Hùng

Reasoning-Centric Fake News Detection: A Comprehensive Survey of Architectures, Benchmarks, and Open Challenges

Nguyễn Thanh Sơn

Optimizing predictive accuracy in general medical exams using hybrid machine learning and metaheuristic optimization methods

Nguyễn Minh Tuấn

A Smart Rule-Generation Approach for Network Intrusion Prevention Systems

Nguyễn Huy Trung

Hybrid quantum–chaotic key expansion enhances QKD rates using the Lorenz system

Pobporn Danvirutai