Cổng tri thức PTIT

Bài báo quốc tế

Kho tri thức

/

/

Exploring Linguistic Patterns through Machine Learning: Evidence from Logistic Regression Analysis

Exploring Linguistic Patterns through Machine Learning: Evidence from Logistic Regression Analysis

Nguyễn Minh Tuấn

This study examines how machine learning techniques can detect and inter pret linguistic patterns in Vietnamese text, with logistic regression used as a core baseline model. The proposed framework integrates linguistic theory with computational analysis to uncover phonological, morphological, syntactic, and semantic structures within a multi-domain Vietnamese text classification corpus. After data preprocessing, tokenization, and stopword removal, several feature extraction strategies including TF-IDF, n-grams, and linguistically enriched fea tures such as part-of-speech and morphological cues were applied to represent both surface-level and deep linguistic regularities. Multiple models, including Logistic Regression, CNN, Bi-LSTM with Attention, and a fine-tuned PhoBERT transformer, were trained and evaluated using standard classification metrics. Experimental results reveal that the Bi-LSTM with Attention model achieved the highest F1-score (0.80), outperforming both the baseline and CNN models, while PhoBERT suffered from overfitting and limited generalization. Analysis of feature weights and attention distributions further highlights meaningful depen dencies across linguistic levels, demonstrating the value of machine learning in uncovering structured linguistic insights. The findings contribute to compu tational linguistics research by providing a scalable, data-driven approach for studying linguistic patterns in low-resource languages such as Vietnamese.

Xuất bản trên:

Exploring Linguistic Patterns through Machine Learning: Evidence from Logistic Regression Analysis


Nhà xuất bản:

oeil

Địa điểm:


Từ khoá:

machine learning, logistic regression, linguistic patterns, computational linguistics, data-driven linguistics, predictive modeling, corpus analysis