WiT: Wood Species Identification via a Hybrid CNN–Transformer With Query-Guided Cross-Attention
Wood species identification is essential for forestry monitoring and biodiversity conservation, yet remains challenging owing to high inter-class similarity and strong intra-class variation in macroscopic wood images. Convolutional Neural Networks (CNNs) effectively capture local anatomical patterns, whereas Vision Transformers (ViTs) model long-range spatial dependencies but can be sensitive to surface noise and repetitive textures. Existing hybrid CNN–Transformer approaches mainly rely on parallel pro- cessing and generic feature fusion, without explicitly reflecting the hierarchical local-to-global perception strategy employed by human wood taxonomists. In this paper, we propose WiT, an effective wood species identification framework based on a perception-aligned sequential hybrid CNN–Transformer architecture with a Query-Guided Cross-Attention (QGCA) module. WiT first extracts refined local anatomical features using a CNN backbone and then performs global structural modeling with a transformer encoder; this alignment with expert inspection is architectural in nature rather than formally enforced. A query- guided attention module adaptively integrates local and global representations, emphasizing discriminative anatomical cues while reducing the influence of irrelevant surface artifacts. Extensive experiments on six macroscopic wood datasets, including the newly curated IC4SD-VN99 benchmark, show that WiT achieves an average accuracy of 98.37% and a macro F1-score of 98.12% (mean ±std across three independent random seeds). Across the evaluated benchmarks, WiT provides competitive and stable performance, outperforming all ablation variants in accuracy and in F1-score on five of six datasets, with a negligible difference on IC4SD-VN99 (< 0.01 pp). With an inference latency of 11.95 ms on an NVIDIA L4 GPU (batch size = 1) at a computational cost of 13.33 GFLOPs and 52.96 M parameters, the proposed framework offers a practical balance between recognition performance and efficiency for laboratory-grade forestry applications.
Xuất bản trên:
WiT: Wood Species Identification via a Hybrid CNN–Transformer With Query-Guided Cross-Attention
Từ khoá:
Wood species identification; Hybrid CNN–Transformer; Perception-aligned; Query- guided cross-attention; Fine-grained image classification; Forestry monitoring