Dynamic Token Selection for Aerial-Ground Person Re-Identification

Yuhai Wang*, Maryam Pishgar
University of Southern California
*Corresponding Author
IEEE International Conference on Multimedia & Expo(ICME) 2025

Showing the architecture of Dynamic Token Selective Transformer and Visual Token Selector.

Abstract

Aerial-Ground Person Re-identification (AGPReID) holds significant practical value but faces unique challenges due to pronounced variations in viewing angles, lighting conditions, and background interference.Traditional methods, often involving a global analysis of the entire image, frequently lead to inefficiencies and susceptibility to irrelevant data.

In this paper, we propose a novel Dynamic Token Selective Transformer (DTST) tailored for AGPReID, which dynamically selects pivotal tokens to concentrate on pertinent regions.

Specifically, we segment the input image into multiple tokens, with each token representing a unique region or feature within the image. Using a Top-k strategy, we extract the k most significant tokens that contain vital information essential for identity recognition. Subsequently, an attention mechanism is employed to discern interrelations among diverse tokens, thereby enhancing the representation of identity features. Extensive experiments on benchmark datasets showcases the superiority of our method over existing works. Notably, on the CARGO dataset, our proposed method gains 1.18% mAP improvements when compared to the second place. In addition, we comprehensively analyze the impact of different numbers of tokens, token insertion positions, and numbers of heads on model performance.

Experimental Validation Video

Approch

We propose the Dynamic Token Selective Transformer (DTST) based on the View-Decoupled Transformer (VDT) to tackle the view discrepancy challenge in AGPReID. Input images that include both aerial and ground views are tokenized into a sequence of tokens. To encompass both global and view-specific details, meta tokens and view tokens are added to these image tokens before they are inputted into the VDT.

We also introduce the Visual Token Selector (VTS) to each VDT blocks, designed to dynamically refine the token representation by selecting the most informative tokens for subsequent analysis. This module aims to reduce redundancy and enhance the model's ability to focus on critical regions, thereby optimizing computational efficiency while preserving feature quality. The VTS mechanism can be understood as a dynamic token selection process that leverages attention scores to determine the importance of each token.

The following figures illustrate the architecture of the proposed DTST and DTS.

DTST Overview

The framework incorporates N Token Selection VDT blocks, where each block consists of an encoder layer and a visual token selector. The loss function is designed to account for both view-related and view-unrelated features, while an orthogonal loss ensures that these features remain independent from each other, further enhancing feature disentanglement and robustness.

DTST Overview

VTS Overview

Designed to dynamically refine the token representation by selecting the most informative tokens for subsequent analysis. This module aims to reduce redundancy and enhance the model's ability to focus on critical regions, thereby optimizing computational efficiency while preserving feature quality.

Visual Token Selector

Experiments

We evaluate the proposed DTST on two benchmark datasets: CARGO, AG-ReID.

The results demonstrate that our method outperforms existing works in terms of mAP and rank-1 accuracy.


1. CARGO Result

result

2. AG-ReID Result

AG_Result

BibTeX

@article{wang2024dynamic,
      title={Dynamic Token Selection for Aerial-Ground Person Re-Identification},
      author={Wang, Yuhai and Pishgar, Maryam},
      journal={arXiv preprint arXiv:2412.00433},
      year={2024}
    }