CN-122023960-A - Feature labeling model training method, feature labeling method, device and medium

CN122023960ACN 122023960 ACN122023960 ACN 122023960ACN-122023960-A

Abstract

The application relates to the technical field of computers, in particular to a training method of a feature labeling model, a feature labeling method, equipment and a medium, and aims to solve the technical problems of poor time sequence understanding capability, single task structure, limitation of a data format and high labeling cost of the existing video labeling result. For this purpose, the application performs labeling of a preset type of labeling target on a training data set for training, and obtains a structured labeling result of each video frame sequence of the training data set. According to the marked training data set, the feature marking model is trained, so that a trained feature marking model is obtained, the detection, tracking and feature description capabilities of the feature marking model can be effectively trained, the feature marking model can cope with more complex feature marking tasks, and the continuous target tracking, target identity consistency maintenance and cross-frame feature induction capabilities of the feature marking model are realized.

Inventors

LIU JIAJUN
TIAN JUNGANG
LI ZHIXING
WANG HAO

Assignees

浙江智谱新篇科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251229

Claims (10)

1. A method for training a feature annotation model, the method comprising: Labeling a training data set for training by using a preset type of labeling target to obtain a structured labeling result of each video frame sequence in the training data set, wherein the structured labeling result comprises a target ID and space information of the labeling target in each video frame in the video frame sequence and global induction characteristics of the labeling target in all video frames in the video frame sequence; And training the feature labeling model according to the labeled training data set to obtain the feature labeling model after training.
2. The method for training a feature labeling model as claimed in claim 1, The marking the training data set for training with the marking target of the preset type to obtain the structured marking result of each video frame sequence in the training data set comprises the following steps: Labeling the labeling targets of the video frame sequences based on a preset first labeling model aiming at each video frame sequence in the training data set, and obtaining a model labeling result; and obtaining the structured labeling result according to the model labeling result.
3. The method for training a feature labeling model as claimed in claim 2, The labeling of the labeling target is performed on the video frame sequence based on a preset first labeling model, and a model labeling result is obtained, including: Labeling the labeling target for the video frame sequence based on the first labeling model and a preset first labeling prompt word, and obtaining a model labeling result; The first prompt word is a prompt text indicating that the first annotation model generates a model annotation result comprising a target ID and spatial information of the annotation target in each video frame in the video frame sequence and global inductive characteristics of the annotation target in all video frames in the video frame sequence.
4. The method for training a feature labeling model as claimed in claim 2, The step of obtaining the structured labeling result according to the model labeling result comprises the following steps: And carrying out manual verification and correction on the model labeling result to obtain the structured labeling result.
5. The method for training a feature labeling model as claimed in claim 1, Training the feature labeling model according to the labeled training data set to obtain a feature labeling model after training, wherein the training comprises the following steps: and inputting the video frame sequence containing the structural labeling result in the training data set into the feature labeling model, and training the feature labeling model to obtain a trained feature labeling model.
6. The feature labeling model training method according to claim 5, wherein the feature labeling model is a second labeling large model; Training the feature labeling model to obtain a trained feature labeling model, wherein the training comprises the following steps: Setting a preset second prompt word, wherein the second prompt word is a prompt text for indicating the second annotation large model to realize annotation of a preset type of annotation target and time sequence association; and training the second annotation large model according to the second prompt word and the video frame sequence containing the structured annotation result in the training data set to obtain a trained second annotation large model.
7. A method of labeling features, the method comprising: Labeling the passenger video frame sequence based on the trained feature labeling model, and obtaining a feature labeling result of the passenger video frame sequence; Wherein the feature labeling model is obtained by training according to the feature labeling model training method of any one of claims 1 to 6.
8. An electronic device, comprising: At least one processor; And a memory communicatively coupled to the at least one processor; Wherein the memory has stored therein a computer program which, when executed by the at least one processor, implements the feature annotation model training method of any of claims 1 to 6.
9. An electronic device, comprising: At least one processor; And a memory communicatively coupled to the at least one processor; wherein the memory has stored therein a computer program which, when executed by the at least one processor, implements the method of labeling features of claim 7.
10. A computer readable storage medium having stored therein a plurality of program code, wherein the program code is adapted to be loaded and executed by a processor to perform the method of model characterization training of any of claims 1 to 6 or the method of characterization of claim 7.

Description

Feature labeling model training method, feature labeling method, device and medium Technical Field The application relates to the technical field of computers, in particular to a training method of a feature labeling model, a feature labeling method, a device and a medium. Background The existing multi-modal large models (such as GLM-4.5v, qwen2.5-VL series, etc.) perform well in understanding tasks that process static images, but have significant limitations in processing tasks that involve timing information, mainly in terms of: 1. The lack of sequential consistency understanding that the existing model processes video independently and frame by frame, and the lack of an effective mechanism inside the model to correlate the same target in different frames, namely, to perform sequential correlation, results in failure to stably track and understand the behavior evolution of the target. 2. Task definition is single, traditional fine tuning methods generally aim at a single task (such as detection only or identification only), and a unified framework is lacking to combine multiple tasks such as detection, tracking, feature description and the like for end-to-end learning. This results in a split model capability, which makes it difficult to accomplish the complex task of "input video, output with timing ID and feature description". 3. Data format limitations the common annotation format for object detection or image description datasets is independent per frame. For example, the bounding boxes of the same person in different frames are independent anonymous annotations, and without a unique ID throughout the video, a unified feature summary across frames cannot be provided. This format does not train the model to form a time series identity consensus and feature generalization capability. 4. The data marking cost is high, a great deal of manpower is required to mark, track and distribute the ID on a frame-by-frame basis for constructing a high-quality video time sequence marking data set, time and labor are wasted, errors are easy to occur, and the method becomes a bottleneck for technology landing. Accordingly, there is a need in the art for a new characterization scheme to address the above-described problems. Disclosure of Invention In order to overcome the defects, the application is provided to solve or at least partially solve the technical problems of poor time sequence understanding capability, single task structure, limitation of data format and high labeling cost of the existing video labeling result. In a first aspect, a method for training a feature labeling model is provided, which is characterized in that the method includes: Labeling a training data set for training by using a preset type of labeling target to obtain a structured labeling result of each video frame sequence in the training data set, wherein the structured labeling result comprises a target ID and space information of the labeling target in each video frame in the video frame sequence and global induction characteristics of the labeling target in all video frames in the video frame sequence; And training the feature labeling model according to the labeled training data set to obtain the feature labeling model after training. In one technical scheme of the feature labeling model training method, labeling a preset type of labeling target for a training data set for training to obtain a structured labeling result of each video frame sequence in the training data set includes: Labeling the labeling targets of the video frame sequences based on a preset first labeling model aiming at each video frame sequence in the training data set, and obtaining a model labeling result; and obtaining the structured labeling result according to the model labeling result. In one technical scheme of the feature labeling model training method, the labeling target for the video frame sequence based on the preset first labeling model to obtain a model labeling result includes: Labeling the labeling target for the video frame sequence based on the first labeling model and a preset first labeling prompt word, and obtaining a model labeling result; The first prompt word is a prompt text indicating that the first annotation model generates a model annotation result comprising a target ID and spatial information of the annotation target in each video frame in the video frame sequence and global inductive characteristics of the annotation target in all video frames in the video frame sequence. In one technical scheme of the feature labeling model training method, the obtaining the structured labeling result according to the model labeling result includes: And carrying out manual verification and correction on the model labeling result to obtain the structured labeling result. In one technical scheme of the feature labeling model training method, training the feature labeling model according to the labeled training data set to obtain a feature labeling model