CN-121982617-A - Door entry and exit detection method and device based on visual language model

CN121982617ACN 121982617 ACN121982617 ACN 121982617ACN-121982617-A

Abstract

The disclosure provides a door entry and exit detection method and device based on a visual language model. The method includes obtaining a scene video including video of an in-out event of one or more doors in a target scene, obtaining visual identifications of the one or more doors in the target scene, wherein the visual identifications indicate positions of each door in the target scene, inputting a prompt text and at least one of (i) an identification image including the visual identifications, the scene video, and (ii) an identification video including the visual identifications, the identification video being obtained by identifying the scene video with the visual identifications, into the visual language model to detect the door in-out event in the scene video.

Inventors

GUAN YIPENG

Assignees

墨芯人工智能科技(深圳)有限公司

Dates

Publication Date: 20260505
Application Date: 20260325

Claims (19)

1. A door entry and exit detection method based on a visual language model, the method comprising: acquiring a scene video, wherein the scene video comprises videos of in-out events of one or more doors in a target scene; obtaining visual identifications of the one or more doors in the target scene, wherein the visual identifications indicate a position of each door of the one or more doors in the target scene; entering prompt text and at least one of the following into the visual language model to detect a door entry and exit event in the scene video: (i) An identification image comprising the visual identification and the scene video; (ii) An identification video including the visual identification, the identification video being obtained by identifying the scene video with the visual identification, The prompt text is used for guiding the visual language model to infer time sequence information changes in the scene video based on the visual identification so as to detect the entrance and exit events of the one or more doors in the scene video.
2. The method of claim 1, wherein the visual identification is determined by: acquiring a scene image of the target scene, the scene image comprising the one or more doors in the target scene; Performing instance segmentation on the door areas in the scene image to obtain the positions of the one or more doors in the target scene based on the area masks corresponding to the one or more doors obtained by the instance segmentation; and respectively allocating visual identification features to the one or more doors based on the positions of the one or more doors in the target scene, wherein doors at different positions are allocated with different visual identification features as visual identifications of the doors.
3. The method of claim 2, wherein the one or more doors are sequentially assigned different ones of the visual identification features based on relative spatial relationships of the one or more doors in the scene image.
4. The method according to claim 2, wherein the identification image is obtained by: and based on the area mask, overlaying the allocated visual identification features on areas of the corresponding doors in the scene image.
5. The method of claim 2, wherein the identification video is obtained by: Based on the region mask, the assigned visual identification feature is overlaid on a region of a corresponding door in each frame of the scene video.
6. The method of claim 2, wherein the visual identification features comprise at least one or more of a color, a texture, a vertex coordinate of the one or more doors, a door number.
7. The method of claim 1, wherein the prompt text indicates a statistical rule for counting the entry and exit events occurring at the one or more doors and an output structure of the visual language model, wherein the statistical rule includes counting a single entry event or a single exit event occurring at a single door as one entry or one exit, and the output structure includes at least a door position at which the entry and exit event occurs, the entry and exit event corresponding to the door position, and a number of entries and/or exits corresponding to the entry and exit event.
8. The method of claim 1, wherein the visual language model is a transducer-based multimodal model.
9. A training method for a visual language model for door entry and exit detection, the method comprising: Acquiring a scene video sample, wherein the scene video sample comprises videos of in-out events of one or more doors in a target scene; constructing a sample data set corresponding to the target scene, wherein the sample data set comprises: (i) An identification image including visual identifications of the one or more doors, the scene video sample, and annotation information corresponding to the scene video sample, or (Ii) Identifying video comprising visual identification and marking information corresponding to the identifying video based on the scene video sample; Wherein the visual identification indicates a location of each of the one or more doors in the target scene, the annotation information indicating a correct number of ingress and egress events for the one or more doors; Inputting a prompt text and the sample data set into the visual language model to obtain the prediction times of the in-out events of one or more doors in the target scene output by the visual language model, wherein the prompt text is used for guiding the visual language model to infer the change of time sequence information in the scene video sample based on the visual identification so as to detect the in-out events of the one or more doors in the scene video sample; and fine-tuning parameters of the visual language model based on a supervision fine-tuning method by utilizing the difference between the predicted times and the correct times.
10. The method according to claim 9, wherein the method further comprises: Constructing a preference pair set, wherein the preference pair set comprises positive annotation information and negative annotation information corresponding to the scene video sample, the positive annotation information comprises annotation information of correct times of in-out events occurring on one or more doors, and the negative annotation information comprises annotation information of incorrect times of in-out events occurring on one or more doors; And performing preference optimization training on the visual language model by using the preference pair set.
11. The method of claim 9, wherein the visual identification is determined by: acquiring a scene image of the target scene, the scene image comprising the one or more doors in the target scene; Performing instance segmentation on a door region in the scene image, and acquiring the positions of the one or more doors in the target scene based on region masks corresponding to the one or more doors obtained by the instance segmentation; The one or more doors are respectively assigned visual identification features based on the positions of the one or more doors in the target scene, and doors located at different positions are assigned different visual identification features as visual identifications of the doors.
12. The method of claim 11, wherein the one or more doors are sequentially assigned different ones of the visual identification features based on relative spatial relationships of the one or more doors in the scene image.
13. The method according to claim 11, wherein the identification image is obtained by: and based on the area mask, overlaying the allocated visual identification features on areas of the corresponding doors in the scene image.
14. The method of claim 11, wherein the identification video is obtained by: Based on the region mask, the assigned visual identification feature is overlaid on a region of a corresponding door in each frame of the scene video.
15. The method of claim 11, wherein the visual identification features comprise at least one or more of a color, a texture, a vertex coordinate of the one or more doors, a door number.
16. An apparatus for door entry and exit detection based on a visual language model, the apparatus comprising: A scene video acquisition unit configured to acquire a scene video including videos of entry and exit events of one or more doors in a target scene; A visual identification acquisition unit configured to acquire visual identifications of the one or more doors in the target scene, wherein the visual identifications indicate a position of each of the one or more doors in the target scene; A model reasoning unit configured to input prompt text and at least one of the following into the visual language model to detect door entry and exit events in the scene video: (i) An identification image comprising said visual identification, said scene video, or (Ii) An identification video including the visual identification, the identification video being obtained by identifying the scene video with the visual identification, The prompt text is used for guiding the visual language model to infer time sequence information changes in the scene video based on the visual identification so as to detect the entrance and exit events of the one or more doors in the scene video.
17. A computer device, the computer device comprising: At least one processor; A memory having a computer program stored thereon, wherein the computer program, when executed by the at least one processor, causes the at least one processor to perform the method of any of claims 1-15.
18. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, causes the processor to perform the method of any of claims 1-15.
19. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, causes the processor to perform the method of any of claims 1-15.

Description

Door entry and exit detection method and device based on visual language model Technical Field The present disclosure relates to door entry and exit detection, and more particularly, to a door entry and exit detection method based on a visual language model, an apparatus, a training method for a visual language model for door entry and exit detection, and a computer device, a computer readable storage medium, a computer program product for performing the aforementioned detection method and training method. Background With the rapid development of artificial intelligence technology toward multi-modal fusion, visual Language Model (VLM) has become one of the core technologies in the multi-modal intelligence field. The visual language model is a multi-mode generation type AI model integrating visual processing and natural language understanding capability, breaks through the limitation of a traditional single-mode model by combining a visual encoder with a large language model, can realize joint understanding, reasoning and response generation of visual information such as images, videos and the like and text information, can adapt to various scene demands, and is a key carrier for connecting visual perception and language interaction. At present, the visual language model has been widely applied in a plurality of industries and scenes by means of the cross-modal collaborative processing capability, and the intelligent upgrading of each field is gradually promoted. Disclosure of Invention The present disclosure provides a door entry and exit detection method and apparatus based on a visual language model, a training method for a visual language model for door entry and exit detection, and a computer device, a computer readable storage medium, and a computer program product for performing the aforementioned detection method and training method. According to one aspect of the disclosure, a door access detection method based on a visual language model is provided, the method comprising obtaining a scene video comprising video of access events of one or more doors in a target scene, obtaining visual identifications of the one or more doors in the target scene, wherein the visual identifications indicate positions of each door in the target scene, inputting a prompt text and at least one of (i) an identification image comprising the visual identifications, the scene video, and (ii) an identification video comprising the visual identifications, the identification video being obtained by identifying the scene video with the visual identifications, wherein the prompt text is used for guiding the visual language model to infer changes in timing information in the scene video based on the visual identifications so as to detect access events of the one or more doors in the scene video. In some embodiments, the visual identification is determined by obtaining a scene image of the target scene, the scene image including the one or more doors in the target scene, performing instance segmentation on door regions in the scene image to obtain positions of the one or more doors in the target scene based on region masks corresponding to the one or more doors obtained by the instance segmentation, and assigning visual identification features to the one or more doors based on the positions of the one or more doors in the target scene, respectively, wherein doors located at different positions are assigned different visual identification features as visual identifications of the doors. In some embodiments, the one or more doors are sequentially assigned different ones of the visual identification features based on relative spatial relationships of the one or more doors in the scene image. In some embodiments, the identification image is obtained by: and based on the area mask, overlaying the allocated visual identification features on areas of the corresponding doors in the scene image. In some embodiments, the identification video is obtained by overlaying the assigned visual identification feature on an area of a corresponding door in each frame of the scene video based on the area mask. In some embodiments, the visual identification features include at least one or more of a color, a texture, vertex coordinates of the one or more doors, a door number. In some embodiments, the prompt text indicates a statistical rule for counting the entrance and exit events occurring at the one or more doors and an output structure of the visual language model, wherein the statistical rule includes counting a single entrance event or a single exit event occurring at a single door as one entrance or one exit, and the output structure includes at least a door position at which the entrance and exit event occurs, the entrance and exit event corresponding to the door position, and a number of entrance and/or exit times corresponding to the entrance and exit event. In some embodiments, the visual language model is a transducer-based multimodal model.