CN-121982452-A - Implementation method for recognizing handedness and driving dialogue based on visual model

CN121982452ACN 121982452 ACN121982452 ACN 121982452ACN-121982452-A

Abstract

The invention discloses a realization method for recognizing a sponsored and driven dialogue based on a visual model, and belongs to the technical field of artificial intelligence and intelligent interaction. The method solves the technical problem that the existing hand-held display scene lacks hand-held recognition and interaction functions. The technical scheme includes three steps of training and deploying a visual recognition model aiming at a hand, configuring a multi-stage recognition model strategy, and executing a hand recognition driving intelligent interaction flow. Finally, the camera is used for detecting and identifying handoffs, acquiring the handoffs brand IP image and wearing detail feature tag sets, and calling immersive dialogue interaction of the streaming end-to-end voice large model cloud service driver according with handoffs style by using the tag sets as parameters.

Inventors

DUAN WEIWEI

Assignees

段巍巍

Dates

Publication Date: 20260505
Application Date: 20260130

Claims (3)

1. The realization method for recognizing the hand and driving the dialogue based on the visual model is characterized by comprising the following steps: Step S1, training and deployment of a visual identification model aiming at handhold, wherein the training and deployment comprises the steps of selecting a visual identification basic model, collecting training samples, scoring sample labels, training and verifying the model, converting a model format and deploying a system; step S2, configuring a multi-level recognition model strategy, wherein the multi-level recognition model strategy comprises a first-level classification model for recognizing brand IP, a second-level classification model (optional) for recognizing sub-product images under brand IP classification, and a third-level classification model (optional) for recognizing hand wearing features by using an original basic model; And step S3, the implementation of the handhold recognition and intelligent interaction flow comprises the steps of periodically calling a camera to shoot a handhold placement area, submitting an image to a multi-level visual recognition model output result to form a structured handhold feature tag set, playing welcome introduction voice according to the tag set result, and calling an end-to-end voice big model cloud service or local voice to perform intelligent dialogue interaction based on a network connection state by taking the tag set as a parameter.
2. The implementation method of claim 1 is characterized in that the sample in the step 1 is labeled with a classification label, two layers of labels are required to be labeled, namely a brand IP label and a sub-product image label under the brand IP classification, and the model in the step 1 is trained and verified, namely two classification model training is required to be carried out according to the brand IP label and the sub-product image label under the brand IP classification.
3. The implementation method according to claim 1, wherein in the step 3, the end-to-end voice big model cloud service is called by using a tag set as a parameter to implement streaming intelligent dialogue interaction through WebSocket protocol.

Description

Implementation method for recognizing handedness and driving dialogue based on visual model Technical Field The invention relates to an artificial intelligent recognition and conversation realization method aiming at a hand, in particular to a system realization method for recognizing the hand based on a visual model and driving intelligent conversation interaction through recognition features. Background With the rapid development of artificial intelligence technology, intelligent interactive systems are widely used in various fields. Traditional handholds mainly take static viewing as a main part, and lack interaction capability with users. In recent years, technicians have begun to explore the application of smart identification and dialogue techniques to the modification of a hand-held display to enhance the user's hand-held use experience. However, the existing handhold identification generally needs to be performed by implanting an RFID chip into the handhold and based on a specific base, the mode needs to modify the handhold itself, the compatibility is weak, and the adaptation to various existing handhold products on the market cannot be applied, and secondly, the intelligent dialogue system adopts a traditional three-section processing mode of voice identification, text large model and voice synthesis, and the sectional processing results in larger interaction delay and unsmooth user experience. Disclosure of Invention The invention provides a realization method for recognizing a hand and driving a dialogue based on a visual model, which can realize recognition and intelligent dialogue application of the hand by only developing a mobile phone APP or only reforming a hand-holding tray, a hand-holding showcase and the like by an embedded system in order to solve the problems that in the prior art, the hand-holding recognition needs a specific structure and a specific base, and the hand-holding recognition is difficult to be compatible with recognition stock-holding so as to drive intelligent dialogue interaction with a specific hand-holding IP style. The invention discloses a realization method for recognizing a hand and driving a dialogue based on a visual model, which is used for solving the technical problem, and comprises the following steps: step S1, training and deployment of a visual recognition model aiming at handy, wherein the training and deployment comprises selection of a visual recognition basic model, acquisition of training samples, scoring of samples and marking of labels, training and verification of the model, conversion of a model format and deployment of a system. Firstly, a stable and proper visual target recognition model deployed in a system environment is selected, if a hand cabinet is transformed, YOLOv n suitable for an embedded system is required to be selected as a basic model, if a mobile phone APP is developed to realize, a small-size visual basic model with compatibility requirements is also required to be selected. Further, a multi-angle and multi-illumination sampling method with a hand as an origin is adopted, and cameras under different display distances are simulated through a plurality of rotation radiuses to collect hand sample videos of various brands of IP and images. Further, after framing the sample video, marking a multi-stage classification label by using an image marking tool, dividing the sample into a training set, a verification set and a test set according to a proportion, and performing two-stage model training by using a CPU or a GPU according to the multi-stage label (the first-stage model training uses brand IP classification, and the second-stage model training uses brand IP sub-product image classification). Further, the model is converted into a format suitable for the actual operating system, and the dependent software environment of the model operation is installed. Step S2, a multi-level recognition model policy configuration including a first level classification model for recognizing brand IP (e.g., A brand puppies), a second level classification model (optionally configured on demand) for recognizing child product images (e.g., chef puppies) under brand IP classification, and a third level classification model (optionally configured on demand) for recognizing hand-worn features (e.g., turners) using the original base model. And S3, realizing a handhold recognition and intelligent interaction flow, which comprises periodically calling a camera to shoot a handhold placement area, submitting an image to a multi-level visual recognition model to hit an output result to form a structured handhold feature tag set, playing welcome introduction voice according to the tag set result, calling the recognized feature tag as a parameter through a WebSocket protocol if network connection exists, and realizing intelligent dialogue interaction by using the streaming intelligent dialogue interaction streaming end-to-end voice big model cloud service throu