CN-122023967-A - Industry large model construction system and equipment for live broadcast scene

CN122023967ACN 122023967 ACN122023967 ACN 122023967ACN-122023967-A

Abstract

The invention discloses a live-broadcast scene-oriented industry large model construction system and equipment, and relates to the technical field of live broadcast industry large model processing, comprising the steps of constructing an initial live broadcast scene industry large model, extracting visual characteristics of a target live broadcast scene picture containing semantic information, converting the visual characteristics into a characteristic vector sequence, calculating semantic confidence and spatial association of each visual characteristic vector, and cutting and encoding; the method comprises the steps of carrying out fusion representation on visual feature vector semantic representation and text data, structuring and writing context information of a target live broadcast scene into key values of context information of an initial live broadcast scene industry large model, carrying out incremental update on the key values of the context information of the initial live broadcast scene industry large model to obtain updated context information key values, carrying out cloud edge collaborative training on the initial live broadcast scene industry large model to obtain fusion model training parameters, and updating the initial live broadcast scene industry large model based on the fusion model training parameters to obtain a final target live broadcast scene industry large model.

Inventors

Xiong wenchang
YUAN LIANG

Assignees

广州灵鲸科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260203

Claims (10)

1. The industry large model construction system oriented to the live broadcast scene is characterized by comprising the following modules: the initial model building module is used for building an initial live scene industry large model; the visual characteristic extraction module is used for extracting visual characteristics of semantic information contained in the target live broadcast scene picture; the visual feature processing module is used for converting visual features containing semantic information into a feature vector sequence and calculating semantic confidence of each visual feature vector in the feature vector sequence and spatial association of each visual feature vector; the clipping and encoding module is used for clipping and encoding the visual feature vectors according to the semantic confidence and the spatial correlation of each visual feature vector in the visual feature vector sequence; the fusion alignment module is used for fusing the visual feature vector semantic representation with the text data to obtain a multi-mode visual semantic fusion representation; The context maintenance module is used for reading the context information of the target live broadcast scene and carrying out structural representation, and writing the context information of the structural representation into the key value of the context information of the large model of the initial live broadcast scene industry; The incremental updating module is used for carrying out incremental updating on the key value of the context information of the large model of the initial live broadcast scene industry to obtain an updated context information key value; the asymmetric collaborative training module is used for carrying out cloud-edge collaborative training on the large model in the initial live scene industry to obtain fusion model training parameters; And the model updating module is used for updating the initial live broadcast scene industry large model based on the fusion model training parameters to obtain a final target live broadcast scene industry large model.
2. The industry large model construction system for live scenes according to claim 1, wherein the specific process of extracting visual features of the target live scene picture including semantic information comprises: The method comprises the steps that a preset vision-language model is arranged at a target live scene edge server to obtain live real-time picture data; Inputting the live real-time picture data into the preset vision-language model for semantic analysis; The preset visual-language model obtains visual features containing semantic information through a visual feature extraction layer.
3. The industry large model construction system for the live broadcast scene as claimed in claim 1, wherein the visual feature processing module comprises a visual feature vector generation unit, a semantic confidence evaluation unit and a spatial correlation calculation unit; the visual feature vector generation unit is used for converting visual features containing semantic information into a visual feature vector sequence; the semantic confidence evaluation unit is used for calculating the semantic confidence of each visual feature vector in the visual feature vector sequence; the spatial correlation degree calculation unit is used for calculating the spatial correlation degree of each visual feature vector in the visual feature vector sequence.
4. The industry large model building system for live-broadcast scene as claimed in claim 1, wherein the specific process of clipping the video feature vector comprises: acquiring semantic confidence coefficient of each visual feature vector calculated by a semantic confidence coefficient evaluation unit; Presetting a semantic confidence threshold and a preset spatial association threshold, and respectively comparing the semantic confidence of each calculated visual feature vector with the preset semantic confidence threshold and the preset spatial association threshold; Removing the visual feature vectors with the semantic confidence coefficient of each calculated visual feature vector smaller than a preset semantic confidence coefficient threshold; And eliminating the visual feature vectors with the spatial correlation degree smaller than the preset spatial correlation degree threshold value of each calculated visual feature vector to obtain a cut visual feature vector sequence.
5. The industry large model construction system for the live broadcast scene according to claim 1, wherein the fusion alignment module comprises a visual semantic token generation unit, a text token acquisition unit and a fusion alignment unit; The visual semantic token generation unit is used for converting visual feature vector semantic representation into a visual semantic token sequence which can be directly input into the initial live broadcast scene industry large model; The text token acquisition unit is used for converting text data in a target live broadcast scene into a large model text token sequence which can be directly input into the industry of the initial live broadcast scene; The fusion alignment unit fuses the visual semantic token sequence and the text token sequence to obtain multi-mode visual semantic fusion representation which is used as input of an initial live scene industry large model.
6. The industry large model construction system for the live broadcast scene as claimed in claim 1, wherein the context maintenance module specifically comprises a context information structuring unit, a writing unit and a maintenance unit; the context information structuring unit is used for carrying out structural representation on the read context information of the target live broadcast scene; the writing unit is used for writing the context information of the target live broadcast scene after the structural representation into the key value of the context information of the large model of the initial live broadcast scene industry; the maintenance unit is used for carrying out cache updating, expiration cleaning and capacity control on the context information of the target live broadcast scene in the key value.
7. The industry large model building system for direct broadcast scenario of claim 1, wherein said incremental update module comprises a context retrieval unit and an incremental calculation unit The context searching unit is used for searching the key value of the historical context information matched with the current input information from the key value of the context information of the large model of the initial live broadcast scene industry; The increment calculating unit is used for carrying out increment updating on the current input information by adopting an increment cache updating mechanism according to the key value of the history context information acquired by the context searching unit.
8. The industry large model construction system for the live broadcast scene according to claim 1, wherein the asymmetric collaborative training module specifically comprises a cloud model training unit, an edge model training unit and a parameter sharing and transmitting unit; the cloud model training unit is used for acquiring target live broadcast scene data, and training an initial live broadcast scene industry large model uploaded to the cloud server by taking the target live broadcast scene data as training data to obtain cloud model training parameters; The edge model training unit is used for acquiring local data of the target live broadcast scene, and the local data of the target live broadcast scene is used as training data to train the initial live broadcast scene industry large model uploaded to the edge server side to obtain edge model training parameters; and the parameter sharing and transmitting unit fuses the cloud model training parameters and the edge model training parameters to obtain fusion model training parameters.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the live-scene-oriented industry large model building system of any of claims 1-8 when the computer program is executed by the processor.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program, which when executed implements an industry large model building system oriented to a live scenario according to any of claims 1-8.

Description

Industry large model construction system and equipment for live broadcast scene Technical Field The invention relates to the technical field of live broadcast industry large model processing, in particular to an industry large model construction system and equipment for a live broadcast scene. Background With the rapid development of the live broadcast industry, the demand of the live broadcast platform for intelligent application is increasingly urgent, and the rapid development of large language model technology provides convenience for context understanding application of the live broadcast industry. The prior art has the following problems that in a traditional live broadcast scene, the problems of inconsistent understanding of a live broadcast environment and a context, limited real-time reasoning efficiency and the like exist, and the traditional general large language model has technical bottlenecks in aspects of edge-end semantic visual analysis, visual text semantic alignment, continuous context maintenance and the like, so that an industrial large model construction system oriented to the live broadcast scene is needed to solve the problems in the prior art. Disclosure of Invention In order to solve the technical problems, an aspect of the present invention provides an industry large model building system for a live broadcast scene, which includes the following modules: the initial model building module is used for building an initial live scene industry large model; the visual characteristic extraction module is used for extracting visual characteristics of semantic information contained in the target live broadcast scene picture; the visual feature processing module is used for converting visual features containing semantic information into a feature vector sequence and calculating semantic confidence of each visual feature vector in the feature vector sequence and spatial association of each visual feature vector; the clipping and encoding module is used for clipping and encoding the visual feature vectors according to the semantic confidence and the spatial correlation of each visual feature vector in the visual feature vector sequence; the fusion alignment module is used for fusing the visual feature vector semantic representation with the text data to obtain a multi-mode visual semantic fusion representation; The context maintenance module is used for reading the context information of the target live broadcast scene and carrying out structural representation, and writing the context information of the structural representation into the key value of the context information of the large model of the initial live broadcast scene industry; The incremental updating module is used for carrying out incremental updating on the key value of the context information of the large model of the initial live broadcast scene industry to obtain an updated context information key value; the asymmetric collaborative training module is used for carrying out cloud-edge collaborative training on the large model in the initial live scene industry to obtain fusion model training parameters; And the model updating module is used for updating the initial live broadcast scene industry large model based on the fusion model training parameters to obtain a final target live broadcast scene industry large model. In a preferred embodiment, the specific process of extracting visual features of the target live scene picture containing semantic information includes: The method comprises the steps that a preset vision-language model is arranged at a target live scene edge server to obtain live real-time picture data; Inputting the live real-time picture data into the preset vision-language model for semantic analysis; The preset visual-language model obtains visual features containing semantic information through a visual feature extraction layer. In a preferred embodiment, the visual feature processing module comprises a visual feature vector generation unit, a semantic confidence evaluation unit and a spatial correlation calculation unit; the visual feature vector generation unit is used for converting visual features containing semantic information into a visual feature vector sequence; the semantic confidence evaluation unit is used for calculating the semantic confidence of each visual feature vector in the visual feature vector sequence; the spatial correlation degree calculation unit is used for calculating the spatial correlation degree of each visual feature vector in the visual feature vector sequence. In a preferred embodiment, the specific process of clipping the visual feature vector comprises: acquiring semantic confidence coefficient of each visual feature vector calculated by a semantic confidence coefficient evaluation unit; Presetting a semantic confidence threshold and a preset spatial association threshold, and respectively comparing the semantic confidence of each calculated visual feature vector with the