Search

CN-122020553-A - Live broadcast scene automatic identification and adaptation method and system based on Al

CN122020553ACN 122020553 ACN122020553 ACN 122020553ACN-122020553-A

Abstract

The invention provides an Al-based live broadcast scene automatic identification and adaptation method and system. Belongs to the technical field of artificial intelligence. The method comprises the steps of collecting multi-mode data of a live broadcast stream in real time to generate a visual frame sequence, an audio waveform, a text semantic stream and network state parameters, carrying out dynamic region segmentation based on the visual frame sequence, carrying out voiceprint feature extraction by combining the audio waveform, carrying out keyword clustering by the text semantic stream, and generating an original feature set of the live broadcast scene by fusing the network state parameters, and accurately capturing deep features of the visual, audio, text and network state in the live broadcast by fusing the multi-mode data and carrying out scene semantic modeling, wherein scene types and semantic information can be accurately identified even under a complex composite scene, so that the accuracy and the comprehensiveness of scene identification are greatly improved.

Inventors

  • ZHANG XIAOLI
  • GUO HAIJUAN

Assignees

  • 北京普汇润泽科技有限公司

Dates

Publication Date
20260512
Application Date
20260203

Claims (10)

  1. 1. The Al-based live broadcast scene automatic identification and adaptation method is characterized by comprising the following steps of: S1, acquiring multi-mode data of a live broadcast stream in real time to generate a visual frame sequence, an audio waveform, a text semantic stream and network state parameters; S2, carrying out multi-mode feature fusion processing according to the original feature set of the live scene, constructing a structured scene fingerprint, carrying out scene semantic modeling, identifying the type of the basic scene, further analyzing deep semantic features, and generating composite scene semantic description data; s3, carrying out dynamic resource demand prediction according to the composite scene semantic description data to obtain a corresponding result and carrying out corresponding analysis to generate a resource collaborative configuration scheme; S4, issuing real-time instructions to the edge computing nodes, driving the four modules to cooperatively operate, and uploading key scene features to a cloud for global model updating after the edge nodes are subjected to local optimization to form an edge and cloud cooperative feedback closed loop; And S5, performing end-to-end experience generation according to the dynamically optimized scene adaptation parameter set, completing related operation under low delay constraint, scoring the generated result in real time through the experience quality assessment model, triggering a re-identification process when the score is lower than a threshold value, forming full-process closed-loop control, and finally generating exclusive live broadcast experience data.
  2. 2. The Al-based live scene automatic identification and adaptation method according to claim 1, wherein S1 comprises: S11, carrying out multi-mode data real-time acquisition on the live broadcast stream to generate a visual frame sequence, an audio waveform, a text semantic stream and network state parameters; s12, synchronously preprocessing the video frame sequence, the audio waveform, the text semantic stream and the network state parameter to finish the alignment of multi-mode data time stamps and generate a time sequence synchronous multi-mode data set; s13, extracting inter-frame difference information of a visual frame sequence based on a time sequence synchronous multi-mode data set, and performing dynamic region segmentation to generate a foreground target region set and a background environment region set; s14, extracting voice print characteristics by taking the audio waveforms in the time sequence synchronous multi-mode data set and combining the time sequence change characteristics of the foreground target area set to generate a voice characteristic vector and an environment voice characteristic vector; And S15, extracting text semantic streams in the time sequence synchronous multi-mode data set, carrying out keyword clustering based on word segmentation results and word frequency statistics to generate a theme semantic cluster, and carrying out feature dimension calibration by fusing a foreground target area set, a background environment area set, a human voice feature vector, an environment sound feature vector, the theme semantic cluster and network state parameters to generate a live broadcast scene original feature set.
  3. 3. The Al-based live scene automatic identification and adaptation method according to claim 1, wherein S2 comprises: s21, carrying out multi-mode feature fusion processing based on an original feature set of a live scene, integrating the dimensional information by adopting a feature dimension weighting and splicing mode to generate an initial scene feature matrix; S22, constructing a semantic association network based on the structured scene fingerprint, and performing scene semantic modeling; s23, matching a preset scene feature library through a semantic association network, and identifying basic scene types; S24, further analyzing the deep semantic features by combining the time sequence change features and the interactive behavior features, and integrating the basic scene types and the deep semantic features to generate composite scene semantic description data.
  4. 4. The Al-based live scene automatic identification and adaptation method according to claim 3, wherein S21 comprises: S211, calling an original feature set of a live broadcast scene, splitting each dimensional feature of vision, audio, text and a network to generate a single-dimensional feature subset; S212, calculating information contribution degree of each standardized single-dimension feature set to generate a feature weight coefficient set, and carrying out weighting treatment on the standardized single-dimension feature set based on the feature weight coefficient set to generate a weighted Shan Weidu feature set; S213, orderly splicing the weighted Shan Weidu feature sets, integrating the dimensional information to generate an initial scene feature matrix; s214, screening high-overlapping features based on the feature correlation matrix to generate a redundant feature list, and removing features corresponding to the redundant feature list from the initial scene feature matrix to generate a redundancy-removing feature matrix; S215, performing dimension compression processing on the redundancy elimination feature matrix, reserving core feature information, generating a low-dimensional core feature matrix, and constructing a structured scene fingerprint based on the low-dimensional core feature matrix.
  5. 5. The Al-based live scene automatic identification and adaptation method according to claim 4, wherein S215 comprises: Extracting redundant feature matrixes, calculating variance contribution values of all features, and generating a feature variance contribution sequence; performing numerical normalization secondary treatment on the core feature candidate set, eliminating numerical interference in the dimension compression process, and generating a normalized candidate feature matrix; Performing dimension conversion on the normalized candidate feature matrix by adopting a feature space mapping algorithm, reducing the feature dimension scale, and generating a low-dimensional transition feature matrix; The information coincidence degree of the low-dimensional transition feature matrix and the redundancy elimination feature matrix is calculated to generate an information retention evaluation result; and merging numerical information of the low-dimensional core feature matrix and structural information of the feature association spectrum to generate a structured scene fingerprint.
  6. 6. The Al-based live scene automatic identification and adaptation method according to claim 1, wherein S3 comprises: S31, constructing a resource demand prediction model based on composite scene semantic description data, inputting scene dynamic change characteristics and historical resource configuration data, and performing dynamic resource demand prediction to generate a coding complexity index, a beauty rendering priority, a subtitle generation strategy and a broadcasting guiding switching frequency; S32, analyzing coding resource requirements of different video areas based on coding complexity indexes, and dynamically distributing code rates to coding modules to generate a differential code rate configuration scheme; s33, analyzing the time sequence rhythm and importance degree of the text semantic stream by referring to a subtitle generation strategy, and carrying out parameter adjustment on a natural language processing model to generate a subtitle generation parameter set; S34, analyzing time sequence association of the multi-camera signals according to the pilot switching frequency, and performing intelligent mixed flow on the multi-camera signals to generate a lens switching sequence and a mixed flow rule; s35, fusing the differential code rate configuration scheme, the calculation force distribution sequence, the caption generation parameter set, the shot switching sequence and the mixed flow rule to generate a resource collaborative configuration scheme.
  7. 7. The Al-based live scene automatic identification and adaptation method according to claim 6, wherein S32 comprises: The method comprises the steps of calling coding complexity index and video region division data, establishing a corresponding relation between each region and the complexity index, and generating a region complexity corresponding table; Setting basic code rate intervals according to different complexity levels, and generating an area basic code rate set by combining the area occupation ratio of the area; correcting the numerical value in the region basic code rate set by using the code rate adjustment coefficient to obtain the target code rate of the adaptive bandwidth of each region, and generating a region target code rate list; the method comprises the steps of integrating a target code rate list of a region to generate a differential code rate configuration scheme, retrieving beauty rendering priority data and a rendering task list to be executed currently, establishing a mapping relation between tasks and priorities, and generating a task priority comparison table; Sequencing rendering tasks according to the order of priority from high to low to generate an ordered task queue, analyzing rendering parameters and effect requirements of each task in the ordered task queue, estimating required computing force resources, and generating task computing force required values; The method comprises the steps of monitoring the current running state of a graphic processing unit, counting occupied computing power and residual computing power to generate total available computing power, combining task priority and computing power demand values, distributing available computing power resources for each task according to a proportion to generate task computing power distribution results, and integrating computing power distribution results of each task based on the sequence of an ordered task queue to generate computing power distribution sequences.
  8. 8. The Al-based live scene automatic identification and adaptation method according to claim 1, wherein S4 comprises: S41, generating a standardized control instruction based on a resource collaborative configuration scheme, and issuing a real-time instruction to an edge computing node, wherein the edge computing node analyzes the standardized control instruction, drives a coding module, a American Yan Mokuai, a subtitle module and a director module to cooperatively operate, and completes the real-time processing of the live stream; S42, after the edge node completes local optimization, extracting key scene characteristics and resource consumption data in the processing process, and generating edge node processing feedback data; s43, uploading the edge node processing feedback data to a cloud, and carrying out global model updating by the cloud in combination with the feedback data of the polygonal edge nodes to correct scene recognition and resource prediction parameters; S44, forming an edge and cloud cooperative feedback closed loop, continuously optimizing the generation precision of the structured scene fingerprint through global model updating, and generating a dynamically optimized scene adaptation parameter set based on the optimized structured scene fingerprint.
  9. 9. The Al-based live scene automatic identification and adaptation method according to claim 1, wherein S5 comprises: s51, analyzing end-to-end experience generation requirements based on the dynamically optimized scene adaptation parameter set, and starting coding quality adjustment, beauty effect adaptation, subtitle dynamic generation and guide picture switching under low delay constraint; S52, after the end-to-end experience generation is completed, initial live broadcast experience data are generated, an experience quality assessment model is called to score the initial live broadcast experience data in real time, and the definition of live broadcast pictures, the audio stability, the subtitle accuracy and the guide broadcast fluency are quantized; S53, comparing the scoring result with a preset threshold value, and triggering a re-recognition process when the scoring is lower than the threshold value; S54, forming full-flow closed-loop control of understanding, decision making, execution and feedback, continuously iterating and optimizing the processing effect of each link, and finally generating exclusive live broadcast experience data.
  10. 10. A system for implementing the Al-based live scene automatic identification and adaptation method as claimed in claim 1, the system comprising: The data acquisition module is used for carrying out multi-mode data real-time acquisition on the live broadcast stream to generate a visual frame sequence, an audio waveform, a text semantic stream and network state parameters; the feature fusion module is used for carrying out multi-mode feature fusion processing according to the original feature set of the live scene, constructing a structured scene fingerprint, carrying out scene semantic modeling based on the structured scene fingerprint, identifying the type of the basic scene, further analyzing deep semantic features and generating composite scene semantic description data; the demand prediction module is used for carrying out dynamic resource demand prediction according to the composite scene semantic description data to obtain a corresponding result and carrying out corresponding analysis to generate a resource collaborative configuration scheme; the dynamic optimization module is used for issuing real-time instructions to the edge computing nodes through a resource collaborative configuration scheme, driving the four modules to cooperatively operate, and uploading key scene characteristics to a cloud for global model updating after the edge nodes are locally optimized to form an edge and cloud collaborative feedback closed loop; and the closed-loop control module is used for carrying out end-to-end experience generation according to the dynamically optimized scene adaptation parameter set, completing related operation under low delay constraint, grading the generated result in real time through the experience quality evaluation model, triggering the re-identification process when the grade is lower than a threshold value, forming full-process closed-loop control of understanding, decision, execution and feedback, and finally generating dedicated live broadcast experience data.

Description

Live broadcast scene automatic identification and adaptation method and system based on Al Technical Field The invention provides an Al-based live broadcast scene automatic identification and adaptation method and system, and belongs to the technical field of artificial intelligence. Background Under the current vigorous development of the live broadcast industry, users have increasingly stringent requirements on live broadcast experience, and are expected to obtain proprietary and high-quality viewing experience in different scenes. However, there are many limitations to current live scene recognition and adaptation techniques. In the prior art, a single-mode identification mode is mostly adopted, and only certain information in vision, audio or text is relied on, so that a scene is judged on one side. For example, the visual model may determine that the conference scene is only determined by sitting postures of multiple persons in the screen, but ignores sales promotion in audio or commodity images shared by the screen, so as to cause erroneous determination. Meanwhile, scene definition is very stiff, complex and changeable compound scenes are difficult to deal with in a preset limited category, and the situation including experimental demonstration in live education is difficult to accurately classify. In the adaptation link, the prior art also suffers from problems. The parameter adjustment has a delay of several seconds after scene switching, which seriously affects the consistency of user experience. In addition, the modules such as coding, beautifying, caption and the like independently operate, the resource scheduling lacks cooperativity, and the calculation power and the bandwidth can not be dynamically allocated according to the scene priority, so that the resource waste or the insufficient processing of key links are caused. In addition, the existing scheme focuses on improvement of single functions, such as more accurate classification, filter switching or intelligent beautifying, deep semantic features of scenes cannot be deeply fused with multi-mode sensing, dynamic resource scheduling and edge-cloud collaborative architecture, a complete self-adaptive live broadcast operating system is constructed, and end-to-end scene exclusive experience generation cannot be achieved. Disclosure of Invention The invention provides an Al-based live broadcast scene automatic identification and adaptation method and system, which are used for solving the problems mentioned in the background art: The invention provides an Al-based live broadcast scene automatic identification and adaptation method, which comprises the following steps: S1, acquiring multi-mode data of a live broadcast stream in real time to generate a visual frame sequence, an audio waveform, a text semantic stream and network state parameters; s2, carrying out multi-mode feature fusion processing according to the original feature set of the live scene, constructing a structured scene fingerprint, carrying out scene semantic modeling based on the structured scene fingerprint, identifying the type of the basic scene, and further analyzing deep semantic features to generate composite scene semantic description data; s3, carrying out dynamic resource demand prediction according to the composite scene semantic description data to obtain a corresponding result and carrying out corresponding analysis to generate a resource collaborative configuration scheme; s4, issuing real-time instructions to the edge computing nodes through a resource collaborative configuration scheme, driving the four modules to cooperatively operate, and uploading key scene features to a cloud for global model updating after the edge nodes are locally optimized to form an edge and cloud collaborative feedback closed loop; And S5, performing end-to-end experience generation according to the dynamically optimized scene adaptation parameter set, completing related operation under low delay constraint, scoring the generated result in real time through the experience quality assessment model, triggering a re-identification process when the score is lower than a threshold value, forming full-process closed-loop control, and finally generating exclusive live broadcast experience data. The system for realizing the Al-based live scene automatic identification and adaptation method provided by the invention comprises the following steps: The data acquisition module is used for carrying out multi-mode data real-time acquisition on the live broadcast stream to generate a visual frame sequence, an audio waveform, a text semantic stream and network state parameters; the feature fusion module is used for carrying out multi-mode feature fusion processing according to the original feature set of the live scene, constructing a structured scene fingerprint, carrying out scene semantic modeling based on the structured scene fingerprint, identifying the type of the basic scene, further analyzing de