CN-121999759-A - Network broadcasting method and system realized by intelligent generation technology

CN121999759ACN 121999759 ACN121999759 ACN 121999759ACN-121999759-A

Abstract

The invention relates to the technical field of network audio and discloses a network broadcasting method and a system realized by adopting an intelligent generation technology, wherein the network broadcasting method comprises the steps of obtaining target broadcasting time of a podcast program and generating script text; the method comprises the steps of carrying out content type identification on script texts, carrying out content perception duration estimation on the script texts, generating duration labeling scripts carrying estimated duration annotation, carrying out semantic complete unit identification to obtain a connection sentence set and a content sentence set, constructing a semantic dependency topological graph, carrying out semantic perception interception processing on the duration labeling scripts according to the semantic dependency topological graph to obtain intercepted script contents, generating compensating ending texts, adding the compensating ending texts to the intercepted script contents to generate complete intercepted script texts, and submitting the complete intercepted script texts to speech synthesis processing to obtain webcast audio. The method and the device realize high-efficiency duration compression and effectively ensure the integrity of the script semantic quotation relationship after interception on the premise of strictly meeting the constraint of the target broadcasting duration.

Inventors

WU HAO
HOU DAWEI
MA CHENYANG
GU GUOYING
LI RUOXUAN
XU XIONG
YAN WENWEN
GUO CHUNHUI

Assignees

江苏省广播电视总台

Dates

Publication Date: 20260508
Application Date: 20260410

Claims (10)

1. The network broadcasting method realized by adopting the intelligent generation technology is characterized by comprising the following steps: Acquiring a target broadcasting time length of a podcast program, and generating a script text according to the target broadcasting time length; Performing content type identification on the script text, performing content perception duration estimation on the script text based on the content type, generating a duration annotation script carrying estimated duration annotation, performing semantic complete unit identification on the duration annotation script, obtaining a connection sentence set and a content sentence set, and constructing a semantic dependent topological graph; Based on the estimated time length annotation and the target broadcasting time length, executing semantic perception cut-off processing on the time length annotation script according to the semantic dependency topological graph to obtain cut-off script content, generating a compensating ending text based on the cut-off script content, and attaching the compensating ending text to the cut-off script content to generate a complete cut-off script text; And submitting the complete cut-off script text to speech synthesis processing to obtain the Internet broadcasting audio realized by the intelligent generation technology.
2. The method for network broadcasting implemented by intelligent generation technology according to claim 1, wherein generating script text according to said target broadcasting time length comprises calculating a reference word number upper limit based on said target broadcasting time length and a preset reference character time length coefficient; Pre-scanning the received original material text, counting expected duty ratios of digital dense paragraphs and term dense paragraphs in the original material text, and generating a content density correction coefficient based on the expected duty ratios; And calculating the upper limit of the corrected word number based on the upper limit of the basic word number and the content density correction coefficient, and generating script text by taking the upper limit of the corrected word number as a generation constraint parameter.
3. The method for network broadcasting realized by adopting intelligent generation technology according to claim 2, wherein the content type identification of the script text comprises traversing the script text, identifying character sequences formed by continuous digital characters appearing in the script text and marked as digital string types, identifying character sequences formed by continuous capital letters and not forming complete English vocabulary and marked as English abbreviation types, identifying character sequences formed by continuous appearance of two or more punctuation marks and marked as continuous punctuation pause types; uniformly marking the number string type and the English abbreviation type as high expansion coefficient content types, marking the continuous punctuation pause type as time domain pause content types, and marking the rest character sequences in the script text as reference content types; And counting the distribution positions of the high expansion coefficient content type and the time domain pause content type in the script text to obtain a content type distribution map.
4. The method for multicasting implemented by adopting the intelligent generation technology according to claim 3 wherein performing content-aware duration prediction on the script text based on content type comprises dividing the script text into continuous text segments, and determining a base pronunciation duration of a reference content type and a reference silence duration of the time-domain pause content type in the continuous text segments respectively according to a preset duration acoustic reference; performing syllable decomposition on the content types with high expansion coefficients in the continuous text segments according to the content type distribution map, and performing scaling conversion on the basic pronunciation time length based on the actual syllable quantity after decomposition to obtain basic expansion time length; If the character spacing distance is smaller than a preset broadcasting cognitive safety distance threshold, judging that a corresponding region forms a dense broadcasting interval, and constructing an information dense punishment factor according to the character spacing distance, wherein the information dense punishment factor is inversely proportional to the character spacing distance, and when the character spacing distance approaches zero, the punishment factor takes a preset maximum value; The basic pronunciation time length, the reference silence time length and the rhythm correction expansion time length in the continuous text segment are subjected to time domain feature aggregation to obtain segment estimated time length; and attaching the accumulated estimated duration sequence to the script text in a structured annotation form to obtain a duration annotation script carrying the estimated duration annotation.
5. The method for webcasting realized by adopting the intelligent generation technology according to claim 4, wherein the steps of carrying out semantic complete unit identification on the duration annotation script to obtain a connection sentence set and a content sentence set and constructing a semantic dependent topological graph comprise carrying out semantic structure analysis on text content of the duration annotation script to identify semantic complete units in the text content; The method comprises the steps of executing connection type word scanning aiming at the sentence head position of a semantic complete unit, and simultaneously scanning whether the interior of the semantic complete unit carries a back-meaning component referring to the content; searching a corresponding time interval in the accumulated estimated time length sequence based on the initial character position and the final character position of the semantic complete unit, and taking the difference value between the final accumulated time length and the initial accumulated time length of the time interval as the accurate time length cost of the semantic complete unit; And aiming at a connecting sentence carrying a back-pointing component in the semantic complete unit, identifying a pointed target semantic complete unit and connecting a directed dependency edge to the target semantic complete unit node by the connecting sentence node.
6. The method for multicasting by adopting intelligent generation technology according to claim 5 wherein performing semantic perception truncation processing on the duration annotation script according to the semantic dependency topological graph comprises calculating the sum of all accurate duration costs in the connection sentence set and the content sentence set to obtain the total full-text duration; Responding to the total content time length exceeding the target broadcasting time length, determining a leaf node set with zero degree in the semantic dependency topological graph, wherein the zero degree indicates that the current node is not dependent on any other semantic complete unit; traversing all starting point nodes of the semantic dependency topological graph, which take the current removal node as the dependent edge end point, marking the starting point node losing the semantic reference target as a suspension node and removing the suspension node, and updating the total time duration total according to the accurate time duration cost of the suspension node; Repeatedly executing the steps of removing the current removed node from the removal candidate sequence, determining the suspended node, removing and updating the total length duration until the total length duration does not exceed the target broadcasting duration; and sequentially extracting corresponding text contents according to the initial character position sequence of the reserved nodes in the script text in the semantic dependency topological graph and sequentially splicing to obtain cut-off script contents.
7. The method for multicasting as in claim 6 wherein generating a compensatory ending text based on the truncated script content comprises determining a total amount of truncated script duration based on the total amount of full-text duration after semantic perception truncation processing; Taking the remaining available time length as a time constraint condition, and extracting a content abstract from the removed node set to generate a quick message short sentence set; And taking the short message sentence set as a generation constraint input, generating a compensating ending text with the time length not exceeding the residual available time length, wherein the compensating ending text is used for summarizing the removed node set and closing in by a preset ending language, and attaching the compensating ending text to the end of the cut-off script content to generate a complete cut-off script text.
8. A webcast system implemented using intelligent generation techniques using the method of any one of claims 1-7, characterized by: the script text generation module is used for obtaining the target broadcasting time length of the podcast program and generating a script text according to the target broadcasting time length; The time length labeling script module is used for carrying out content type identification on the script text, carrying out content perception time length estimation on the script text based on the content type, generating a time length labeling script carrying estimated time length annotation, carrying out semantic integral unit identification on the time length labeling script, obtaining a connection sentence set and a content sentence set, and constructing a semantic dependency topological graph; The text cut-off processing module is used for executing semantic perception cut-off processing on the time length labeling script according to the semantic dependency topological graph based on the estimated time length annotation and the target broadcasting time length to obtain cut-off script content, generating a compensating ending text based on the cut-off script content, and attaching the compensating ending text to the cut-off script content to generate a complete cut-off script text; and the webcast generation module submits the complete cut-off script text to speech synthesis processing to obtain webcast audio realized by the intelligent generation technology.
9. Computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method according to any of claims 1-7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of the claims 1-7.

Description

Network broadcasting method and system realized by intelligent generation technology Technical Field The invention relates to the technical field of network audio, in particular to a network broadcasting method and system realized by adopting an intelligent generation technology. Background In recent years, a content generation technology with a large language model as a core is widely introduced into a podcast content production process, the end-to-end automatic production from material input to playable audio is realized through the cooperative application of an automatic script generation and text-to-speech (TTS) synthesis technology, in order to adapt to the scheduling specification of a platform, a linear process of target duration, word number estimation, script generation and speech synthesis is generally adopted, average speech speed and character number are used as the estimation basis of the playout duration, overtime content is processed through a hard truncation or simple position truncation mode after the generation is finished, and finally an engine is submitted to complete audio synthesis output. However, in a real streaming media distribution scene when the real streaming media distribution scene is strictly limited, the linear pipeline based on the basic character count is difficult to achieve expected broadcasting precision and content continuity, on one hand, the existing duration prediction mechanism only depends on the basic character count, does not strip physical acoustic expansion effects of heterogeneous texts such as digital strings, english abbreviations and the like, and does not consider superposition delay effects generated on the real broadcasting rhythm when the heterogeneous texts are densely distributed, so that deviation exists between calculation and real rendering time consumption of the pure text, waste of bottom rendering calculation force and real synthesis timeout are caused, and on the other hand, when the time is overtime and forced compression is triggered, the existing truncation strategy breaks away from syntactic dependency relationship and logic skeleton of script texts, and the rough truncation without the semantic continuity not only causes logic fracture and suspension of contexts of the webcast program, so that hearing experience becomes hard, but also causes permanent discarding of core fact information carried in a cut section as invalid data, so that unidirectional information loss is caused, and the service requirements of high-concurrency and high-quality intelligent audio broadcasting cannot be met. Disclosure of Invention The present invention has been made in view of the above-described problems. In order to solve the technical problems, the invention provides the following technical scheme that the network broadcasting method realized by adopting the intelligent generation technology comprises the following steps: Acquiring a target broadcasting time length of a podcast program, and generating a script text according to the target broadcasting time length; Performing content type identification on the script text, performing content perception duration estimation on the script text based on the content type, generating a duration annotation script carrying estimated duration annotation, performing semantic complete unit identification on the duration annotation script, obtaining a connection sentence set and a content sentence set, and constructing a semantic dependent topological graph; Based on the estimated time length annotation and the target broadcasting time length, executing semantic perception cut-off processing on the time length annotation script according to the semantic dependency topological graph to obtain cut-off script content, generating a compensating ending text based on the cut-off script content, and attaching the compensating ending text to the cut-off script content to generate a complete cut-off script text; And submitting the complete cut-off script text to speech synthesis processing to obtain the Internet broadcasting audio realized by the intelligent generation technology. The method for generating script text according to the target broadcasting time length comprises the step of calculating a reference word number upper limit based on the target broadcasting time length and a preset reference character time length coefficient; Pre-scanning the received original material text, counting expected duty ratios of digital dense paragraphs and term dense paragraphs in the original material text, and generating a content density correction coefficient based on the expected duty ratios; And calculating the upper limit of the corrected word number based on the upper limit of the basic word number and the content density correction coefficient, and generating script text by taking the upper limit of the corrected word number as a generation constraint parameter. The invention relates to a preferable scheme of an internet broadca