CN-116778962-B - Word granularity timestamp determining method, electronic equipment and storage medium

CN116778962BCN 116778962 BCN116778962 BCN 116778962BCN-116778962-B

Abstract

The embodiment of the invention discloses a method, electronic equipment and a storage medium for determining a word granularity timestamp, which are characterized in that a time point corresponding to a probability peak value of a current word is determined as a starting point of the current word, and a termination point of the current word is determined according to a difference value between the probability peak value of the current word and a time point corresponding to an adjacent next probability peak value, and further the timestamp of the current word is determined according to the starting point and the termination point. Therefore, the method and the device can solve the technical problems of time offset of the word granularity time stamp and the duration time of the non-effective voice encapsulation word entering, thereby improving the accuracy of the word granularity time stamp.

Inventors

SONG SHASHA
WEI GUANGHUI
LI ZHIFEI

Assignees

上海墨百意信息科技有限公司

Dates

Publication Date: 20260505
Application Date: 20230619

Claims (7)

1. A method for determining word granularity time stamps, the method comprising: determining a probability peak value of each word in the target audio; determining a time point corresponding to the probability peak value of the current word as a starting point of the current word; Responding to the fact that the current word is not the tail word of the target audio, and determining the ending point of the current word according to the difference value of the probability peak value of the current word and the time point corresponding to the next adjacent probability peak value; determining the time stamp of the current word according to the starting point and the ending point; The determining the ending point of the current word according to the difference value between the probability peak value of the current word and the time point corresponding to the next adjacent probability peak value comprises the following steps: Determining a time point corresponding to the adjacent next probability peak as an ending point of the current word in response to the difference being not greater than a first predetermined length of time; the determining the timestamp of the current word according to the starting point and the ending point comprises the following steps: moving the start point and the end point forward for a third predetermined period of time, respectively; determining the time stamp of the current word according to the starting point and the ending point after the movement; the starting point is a time point corresponding to a probability peak value of the current word, the ending point is a time point corresponding to a next probability peak value adjacent to the probability peak value of the current word, and the length of the third preset time length is smaller than the difference value between the starting point and the ending point.
2. The method of claim 1, wherein determining the probability peak for each word in the target audio comprises: And decoding the target audio through a preset decoding algorithm, and determining the probability peak value of each word in the target audio.
3. The method of claim 1, wherein determining the ending point of the current word based on the difference between the probability peak of the current word and the point in time corresponding to the next adjacent probability peak comprises: and determining the time point acquired after the starting point is moved backwards for a second preset time length as the ending point of the current word in response to the difference value being larger than the first preset time length.
4. A method according to claim 3, wherein said determining the timestamp of the current word from the start point and the end point comprises: moving the starting point forward for a third predetermined period of time; and determining the time stamp of the current word according to the starting point and the ending point after the movement.
5. The method according to claim 1, wherein the method further comprises: determining a difference value between a time point corresponding to a probability peak value of the current word and a time point when the target audio is finished in response to the current word being a tail word of the target audio; Determining the time point when the target audio ends as the ending point of the current word in response to the difference between the time point corresponding to the probability peak of the current word and the time point when the target audio ends being not greater than a second preset duration; And determining the time point acquired after the starting point moves backwards for a second preset time as the ending point of the current word in response to the difference value between the time point corresponding to the probability peak value of the current word and the time point of ending the target audio is larger than a second preset time.
6. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-5.
7. A computer readable storage medium, on which computer program instructions are stored, which computer program instructions, when executed by a processor, implement the method of any of claims 1-5.

Description

Word granularity timestamp determining method, electronic equipment and storage medium Technical Field The invention relates to the technical field of voice recognition, in particular to a word granularity time stamp determining method, electronic equipment and a storage medium. Background With the development of speech recognition technology, the accuracy requirement of word granularity time stamping in speech recognition is also increasing. In the prior art, the start time of a word is taken from the time of the last probability peak value and the end time of the word is taken from the time of the probability peak value of the current word by a CTC (Connectionist temporalclassification, connection time sequence classification) algorithm, so that the time of the probability peak value is taken as the time stamp of the word. However, since the probability peak value of CTC prediction is slightly delayed in time from the time point corresponding to the real pronunciation, the time at which the probability peak value is located is directly taken as the timestamp of the word, which may result in inaccurate start time and end time of the word, and in addition, since silence, noise, etc. may be included in the audio, such time of the non-valid voice may be encapsulated into the timestamp of the word, thereby affecting the accuracy of the word granularity timestamp. Disclosure of Invention In view of the above, the present invention is directed to a method, an electronic device, and a storage medium for determining a word granularity timestamp, so as to solve the technical problems of time offset of word granularity timestamp information and duration of non-valid voice encapsulation into words, and improve accuracy of the word granularity timestamp information. In a first aspect, a method for determining a word granularity timestamp is provided, the method comprising: determining a probability peak value of each word in the target audio; determining a time point corresponding to the probability peak value of the current word as a starting point of the current word; Responding to the fact that the current word is not the tail word of the target audio, and determining the ending point of the current word according to the difference value of the probability peak value of the current word and the time point corresponding to the next adjacent probability peak value; and determining the time stamp of the current word according to the starting point and the ending point. In some embodiments, the determining the probability peak for each word in the target audio includes: And decoding the target audio through a preset decoding algorithm, and determining the probability peak value of each word in the target audio. In some embodiments, the determining the ending point of the current word according to the difference between the probability peak value of the current word and the time point corresponding to the next adjacent probability peak value includes: and determining a time point corresponding to the adjacent next probability peak value as the ending point of the current word in response to the difference value not being greater than a first preset time length. In some embodiments, the determining the timestamp of the current word from the starting point and the ending point includes: moving the start point and the end point forward for a third predetermined period of time, respectively; And determining the time stamp of the current word according to the starting point and the ending point after the movement. In some embodiments, the determining the ending point of the current word according to the difference between the probability peak value of the current word and the time point corresponding to the next adjacent probability peak value includes: and determining the time point acquired after the starting point is moved backwards for a second preset time length as the ending point of the current word in response to the difference value being larger than the first preset time length. In some embodiments, the determining the timestamp of the current word from the starting point and the ending point includes: moving the starting point forward for a third predetermined period of time; and determining the time stamp of the current word according to the starting point and the ending point after the movement. In some embodiments, the third predetermined length of time is less than the difference between the start point and the end point. In some embodiments, the method further comprises: determining a difference value between a time point corresponding to a probability peak value of the current word and a time point when the target audio is finished in response to the current word being a tail word of the target audio; Determining the time point when the target audio ends as the ending point of the current word in response to the difference between the time point corresponding to the probability peak of the current word and t