CN-121998802-A - Music teaching system based on speech recognition

CN121998802ACN 121998802 ACN121998802 ACN 121998802ACN-121998802-A

Abstract

The invention discloses a music teaching system based on voice recognition, which relates to the technical field of music feature fusion, wherein a technical feature quantization module is used for calculating a deviation rate of a rhythm time value, a pitch accuracy rate, a musical interval tension scalar, a tone type scalar, a prosody boundary detection intensity, a linearity scalar, a discontinuous scalar and a semantic definition scalar, a co-emotion reasoning module is used for constructing an emotion theory mapping model and generating an emotion expression scalar, an artistic connotation scalar and a sound narrative continuity scalar based on the emotion theory mapping model, and a teaching application module is used for generating a personalized co-emotion dialogue instruction and a Darcy litz body state rhythm teaching instruction aiming at the current technical feature deviation. The invention is a complete system on the depth perception, philosophy cognition and imaging teaching guidance of music performance, realizes the depth fusion of technical features and application features, unifies the original isolated objective score and subjective art guidance, and solves the core problems of technology and art disjoint and hard feedback.

Inventors

XU JING
LI PEIRUI
ZHANG XINCHENG
ZHAO ZIYU
YANG XIAO
HUANG YAN
ZUO LINGHUI
LI JIANG

Assignees

四川云数赋智教育科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260123

Claims (10)

1. The music teaching system based on the voice recognition is characterized by comprising a technical characteristic quantization module, a co-emotion reasoning module and a teaching application module: The technical characteristic quantization module is used for acquiring music performance data of a learner through voice recognition and a musical instrument digital interface MIDI, and calculating a rhythm value deviation rate, a pitch accuracy, a musical interval tension scalar, a tone type scalar, a prosody boundary detection intensity, a linearity scalar, a non-continuity scalar and a semantic definition scalar after performing time sequence alignment based on dynamic time regularity on the music performance data; The co-emotion reasoning module is used for constructing an emotion theory mapping model and generating an emotion expression scalar, an artistic connotation scalar and a sound narrative continuity scalar based on the emotion theory mapping model; And the teaching application module is used for inputting a condition-driven dialogue generating model by taking the emotion expression scalar, the sound narrative continuity scalar and the artistic connotation scalar as conditions to generate a personalized co-emotion dialogue instruction and a Darcy litz body state rhythm teaching instruction aiming at the current technical characteristic deviation.
2. The speech recognition based music tutorial system of claim 1, wherein the technical feature quantization module includes: the music performance data comprises lyric texts corresponding to the audio data, the musical instrument digital interface data and the audio data of the performance, which are singed by a learner; When the audio data of the singing and playing of the learner comprises an original sound wave signal acquired through a microphone during the singing and playing of a non-musical instrument digital interface device, the non-musical instrument digital interface device comprises a human voice and an original sound instrument, and the original sound wave signal comprises an amplitude sampling point sequence, a fundamental frequency track, a loudness energy track and a Mel frequency cepstrum coefficient of a sound signal stored in a FLAC format on a time axis; The musical instrument digital interface data comprises a digital instruction sequence generated by a learner when playing the electronic musical instrument, wherein the digital instruction sequence comprises a note-on event, a note-off event, pitch data, a dynamics value and a control change event; The note-on event includes a time stamp of the key being pressed and a strength value of the note; The note-off event includes a time stamp that the key was released; The pitch data comprises discrete pitch values represented by natural numbers of 1-127 recorded in a curved spectrum; the force value is a value attached to a note-on event, and the speed and pressure value of the pressed key are recorded; The control change event includes a change in the use state of the pedal, tremolo wheel and curved sound wheel controller during performance.
3. A speech recognition based music tutorial system according to claim 2, wherein: Extracting technical characteristics of music after carrying out time series alignment on audio data and musical instrument digital interface data of singing and playing of a learner, wherein the technical characteristics of music comprise a rhythm value deviation rate, a pitch accuracy rate, a musical interval tension scalar and an adjustment type scalar; the time series alignment includes: converting the standard music score uploaded by the learner into a standard note event set, wherein the mathematical expression of the standard note event set is as follows: ; ; Wherein S is the note event set, Is the first The event of a standard note is a single-note event, For a standard timestamp of a note, the standard timestamp is the intermediate timestamp of the note-off event and the note-on event included in the current note, Is the standard pitch data of the notes, Is the total number of notes in a standard score; The audio data of the actual singing and playing of the learner are converted into an actual note event set, wherein the actual note event set comprises actual note events of the actual singing and playing of all learners, the structure of the actual note event set is the same as that of a standard note event set, and each actual note event set comprises an actual note event comprising an actual time stamp of a note and actual pitch data; constructing a cost matrix, initializing a cost matrix dimension of m×n, m being the number of actual note events included in the set of actual note events, the cost matrix representing notes in the sequence to be performed With notes in a standard sequence Performing local matching cost generated by matching; The mathematical expression for calculating the local matching cost is: ; Wherein, the In order to match the cost locally, For the actual pitch data of the ith note in the actual note event set, For the actual timestamp of the i-th note in the actual note event set, And Is a preset weight coefficient.
4. A speech recognition based music tutorial system according to claim 3, wherein: Initializing an m×n cumulative cost matrix D, the initializing including setting a start point of the cumulative cost matrix D to be The first row is arranged as Following a rule that can only be moved horizontally from the left-hand predecessor node, the first column is set to Following rules that can only be moved from the upper precursor node; using dynamic programming, iteratively computing each element in the cumulative cost matrix D, finding a minimum cumulative cost path from a start point to an end point, the iterative computation comprising: Accumulating elements in cost matrix D Is equal to the current local matching cost plus the minimum cumulative cost selected from three adjacent precursor nodes, expressed mathematically as: ; After the iterative calculation is completed, the lower right corner element of the cost matrix D is accumulated I.e. the minimum total matching cost, from the minimum total matching cost Reverse backtracking to origin along the path leading to the least cumulative cost The trace-back path is the optimal alignment path, the path which causes the minimum accumulated cost is the path formed by the precursor nodes when the minimum accumulated cost is selected, and the optimal alignment path comprises index pairs The actual note event and the standard note event which are correspondingly matched in time are obtained.
5. The speech recognition based music tutorial system of claim 4, wherein: the mathematical expression of the deviation rate of the rhythm value is: ; wherein N is the total number of notes, For the duration of the ith note in the actual set of note events, the duration is the difference between the time stamps of note-off events and note-on events, For the duration of the j-th note in the standard note event set for which the i-th note time in the actual note event set corresponds to match, For the weight of the i-th note, The calculation process of (1) comprises: Extracting key local features of the notes, wherein the key local features comprise dynamics values of the notes, relative values of duration of the notes, signal to noise ratio of the notes, boundary notes and re-beat notes; the relative value of the duration of the notes is the proportion of the duration of the notes to the total duration of the piece of music; the note signal-to-noise ratio is the local signal-to-noise ratio of note i for the duration of the note; Boundary notes are notes on potential prosody boundaries; The re-beat note is the first note of each measure conforming to the rhythm of the music score; All key local feature sets corresponding to a random note are set to be one-dimensional local note feature vectors, the weight of the current note is calculated based on the local note feature vectors, and the mathematical expression is as follows: ; ; Wherein, the Is a musical note Is defined by the feature vectors of the partial notes, Is a preset characteristic sensitivity parameter; the mathematical expression of pitch accuracy is: ; ; Wherein, the Is the first The deviation of the note score of the individual notes, Is a preset allowable maximum score deviation threshold, The pitch accuracy is; the mathematical expression of the interval tension scalar is: ; Wherein, the Is a scalar of the tension of the musical interval, For the total number of consecutive note pairs or chords, The interval between the kth note pair or between chords, the interval being in units of semitones, Is a discrete interval tension function; The calculation process of the adjustment type scalar comprises the following steps: collecting a pitch frequency histogram, wherein the pitch frequency histogram is a distribution histogram formed by the occurrence frequency of each chromatic scale in 12 chromatic scales in corresponding music pieces in audio data and musical instrument digital interface data, normalizing the pitch frequency histogram to obtain actual scale distribution, calculating cosine similarity between the actual scale distribution and a target tuning template to obtain tuning matching scores, collecting all the cosine similarity into one-dimensional vectors to obtain tuning type scalar, and the expression of the tuning type scalar is: ; Wherein, the For the type of mode scalar quantity, For the purpose of adjusting the template to be a target, Template for actual scale distribution and target adjustment Cosine similarity between them; the highest scoring element of the measure type scalar represents the dominant measure type of the current piece of music.
6. The speech recognition based music tutorial system of claim 5, wherein: performing prosodic structure analysis on the audio data, extracting prosodic boundary detection intensity and prosodic processing scalar, the prosodic structure analysis comprising: Traversing a loudness energy track of the audio data, and identifying a region, which is marked as a potential prosody boundary, of which the acquired loudness energy is lower than a preset loudness energy threshold and the duration exceeds a minimum inter-sentence pause threshold; Detecting whether the actual duration of a note preceding a potential prosody boundary is greater than a standard duration, if the actual duration is greater than the standard duration, marking the note as a note prolonged at the end of a sentence, calculating the difference between the actual duration and the standard duration, marking the note as a time delay length, calculating the proportion of the time delay length to the total duration of a music piece corresponding to current audio data, marking the note as a time value prolonged proportion, and the standard duration is the average value of the actual durations of all notes of the current audio data; Detecting whether the fundamental frequency of a note at a potential prosody boundary is reduced, if the fundamental frequency is reduced, marking the note as a phrase ending marking note, calculating the average value of the differences between the fundamental frequency of the current note and the fundamental frequency of the previous note, and marking the average value as the reduction amplitude of the fundamental frequency; performing product operation on the time delay length, the time value extension proportion and the numerical value of the fundamental frequency descending amplitude of notes on the potential prosody boundary to obtain note intensity characteristic values on the potential prosody boundary, and calculating the average value of note intensity characteristic values corresponding to all notes on the potential prosody boundary to obtain a prosody boundary detection intensity scalar; And calculating the product of the prosody boundary detection intensity scalar and the deviation rate of the rhythm value to obtain a prosody processing scalar.
7. The speech recognition based music tutorial system of claim 6, wherein: Performing a musical structure analysis on the audio data and the musical instrument digital interface data, extracting a linearity scalar and a discontinuity scalar, the musical structure analysis process including: Executing a pattern matching algorithm on the standard note event set, and identifying and acquiring an motivation set, wherein the identifying and acquiring process of the motivation set comprises the following steps: Encoding relative feature vectors for all corresponding notes in the standard note event set, wherein the relative feature vectors comprise relative pitch features and relative rhythm features, the relative pitch features are the number of interval semitones between a current note and a previous note, and the relative rhythm features are the ratio of the duration time of the current note to the duration time of the previous note; presetting a minimum motivation length and a maximum motivation length, wherein the motivation length is in units of the number of notes; Traversing a standard note event set to obtain complete melody fragments with all lengths between a minimum motivation length and a maximum motivation length range, calculating note similarity between any two complete melody fragments by using a longest public subsequence algorithm LCS (virtual control system) by taking the complete melody fragments as a measure unit to obtain note similarity calculated by each complete melody fragment and any other complete melody fragment in the standard note event set, counting note similarity quantity of note similarity corresponding to each complete melody fragment exceeding a preset note similarity threshold, and outputting the current complete melody fragment as an effective motivation fragment when the quantity is larger than the preset minimum repetition number threshold; Iteratively checking whether all the effective motivation fragments comprise effective motivation fragments with shorter length, if so, outputting the effective motivation fragments with shorter length as basic motivation fragments, and directly outputting the effective motivation fragments as basic motivation fragments without the effective motivation fragments with shorter length, and combining all the basic motivation fragments into a motivation set; Recording a pitch contour, a rhythm pattern and a digital instruction sequence of each basic motivation segment, and respectively structuring each element in the motivation set into a characteristic vector comprising the pitch contour, the rhythm pattern and the internal interval tension to obtain a motivation characteristic set; Traversing a standard note event set, obtaining all occurrence positions of each basic motivation segment in a music piece, and identifying the unfolding variation types on each occurrence position, wherein the unfolding variation types comprise repetition, reflection, expansion and fragment unfolding, the pitch variation characteristics and the rhythm variation characteristics of each unfolding variation type are obtained by utilizing big data, non-repetition coding is carried out on the unfolding variation types and the pitch variation characteristics and the rhythm variation characteristics thereof, and the unfolding variation types are combined into a variation coding vector; Extracting an actual motive melody fragment of an actual note event set at a corresponding appearance position, obtaining an variation coding vector of the actual motive melody fragment, checking whether the variation coding vector of the actual motive melody fragment is matched with the variation coding vector of a basic motive fragment at the corresponding appearance position, wherein the matching logic is non-repeated coding mathematical operation logic with the same elements at the corresponding position of the variation coding vector; Based on the variation coding vectors of the actual motivation melody segments and the basic motivation segments at the corresponding positions, calculating linearity, wherein the mathematical expression of the linearity is as follows: ; Wherein, the In order to be of a degree of linearity, For the total number of variations to be tracked, The euclidean distance of the vector is encoded for the variation of the actual motivation melody segment and the basic motivation segment at the corresponding positions, Encoding a vector for the variation of the kth segment of the basic motivation segment, Encoding a vector for the variation of the kth segment of the actual motivation melody segment; Identifying musical interval tension scalars exceeding an incompatibility threshold, monitoring the switching frequency of the adjustable type scalars in a unit time window, identifying sudden increase of a deviation rate of a rhythm value at a non-rhythm boundary, wherein the sudden increase represents discontinuous sound wave value increase, and calculating the discontinuity; ; Wherein, the In the form of a discontinuity which is a function of the degree of discontinuity, For an average of interval tension scalars exceeding the mismatch threshold, For the number of scalar switches of the type of the mode, Is the average amplitude of the deviation rate mutation of the rhythm value.
8. The speech recognition based music tutorial system of claim 7, wherein: Inquiring standard tone of each Chinese character in the lyric text, wherein the standard tone comprises a shade level, a sun level, an ascending tone and a descending tone, and pitch trend is correspondingly obtained according to the standard tone, and the standard tone corresponds to the pitch trend and comprises the shade level and the sun level which correspond to the flatness, the ascending tone correspondingly ascends and the descending tone correspondingly descends; extracting the actual pitch trend of the corresponding fundamental frequency track of each Chinese character in the corresponding original sound wave signal section and the standard pitch trend of the standard music score of the corresponding section, and calculating the tone melody violation rate and the semantic definition scalar based on the actual pitch trend and the standard pitch trend; Comparing whether the actual pitch trend of the Chinese characters is the same as the standard pitch trend of the corresponding position, if not, marking the Chinese characters as tone violation melodies, and calculating the ratio of the number of Chinese characters marked as tone violation melodies to the total number of Chinese characters corresponding to the original sound wave signals to obtain the tone melody violation rate; Subtracting the violation rate of the tone melody from 1 to obtain the semantic definition scalar.
9. The speech recognition based music teaching system of claim 8, wherein the co-emotion inference module comprises an emotion theory mapping model construction unit and a co-emotion vector extraction unit: the emotion theory mapping model construction unit is used for constructing an emotion theory mapping model, and the emotion theory mapping model comprises: inputting the deviation rate of the rhythm value and the adjustment type scalar as input characteristics; carrying out weighted summation on elements in the adjustment type scalar by multiplying the elements with a weight matrix, wherein the weight matrix comprises emotion tendencies of all adjustment types in the adjustment type scalar, and carrying out normalization calculation on a weighted summation result to obtain emotion polarity characteristics; When the emotion polarity characteristic is more than 0.5 and less than 1, the positive emotion is represented; when the emotion polarity characteristic is less than 0.5 and greater than 0, a negative emotion is represented; splicing the deviation rate of the rhythm value, the difference between 1 and the deviation rate of the rhythm value and the emotion polarity characteristic into a fusion characteristic vector, inputting a first full-connection layer, outputting the result through a nonlinear activation function ReLU, inputting a second full-connection layer, outputting the result through the nonlinear activation function ReLU, inputting the result into a linear layer, outputting the result through a Softmax activation function, and obtaining an emotion expression scalar, wherein each element in the emotion expression scalar represents the matching probability of the music performance of a performer and each emotion; The co-emotion vector extraction unit includes: Carrying out artistic connotation mapping on the discontinuous scalar and the chord Cheng Zhangli scalar according to a rule of explaining the Nylon philosophy to obtain the artistic connotation scalar, wherein the rule of explaining the Nylon philosophy comprises the following steps: inputting the discontinuous scalar quantity, the linearity scalar quantity and the interval tension scalar quantity into an emotion theory mapping model at the input positions of the deviation rate of the rhythm value, the difference between 1 and the deviation rate of the rhythm value and the emotion polarity characteristic, and outputting the spirit component characteristic; inputting the difference between the deviation rate of the rhythm value and 1, the deviation rate of the rhythm value and the tension scalar of the interval into a emotion theory mapping model at the input positions of the deviation rate of the rhythm value and 1, the deviation rate of the rhythm value and emotion polarity characteristics, and outputting the emotion component characteristics; splicing spirit component features and authority spirit component features to obtain artistic connotation quantity; the semantic definition scalar and the prosody boundary detection intensity scalar are input into the emotion theory mapping model, and the acoustic narrative continuity scalar is output.
10. The speech recognition based music tutorial system of claim 9, wherein the tutorial application module includes: Inputting a condition-driven dialogue generation model LLM by taking the emotion expression scalar, the sound narrative continuity scalar and the artistic connotation scalar as conditions to generate a personalized co-emotion dialogue aiming at the current technical characteristic deviation, wherein the technical characteristic deviation type comprises the technical characteristic deviation when any one characteristic value of a rhythm value deviation rate, a pitch accuracy rate, a musical interval tension scalar and a tone type scalar exceeds a preset corresponding threshold range; Index matching is carried out in a preset Darcy litz body state rhythm teaching instruction library according to the technical characteristic deviation type included in the personalized cosmopolitan dialogue, and a body movement instruction is generated based on a matched instruction template and specific deviation values, wherein the body movement instruction comprises a stride change instruction for correcting rhythm, a body extension instruction for sensing musical interval and a simulated action instruction for embodying a regulating type.

Description

Music teaching system based on speech recognition Technical Field The invention relates to the technical field of music feature fusion, in particular to a music teaching system based on voice recognition. Background Music education is a key way to cultivate aesthetic literacy and creativity. Traditional music teaching relies heavily on personal experience and subjective judgment of teachers, and is difficult to realize large-scale, personalized and standardized teaching. With the development of artificial intelligence technology, music teaching applications for performing pitch and rhythm scoring by using speech recognition are emerging. At present, the Chinese patent application No. CN202510554921.0 discloses a music teaching system based on artificial intelligence, which comprises a neuro-cognitive adaptation module, a music DNA map construction module, a cognitive load monitoring module, an anti-AI-dependent regulation module and a multi-source data fusion unit. According to the music teaching system based on the artificial intelligence, a dynamically evolving personalized teaching model is constructed through multi-mode data fusion and real-time analysis driven by the artificial intelligence. The music DNA map is based on a quantum enhancement modeling technology, continuously updates and predicts a skill development track, and the AI teaching strategy can realize millisecond-level dynamic adjustment according to the nerve feedback and behavior data of the learner, thereby remarkably improving the skill grasping efficiency and the knowledge retention rate and breaking through the limitation of the traditional teaching system on statics and singleness. Most of the prior art systems are mechanically scored, feedback is hard, guidance with emotion temperature and artistic elicitation cannot be provided like human teachers, deep music understanding is difficult to build, the prior art systems can only evaluate quantifiable technical indexes, perceptual elements in music such as emotion expression, sound narrative, cultural connotation and the like cannot be understood and evaluated, teaching targets are enabled to be on the one side, the prior art systems are remained on the listening imitation and scoring feedback level, deep fusion with a mature music teaching method is lacking, and abstract music concepts cannot be converted into visual body experience. Disclosure of Invention The invention solves the technical problems that the existing system is mostly mechanically scored, feedback is hard, guidance with emotion temperature and artistic elicitation can not be provided like a human teacher, deep music understanding is difficult to build, the existing system can only evaluate quantifiable technical indexes, perceptual elements in music such as emotion expression, sound narrative, cultural connotation and the like can not be understood and evaluated, teaching targets are on one surface, the existing system is mostly remained on the listening and singing imitation and scoring feedback level, depth fusion with a mature music teaching method is lacked, and abstract music concepts can not be converted into visual body experience. In order to solve the technical problems, the invention provides the following technical scheme that the music teaching system based on voice recognition comprises a technical feature quantization module, a co-emotion reasoning module and a teaching application module: The technical characteristic quantization module is used for acquiring music performance data of a learner through voice recognition and a musical instrument digital interface MIDI, and calculating a rhythm value deviation rate, a pitch accuracy, a musical interval tension scalar, a tone type scalar, a prosody boundary detection intensity, a linearity scalar, a non-continuity scalar and a semantic definition scalar after performing time sequence alignment based on dynamic time regularity on the music performance data; The co-emotion reasoning module is used for constructing an emotion theory mapping model and generating an emotion expression scalar, an artistic connotation scalar and a sound narrative continuity scalar based on the emotion theory mapping model; And the teaching application module is used for inputting a condition-driven dialogue generating model by taking the emotion expression scalar, the sound narrative continuity scalar and the artistic connotation scalar as conditions to generate a personalized co-emotion dialogue instruction and a Darcy litz body state rhythm teaching instruction aiming at the current technical characteristic deviation. Preferably, the technical feature quantization module includes: the music performance data comprises lyric texts corresponding to the audio data, the musical instrument digital interface data and the audio data of the performance, which are singed by a learner; When the audio data of the singing and playing of the learner comprises an original sound wave signal acq