CN-121859270-B - Multi-mode interaction data processing method and system for children's drawing book reading

CN121859270BCN 121859270 BCN121859270 BCN 121859270BCN-121859270-B

Abstract

The invention provides a multi-mode interaction data processing method and system for children's drawing reading, and relates to the technical field of data processing, wherein the method comprises the steps of carrying out parallel processing on a multi-mode original interaction data set, extracting and outputting visual semantic features, interaction voice semantic features and interaction behavior semantic features; dynamically determining at least three key semantic anchors in the visual semantic features and the interactive voice semantic features, namely, the visual key semantic anchors correspond to core role space coordinates and key word area center points of the current drawing page, the interactive voice key semantic anchors correspond to core emotion frame time stamps and query word positions in a voice frequency spectrum, and constructing a dynamic semantic geometric relation diagram in a cross-modal joint feature space based on the key semantic anchors. The invention can realize the accurate self-adaptive fusion of the cross-modal characteristics, and effectively improves the accuracy of the processing of the children's drawing reading interactive data and the suitability of the interactive response.

Inventors

GAO XUEFEI
CHEN QICHUAN
CHEN WEIBIN

Assignees

厦门三读教育科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260317

Claims (9)

1. A multi-mode interaction data processing method for children to read a drawing book is characterized by comprising the following steps: carrying out parallel processing on the multi-mode original interaction data set, and extracting and outputting visual semantic features, interactive voice semantic features and interactive behavior semantic features; Dynamically determining a key semantic anchor point in the visual semantic feature and the interactive voice semantic feature, wherein the key semantic anchor point comprises a visual key semantic anchor point and an interactive voice key semantic anchor point, the visual key semantic anchor point comprises a candidate visual space anchor point and a candidate text area anchor point, the interactive voice key semantic anchor point comprises a candidate emotion time anchor point and a candidate query semantic anchor point, the candidate visual space anchor point corresponds to the core role space coordinate of the current picture page, the candidate text area anchor point corresponds to the key text area center point of the current picture page, the candidate emotion time anchor point corresponds to the core emotion frame timestamp in the voice frequency spectrum, and the candidate query semantic anchor point corresponds to the query word position in the voice frequency spectrum; Based on the key semantic anchor points, constructing a dynamic semantic geometric relation diagram in a cross-modal joint feature space, wherein a semantic ellipse is generated by taking a visual key semantic anchor point as a focus, and a center of a circle is obtained by carrying out weighted calculation on a core emotion frame timestamp anchor point and a query word position anchor point in the interactive voice key semantic anchor points so as to generate a semantic perception circle; calculating the geometrical inclusion degree and the area overlapping rate between the semantic ellipse and the semantic perception circle, and generating a cross-modal semantic fusion adjustment coefficient; adopting a cross-modal semantic fusion adjustment coefficient to perform weighted fusion on the visual semantic features, the interactive voice semantic features and the interactive behavior semantic features to generate multi-modal joint semantic characterization; based on the multi-mode joint semantic characterization, joint reasoning is carried out, the interaction intention judgment result and the cognition state evaluation result of the child are analyzed and output, and the multi-mode joint semantic characterization is iteratively updated according to the analysis result, so that coherent reading session up-down Wen Biaozheng is generated; And generating and outputting a personalized interaction response instruction which is matched with the plot evolution according to the coherent reading session context representation and the interaction intention judging result.
2. The method for processing multi-modal interaction data for children's drawing and reading according to claim 1, wherein the steps of processing the multi-modal original interaction data set in parallel, extracting and outputting visual semantic features, interactive voice semantic features and interactive behavior semantic features comprise: Separating out the picture image data, the voice data of children, the touch pressure data and the three-dimensional posture data from the acquired multi-mode original interaction data set; Performing edge-end preprocessing on the picture image data to obtain a preprocessed page image, performing picture page feature segmentation and object recognition on the preprocessed page image, and positioning a text region and an illustration element region; The method comprises the steps of carrying out noise reduction and endpoint detection on voice data of children to obtain pure voice fragments, carrying out acoustic feature analysis and voice recognition on the pure voice fragments to extract acoustic features in voice frequency spectrums and transcribe the acoustic features into text information, and analyzing and outputting interactive voice semantic features comprising reading contents, questioning contents and mood; and carrying out synchronization and time sequence alignment processing on the touch pressure data and the three-dimensional gesture data to obtain a synchronous behavior sequence, and generating and outputting interaction behavior semantic features representing reading rhythm and concentration state by analyzing a page turning pressure change mode, a page turning angle sequence and a sight focus change track from the synchronous behavior sequence.
3. The multi-modal interactive data processing method for children's picture reading according to claim 2, wherein key semantic anchors are dynamically determined in visual semantic features and interactive voice semantic features, wherein the key semantic anchors include visual key semantic anchors and interactive voice key semantic anchors, the visual key semantic anchors include candidate visual space anchors and candidate text region anchors, the interactive voice key semantic anchors include candidate emotion time anchors and candidate query semantic anchors, wherein the candidate visual space anchors correspond to core character space coordinates of a current picture page, the candidate text region anchors correspond to key text region center points of the current picture page, the candidate emotion time anchors correspond to core emotion frame timestamps in a voice spectrum, and the candidate query semantic anchors correspond to query word positions in the voice spectrum, comprising: based on visual semantic features, identifying a core character object and a key character area in a current drawing page, extracting a boundary frame center coordinate of the core character object in a page image coordinate system as a candidate visual space anchor point; Based on the interactive voice semantic features, identifying emotion fluctuation frames and question sentence patterns in the voice fragments of the children, positioning a central timestamp of the emotion fluctuation frames on a voice time axis to serve as a candidate emotion time anchor point, and positioning a position index of a question word in the question sentence patterns in a text sequence to serve as a candidate question semantic anchor point.
4. The method for processing multi-modal interaction data for children's pictorial reading of claim 3, wherein a dynamic semantic geometry graph is constructed in a cross-modal joint feature space based on the key semantic anchor points, wherein a semantic ellipse is generated with visual key semantic anchor points as focuses, wherein a center of a circle is obtained by weighting and calculating a core emotion frame timestamp anchor point and a query word position anchor point in the interactive voice key semantic anchor points, so as to generate a semantic perception circle, and the method comprises the following steps: calculating an elliptical focus according to a preset semantic association distance threshold by taking a core role space coordinate and a key text region center point in a visual key semantic anchor point as a focus; calculating a perception radius through the acoustic feature intensity and emotion confidence in the interactive voice semantic features by carrying out weighted calculation on a core emotion frame timestamp anchor point and a query word position anchor point in the interactive voice key semantic anchor point and taking a calculation result as a circle center; Mapping the semantic ellipse and the semantic perception circle into the same cross-modal joint feature space coordinate system to form a dynamic semantic geometric relation graph, wherein the semantic ellipse represents the distribution range of the visual semantic of the drawing book, the semantic perception circle represents the perception range of the voice interaction semantic of the children, and the semantic ellipse and the semantic perception circle jointly construct a cross-modal semantic-related geometric constraint frame.
5. The method for processing multi-modal interactive data for children's drawing and reading according to claim 4, wherein calculating geometrical inclusion and area overlapping ratio between the semantic ellipse and the semantic perception circle generates a cross-modal semantic fusion adjustment coefficient, comprising: Under the same coordinate system mapped by the dynamic semantic geometric relation diagram, the focal position, the length of the long axis and the short axis of the semantic ellipse, and the center coordinates and the perception radius of the semantic perception circle are obtained; based on the obtained center coordinates and focus positions, calculating the relative position relation between the semantic perception circle and the semantic ellipse, judging whether the center of the semantic perception circle is positioned in the semantic ellipse or at the boundary according to the relative position relation, and calculating a corresponding first geometric inclusion value; Calculating the area of an intersection area between the semantic perception circle and the semantic ellipse, dividing the area of the intersection area by the area of the semantic perception circle, and obtaining a second area overlap value; and carrying out weighted synthesis on the first geometric inclusion value and the second area overlapping value, and generating the cross-modal semantic fusion adjustment coefficient by combining the visual semantic distribution confidence degree characterized by the semantic ellipse.
6. The method for processing multi-modal interaction data for children's drawing reading according to claim 5, wherein the steps of using cross-modal semantic fusion adjustment coefficients to perform weighted fusion on the visual semantic features, the interactive voice semantic features and the interactive behavior semantic features to generate multi-modal joint semantic representation include: Acquiring cross-modal semantic fusion adjustment coefficients, and synchronously reading visual semantic features, interactive voice semantic features and interactive behavior semantic features; Inputting the cross-modal semantic fusion adjustment coefficient into a set weight mapping function, and respectively calculating to obtain visual weights of visual semantic features, voice weights of interactive voice semantic features and behavior weights of interactive behavior semantic features; According to the visual weight, the voice weight and the behavior weight, carrying out weighted summation calculation on corresponding visual semantic features, interactive voice semantic features and interactive behavior semantic features to obtain a preliminary fusion feature vector; and carrying out normalization and dimension reduction processing on the primary fusion feature vector, and outputting multi-mode joint semantic representation accurately bound with the semantic content of the current drawing page.
7. The method for processing multi-modal interaction data for children's pictorial reading according to claim 6, wherein the method for processing multi-modal interaction data for children's pictorial reading is characterized by performing joint reasoning based on multi-modal joint semantic characterization, analyzing and outputting the interactive intention judgment result and the cognitive state evaluation result of the children, and performing iterative update on the multi-modal joint semantic characterization according to the interactive intention judgment result and the cognitive state evaluation result, so as to generate a coherent reading session context characterization, and the method comprises the following steps: Acquiring the multi-mode joint semantic representation corresponding to the current drawing page; Inputting the multi-modal joint semantic representation into a pre-trained intention and cognition joint reasoning model to extract multi-modal semantic association modes in the multi-modal joint semantic representation, and carrying out mode matching and reasoning based on a knowledge graph of the plot, so as to output a multi-modal semantic matching and reasoning result; analyzing and outputting an interaction intention judgment result and a cognitive state evaluation result according to the multi-mode semantic matching and reasoning result; And carrying out time sequence fusion on the multi-mode joint semantic representation, the interactive intention judging result and the cognitive state evaluating result and the historical reading session context representation stored in real time, and updating and generating a coherent reading session context representation at the current moment through the time sequence fusion.
8. The method for processing multi-modal interaction data for children's pictorial reading of claim 7, wherein generating and outputting a personalized interaction response instruction adapted to pictorial plot evolution according to a coherent reading session context representation and the interaction intention determination result comprises: Acquiring a reading session context characterization and interaction intention judgment result at the current moment; Based on the reading session context representation at the current moment, the position of the current reading process in story context and the subsequent plot trend are predicted by combining the plot knowledge graph, and a plot adaptation strategy is generated; matching and retrieving an optimal response template from a preset multi-mode response template library according to the interaction intention judging result and the scenario adaptation strategy; and filling the key semantic information in the reading session context characterization into an optimal response template, generating a specific personalized interaction response instruction, and outputting the specific personalized interaction response instruction to an interaction execution terminal.
9. A multimodal interactive data processing system for children's pictorial reading, the system implementing a method as claimed in any one of claims 1 to 8, comprising: The preprocessing and extracting module is used for carrying out parallel processing on the multi-mode original interaction data set, extracting and outputting visual semantic features, interaction voice semantic features and interaction behavior semantic features; The key semantic anchor positioning module is used for dynamically determining key semantic anchors in visual semantic features and interactive voice semantic features, wherein the key semantic anchors comprise visual key semantic anchors and interactive voice key semantic anchors, the visual key semantic anchors comprise candidate visual space anchors and candidate text area anchors, the interactive voice key semantic anchors comprise candidate emotion time anchors and candidate query semantic anchors, the candidate visual space anchors correspond to core role space coordinates of a current drawing page, the candidate text area anchors correspond to key text area central points of the current drawing page, the candidate emotion time anchors correspond to core emotion frame timestamps in a voice frequency spectrum, and the candidate query semantic anchors correspond to query word positions in the voice frequency spectrum; The geometric relation construction module is used for constructing a dynamic semantic geometric relation diagram in a cross-modal joint feature space based on the key semantic anchor points, wherein a semantic ellipse is generated by taking a visual key semantic anchor point as a focus, and a circle center is acquired by carrying out weighted calculation on a core emotion frame timestamp anchor point and a query word position anchor point in the interactive voice key semantic anchor point so as to generate a semantic perception circle; The fusion coefficient calculation module is used for calculating the geometrical inclusion degree and the area overlapping rate between the semantic ellipse and the semantic perception circle and generating a cross-modal semantic fusion adjustment coefficient; The multi-modal feature fusion module is used for carrying out weighted fusion on the visual semantic features, the interactive voice semantic features and the interactive behavior semantic features by adopting cross-modal semantic fusion adjustment coefficients to generate multi-modal joint semantic characterization; the intention and cognition reasoning module is used for carrying out joint reasoning based on the multi-mode joint semantic characterization, analyzing and outputting the interaction intention judgment result and the cognition state evaluation result of the child, and carrying out iterative updating on the multi-mode joint semantic characterization according to the interaction intention judgment result and the cognition state evaluation result to generate coherent reading session up and down Wen Biaozheng; And the personalized response generation module is used for generating and outputting personalized interaction response instructions which are adapted to the plot evolution according to the coherent reading session context characterization and the interaction intention judgment result.

Description

Multi-mode interaction data processing method and system for children's drawing book reading Technical Field The invention relates to the technical field of data processing, in particular to a multi-mode interaction data processing method and system for children to draw books and read. Background Along with the deep fusion of the artificial intelligence technology and the early education field of children, the multi-modal interaction technology gradually becomes a core support technology of the intelligent drawing book reading product for children. The current multi-mode interactive data processing scheme for children drawing books has the technical defects that the fusion process of cross-mode semantic features lacks dynamic geometric association constraint based on accurate semantic anchor points, always depends on a fusion mode of static weighting or simple splicing, and cannot establish accurate geometric mapping and dynamic association mechanisms based on core semantics among visual, voice and behavioral multi-modes. Therefore, the accuracy of cross-modal semantic fusion is insufficient, and the dynamic change characteristics of semantic association in the children's drawing book reading scene are difficult to adapt. For example, in a specific scene that a child points to a core role of a drawing page and sends a voice question with a questioning emotion, the prior art cannot accurately perform geometric association matching on space semantic information of the core role of a visual layer, a voice layer questioning word and time sequence semantic information of the emotion, a static fusion logic can lose key cross-mode semantic association information, so that generated multi-mode joint semantic representations cannot accurately attach to the real interaction state of the core semantic of the current drawing page and the child, and due to the fact that constraint of dynamic semantic geometric association is lacking, the obtained multi-mode semantic representations are fused, the binding effect of the multi-mode semantic representations and the drawing scenario and the interaction intention of the child is poor, the accuracy of interaction intention judgment and cognition state assessment is low, the output interaction response is difficult to match with the evolution rhythm of the drawing scenario and the real-time reading requirement of the child, and personalized and coherent drawing reading interaction guidance cannot be realized. Disclosure of Invention The invention aims to solve the technical problem of providing a multi-mode interaction data processing method and system for children's drawing reading, which can realize the accurate self-adaptive fusion of cross-mode characteristics and effectively improve the accuracy of the children's drawing reading interaction data processing and the suitability of interaction response. In order to solve the technical problems, the technical scheme of the invention is as follows: In a first aspect, a method for processing multimodal interactive data for children's pictorial reading, the method comprising: carrying out parallel processing on the multi-mode original interaction data set, and extracting and outputting visual semantic features, interactive voice semantic features and interactive behavior semantic features; Dynamically determining a key semantic anchor point in the visual semantic feature and the interactive voice semantic feature, wherein the key semantic anchor point comprises a visual key semantic anchor point and an interactive voice key semantic anchor point, the visual key semantic anchor point comprises a candidate visual space anchor point and a candidate text area anchor point, the interactive voice key semantic anchor point comprises a candidate emotion time anchor point and a candidate query semantic anchor point, the candidate visual space anchor point corresponds to the core role space coordinate of the current picture page, the candidate text area anchor point corresponds to the key text area center point of the current picture page, the candidate emotion time anchor point corresponds to the core emotion frame timestamp in the voice frequency spectrum, and the candidate query semantic anchor point corresponds to the query word position in the voice frequency spectrum; Based on the key semantic anchor points, constructing a dynamic semantic geometric relation diagram in a cross-modal joint feature space, wherein a semantic ellipse is generated by taking a visual key semantic anchor point as a focus, and a center of a circle is obtained by carrying out weighted calculation on a core emotion frame timestamp anchor point and a query word position anchor point in the interactive voice key semantic anchor points so as to generate a semantic perception circle; calculating the geometrical inclusion degree and the area overlapping rate between the semantic ellipse and the semantic perception circle, and generating a cross-modal semantic fusio