CN-121979475-A - Webpage interaction method and related equipment based on semantic analysis

CN121979475ACN 121979475 ACN121979475 ACN 121979475ACN-121979475-A

Abstract

The application relates to the technical field of webpage interaction and provides a webpage interaction method based on semantic analysis and related equipment, wherein the method comprises the steps of capturing a user voice instruction sent by a user and obtaining a plurality of controls of a target webpage; the method comprises the steps of carrying out semantic analysis on a user voice command, screening a plurality of candidate controls from all the controls based on semantic analysis results, carrying out multi-mode scoring on the candidate controls to obtain feature scores of the candidate controls respectively aiming at each candidate control, determining a target control from all the candidate controls according to all the feature scores and determining the visual state of the target control, and carrying out interactive control on the target control based on the visual state and the user voice command to obtain an interactive result. The method can improve the effect and universality of the voice-controlled webpage interaction.

Inventors

TANG MING
ZENG HUAN

Assignees

湖南小算科技信息有限公司

Dates

Publication Date: 20260505
Application Date: 20260127

Claims (10)

1. A webpage interaction method based on semantic analysis is characterized by comprising the following steps: Capturing a user voice instruction sent by a user, and acquiring a plurality of controls of a target webpage; Carrying out semantic analysis on the user voice instruction, and screening a plurality of candidate controls from all controls based on semantic analysis results; Respectively carrying out multi-mode scoring on the candidate controls aiming at each candidate control to obtain feature scores of the candidate controls, wherein the feature scores are used for identifying and screening the candidate controls in the target webpage; Determining a target control from all candidate controls according to all feature scores, and determining the visibility state of the target control, wherein the visibility state is used for describing whether the target control is visible in a target webpage or not; and performing interactive control on the target control based on the visibility state and the user voice command to obtain an interactive result.
2. The web page interaction method according to claim 1, wherein the obtaining the plurality of controls of the target web page comprises: Identifying a plurality of web page elements of the target web page; the following steps are respectively carried out for each webpage element: Calculating a visibility score, a semantic relatedness score and an interactivity score of the web page element, and calculating a comprehensive score of the web page element based on the visibility score, the semantic relatedness score and the interactivity score; and judging whether the comprehensive score meets a preset operability condition, and if so, taking the webpage element as a control.
3. The web page interaction method according to claim 2, wherein the step of screening a plurality of candidate controls from all the controls based on the semantic parsing result comprises: carrying out semantic matching on each control based on the semantic analysis result to obtain the semantic matching degree between each control and the user voice instruction; And respectively aiming at each control, and taking the control as a candidate control if the semantic matching degree corresponding to the control is greater than or equal to a preset semantic matching degree threshold value.
4. The web page interaction method according to claim 1, wherein the multi-modal scoring of the candidate controls to obtain feature scores of the candidate controls includes: calculating a visual characteristic score according to the visual characteristics of the candidate control; Calculating a text feature score according to the text features of the candidate control; calculating a structural feature score according to the structural features of the candidate control; Calculating an intention matching score between the candidate control and a user voice instruction; and calculating the feature score of the candidate control based on the visual feature score, the text feature score, the structural feature score and the intention match score.
5. The web page interaction method of claim 4, wherein the computing feature scores for the candidate controls based on the visual feature scores, the text feature scores, the structural feature scores, and the intent match scores comprises: by the formula: calculating feature scores ; Wherein, the The weights representing the scores of the visual features, The visual characteristic score is represented by a visual characteristic score, Weights representing the text feature scores are presented as, Representing the feature scores of the text, Weights representing the scores of the structural features, The structural feature scores are represented and the values of the structural features, The weights representing the intent match scores are presented, Representing the intent match score.
6. The web page interaction method of claim 1, wherein the visibility state is visible or invisible; The determining the visibility state of the target control includes: Judging whether the webpage area corresponding to the target control meets a direct visual condition or not; if the direct visual condition is met, the visual state of the target control is visual; if the direct visual condition is not met, performing page scrolling on the target webpage, and judging whether a webpage area corresponding to the target control meets the scrolling visual condition or not based on a page scrolling result; if the scrolling visual condition is met, the visual state of the target control is visual; and if the scrolling visual condition is not met, the visual state of the target control is invisible.
7. The web page interaction method according to claim 6, wherein the performing interaction control on the target control based on the visibility state and the user voice command to obtain an interaction result includes: and when the visibility state is visible, performing interactive control on the target control according to the semantic analysis result of the user voice command to obtain an interactive result.
8. A semantic parsing-based web page interaction device, comprising: the capturing module is used for capturing a user voice instruction sent by a user and acquiring a plurality of controls of a target webpage; the analysis module is used for carrying out semantic analysis on the user voice instruction and screening a plurality of candidate controls from all the controls based on semantic analysis results; The scoring module is used for respectively scoring the candidate controls in a multi-mode for each candidate control to obtain the feature score of the candidate control, wherein the feature score is used for identifying and screening the candidate controls in the target webpage; the system comprises a determining module, a target control determining module and a target webpage determining module, wherein the determining module is used for determining a target control from all candidate controls according to all feature scores and determining the visibility state of the target control, wherein the visibility state is used for describing whether the target control is visible in the target webpage or not; and the interaction module is used for carrying out interaction control on the target control based on the visibility state and the user voice instruction to obtain an interaction result.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the semantic parsing based web page interaction method according to any of claims 1 to 7 when executing the computer program.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the semantic parsing based web page interaction method according to any one of claims 1 to 7.

Description

Webpage interaction method and related equipment based on semantic analysis Technical Field The application relates to the technical field of webpage interaction, in particular to a webpage interaction method based on semantic analysis and related equipment. Background Along with the wide adoption of Web technology stacks by terminals such as vehicle-mounted cabins and intelligent televisions, users have urgent requirements on the voice control of 'native application level' Web pages, and the method specifically comprises core interaction scenes such as clicking, inputting, selecting, scrolling, multimedia control and the like. However, the prior art lacks a unified semantic abstraction layer, so that the universal webpage voice interaction capability of 'no need of transformation, general cross-equipment and safety and controllability' cannot be realized, and the specific defects are as follows: 1. Strongly dependent web page self semantic annotation, can not take effect on a third-party web page (lack of generality and zero transformation capability): The mainstream Web page voice control schemes all require Web pages to provide structural semantic support, and specific implementations include marking interactive attributes on document object model (DOM, document Object Model) elements, exposing a voice-specific JavaScript API (e.g., HANDLEINTENT, EXPOSEACTIONS), accessing a specific voice assistant SDK (e.g., alexa Web API), and the like. The mode has the fundamental defects that a third-party webpage developer lacks adaptation power, main stream scene webpages such as media, electronic commerce and the like are generally free of semantic annotation, the annotation difficulty of dynamic single-page Application (SPA, single Page Application) is exponentially increased, so that zero-transformation voice control cannot fall to the ground, and the universality is seriously insufficient. 2. Continuous DOM scanning and continuous listening incur high performance overhead (not suitable for limited device scenarios such as car set-ups: The prior proposal commonly adopts an implementation mode of 'full DOM scanning + MutationObserver continuous monitoring + high-frequency polling', the mode can still operate at the personal computer (PC, personal Computer), but a series of problems can be caused in the limited computational power equipment such as a vehicle computer, a Television (TV), and the like, namely JS threads are blocked due to the fact that the occupancy rate of a central processing unit (CPU, central Processing Unit) is high, frames are rendered or dithered on pages, the core service functions are influenced by the consumption of WebView resources, and the performance requirements of safety key scenes such as vehicles and the like can not be met. 3. Static DOM cannot represent the real structure of modern web pages (V-DOM/ShadowDOM/asynchronous rendering): The virtual DOM (Virtual DOM) of the frames of the reaction, the Vue and the like, the shadow DOM (Shadow DOM) of the Web Components, the SPA dynamic routing, lazy loading, asynchronous rendering and other modern front-end technologies, so that the static DOM structure cannot reflect the real UI state of the page, the control before virtual DOM rendering does not exist, the external nodes cannot be accessed due to the context isolation of the shadow DOM, the element ID/class name is frequently changed due to dynamic rendering, and the analysis opportunity is uncontrollable due to asynchronous loading. The existing analysis mode based on static DOM can not acquire real semantics, and the control positioning stability is extremely poor. 4. Lacking a unified abstraction of vision and speech, it is difficult to implement "visible can say": The existing scheme only depends on DOM text, ID or tag attribute, visual layer understanding is not introduced, and actual visibility, shielding state, multimedia real-time state and visual layout structure of elements cannot be judged. This lack of visual information results in the system failing to match the user's "what is seen" with "voice instructions", and the lack of multimodal collaboration capability becomes a core technical barrier to "what is seen, i.e., to say" landing. 5. Static rule matching cannot automatically understand unknown web pages: In the prior art, action mapping (action mapping), intention schema, a voice dictionary and an operable element list are predefined by a developer, so that a new webpage cannot be automatically adapted, the voice capability of the updated webpage is immediately invalid, a large amount of manual maintenance is needed, the cost is high, the stability is poor, and the generalized interaction requirement cannot be supported. 6. Failing to be generic across Web kernels: Different devices adopt a differentiated Web engine, android is based on Chromium/WebView, most of automobile machines are customized Chromium or self-grinding kernels, TV is commonly used by WebKit/Qt WebE