CN-122019096-A - AI desktop workstation system based on multi-mode interaction and working method
Abstract
The invention discloses an AI desktop workstation system based on multi-mode interaction and a working method thereof, comprising a hardware layer, a processing module and a processing module, wherein the hardware layer consists of an integrated high-performance computing unit, a multi-mode signal acquisition module, a special AI processing chip (NPU) and at least two display output interfaces; the sensing layer is used for collecting four types of multi-mode data of voice instructions, gesture actions, facial expressions and eye movement tracks of a user in real time, the AI decision layer is used as a system core control unit and is internally provided with a central task scheduling engine, and the application layer is connected with the AI decision layer in an instruction mode. According to the invention, the special AI processing chip and the multi-mode acquisition module are integrated through the hardware layer, the office context data is generated in real time by combining the perception layer, and the AI decision layer dynamically schedules multi-mode data processing, so that the problems of single interaction mode, stiff hardware resource allocation and lack of context perception capability of the traditional workstation are solved, and the intelligent office task linkage execution method has the advantages of improving the multi-mode data processing efficiency, realizing intelligent office task linkage execution and enhancing user interaction experience.
Inventors
- WEI LUN
- LI QINGQING
- HUANG YIRAN
Assignees
- 杭州灵峰智能科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260202
Claims (10)
- 1. An AI desktop workstation system based on multi-modal interactions, comprising: the hardware layer consists of an integrated high-performance computing unit, a multi-mode signal acquisition module, a special AI processing chip (NPU) and at least two display output interfaces; The perception layer is in signal connection with the hardware layer and is used for acquiring four types of multi-modal data of voice instructions, gesture actions, facial expressions and eye movement tracks of a user in real time, extracting software interface elements and document type information displayed on a current screen through an image recognition algorithm and generating an office context data packet; The AI decision layer is used as a system core control unit and is internally provided with a central task scheduling engine, the central task scheduling engine is connected with an AI capacity module through a data bus, and the AI capacity module comprises a Natural Language Processing (NLP) module, a Computer Vision (CV) module and a user behavior analysis module; the application layer is connected with the AI decision layer instruction and comprises a real-time conference summary generation module, an intelligent data insight visualization module and a cross-document knowledge base retrieval recommendation module.
- 2. The multi-modal interaction-based AI desktop workstation system of claim 1, wherein the multi-modal signal acquisition module comprises a high-definition camera, a microphone array, and an eye-tracking sensor, wherein the dedicated AI processing chip (NPU) computing power is not lower than 10TOPS for supporting INT8/FP16 hybrid accuracy computation.
- 3. The AI desktop workstation system based on multi-modal interaction of claim 2, wherein the central task scheduling engine is capable of receiving multi-modal data and office context data packets output by the perception layer, analyzing the user behavior association through a feature fusion algorithm, and dynamically identifying the user's office intention.
- 4. The multi-mode interaction-based AI desktop workstation system as claimed in claim 1, wherein the high-performance computing unit of the hardware layer adopts a multi-core processor, the main frequency is not lower than 3.0GHz, the memory capacity is not lower than 32GB, the DDR5 memory protocol is supported, the display output interface comprises HDMI2.1 and displayport1.4 interfaces, and the 4K@60Hz dual-screen expansion display is supported.
- 5. The multi-modal interaction-based AI desktop workstation system of claim 1, wherein the image recognition algorithm of the perception layer employs a lightweight CNN model, the number of model parameters is no more than 5M, the reasoning takes no more than 100ms on a dedicated AI processing chip (NPU), and the office context data packet further includes a current system time, an edit duration of an opened document, and a software operation history.
- 6. The AI desktop workstation system based on multimodal interaction of claim 3 wherein the feature fusion algorithm of the AI decision layer employs an attention mechanism to assign weight coefficients to speech, image, eye movement data, respectively, the weight coefficients being dynamically adjusted according to user historical operational preferences.
- 7. The AI desktop workstation system based on multi-modal interactions of claim 1, wherein the real-time conference summary generation module of the application layer supports multi-language transcription, including three languages of middle, english and daily, the transcription accuracy is not lower than 95%, the intelligent data insight visualization module supports three graph types of a line graph, a bar graph and a thermodynamic diagram, and automatically recommends optimal graph styles according to data dimensions.
- 8. A multi-modal interaction-based AI desktop workstation working method applied to the multi-modal interaction-based AI desktop workstation system of any one of claims 1 to 7, comprising the steps of: s1, initializing a hardware layer, enabling a multi-mode signal acquisition module to enter a real-time monitoring state, loading model parameters by a special AI processing chip (NPU), and establishing communication links with all levels by a high-performance computing unit; s2, a perception layer collects user operation data, wherein a high-definition camera collects gesture and facial expression images, the sampling frame rate is not lower than 30fps, a microphone array collects voice instructions, noise is reduced and then the voice instructions are converted into 16kHz mono audio data, an eye tracking sensor records eye movement track coordinates, the sampling frequency is not lower than 120Hz, and meanwhile, the current running software process name and document format information are extracted through a screen capturing technology to generate an office context data set containing a time stamp; S3, the AI decision layer receives the data set output by the S2, the central task scheduling engine inputs the multi-mode data into the corresponding AI modules respectively, the Natural Language Processing (NLP) module performs semantic analysis on the voice instruction, the Computer Vision (CV) module performs feature extraction on the gesture and the expression image, and the user behavior analysis module establishes a user operation sequence model by combining the office context data set; S4, the central task scheduling engine fuses output results of all the AI modules, compares the output results with a preset office task library through an intention matching algorithm, determines a user target task, sends a meeting summary generation instruction to an application layer if the target task is a meeting record class, sends a data visualization instruction if the target task is a data analysis class, and sends a knowledge base retrieval instruction if the target task is an information retrieval class; S5, after the application layer receives the instruction, the real-time conference summary generation module automatically transcribes voice content and extracts key issues, the intelligent data insight visualization module generates a trend chart based on an Excel/CSV file currently opened by a user, and the cross-document knowledge base retrieval recommendation module screens documents with the relevance Top5 from a local document base and a cloud knowledge base according to the context keywords and pushes the documents to the display output interface; And S6, after the task execution is completed, the AI decision layer monitors feedback operation of a user on a result through the user behavior analysis module, and if the user modifies summary content or adjusts chart patterns, the AI decision layer automatically updates a task execution model to optimize the follow-up intention recognition accuracy.
- 9. The multi-modal interaction-based AI desktop workstation working method of claim 8, wherein in step S2, the eye tracking sensor locates the eye coordinates by cornea reflection, and the screen capturing technology adopts a system bottom API call mode, so as to avoid a jamming effect on software operation.
- 10. The multi-modal interaction-based AI desktop workstation working method as claimed in claim 8, wherein the intention matching algorithm in step S4 adopts a deep learning model, the model training data set comprises more than 10 tens of thousands of user operation samples under office scenes, the preset office task library comprises conference collaboration, data processing, document management and schedule reminding four types of primary tasks, and each type of primary task is subdivided into at least five types of secondary subtasks; In step S6, the user feedback operation includes mouse click confirmation, voice command correction and keyboard editing modification, the AI decision layer updates the task execution model through the reinforcement learning algorithm, uses the user feedback result as a reward signal, and optimizes the weight distribution logic of the feature fusion algorithm.
Description
AI desktop workstation system based on multi-mode interaction and working method Technical Field The application relates to the technical field of intelligent office equipment, in particular to an AI desktop workstation system based on multi-mode interaction and a working method. Background In current office environments, conventional desktop workstations expose a number of drawbacks. Functionally, they are often limited to single task processing, failing to meet the diversified demands in complex office scenarios. The software applications are mutually independent, data are difficult to share and circulate, and individual data islands are formed. For example, in performing data analysis, a user may need to export data from one piece of data processing software and manually import the data into the report making software, which is cumbersome and error-prone. When a user faces complex office tasks, such as data analysis, report writing and conference communication participation, the user needs to frequently switch among different devices such as a computer, a mobile phone, a conference tablet and the like, and jump back and forth in various office application programs. This not only increases the complexity of the operation, but also results in a great waste of time, severely affecting the working efficiency. While some voice assistants or intelligent software are already on the market, their combination with hardware is relatively loose. In the actual office process, the intelligent tools cannot deeply understand the office context information, and it is difficult to provide consistent and intelligent services. For example, when a user is using document editing software, the voice assistant cannot accurately understand the user instructions and provide targeted assistance from the current document content. The existence of these problems highlights the urgent and necessary need to develop a completely new desktop workstation system that can deeply fuse AI capabilities with hardware. Disclosure of Invention The application provides an AI desktop workstation system based on multi-mode interaction and a working method thereof, which have the advantages of improving multi-mode data processing efficiency, realizing intelligent office task linkage execution and enhancing user interaction experience. The application provides an AI desktop workstation system based on multi-mode interaction, which comprises a hardware layer, a perception layer and an AI decision layer, wherein the hardware layer is composed of an integrated high-performance computing unit, a multi-mode signal acquisition module, a special AI processing chip (NPU) and at least two display output interfaces, the perception layer is connected with the hardware layer in a signal mode and used for acquiring four multi-mode data of a user voice instruction, gesture actions, facial expressions and eye movement tracks in real time, extracting software interface elements and document type information displayed on a current screen through an image recognition algorithm to generate an office context data packet, the AI decision layer is used as a system core control unit and is internally provided with a central task scheduling engine, the central task scheduling engine is connected with an AI capability module through a data bus, the AI capability module comprises a Natural Language Processing (NLP) module, a Computer Vision (CV) module and a user behavior analysis module, and the application layer is connected with the AI decision layer in an instruction mode and comprises a real-time conference summary generation module, an intelligent data observation visualization module and a document knowledge base retrieval recommendation module, and can automatically call a corresponding module to execute an office task according to the intention instruction output by the AI decision layer. Furthermore, the multi-mode signal acquisition module comprises a high-definition camera, a microphone array and an eye tracking sensor, wherein the calculation force of a special AI processing chip (NPU) is not lower than 10TOPS, and the special AI processing chip is used for supporting the INT8/FP16 mixed precision calculation. Further, the central task scheduling engine can receive the multimodal data and the office context data packet output by the perception layer, analyze the user behavior association degree through the feature fusion algorithm, and dynamically identify the office intention of the user. Furthermore, the high-performance computing unit of the hardware layer adopts a multi-core processor, the main frequency is not lower than 3.0GHz, the memory capacity is not lower than 32GB, the DDR5 memory protocol is supported, the display output interface comprises HDMI2.1 and displayport1.4 interfaces, and the 4K@60Hz dual-screen expansion display is supported. Furthermore, the image recognition algorithm of the perception layer adopts a lightweight CNN model, the model pa