US-12625672-B2 - Server and electronic device for processing user utterance and operating method thereof by selecting among a plurality of electronic devices one device based on a sum of scores

US12625672B2US 12625672 B2US12625672 B2US 12625672B2US-12625672-B2

Abstract

An intelligent server for processing a user utterance is provided. The intelligent server includes a memory configured to store context information including information on electronic devices and information on domains corresponding to the electronic devices, and a processor configured to, based on a target utterance received from any one of the one or more electronic device and the context information, generate combinations of electronic device information and domain information capable of processing the target utterance, determine reference information for processing the target utterance among the context information, and calculate a quality of service score for each combination referring to the reference information, determine a target combination of a target electronic device and a target domain corresponding to the target electronic device based on the quality of service score, and transmit a command to process the target utterance with the target domain, to the target electronic device.

Inventors

SangMin Park
Jaeyung Yeo
Gajin SONG

Assignees

SAMSUNG ELECTRONICS CO., LTD.

Dates

Publication Date: 20260512
Application Date: 20220819
Priority Date: 20210924

Claims (20)

1 . An intelligent server for processing a user utterance, the server comprising: memory storing instructions; and at least one processor including processing circuitry, wherein the instructions that, when executed by the at least one processor individually or collectively, cause the server to: based on a target utterance received from any one of at least one electronic device and context information comprising information on each of at least one electronic device and information on at least one domain corresponding to each of the at least one electronic device, generate at least one combination of electronic device information and domain information capable of processing the target utterance, determine reference information among the context information based on the target utterance, calculate a quality of service (QoS) score for each of the at least one combination referring to the reference information, determine a target combination of a target electronic device and a target domain corresponding to the target electronic device based on the QoS score, and transmit a command to process the target utterance with the target domain, to the target electronic device, wherein the calculating of the QoS score includes: calculating a controllability score for each of the at least one combination based on specification information of each of at least one electronic device and on whether at least one domain supports a function associated with the target utterance; and calculating the QoS score by a sum of the controllability score, a functionality score, an accessibility score, and a functional performance robustness score for each of the at least one domain corresponding to each of the at least one electronic device, and wherein the controllability score is associated with a degree of a controllability of each of the at least one combination, the functionality score is associated with sharing frequency of each of the at least one combination, the accessibility score is associated with complexity of authentication of each of the at least one combination, and the functional performance robustness score is associated with a collision between domains.
2 . The server of claim 1 , wherein the context information comprises permanent context information that does not change in real time and instant context information that changes in real time.
3 . The server of claim 2 , wherein the permanent context information comprises at least one of network information on the at least one electronic device, account information on the at least one electronic device, information on whether the at least one electronic device is a professional device, and performance information on the at least one domain.
4 . The server of claim 2 , wherein the instant context information comprises at least one of user preference information of the at least one domain, execution history information of the at least one domain, and utterance history information received by the at least one electronic device.
5 . The server of claim 1 , wherein the instructions that, when executed by the at least one processor individually or collectively, further cause the electronic device to, in response to the target utterance being an utterance related to playing music, determine information on whether an electronic device is a professional device and current volume information of the electronic device, to be the reference information.
6 . The server of claim 1 , wherein the instructions that, when executed by the at least one processor individually or collectively, further cause the electronic device to, in response to the target utterance being an utterance for playing at maximum volume, determine information on whether an electronic device is a professional device, maximum volume information of the electronic device, and information on whether a domain has an amplification function, to be the reference information.
7 . The server of claim 1 , wherein the instructions that, when executed by the at least one processor individually or collectively, further cause the electronic device to, in response to the target utterance being an utterance for sound quality, determine information on sound quality of the electronic device and information on sound quality of the domain, to be the reference information.
8 . The server of claim 1 , wherein the domain is software configured to process an utterance through a corresponding electronic device, and wherein the software comprises at least one of an application, a program for providing a service in a form of a widget, and a web app.
9 . A method for processing a user utterance in an intelligent server, the method comprising: receiving a target utterance from any one of at least one electronic device; generating at least one combination of electronic device information and domain information, capable of processing the target utterance, based on the target utterance and context information, the context information including information on each of the at least one electronic device and information on at least one domain corresponding to each of the at least one electronic device; determining reference information among the context information based on the target utterance; calculating a quality of service (QoS) score for each of the at least one combination referring to the reference information; determining a target combination of a target electronic device and a target domain corresponding to the target electronic device based on the QoS score; and transmitting a command to process the target utterance with the target domain, to the target electronic device, wherein the calculating of the QoS score includes; calculating a controllability score for each of the at least one combination based on specification information of each of at least one electronic device and on whether at least one domain supports a function associated with the target utterance; and calculating the QoS score by a sum of the controllability score, a functionality score, an accessibility score, and a functional performance robustness score for each of the at least one domain corresponding to each of the at least one electronic device, and wherein the controllability score is associated with a degree of a controllability of each of the at least one combination, the functionality score is associated with sharing frequency of each of the at least one combination, the accessibility score is associated with complexity of authentication of each of the at least one combination, and the functional performance robustness score is associated with a collision between domains.
10 . The method of claim 9 , wherein the context information comprises permanent context information that does not change in real time and instant context information that changes in real time.
11 . The method of claim 10 , wherein the permanent context information comprises at least one of network information on the at least one electronic device, account information on the at least one electronic device, information on whether the at least one electronic device is a professional device, and performance information on the at least one domain.
12 . The method of claim 10 , wherein the instant context information comprises at least one of user preference information of the at least one domain, execution history information of the at least one domain, and utterance history information received by the at least one electronic device.
13 . The method of claim 9 , wherein the determining of the reference information comprises, in response to the target utterance being an utterance related to playing music, determining information on whether an electronic device is a professional device and current volume information of the electronic device, to be the reference information.
14 . The method of claim 9 , wherein the determining of the reference information comprises, in response to the target utterance being an utterance for playing at maximum volume, determining information on whether an electronic device is a professional device, maximum volume information of the electronic device, and information on whether a domain has an amplification function, to be the reference information.
15 . The method of claim 9 , wherein the determining of the reference information comprises, in response to the target utterance being an utterance for sound quality, determining information on sound quality of the electronic device and information on sound quality of the domain, to be the reference information.
16 . A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform a method comprising: receiving a target utterance from any one of at least one electronic device; generating at least one combination of electronic device information and domain information, capable of processing the target utterance, based on the target utterance and context information, the context information including information on each of the at least one electronic device and information on at least one domain corresponding to each of the at least one electronic device; determining reference information among the context information based on the target utterance; calculating a quality of service (QoS) score for each of the at least one combination referring to the reference information; determining a target combination of a target electronic device and a target domain corresponding to the target electronic device based on the QoS score; and transmitting a command to process the target utterance with the target domain, to the target electronic device, wherein the calculating of the QoS score includes: calculating a controllability score for each of the at least one combination based on specification information of each of at least one electronic device and on whether at least one domain supports a function associated with the target utterance; and calculating the QoS score by a sum of the controllability score, a functionality score, an accessibility score, and a functional performance robustness score for each of the at least one domain corresponding to each of the at least one electronic device, and wherein the controllability score is associated with a degree of a controllability of each of the at least one combination, the functionality score is associated with sharing frequency of each of the at least one combination, the accessibility score is associated with complexity of authentication of each of the at least one combination, and the functional performance robustness score is associated with a collision between domains.
17 . An electronic device for processing a user utterance, comprising: memory storing computer-executable instructions; and at least one processor including processing circuitry, wherein the instructions that, when executed by the at least one processor individually or collectively, cause the electronic device to: based on a target utterance received from at least one electronic device and context information comprising information on each of at least one electronic device comprising the electronic device and information on at least one domain corresponding to each of the at least one electronic device, generate at least one combination of electronic device information and domain information capable of processing the target utterance, determine reference information for processing the target utterance among the context information, calculate a quality of service (QoS) score for each of the at least one combination referring to the reference information, determine a target combination of a target electronic device and a target domain corresponding to the target electronic device based on the QoS score, and transmit a command to process the target utterance with the target domain, to the target electronic device, wherein the calculating of the QoS score includes: calculating a controllability score for each of the at least one combination based on specification information of each of at least one electronic device and on whether at least one domain supports a function associated with the target utterance; and calculating the QoS score by a sum of the controllability score, a functionality score, an accessibility score, and a functional performance robustness score for each of the at least one domain corresponding to each of the at least one electronic device, and wherein the controllability score is associated with a degree of controllability of each of the at least one combination, the functionality score is associated with sharing frequency of each of the at least one combination, the accessibility score is associated with complexity of authentication of each of the at least one combination, and the functional performance robustness score is associated with a collision between domains.
18 . The electronic device of claim 17 , wherein the instructions that, when executed by the at least one processor individually or collectively, further cause the electronic device to, in response to the target utterance being an utterance related to playing music, determine information on whether an electronic device is a professional device and current volume information of the electronic device, to be the reference information.
19 . The electronic device of claim 17 , wherein the instructions that, when executed by the at least one processor individually or collectively, further cause the electronic device to, in response to the target utterance being an utterance for playing at maximum volume, determine information on whether an electronic device is a professional device, maximum volume information of the electronic device, and information on whether a domain has an amplification function, to be the reference information.
20 . The electronic device of claim 17 , further comprising: a memory configured to store a database including the context information and the domain information.

Description

CROSS-REFERENCE TO RELATED APPLICATION(S) This application is a continuation application, claiming priority under § 365(c) of an International application No. PCT/KR2022/010924, filed on Jul. 26, 2022, which is based on and claims the benefit of a Korean patent application number 10-2021-0126219, filed on Sep. 24, 2021, in the Korean Intellectual Property Office, the disclosure of which is incorporated by reference herein in its entirety. BACKGROUND 1. Field The disclosure relates to an intelligent server and an electronic device for processing a user utterance and an operating method thereof. 2. Description of Related Art Electronic devices including a voice assistant function that provides a service based on user utterance are being widely distributed. The electronic device may recognize the user utterance through an artificial intelligence server and may figure out the meaning and intent of the user utterance. The artificial intelligence server may infer an intent of a user by interpreting an utterance of the user, perform tasks according to the inferred intent, and perform tasks according to the intent of the user expressed through interaction, in a natural language, between the user and the artificial intelligence server. At the moment an utterance is made, the artificial intelligence server may analyze various pieces of information on a situation related to the utterance to figure out an intent of the utterance. The above information is presented as background information only to assist with an understanding of the disclosure. No determination has been made, and no assertion is made, as to whether any of the above might be applicable as prior art with regard to the disclosure. SUMMARY As electronic devices capable of performing various functions, such as smart watches, smart refrigerators, and/or smart speakers, are increasing, it is becoming important for an artificial intelligence server to be able to determine which device is to process an utterance. An artificial intelligence server may determine which electronic device to process a user utterance has priority according to a predefined policy, and after the electronic device to process the user utterance is determined, the server may determine an application of the electronic device to process the user utterance. For example, after the electronic device is determined, an application to process the utterance among applications in the electronic device may be determined by classifying an intent of the user utterance. However, a method of determining the application to process the utterance in the electronic device after the electronic device is determined, only considers whether the application supports the utterance, and does not consider the service quality of the application. Aspects of the disclosure are to address at least the above-mentioned problems and/or disadvantages and to provide at least the advantages described below. Accordingly, an aspect of the disclosure is to provide a server and electronic device for processing user utterance and operating method thereof. Additional aspects will be set forth in part in the description which follows and, in part, will be apparent from the description, or may be learned by practice of the presented embodiments. In accordance with an aspect of the disclosure, an intelligent server for processing a user utterance is provided. The intelligent server includes a memory configured to store context information including information on each of at least one electronic device and information on at least one domain corresponding to each of the one or more of electronic device, and a processor configured to, based on a target utterance received from any one of the one or more electronic device and the context information, generate at least one combination of electronic device information and domain information capable of processing the target utterance, determine reference information for processing the target utterance among the context information, calculate a quality of service score for each of the one or more combination referring to the reference information, determine a target combination of a target electronic device and a target domain corresponding to the target electronic device based on the quality of service score, and transmit a command to process the target utterance with the target domain, to the target electronic device. In accordance with an aspect of the disclosure, a method for processing a user utterance in an intelligent server is provided. The method includes receiving a target utterance from any one of at least one electronic device, generating at least one combination of electronic device information and domain information, capable of processing the target utterance, based on the target utterance and context information, the context information including information on each of the at least one electronic device and information on at least one domain corresponding to each of the one or