CN-121985421-A - Internet of vehicles resource allocation method, equipment, computer program product and computer readable storage medium

CN121985421ACN 121985421 ACN121985421 ACN 121985421ACN-121985421-A

Abstract

The application provides a method, equipment, a computer program product and a computer readable storage medium for distributing resources of a vehicle networking, which comprise the steps of obtaining a plurality of online interaction sample data generated by real-time interaction of an initial online reinforcement learning model and a vehicle networking environment, obtaining a pre-trained target offline reinforcement learning model, wherein each online interaction sample data at least comprises a current state, an action, a reward and a next state, determining a target value function based on the online interaction sample data, the target offline reinforcement learning model and the initial online reinforcement learning model, updating network parameters in the initial online reinforcement learning model by adopting the target value function to obtain the target online reinforcement learning model, obtaining the current state of each vehicle in the vehicle networking environment, and determining distribution of spectrum resources and transmission power of a communication link in the vehicle networking environment based on the target online reinforcement learning model and each current state, wherein the communication link at least comprises a vehicle-to-infrastructure communication link and a vehicle-to-vehicle communication link.

Inventors

XIE HONGMING

Assignees

中移(苏州)软件技术有限公司
中国移动通信集团有限公司

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (10)

1. A method for allocating resources of the internet of vehicles, the method comprising: acquiring a plurality of online interaction sample data generated by real-time interaction between an initial online reinforcement learning model and a car networking environment, and acquiring a pre-trained target offline reinforcement learning model, wherein each online interaction sample data at least comprises a current state, an action, a reward and a next state; determining a target value function based on the online interaction sample data, the target offline reinforcement learning model, and the initial online reinforcement learning model; Updating network parameters in the initial online reinforcement learning model by adopting the target value function to obtain a target online reinforcement learning model; The method comprises the steps of obtaining the current state of each vehicle in the Internet of vehicles, and determining the distribution of the frequency spectrum resources and the transmission power of a communication link in the Internet of vehicles based on the target online reinforcement learning model and each current state, wherein the communication link at least comprises a vehicle-to-infrastructure communication link and a vehicle-to-vehicle communication link.
2. The method of claim 1, wherein the obtaining a pre-trained target offline reinforcement learning model comprises: acquiring a plurality of historical interaction sample data; constructing a first loss function based on the plurality of historical interaction sample data by adopting a conservative value function algorithm; and iteratively updating parameters of the initial offline reinforcement learning model based on the first loss function by a gradient descent method until a training termination condition is met, so as to obtain the target offline reinforcement learning model.
3. The method of claim 1, wherein the determining a target value function based on the online interaction sample data, the target offline reinforcement learning model, and the initial online reinforcement learning model comprises: Randomly sampling a batch of samples from the online interaction sample data to serve as target online interaction sample data; and determining the target value function according to the target online interaction sample data and the target offline reinforcement learning model and the initial online reinforcement learning model respectively.
4. The method of claim 3, wherein the determining the target value function for each of the target online interaction sample data based on the target online interaction sample data by the target offline reinforcement learning model and the initial online reinforcement learning model, respectively, comprises: inputting the target online interaction sample data into the target offline reinforcement learning model aiming at each target online interaction sample data to obtain a first value function; inputting the target online interaction sample data into the initial online reinforcement learning model to obtain a second value function; And determining the target value function from the first value function and the second value function through a fusion rule.
5. The method of claim 4, wherein determining the target value function from the first value function and the second value function by fusing rules comprises: And comparing the magnitudes of the first value function and the second value function for each target online interaction sample data, and determining the value function with larger value as the target value function.
6. The method of claim 1, wherein updating network parameters in the initial online reinforcement learning model with the target value function to obtain a target online reinforcement learning model comprises: aiming at each target online interaction sample data, acquiring a current value function of the initial online reinforcement learning model for current state and action estimation in the target online interaction sample data; constructing a second loss function based on a plurality of the target value functions and the current value function; and iteratively updating network parameters in the initial online reinforcement learning model based on the second loss function by a gradient descent method until a training stopping condition is met, so as to obtain the target online reinforcement learning model.
7. The method of claim 1, wherein determining the allocation of spectral resources and transmission power of a communication link in the internet of vehicles environment based on the target online reinforcement learning model and each of the current states comprises: Processing each current state by adopting the target online reinforcement learning model to obtain a resource allocation action of each vehicle, wherein the resource allocation action comprises selection of a frequency spectrum resource block and determination of a transmission power level; the spectrum resources and the transmission power are allocated for the communication link of each vehicle based on each of the resource allocation actions.
8. An internet of vehicles resource allocation apparatus, the apparatus comprising: a memory for storing computer executable instructions or computer programs; a processor for implementing the method of any of claims 1 to 7 when executing computer executable instructions or computer programs stored in the memory.
9. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method according to any of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores one or more programs, the one or more programs may be executed by one or more processors to implement the methods of any of claims 1-7.

Description

Internet of vehicles resource allocation method, equipment, computer program product and computer readable storage medium Technical Field The present application relates to internet of vehicles, and in particular, to an internet of vehicles resource allocation method, apparatus, computer program product, and computer readable storage medium. Background In the field of internet of vehicles resource allocation, the prior art generally models a resource scheduling problem as a partially observable Markov decision process, and adopts a multi-agent reinforcement learning method to perform online learning. At present, each vehicle is regarded as an intelligent body in the related art, a sample is obtained through real-time interaction with the environment of the Internet of vehicles and is stored in an experience pool, and the respective target network is trained after random sampling so as to obtain a better strategy. However, in the related art, only an online reinforcement learning algorithm is used, the utilization efficiency of limited samples in the training process is insufficient, particularly in a high-dynamic scene of real-time scheduling of the Internet of vehicles, the algorithm is difficult to accumulate enough diversified experiences in a short time, the samples are deficient, the accurate modeling of a complex environment by a model is limited, meanwhile, the overall state feature coding capability is insufficient due to the modeling of a resource scheduling problem into a partially observable Markov decision process, so that the deviation exists in value function estimation, complex factors such as cross traffic flow and congestion are difficult to comprehensively consider, the resource allocation is inaccurate finally, and the communication requirements of the Internet of vehicles on high reliability and low delay cannot be met. Disclosure of Invention The embodiment of the application provides a method, equipment, a computer program product and a computer readable storage medium for allocating resources of the Internet of vehicles, so that frequency spectrum and power allocation decisions are more accurate, and the application requirements of the Internet of vehicles on high real-time performance and high reliability are met. The technical scheme of the embodiment of the application is realized as follows: the embodiment of the application provides a method for allocating resources of the Internet of vehicles, which comprises the following steps: acquiring a plurality of online interaction sample data generated by real-time interaction between an initial online reinforcement learning model and a car networking environment, and acquiring a pre-trained target offline reinforcement learning model, wherein each online interaction sample data at least comprises a current state, an action, a reward and a next state; determining a target value function based on the online interaction sample data, the target offline reinforcement learning model, and the initial online reinforcement learning model; Updating network parameters in the initial online reinforcement learning model by adopting the target value function to obtain a target online reinforcement learning model; The method comprises the steps of obtaining the current state of each vehicle in the Internet of vehicles, and determining the distribution of the frequency spectrum resources and the transmission power of a communication link in the Internet of vehicles based on the target online reinforcement learning model and each current state, wherein the communication link at least comprises a vehicle-to-infrastructure communication link and a vehicle-to-vehicle communication link. In the above solution, the obtaining the pre-trained target offline reinforcement learning model includes: acquiring a plurality of historical interaction sample data; constructing a first loss function based on the plurality of historical interaction sample data by adopting a conservative value function algorithm; and iteratively updating parameters of the initial offline reinforcement learning model based on the first loss function by a gradient descent method until a training termination condition is met, so as to obtain the target offline reinforcement learning model. In the above aspect, the determining the target value function based on the online interaction sample data, the target offline reinforcement learning model and the initial online reinforcement learning model includes: Randomly sampling a batch of samples from the online interaction sample data to serve as target online interaction sample data; and determining the target value function according to the target online interaction sample data and the target offline reinforcement learning model and the initial online reinforcement learning model respectively. In the above solution, the determining, for each of the target online interaction sample data, the target function based on the target online interaction