CN-121710974-B - Phased array phase control method and device based on reinforcement learning
Abstract
The invention discloses a phased array phase control method based on reinforcement learning, which belongs to the technical field of wireless communication, and can effectively improve control adaptability and accuracy by controlling the phased array phase based on reinforcement learning through a deep reinforcement learning algorithm, so that the control of the phased array can adapt to environmental changes and device aging, and an improved particle swarm algorithm is adopted to update a strategy network, thereby improving the updating effect of the strategy network and further enhancing the phase control accuracy of the phased array.
Inventors
- LEI QI
- HOU XIAOPENG
Assignees
- 四川博谱微波科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260212
Claims (9)
- 1. A reinforcement learning-based phased array phase control method, comprising: Collecting a real-time state of a phased array, and selecting phase control actions of each antenna unit of the phased array in a dynamic space by adopting a strategy network based on the real-time state; According to the phase control action of each antenna unit of the phased array, adjusting the beam direction of the phased array, and acquiring a communication performance index by adopting the phased array after adjustment, wherein the communication performance index is a signal-to-interference-and-noise ratio; acquiring rewards according to the communication performance indexes, acquiring a state at the next moment, and constructing an experience sample according to the real-time state, the phase control action, the rewards and the state at the next moment; obtaining rewards according to the communication performance index comprises the following steps: ; in the formula, For the bonus of the current time t, As the communication performance index at the current time t, For the phase vector at the current instant t, For the phase vector at the last time t-1, As a result of the communication weight coefficient, Is a smooth weight coefficient; an experience sample is adopted to construct an experience playback pool, and an experience sample in the experience playback pool is updated by adopting a first-in first-out strategy; After each time that the experience samples in the experience playback pool reach the preset quantity is monitored, randomly sampling a plurality of experience samples from the experience playback pool, and updating the strategy network by using the sampled experience samples; and controlling the phase of the phased array by adopting the updated strategy network, and finishing the phased array phase control based on reinforcement learning.
- 2. The reinforcement learning-based phased array phase control method of claim 1, wherein the real-time state includes channel state information fed back by a receiving end, a receiving signal-to-noise ratio, and/or a phase control action of a previous time.
- 3. The reinforcement learning based phased array phase control method of claim 1, wherein the phase control actions include phase offsets for individual antenna elements in the phased array.
- 4. The reinforcement learning based phased array phase control method of claim 1, wherein updating the strategy network with sampled empirical samples comprises: based on the experience sample, updating network parameters of the strategy network by adopting a gradient descent method.
- 5. The reinforcement learning based phased array phase control method of claim 1, wherein updating the strategy network with sampled empirical samples comprises: generating a particle swarm based on network parameters of the policy network; for any particle in the particle swarm, acquiring a loss function value corresponding to the particle according to a sampled experience sample, and determining the particle with the minimum loss function value as an instantaneous optimal particle; Based on the instantaneous optimal particles, a double-optimal learning strategy with improved nonlinear adjustment weight learning coefficients is adopted to rapidly explore the solution space of the particles in the particle swarm, and the particles after the rapid exploration of the solution space are determined; carrying out solution space enhancement exploration on the particles after the solution space rapid exploration by adopting a nonlinear information exchange strategy with improved fuzzy information quantity, and determining the particles after the solution space enhancement exploration; performing global greedy exploration on the particles after solution space enhancement exploration by adopting a sine Laiwei flight strategy with improved time self-adaptive adjustment, and determining the particles after global greedy exploration; and acquiring the iterated times, judging whether the iterated times are greater than or equal to a preset maximum iterated times, if so, determining global optimal particles according to the particles after the global greedy exploration, taking network parameters in the global optimal particles as network parameters of the strategy network, and finishing updating, otherwise, returning to the step of acquiring instantaneous optimal particles.
- 6. The reinforcement learning-based phased array phase control method of claim 5, wherein based on the instantaneous optimal particles, a dual-optimal learning strategy with improved nonlinear adjustment weight learning coefficients is adopted to rapidly explore solution spaces of particles in the particle swarm, and particles after rapid exploration of solution spaces are determined, comprising: the nonlinear adjustment function is constructed as follows: ; in the formula, Is a nonlinear adjusting function, x is a variable, and e is a natural constant; according to the nonlinear adjusting function, acquiring a speed adjusting weight, a first learning factor and a second learning factor as follows: ; ; ; in the formula, The weight is adjusted for the speed and, As a result of the first learning factor, As a result of the second learning factor, In order to adjust the parameters of the device, For the number of iterations that have been performed, The maximum iteration number; according to the speed adjustment weight, the first learning factor and the second learning factor, rapidly exploring the solution space of the particles in the particle swarm, and determining that the particles after rapidly exploring the solution space are: ; ; in the formula, For the ith particle in the ith iteration process, Particles following the fast exploration of the ith solution space, For the update rate of the ith particle in the ith iteration of the iter +1 process, For the update rate of the ith particle during the ith iteration, i=1, 2,..m, M is the total number of particles, Is a first random number between (0, 1), Is a second random number between (0, 1), Is that Is used to determine the historical optimum value of (c), Is the instantaneous optimal particle.
- 7. The reinforcement learning-based phased array phase control method of claim 6, wherein performing a solution space enhancement exploration on the particles after the solution space rapid exploration by using a nonlinear information exchange strategy with improved fuzzy information quantity, determining the particles after the solution space enhancement exploration, comprising: Acquiring a loss function value of the particles after the solution space rapid exploration, and acquiring the attribution degree of the particles after the solution space rapid exploration relative to the instantaneous optimal particles according to the loss function value; ; in the formula, The degree of particle attribution after the j-th solution space is quickly explored, For the loss function value of the instantaneous optimal particle, The maximum loss function value corresponding to the particle after the fast exploration of all solution spaces, A loss function value of the particle after the j-th solution space is quickly explored; According to the attribution degree, the degree of the current optimal position of the particles which are not in the instant optimal particles after the rapid exploration of the solution space is obtained is as follows: ; in the formula, To the extent that the particles after the jth solution space fast explore are not at the current optimal position where the instantaneous optimal particles are located, Is an adjustable parameter, and has a value of [0.8,1 ]; according to the attribution degree and the degree, the fuzzy information quantity is obtained as follows: ; wherein K is fuzzy information quantity, and max is maximum function; according to the fuzzy information quantity, acquiring a first nonlinear exchange coefficient and a second nonlinear exchange coefficient as follows: ; ; in the formula, For the first nonlinear switching coefficient, For the second nonlinear switching coefficient, A third random number which is randomly 1 or-1, A fourth random number that is random 1 or-1; according to the first nonlinear exchange coefficient and the second nonlinear exchange coefficient, performing solution space enhancement exploration on the particles after the solution space rapid exploration, and determining that the particles after the solution space enhancement exploration are: ; in the formula, For particles after the jth solution space fast explored in the ith iteration process, Particles after exploration are enhanced for the jth solution space, In order to achieve a peripheral rate of the material, Is in combination with Different random solution spaces enhance the particles after exploration.
- 8. The reinforcement learning based phased array phase control method of claim 7, wherein performing a global greedy search on the particles after the solution space reinforcement search using a sinusoidal lewy flight strategy with improved time-adaptive tuning, determining the particles after the global greedy search, comprising: The gold angle is obtained by: ; in the formula, In the form of a golden angle, the angle of the gold, Is a golden ratio calculation formula, and ; Generating a Levy flight factor, and performing global search on the particles after solution space enhancement exploration according to the golden angle and the Levy flight factor to obtain global search particles as follows: ; in the formula, For the particles after the m-th solution space enhancement exploration in the iter-th iteration process, For the mth global search particle, Is a fifth random number between (0, 1), For the step-size scaling factor, Is a random step vector, and has dimensions and The same; Judging whether the loss function value of the particles after the solution space enhancement exploration is smaller than the loss function value of the corresponding global search particles, if so, taking the particles after the original solution space enhancement exploration as the particles after the global greedy exploration, otherwise, taking the global search particles as the particles after the global greedy exploration.
- 9. A reinforcement learning-based phased array phase control apparatus for performing the reinforcement learning-based phased array phase control method of any one of claims 1 to 8, comprising: The action control module is used for collecting the real-time state of the phased array and selecting the phase control action of each antenna unit of the phased array in the action space by adopting a strategy network based on the real-time state; the index acquisition module is used for adjusting the beam direction of the phased array according to the phase control action of each antenna unit of the phased array and acquiring the communication performance index by adopting the phased array after adjustment; The sample construction module is used for acquiring rewards according to the communication performance index, acquiring the state of the next moment, and constructing an experience sample according to the real-time state, the phase control action, the rewards and the state of the next moment; The sample updating module is used for constructing an experience playback pool by adopting an experience sample and updating the experience sample in the experience playback pool by adopting a first-in first-out strategy; The network updating module is used for randomly sampling a plurality of experience samples from the experience playback pool after the experience samples in the experience playback pool are monitored to reach the preset quantity, and updating the strategy network by using the sampled experience samples; and the reinforcement control module is used for controlling the phase of the phased array by adopting the updated strategy network to finish the phased array phase control based on reinforcement learning.
Description
Phased array phase control method and device based on reinforcement learning Technical Field The invention belongs to the technical field of wireless communication, and particularly relates to a phased array phase control method and device based on reinforcement learning. Background Conventional phased array phase control methods typically rely on accurate Channel State Information (CSI) to calculate optimal phase weights using beamforming algorithms (e.g., codebook search or convex optimization based methods). However, in practical communication environments, the channel state tends to be time-varying and difficult to accurately acquire, and especially in high-speed mobile scenarios, the channel estimation feedback can incur significant overhead and delay. In addition, the traditional optimization algorithm is high in computational complexity, and is difficult to meet the requirement of low-delay communication. Therefore, the control precision is improved by adopting reinforcement learning, the reinforcement learning learns the optimal strategy based on rewarding feedback through interaction of an agent and the environment, an accurate channel model is not required to be known in advance, and the method has strong adaptability. However, in the training process, especially in the parameter updating stage of the strategy network, the existing reinforcement learning method often has the problems of low convergence speed and easy sinking into a local optimal solution. The traditional gradient descent method has limited searching capability when facing the complicated non-convex optimization problem, so that the beam pointing control precision of the phased array is not high, and the communication performance is limited. Disclosure of Invention The invention provides a phased array phase control method and device based on reinforcement learning, which are used for solving the problems that the traditional optimization algorithm is high in computational complexity and difficult to meet the requirement of low-delay communication, and meanwhile solving the problems that the strategy network in the traditional reinforcement learning technology is low in updating efficiency and is easy to fall into local optimization, so that the beam control precision is insufficient, and the control precision is further improved. In one aspect, the present invention provides a phased array phase control method based on reinforcement learning, including: Collecting a real-time state of a phased array, and selecting phase control actions of each antenna unit of the phased array in a dynamic space by adopting a strategy network based on the real-time state; according to the phase control action of each antenna unit of the phased array, adjusting the beam direction of the phased array, and acquiring a communication performance index by adopting the phased array after adjustment; acquiring rewards according to the communication performance indexes, acquiring a state at the next moment, and constructing an experience sample according to the real-time state, the phase control action, the rewards and the state at the next moment; an experience sample is adopted to construct an experience playback pool, and an experience sample in the experience playback pool is updated by adopting a first-in first-out strategy; After each time that the experience samples in the experience playback pool reach the preset quantity is monitored, randomly sampling a plurality of experience samples from the experience playback pool, and updating the strategy network by using the sampled experience samples; and controlling the phase of the phased array by adopting the updated strategy network, and finishing the phased array phase control based on reinforcement learning. Further, the real-time state includes channel state information fed back by the receiving end, a receiving signal-to-noise ratio and/or a phase control action at the last moment. Further, the phase control action includes a phase offset of each antenna element in the phased array. Further, obtaining rewards according to the communication performance index includes: ; in the formula, For the bonus of the current time t,As the communication performance index at the current time t,For the phase vector at the current instant t,For the phase vector at the last time t-1,As a result of the communication weight coefficient,Is a smoothed weight coefficient. Further, updating the policy network with the sampled empirical samples includes: based on the experience sample, updating network parameters of the strategy network by adopting a gradient descent method. Further, updating the policy network with the sampled empirical samples includes: generating a particle swarm based on network parameters of the policy network; for any particle in the particle swarm, acquiring a loss function value corresponding to the particle according to a sampled experience sample, and determining the particle with the minimum lo