CN-122024054-A - Fruit stem example segmentation method and system for string tomato picking robot

CN122024054ACN 122024054 ACN122024054 ACN 122024054ACN-122024054-A

Abstract

The invention provides a method and a system for segmenting fruit stalks of a string tomato picking robot, which comprise the steps of constructing a string tomato fruit stalk data set, carrying out enhancement pretreatment, constructing a YOLO11-FFTDA instance segmentation model, carrying out feature extraction in a backbone network by utilizing a feature pyramid comprising a frequency pyramid attention module, enhancing frequency domain edge details through fast Fourier transformation, introducing a C2PSADA mixed attention module in a neck network fusion stage, generating self-adaptive sampling points by utilizing a deformable attention mechanism to fit fruit stalk geometric forms and inhibit background noise, simultaneously introducing independent high-resolution branches to capture tiny targets, outputting fruit stalks by decoupling segmentation heads, outputting a class, a boundary frame and a mask, calculating an optimal cutting point based on a mask extraction skeleton center, and realizing accurate positioning and picking of a mechanical arm by combining a hand-eye calibration matrix. The invention realizes the closed-loop control from visual perception to physical operation, and remarkably improves the operation success rate and the environmental adaptability of the string tomato picking robot.

Inventors

QIN CHENGJIN
LIN YANGTIAN
ZHENG HANCHEN
YANG YIZHOU
GONG LIANG
ZHANG SHUO
Cheng Leye
LIU CHENGLIANG

Assignees

上海交通大学

Dates

Publication Date: 20260512
Application Date: 20260128

Claims (10)

1. An example segmentation method for fruit stalks of a string tomato picking robot is characterized by comprising the following steps: step S1, obtaining a string tomato image in a real orchard environment, carrying out pixel-level polygon labeling on fruit stalks in the image, carrying out data enhancement pretreatment, and constructing a string tomato fruit stalk instance segmentation data set; S2, constructing a YOLO11-FFTDA example segmentation network model, wherein the model comprises a backbone network, a neck network and a segmentation head in sequence; S3, inputting the preprocessed image into a backbone network, and performing multi-scale extraction on the features by utilizing an integrated convolution feature pyramid module, wherein the convolution feature pyramid module comprises frequency pyramid attention modules which are arranged in parallel and is used for enhancing edge details of the features through frequency domain transformation; S4, inputting a feature map output by a backbone network into a neck network, performing multi-scale fusion by utilizing a path aggregation network, introducing a C2PSADA mixed attention module in a deep feature fusion stage, generating self-adaptive sampling points by utilizing a deformable attention mechanism, fitting irregular shapes of fruit stalks and inhibiting background noise; step S5, leading out tiny target detection branches independent of deep downsampling in a neck network, wherein the branches are directly connected with a high-resolution characteristic layer and are used for capturing tiny fruit stem targets; S6, processing the fused features through a decoupled segmentation head, optimizing model parameters based on a loss function, and outputting category confidence coefficient of tomato fruit stems, boundary frame coordinates and pixel level masks; And S7, calculating the pixel coordinates of the optimal cutting point of the fruit handle according to the output pixel level mask, combining the depth information and the hand-eye calibration matrix, converting the coordinates of the cutting point into three-dimensional space coordinates under a mechanical arm base coordinate system, planning the motion track of the mechanical arm and controlling the end effector to complete the shearing action.
2. The method for dividing fruit stem examples of a string tomato picking robot according to claim 1, wherein the frequency pyramid attention module in the step S3 adopts a multi-branch parallel arrangement pyramid structure, including a feature preserving branch, a global context branch and a cascade frequency domain pyramid branch.
3. A method for dividing fruit stem examples of a string tomato picking robot according to claim 2, wherein each branch processing procedure of the frequency pyramid attention module in step S3 includes: The feature keeps branching utilized The convolution processes the input features and reserves the original space features; The global context branch extracts global context information of an image by utilizing downsampling and average pooling operations; The cascade frequency pyramid branch utilizes a deep cascade structure to gradually strengthen tiny characteristics, and comprises the steps of utilizing expansion rate Is input into a frequency domain attention unit to carry out first frequency domain enhancement, and the features are input into the expansion rate Is subjected to the second frequency domain enhancement again through the input frequency domain attention unit, and the characteristic is input into the expansion rate Is subjected to the third frequency domain enhancement again through the input frequency domain attention unit, and the characteristic is input into the expansion rate And performing a fourth frequency domain enhancement again through the input frequency domain attention unit; And fusing the output characteristics of the three branches to generate a final enhancement characteristic diagram.
4. A method of fruit stem instance segmentation for a string tomato picking robot according to claim 3, wherein in the cascade frequency pyramid branches, the frequency domain attention unit enhancement process comprises: spatial domain characterization using two-dimensional fast fourier transforms Transform to the frequency domain: Wherein, the The fourier transform is represented in two dimensions, Is a complex tensor containing amplitude and phase information; Characterization of the frequency domain Decompose and map to queries Key and key Sum value Introducing a learnable frequency mask matrix Calculating and weighting the frequency domain attention weight: Wherein, the As a priori knowledge, for element-level multiplication with attention attempts to explicitly enhance specific frequency components, Representing the dot product operation, the method comprises the steps of, As the feature dimension of the key vector, As a scaling factor; Restoring the enhanced frequency domain features to a spatial domain by utilizing two-dimensional inverse fast Fourier transform: Wherein, the Representing a two-dimensional inverse fast fourier transform, For the weighted frequency domain features, Representing the spatial domain features after the inverse fourier transform.
5. The method for dividing fruit stem examples of a tandem tomato picking robot according to claim 1, wherein the C2PSADA mixed attention module in the step S4 has a structure in which a polarized self-attention sub-module and a deformable attention sub-module are integrated in series; The polarization self-attention submodule is used for carrying out orthogonal compression and enhancement on the characteristics in the channel and space dimensions and keeping high-resolution information; the deformable attention submodule is used for performing sparse sampling in space, and the receptive field is deformed through learning the offset to match the shape of the fruit stem.
6. A method for dividing fruit stem instances of a string tomato picking robot as defined in claim 5, wherein the calculating of the deformable attention sub-module comprises: Will input a feature map Linear projection layer Generating queries And input a lightweight offset network Generating a sampling offset : Based on offset coordinates For input features Bilinear interpolation sampling is carried out to obtain sparse characteristics : Wherein " Representing bilinear interpolation samples; The sparse feature is subjected to Projection generating deformable keys Sum value Computing queries And key Is added to the dot product similarity of the two images, and a relative position offset matrix is introduced Obtaining a normalized attention weight matrix : Wherein, the And Respectively representing the projected key vector and the value vector, And A learnable linear projection weight matrix of keys and values respectively, In order for the query vector to be dense, As the feature dimension of the key vector, As a result of the use of the scaling factor, And the relative position offset matrix is used for supplementing the space geometric information between the sampling point and the query point. Using a weight matrix Vector of values Weighted summation is carried out, and final characteristics are obtained through the output projection layer : Wherein, the Representing a linear projection layer.
7. The method for dividing fruit stem examples of a tomato picking robot according to claim 5, wherein the specific configuration of the minute object detection branch in step S5 comprises: Extracting 4 times of downsampled feature streams from a backbone network P2 layer; the characteristic flow is directly subjected to an independent convolution module to adjust the channel number without the subsequent downsampling compression operation; the adjusted characteristic flow is connected to a special P2/4 dividing head for identifying and positioning the tiny fruit stem targets with the areas smaller than 32x32 pixels.
8. A method for partitioning fruit stem instances of a string tomato picking robot according to claim 1, wherein in step S6, the loss function optimization model parameters include classification loss, positioning loss, and mask loss.
9. The method for dividing fruit stem examples of a tomato picking robot according to claim 1, wherein the step S7 of converting the coordinates of the cutting point into three-dimensional space coordinates in a robot arm base coordinate system comprises: extracting the skeleton center of the pixel-level mask output in the step S6 as pixel coordinates ; Combining depth information And camera internal parameters Calculating camera coordinates , ; Calculating the base coordinates of the mechanical arm by using the hand eye matrix b T c , ; Planning the motion trail of the mechanical arm, and driving the tail end to reach And finishing shearing.
10. A system for example segmentation of fruit stalks of a string tomato picking robot, comprising: The data acquisition and preprocessing module is used for acquiring a string tomato image in a real orchard environment, carrying out pixel-level polygon labeling on fruit stalks in the image, carrying out data enhancement preprocessing, and constructing a string tomato fruit stalk instance segmentation data set; The model construction module is used for constructing a YOLO11-FFTDA example segmentation network model, wherein the model comprises a backbone network, a neck network and a segmentation head in sequence; The feature extraction and frequency domain enhancement module is used for inputting the preprocessed image into a backbone network, and carrying out multi-scale extraction on the features by utilizing the integrated convolution feature pyramid module, wherein the convolution feature pyramid module comprises frequency pyramid attention modules which are arranged in parallel and is used for enhancing edge details of the features through frequency domain transformation; The feature fusion and geometric self-adaptive focusing module is used for inputting a feature map output by a backbone network into a neck network, carrying out multi-scale fusion by utilizing a path aggregation network, introducing a C2PSADA mixed attention module in a deep feature fusion stage, generating self-adaptive sampling points by utilizing a deformable attention mechanism, fitting irregular shapes of fruit stalks and inhibiting background noise; The micro target detection branch module is used for leading out micro target detection branches independent of deep downsampling in a neck network, and the branches are directly connected with the high-resolution characteristic layer and are used for capturing very micro fruit handle targets; the multi-task decoupling output module is used for processing the fused features through a decoupled segmentation head, optimizing model parameters based on a loss function and outputting category confidence coefficient of the tomato fruit stalks, boundary frame coordinates and pixel level masks; the visual servo and picking execution module is used for calculating the pixel coordinates of the optimal cutting point of the fruit stem according to the output pixel level mask, combining the depth information and the hand-eye calibration matrix, converting the coordinates of the cutting point into three-dimensional space coordinates under a mechanical arm base coordinate system, planning the motion track of the mechanical arm and controlling the end effector to complete the shearing action.

Description

Fruit stem example segmentation method and system for string tomato picking robot Technical Field The invention relates to the field of agricultural automatic picking scenes and image recognition, in particular to a method and a system for segmenting fruit stem examples of a string tomato picking robot. Background In modern agricultural production, with the continuous growth of global population and the arrival of an aging society, the problem of agricultural labor shortage is increasingly severe. In order to ensure grain safety and improve production efficiency, the conversion of passive artificial agriculture into active intelligent agriculture has become a necessary trend. The intelligent harvesting technology for the labor-intensive cash crops such as tomatoes, namely the development of an agricultural robot capable of operating autonomously in an unstructured environment, is a research hotspot in the field of intersection of current agricultural engineering and artificial intelligence. For a string tomato picking robot, the success rate and the working efficiency of picking are directly determined by a visual perception system. Unlike the spherical fruits such as apples and citrus, the picking of the string tomatoes not only needs to identify the fruits, but also has the core task of accurately positioning and cutting the fruit stalks connecting the fruits and the stems so as to plan a collision-free shearing path for the mechanical arm end effector. Early fruit stem positioning methods have relied mainly on traditional image processing algorithms such as color space based thresholding, edge detection, morphological filtering, and the like. However, the real orchard environment has highly unstructured features, the illumination conditions are changed drastically, and the branches and leaves grow in disorder. The traditional method relies on the shallow layer characteristics of manual design, has poor robustness, and is difficult to solve the problems that the color of the fruit stem is similar to that of the background branches and leaves, the fruit stem is shielded by fruits or leaves, and the like. In recent years, with the rapid development of deep learning technology, convolutional Neural Network (CNN) -based object detection and instance segmentation algorithms (such as Mask R-CNN, YOLACT and YOLO series) have made remarkable progress in the field of agricultural vision. In particular, a single-stage example segmentation model (such as YOLOv-seg) is the preferred scheme of the agricultural robot vision system due to the good balance between the reasoning speed and the precision. Nevertheless, when the existing general SOTA (State-of-the-Art) model is directly applied to a string tomato stem segmentation task, there is still a principle technical bottleneck that is difficult to overcome, mainly in the following two aspects: First, the high frequency characteristics of tiny objects are severely lost in deep networks. The tomato stalks often appear in the image as very elongated topologies, the width of which often only takes a few pixels in the whole image. Existing convolutional neural networks downsample through successive convolution and pooling operations during feature extraction, which is mathematically similar to a "low pass filter". As the number of layers of the network increases, the model tends to retain large-area, smooth low-frequency information (e.g., leaf texture, fruit surface), while high-frequency signals representing small fruit handle edges, textures, and topologies are gradually smoothed or even filtered out. This results in the deep feature map, the stem features becoming blurred, and is very prone to missed detection. The current Feature Pyramid (FPN) fusion strategy mainly focuses on scaling of spatial scale, lacks an explicit reservation mechanism for frequency domain information, and cannot fundamentally solve the problem of high-frequency feature attenuation. Second, noise interference and occlusion adaptability in complex backgrounds is insufficient. In naturally grown tomato plants, the background is filled with petioles, main stems and vines that are highly similar in color, texture to the target petioles. Meanwhile, the fruit stalks often show irregular bending forms and are easily blocked by fruits, leaves or other branches. Conventional convolution kernels have a fixed geometry (e.g., a 3x3 rectangle), and the receptive field is limited and it is difficult to adaptively fit the non-rigid, elongated and irregular geometry of the carpels. Existing attention mechanisms (such as global self-attention in SE, CBAM or ViT) tend to be computationally intensive on a regular grid. When the mechanism is used for processing the shielding problem, the attention is easily dispersed to shielding objects or background noise, the foreground and the background cannot be effectively distinguished, the edge of the segmentation mask is rough, and even a large number of false detections