CN-116796170-B - Data processing method, electronic device and medium
Abstract
The invention provides a data processing method, electronic equipment and a medium, and relates to the technical fields of big data, deep learning and the like. The specific implementation scheme comprises the steps of obtaining a plurality of object data, determining that the number of the plurality of object data is larger than a preset number threshold, determining adjacent object data associated with the candidate object data from the plurality of object data according to a feature vector and a first distance threshold of the candidate object data, generating first candidate characterization data according to the candidate object data and the adjacent object data in response to the existence of the adjacent object data in the plurality of object data, processing the plurality of object data based on the candidate object data, the adjacent object data and the first candidate characterization data to obtain a first characterization data set, and determining that the number of the first characterization data contained in the first characterization data set is smaller than or equal to the preset number threshold, and determining the first characterization data set as a first sample data set.
Inventors
- WANG MIAOJUN
- JIAO MENG
- XIE DONG
- LIU XIAOJIA
Assignees
- 湖北星纪魅族科技有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20230619
Claims (16)
- 1. A data processing method, comprising: Operation S210, acquiring a plurality of object data, wherein the object data includes a text type, an image type, a video type, and an audio type; Operation S220, determining that the number of the plurality of object data is greater than a preset number threshold, and determining, for any one candidate object data of the plurality of object data, neighboring object data associated with the candidate object data from the plurality of object data according to the feature vector and the first distance threshold of the candidate object data; Operation S230, generating first candidate characterization data according to the candidate object data and the adjacent object data in response to the existence of the adjacent object data in the plurality of object data, wherein the first candidate characterization data is used for replacing the candidate object data and the adjacent object data; Operation S240, processing the plurality of object data based on the candidate object data, the neighboring object data, and the first candidate characterization data to obtain a first characterization data set, including deleting the candidate object data and the neighboring object data in the plurality of object data to obtain a candidate object data set, and determining the first characterization data set based on the candidate object data set and the first candidate characterization data, and Operation S250 determines that the number of first characterization data included in the first characterization data set is less than or equal to the preset number threshold, and determines the first characterization data set as a first sample data set; And if the number of the first characterization data contained in the first characterization data set is greater than the preset number threshold, determining another candidate object data from the plurality of object data, updating the candidate object data by using the another candidate object data, and repeatedly executing operations S220-S250 until the number of the first characterization data contained in the first characterization data set is less than or equal to the preset number threshold.
- 2. The method of claim 1, wherein the first characterization data set includes a plurality of first characterization data, and wherein in response to detecting that the candidate object data set does not include object data, and determining that the number of the plurality of first characterization data is greater than the preset number threshold, the method further comprises: Determining a point density for each of the first characterization data, including determining, for each of the first characterization data, a number of neighboring characterization data associated with the first characterization data from the plurality of first characterization data based on a second distance threshold and a feature vector of the first characterization data; Determining first characterization data with the highest point density in the plurality of first characterization data as candidate first characterization data; Determining a number of second candidate characterization data associated with the candidate first characterization data based on a point density of the candidate first characterization data; Extracting the characterization data of the plurality of first characterization data based on the number of the second candidate characterization data to obtain a second characterization data set, and Determining that the number of second characterization data contained in the second characterization data set is smaller than or equal to the preset number threshold value, and determining the second characterization data set as a second sample data set.
- 3. The method of claim 2, wherein the extracting the characterization data from the plurality of first characterization data based on the number of second candidate characterization data, resulting in a second characterization data set comprises: determining neighboring characterization data associated with the candidate first characterization data from the plurality of first characterization data according to the feature vector and a second distance threshold of the candidate first characterization data; Clustering the candidate first characterization data and the adjacent characterization data based on the number of the second candidate characterization data to obtain the number of the second candidate characterization data; Deleting candidate first characterization data and the adjacent characterization data in the plurality of first characterization data to obtain a candidate characterization data set, and Determining the second characterization data set according to the candidate characterization data set and the number of second candidate characterization data sets.
- 4. The method of claim 3, wherein, in response to determining that the amount of second characterization data contained in the second characterization data set is greater than the preset amount threshold, the method further comprises: Determining affected first characterization data of the plurality of first characterization data that is associated with the candidate first characterization data; Determining a point density of the affected first characterization data and a point density of the number of second candidate characterization data, respectively; determining, as candidate second characterization data, second characterization data having the greatest point density in the second characterization data set based on the point density of the affected first characterization data, the point densities of the number of second candidate characterization data, and the point densities of other characterization data in the candidate characterization data sets than the affected first characterization data, and Updating the point density of the candidate first characterization data based on the point density of the candidate second characterization data, and repeatedly performing the operation of determining whether the number of second characterization data contained in the second characterization data set is less than or equal to the preset number threshold.
- 5. The method of claim 3, wherein the determining the first characterization data of the plurality of first characterization data having the greatest point density as candidate first characterization data comprises: Sorting the plurality of first characterization data according to the point densities corresponding to the plurality of first characterization data respectively to obtain a characterization data sequence; splitting the characterization data sequence into a plurality of first characterization data subsequences based on computing resources, wherein each first characterization data subsequence comprises a preset number of first characterization data; determining, for each first characterization data sub-sequence, the first characterization data with the highest point density in the preset number of first characterization data as the first sub-sequence characterization data, and And determining the first subsequence characterization data with the highest point density in the plurality of first subsequence characterization data as candidate first characterization data.
- 6. The method of claim 5, wherein, in response to determining that the amount of second characterization data contained in the second characterization data set is greater than the preset amount threshold, the method further comprises: Determining affected first characterization data of the plurality of first characterization data that is associated with the candidate first characterization data; Storing the affected first characterization data and the number of second candidate characterization data into a characterization data sequence to be ordered; Deleting the characterization data related to the candidate first characterization data and the adjacent characterization data and the affected first characterization data in the first characterization data subsequences aiming at each first characterization data subsequence to obtain a second characterization data subsequence; respectively determining the point density of the affected first characterization data and the point densities of the number of second candidate characterization data in the characterization data sequence to be ordered; Respectively distributing the affected first characterization data and the number of second candidate characterization data in the characterization data sequence to a plurality of second characterization data subsequences according to the point density of the affected first characterization data and the point densities of the number of second candidate characterization data in the characterization data sequence to be sequenced and the point densities of the characterization data in a plurality of second characterization data subsequences to obtain a plurality of third characterization data subsequences; determining, for each third characterization data subsequence, characterization data with the highest point density in the third characterization data subsequence as second subsequence characterization data; determining a second sub-sequence characterization data having a highest point density among the plurality of second sub-sequence characterization data as candidate second characterization data, and Updating the point density of the candidate first characterization data based on the point density of the candidate second characterization data, and repeatedly performing the operation of determining whether the number of second characterization data contained in the second characterization data set is less than or equal to the preset number threshold.
- 7. The method of claim 3, wherein each of the object data includes an object data identification, wherein the extracting the characterization data from the plurality of first characterization data based on the number of second candidate characterization data, resulting in a second characterization data set, further comprises: determining, for each of the second candidate characterization data, candidate first characterization data and the neighboring characterization data associated with the second candidate characterization data; determining the object data identification of the candidate first characterization data and the object data identification of the adjacent characterization data according to the object data identifications corresponding to the object data; Determining an object data identity of the second candidate characterization data based on the object data identity of the candidate first characterization data and the object data identity of the neighboring characterization data, and And associating object data identification of the second candidate characterization data with the second candidate characterization data.
- 8. The method of claim 7, wherein the determining the object data identification of the second candidate characterization data from the object data identification of the candidate first characterization data and the object data identification of the neighboring characterization data comprises: determining the number of object data identifications belonging to the same data identification category in the object data identifications of the candidate first characterization data and the object data identifications of the neighboring characterization data, and And identifying the object data with the largest quantity as the object data identification of the second candidate characterization data.
- 9. The method of any of claims 2 to 8, wherein the number of second candidate characterization data is in positive correlation with the point density of the candidate first characterization data.
- 10. The method of any one of claims 1 to 8, further comprising: In response to determining that there is no neighboring object data associated with the candidate object data in the plurality of object data, the candidate object data is determined to be the first candidate characterization data.
- 11. The method of any of claims 1-8, wherein each of the object data includes an object data identification, the generating first candidate characterization data from the candidate object data and the neighboring object data further comprising: determining the object data identification of the candidate object data and the object data identification of the adjacent object data according to the object data identifications corresponding to the object data; determining an object data identity of the first candidate characterization data based on the object data identity of the candidate object data and the object data identities of the neighboring object data, and And associating object data identification of the first candidate characterization data with the first candidate characterization data.
- 12. The method of claim 11, wherein the determining the object data identification of the first candidate characterization data from the object data identification of the candidate object data and the object data identification of the neighboring object data comprises: determining the number of object data identifications belonging to the same data identification category among the object data identifications of the candidate object data and the object data identifications of the neighboring object data, and And identifying the object data with the largest quantity as the object data identification of the first candidate characterization data.
- 13. A training method of a target object detection model, comprising: Acquiring a sample data set; performing iterative training on the deep learning model by using the sample data set until the output result of the deep learning model meets the iterative stopping condition or the accumulated number of iterative training reaches a preset number threshold value, so as to obtain a target object detection model; wherein the sample dataset is obtained using the method of any of claims 1 to 12.
- 14. A target object detection method, comprising: Inputting data to be processed into a target object detection model to obtain a detection result aiming at the data to be processed; Wherein the target object detection model is trained using the method of claim 13.
- 15. An electronic device, comprising: One or more processors; A memory for storing one or more programs, Wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-14.
- 16. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1 to 14.
Description
Data processing method, electronic device and medium Technical Field The present invention relates to the field of data processing technologies, and in particular, to the technical fields of big data, artificial intelligence, deep learning, and the like, and in particular, to a data processing method and apparatus, a training method and apparatus for a target object detection model, a target object detection method and apparatus, an electronic device, a storage medium, and a computer program product. Background In fields such as deep learning, machine learning, etc., model training using a large amount of sample data is generally required. However, when the number of sample data is too large, not only the training efficiency of the model is affected, but also the excessive sample data amount may cause local noise aggregation, thereby affecting the accuracy and precision of the model. Disclosure of Invention The invention provides a data processing method and device, a training method and device of a target object detection model, a target object detection method and device, electronic equipment, a storage medium and a computer program product. According to one aspect of the invention, a data processing method is provided, which comprises the steps of obtaining a plurality of object data, determining that the number of the plurality of object data is larger than a preset number threshold, determining adjacent object data associated with the candidate object data from the plurality of object data according to a feature vector and a first distance threshold of any one of the plurality of object data, generating first candidate characterization data according to the candidate object data and the adjacent object data in response to the existence of the adjacent object data in the plurality of object data, wherein the first candidate characterization data is used for replacing the candidate object data and the adjacent object data, processing the plurality of object data based on the candidate object data, the adjacent object data and the first candidate characterization data to obtain a first characterization data set, and determining that the number of the first characterization data contained in the first characterization data set is smaller than or equal to the preset number threshold to determine the first characterization data set as a first sample data set. According to an embodiment of the invention, processing the plurality of object data based on the candidate object data, the neighboring object data and the first candidate characterization data to obtain a first characterization data set comprises deleting the candidate object data and the neighboring object data in the plurality of object data to obtain the candidate object data set, and determining the first characterization data set according to the candidate object data set and the first candidate characterization data. According to an embodiment of the invention, the first characterization data set comprises a plurality of first characterization data, in response to detecting that the candidate object data set does not comprise object data, and determining that the number of the plurality of first characterization data sets is larger than a preset number threshold, the data processing method further comprises determining the point density of each first characterization data set, determining the first characterization data with the largest point density in the plurality of first characterization data sets as candidate first characterization data, determining the number of second candidate characterization data associated with the candidate first characterization data based on the point density of the candidate first characterization data, extracting the characterization data from the plurality of first characterization data sets based on the number of second candidate characterization data, obtaining a second characterization data set, determining that the number of second characterization data contained in the second characterization data set is smaller than or equal to the preset number threshold, and determining the second characterization data set as a second sample data set. According to the embodiment of the invention, the method comprises the steps of extracting the first characterization data based on the number of the second candidate characterization data, obtaining a second characterization data set, determining adjacent characterization data associated with the first characterization data from the first characterization data based on the feature vector and the second distance threshold of the first characterization data, clustering the first characterization data and the adjacent characterization data based on the number of the second candidate characterization data, obtaining a number of second candidate characterization data, deleting the first characterization data and the adjacent characterization data in the first characterization