US-20260126995-A1 - TOOL FOR ACCURATELY DETECTING THE USE OF THIRD-PARTY LIBRARIES IN APPLICATIONS
Abstract
Disclosed herein is a method performed by one or more computing devices to detect a use of third-party software libraries in an application. The method includes performing static and dynamic analysis of the application to detect one or more signals, generating a tree data structure representing hierarchical component names associated with the one or more signals, wherein each node of the tree data structure represents a path/sub-path of a hierarchical component name, annotating each of one or more nodes of the tree data structure to indicate signals associated with the path/sub-path represented by the node, determining a confidence score for each of the one or more nodes based on the signals, identifying nodes of the tree data structure having a confidence score that meets a threshold confidence score, and reporting one or more of the paths or sub-paths represented by the identified nodes as being associated with third-party software libraries.
Inventors
- Álvaro FEAL
- Narseo VALLINA-RODRIGUEZ
- Joel REARDON
- Serge EGELMAN
- Robert Richter
- Nathaniel Good
Assignees
- AppCensus, Inc.
Dates
- Publication Date
- 20260507
- Application Date
- 20231017
Claims (20)
- 1 . A method performed by one or more computing devices to accurately detect a use of third-party software libraries in an application, the method comprising: performing static analysis of the application and dynamic analysis of the application to detect one or more signals indicative of the use of third-party libraries in the application; generating a tree data structure representing hierarchical component names associated with the one or more signals, wherein each level of the tree data structure represents a level of a component name hierarchy, wherein each node of the tree data structure represents a path or sub-path of a hierarchical component name; annotating each of one or more nodes of the tree data structure to indicate signals associated with the path or sub-path represented by the node; determining a confidence score for each of the one or more nodes based on the signals associated with the path or sub-path represented by the node; identifying nodes of the tree data structure having a confidence score that meets a threshold confidence score; and reporting one or more of the paths or sub-paths represented by the identified nodes as being associated with third-party software libraries.
- 2 . The method of claim 1 , wherein the static analysis detects static analysis signals associated with the application, wherein the static analysis signals include one or more of: a third-party class name signal, a class name cross reference signal, a uniform resource locator signal, a manifest file signal, and a configuration file signal.
- 3 . The method of claim 2 , wherein the dynamic analysis detects dynamic analysis signals associated with runtime behavior of the application, wherein the dynamic analysis signals include one or more of: a network communication signal and a class loaded during runtime signal.
- 4 . The method of claim 3 , wherein each of the one or more signals is assigned a weight representing a confidence level provided by the signal, wherein a confidence score for a node of the tree data structure is calculated based on summing weights of respective signals associated with the path or sub-path represented by the node.
- 5 . The method of claim 4 , wherein the class name cross reference signal, the network communication signal, and/or the class loaded during runtime signal are assigned higher weights than the class name signal, the URL signal, and the manifest file signal.
- 6 . The method of claim 1 , wherein the one or more paths or sub-paths that are reported are paths or sub-paths represented by those of the identified nodes that do not have any child nodes having a confidence score that meets the threshold confidence score.
- 7 . The method of claim 1 , further comprising: generating a fingerprint of non-obfuscated code that has determined to be included in a third-party software library, wherein the fingerprint of the non-obfuscated code is generated based on code features of the non-obfuscated code that are not expected to change with obfuscation; storing the fingerprint of the non-obfuscated code and the non-obfuscated code itself in a data storage; determining whether obfuscated code included in the application matches the fingerprint of the non-obfuscated code; and responsive to determining that the obfuscated code matches the fingerprint of the non-obfuscated code, deobfuscating the obfuscated code using the non-obfuscated code.
- 8 . The method of claim 7 , wherein the code features of the non-obfuscated code that are not expected to change with obfuscation include one or more of: function signatures and string constants appearing in code.
- 9 . The method of claim 1 , further comprising: determining a level of similarity between a string associated with a signal and a hierarchical component name; and associating the signal with the hierarchical component name in response to a determination that the level of similarity between the string associated with the signal and the hierarchical component name meets a threshold similarity level.
- 10 . A set of one or more non-transitory machine-readable storage media storing instructions which, when executed by one or more processors of one or more computing devices, causes the one or more computing devices to perform operations for accurately detecting a use of third-party software libraries in an application, the operations comprising: performing static analysis of the application and dynamic analysis of the application to detect one or more signals indicative of the use of third-party libraries in the application; generating a tree data structure representing hierarchical component names associated with the one or more signals, wherein each level of the tree data structure represents a level of a component name hierarchy, wherein each node of the tree data structure represents a path or sub-path of a hierarchical component name; annotating each of one or more nodes of the tree data structure to indicate signals associated with the path or sub-path represented by the node; determining a confidence score for each of the one or more nodes based on the signals associated with the path or sub-path represented by the node; identifying nodes of the tree data structure having a confidence score that meets a threshold confidence score; and reporting one or more of the paths or sub-paths represented by the identified nodes as being associated with third-party software libraries.
- 11 . The set of one or more non-transitory machine-readable storage media of claim 10 , wherein the static analysis detects static analysis signals associated with the application, wherein the static analysis signals include one or more of: a third-party class name signal, a class name cross reference signal, a uniform resource locator signal, a manifest file signal, and a configuration file signal.
- 12 . The set of one or more non-transitory machine-readable storage media of claim 11 , wherein the dynamic analysis detects dynamic analysis signals associated with runtime behavior of the application, wherein the dynamic analysis signals include one or more of: a network communication signal and a class loaded during runtime signal.
- 13 . The set of one or more non-transitory machine-readable storage media of claim 12 , wherein each of the one or more signals is assigned a weight representing a confidence level provided by the signal, wherein a confidence score for a node of the tree data structure is calculated based on summing weights of respective signals associated with the path or sub-path represented by the node.
- 14 . The set of one or more non-transitory machine-readable storage media of claim 13 , wherein the class name cross reference signal, the network communication signal, and/or the class loaded during runtime signal are assigned higher weights than the class name signal, the URL signal, and the manifest file signal.
- 15 . The set of one or more non-transitory machine-readable storage media of claim 11 , wherein the one or more paths or sub-paths that are reported are paths or sub-paths represented by those of the identified nodes that do not have any child nodes having a confidence score that meets the threshold confidence score.
- 16 . The set of one or more non-transitory machine-readable storage media of claim 11 , wherein the operations further comprise: generating a fingerprint of non-obfuscated code that has determined to be included in a third-party software library, wherein the fingerprint of the non-obfuscated code is generated based on code features of the non-obfuscated code that are not expected to change with obfuscation; storing the fingerprint of the non-obfuscated code and the non-obfuscated code itself in a data storage; determining whether obfuscated code included in the application matches the fingerprint of the non-obfuscated code; and responsive to determining that the obfuscated code matches the fingerprint of the non-obfuscated code, deobfuscating the obfuscated code using the non-obfuscated code.
- 17 . The set of one or more non-transitory machine-readable storage media of claim 16 , wherein the code features of the non-obfuscated code that are not expected to change with obfuscation include one or more of: function signatures and string constants appearing in code.
- 18 . The set of one or more non-transitory machine-readable storage media of claim 11 , wherein the operations further comprise: determining a level of similarity between a string associated with a signal and a class hierarchical component name; and associating the signal with the class hierarchical component name in response to a determination that the level of similarity between the string associated with the signal and the class hierarchical component name meets a threshold similarity level.
- 19 . A computing device, comprising: one or more processors; and a set of one or more non-transitory machine-readable storage media storing instructions which, when executed by the one or more processors, causes the computing device to: perform static analysis of an application and dynamic analysis of the application to detect one or more signals indicative of a use of third-party libraries in the application; generate a tree data structure representing hierarchical component names associated with the one or more signals, wherein each level of the tree data structure represents a level of a component name hierarchy, wherein each node of the tree data structure represents a path or sub-path of a hierarchical component name; annotate each of one or more nodes of the tree data structure to indicate signals associated with the path or sub-path represented by the node; determine a confidence score for each of the one or more nodes based on the signals associated with the path or sub-path represented by the node; identify nodes of the tree data structure having a confidence score that meets a threshold confidence score; and report one or more of the paths or sub-paths represented by the identified nodes as being associated with third-party software libraries.
- 20 . The computing device of claim 19 , wherein the static analysis detects static analysis signals associated with the application and the dynamic analysis detects dynamic analysis signals associated with runtime behavior of the application, wherein the static analysis signals include one or more of: a third-party class name signal, a class name cross reference signal, a uniform resource locator (URL) signal, a manifest file signal, and a configuration file signal, and wherein the dynamic analysis signals include one or more of: a network communication signal and a class loaded during runtime signal.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of U.S. Provisional Application No. 63/379,877 filed Oct. 17, 2022, which is hereby incorporated by reference. TECHNICAL FIELD Embodiments of the invention relate to the field of automated software detection, and more specifically, a tool to detect the use of third-party software libraries in applications. BACKGROUND Third-party software libraries (also referred to as third-party libraries) such as software development kits (SDKs) are fundamental to the development of modern applications. Third-party libraries provide application developers with functionality for performing a variety of tasks. For example, third-party libraries may provide functionality related to cryptography, graphics, anti-fraud, cross-platform development, and/or application integration with online platforms. The use of third-party libraries is considered good software engineering practice because it facilitates code reuse. Also, by nature, popular third-party libraries are more extensively tested and thus are more reliable. Despite the convenience provided by third-party libraries, the use of third-party libraries in mobile applications can have negative security and/or privacy consequences. From a security perspective, application developers may not diligently update the third-party libraries included in their software, thereby exposing users of those applications to unpatched vulnerabilities. Also, from a privacy perspective, third-party libraries may collect personal or sensitive data for secondary purposes such as advertising or user tracking. In the case of the Android operating system, third-party libraries execute with the same user ID and privileges as the host application, so they automatically gain access to the same set of permissions that the user granted to the host application. In some cases, a third-party library provider may even require (or recommend) in its documentation that application developers should expand the set of permissions requested by applications to enable the features of the third-party library. This phenomenon may lead to over-privileging, where certain permissions are not necessary for core application functionality, but instead to facilitate secondary usage by a third-party SDK provider. These privacy issues are aggravated by the lack of mechanisms in mobile operating systems to discern whether permissions are being requested by the application for legitimate reasons to enable the functionality of the application or being requested by third-party libraries for secondary purposes. The ability to accurately detect third-party libraries (e.g., software development kits [SDKs]) and to characterize their behavior is vital for analyzing the security and/or privacy risks of software and their supply chain. This is especially true in the case of mobile applications (also referred to as “apps”) due to the increasing presence of potentially-intrusive third-party libraries in mobile applications that are used for analytics and advertising purposes. Existing third-party library detection tools (e.g., Exodus, LibRadar, and LibScout) suffer from coverage and accuracy limitations due to their reliance on (1) a database of pre-defined code fingerprints of third-party libraries, and (2) static analysis methods to inspect an application's code to determine whether the mobile application's code matches any of the pre-defined code fingerprints. Therefore, to be effective, current static analysis methods require keeping the database of code fingerprints updated, which can be challenging, especially since new third-party library versions are constantly being released or updated, and third-party libraries are merged due to company acquisitions/merger. Moreover, a static analysis approach to detecting third-party libraries is becoming increasingly ineffective as mobile application developers rely on code obfuscation tools to obfuscate their code and protect their intellectual property. Also, a static analysis approach may produce false negatives (e.g., if a third-party library is dynamically loaded at runtime); and false positives (e.g., if legacy or dead code associated with an SDK is conspicuously present but never executes). Some existing third-party library detection tools perform dynamic analysis to detect third-party libraries. For example, a dynamic analysis approach may analyze the traffic generated by a mobile application while the mobile application is being executed to detect third-party host names contacted by the mobile application as a proxy to infer the presence of third-party libraries in the mobile application. However, just because a mobile application contacts a third-party host name does not necessarily mean that the application includes a third-party library associated with that host name. For instance, some third-party libraries allow mobile application developers to integrate multiple ad networks and analytics services using t