US-20260126976-A1 - TRANSPILER TO EXTRACT AND USE INTERMEDIATE REPRESENTATIONS OF A CODE BASE

US20260126976A1US 20260126976 A1US20260126976 A1US 20260126976A1US-20260126976-A1

Abstract

Provided is a process including: obtaining, with a computer system, access to a code base; decomposing, with the computer system, the code base into parts; classifying, with the computer system, the parts according to content type; selecting, with the computer system, processing templates based on the content types, with at least some different content types having different selected processing templates; and generating natural language documentation for the parts, with one or more generative language models, using the processing templates selected for the parts.

Inventors

Daniel Westbrook Hensley
Adam Kevin Tilton

Assignees

Driver AI, Inc.

Dates

Publication Date: 20260507
Application Date: 20251229

Claims (20)

1 . A method, comprising: obtaining, with a computer system, access to a code base; decomposing, with the computer system, the code base into parts; receiving, with the computer system, a processing template, wherein the processing template comprises a natural-language prompt; and generating natural-language documentation for the code base, wherein: the natural-language documentation is generated by a generative language model, the natural-language documentation is based on the prompt of the processing template, and the natural-language documentation is hierarchically structured based on the parts.
2 . The method of claim 1 , wherein the hierarchical structure of the natural-language documentation is based on logic of the template.
3 . The method of claim 1 , further comprising: receiving, with the computer system, a second processing template, wherein the second processing template comprises a second natural-language prompt, and wherein the natural-language documentation is based on the second natural-language prompt.
4 . The method of claim 1 , wherein the processing template specifies an output of the generative language model, and wherein the natural-language documentation comprises the output.
5 . The method of claim 1 , wherein the processing template comprises instructions for processing the code base, and wherein the generative language model generates the natural-language documentation based on the instructions.
6 . The method of claim 1 , wherein generating the natural-language documentation comprises generating a structured outline defining sections and subsections of the natural-language documentation.
7 . The method of claim 6 , wherein the structured outline is generated separately from content of the natural-language documentation.
8 . The method of claim 7 , wherein the content of the natural-language documentation is generated iteratively, by the generative language model, for each of the sections and subsections of the natural-language documentation.
9 . The method of claim 1 , further comprising: constructing a directed acyclic graph (DAG), wherein nodes of the DAG correspond to respective portions of the code base; and for at least some leaf nodes of the DAG corresponding to files in a structured language among files of the code base, decomposing the respective files with a parser to form an abstract syntax tree (AST), parse tree, or symbol table.
10 . The method of claim 1 , further comprising: indexing the natural-language documentation; constructing a directed acyclic graph (DAG), wherein nodes of the DAG correspond to respective portions of the code base; receiving a query; determining which nodes of the DAG are responsive to the query; and returning an indication of a portion of the code base responsive to the query.
11 . The method of claim 10 , further comprising: indexing the natural-language documentation, wherein nodes of the DAG responsive to the query are further determined based on the indices of the natural-language documentation.
12 . The method of claim 1 , further comprising: receiving a .json file comprising a hierarchical taxonomy; and classifying the parts based on the hierarchical taxonomy of the .json file.
13 . The method of claim 12 , wherein the hierarchical structure of the natural-language documentation is based on the classification of the parts.
14 . The method of claim 1 , wherein the processing template comprises means for specifying processing of at least one portion of the natural-language documentation.
15 . The method of claim 1 , further comprising: obtaining a diff between a previous version of the code base and an updated version of the code base; determining which portions of a hierarchical tree of intermediate representations of the code base are affected by the diff; automatically updating the portions of the hierarchical tree of intermediate representations determined to be affected by the diff; and generating updated natural-language documentation based on the updated portions of the hierarchical tree of intermediate representations.
16 . The method of claim 1 , wherein: generating the natural-language documentation comprises iteratively refining intermediate representations to produce higher-level abstractions of the natural-language documentation relative to abstractions of the natural-language documentation prior to the refining; and the parts are classified based at least in part on metadata extracted from the code base, including file types, versioning information, or dependency relationships.
17 . The method of claim 1 , wherein: the natural-language documentation comprises at least four of the following: technical documentation, architecture descriptions, getting-started guides, user guides, product briefs, application-landscape summaries, block diagrams, audio summaries, video summaries, tutorials, application notes, code base summaries, dependency graphs, compliance documentation, security analysis reports, memory safety reports, performance analysis reports, testing documentation, application-program interface descriptions, internal documentation, system architecture diagrams, executive summaries, user persona definitions, user stories, and user journeys.
18 . The method of claim 1 , wherein decomposing the code base into the parts comprises parsing abstract syntax trees (ASTs) to determine logical boundaries of the code base.
19 . The method of claim 1 , wherein generating the natural-language documentation comprises performing multi-pass processing of intermediate representations, the multi-pass processing comprising: generating initial intermediate representations for respective parts of the code base, producing an initial intermediate representation at a first level of granularity; refining the initial intermediate representations through successive processing passes with the generative language model, each pass incorporating additional contextual information; synthesizing the refined intermediate representations into higher-level abstractions relative to abstractions prior to synthesizing, including aggregated summaries for the code base; and using the higher-level abstractions to produce the natural-language documentation.
20 . A tangible, non-transitory, machine-readable medium storing instructions that, when executed, effectuate operations comprising: obtaining, with a computer system, access to a code base; decomposing, with the computer system, the code base into parts; receiving, with the computer system, a processing template, wherein the processing template comprises a natural-language prompt; and generating natural-language documentation for the code base, wherein: the natural-language documentation is generated by a generative language model, the natural-language documentation is based on the prompt of the processing template, and the natural-language documentation is hierarchically structured based on the parts.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS The present application is a continuation of U.S. non-provisional patent application Ser. No. 19/044,565, filed Feb. 3, 2025, titled TRANSPILER TO EXTRACT AND USE INTERMEDIATE REPRESENTATIONS OF A CODE BASE, issued as U.S. Pat. No. 12,511,105, which claims the benefit of U.S. provisional patent application 63/549,385, filed Feb. 2, 2024, titled TRANSPILER TO EXTRACT AND USE INTERMEDIATE REPRESENTATIONS OF A CODE BASE. The entire content of each of the afore-listed patent filing is hereby incorporated by reference for all purposes. BACKGROUND 1. Field The present disclosure relates generally to artificial intelligence and, more specifically, to transpilers to extract and use intermediate representations of a code base. 2. Description of the Related Art In a variety of situations, it can be useful to generate natural language text about another document or corpus of documents. In some cases, that corpus or document may be in a structured language, like portions or all of a code base, or in some cases, that document may be unstructured natural language text, such as novels, research papers, litigation discovery productions or responses, screen plays, transcripts, plays, email repositories, and the like. The generated text about that source material may take a variety of forms, including explanations, summaries, expositions, timelines, technical documentation, and many other examples described in the application that follows. SUMMARY The following is a non-exhaustive listing of some aspects of the present techniques. These and other aspects are described in the following disclosure. Some aspects include a process including: obtaining, with a computer system, access to a code base; decomposing, with the computer system, the code base into parts; classifying, with the computer system, the parts according to content type; selecting, with the computer system, processing templates based on the content types, with at least some different content types having different selected processing templates; and generating natural language documentation for the parts, with one or more generative language models, using the processing templates selected for the parts. Some aspects include a process including: obtaining, with a computer system, access to a code base; decomposing, with the computer system, the code base into parts; generating, with the computer system, documentation for the parts with a language model; associating, with the computer system, the documentation with the parts; indexing, with the computer system, the documentation; obtaining, with the computer system, a query searching for content in the code base; searching, with the computer system, using the index, the code base based on the generated documentation to identify documentation corresponding to the query and, then, content in the code base associated with the identified documentation; and responding, with the computer system, to the query, by identifying the content in the code base associated with the identified documentation. Some aspects include a tangible, non-transitory, machine-readable medium storing instructions that when executed by a data processing apparatus cause the data processing apparatus to perform operations including the above-mentioned process. Some aspects include a system, including: one or more processors; and memory storing instructions that when executed by the processors cause the processors to effectuate operations of the above-mentioned process. BRIEF DESCRIPTION OF THE DRAWINGS The above-mentioned aspects and other aspects of the present techniques will be better understood when the present application is read in view of the following figures in which like numbers indicate similar or identical elements: FIG. 1 illustrates an example of a computing system with a transpiler and a retrieval augmented generation system in accordance with some embodiments of the present techniques; FIG. 2 illustrates an example of a process to generate documentation for a code base in accordance with some embodiments of the present techniques; FIG. 3 illustrates a hierarchical pre-structure of intermediate representations documenting various parts of a code base in accordance with some embodiments of the present techniques; FIG. 4 illustrates an example of a process by which documentation may be updated in response to an update to a code base in accordance with some embodiments of the present techniques; FIG. 5 illustrates an example of a process by which a code base may be searched in accordance with some embodiments of the present techniques; and FIG. 6 illustrates an example of a computing device by which the computing systems and processes described herein may be implemented. While the present techniques are susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. The drawings may not be