US-12619398-B1 - Systems and methods for generating segment-specific source code for mainframe-source artifact

US12619398B1US 12619398 B1US12619398 B1US 12619398B1US-12619398-B1

Abstract

Techniques for generating segment code for a legacy mainframe-source artifact that include generating an abstract-syntax-tree (CAA-AST) from a compiler-analysis artifact (CAA) that is generated by compiling a mainframe-source artifact (MSA), and identifying a plurality of logical segments of the MSA based on the CAA-AST. For each logical segment identified, determining a segment descriptor that includes a segment identifier and a segment label, generating a segment-code prompt based on the segment descriptor, and applying the segment-code prompt to a large-language model (LLM) to generate segment code, where the segment code of the plurality of logical segments is integrated to form integrated project code for the MSA.

Inventors

Senthilkumar RAMAKRISHNAN
Leigh-Ann Russell
Michael Keslar
Karthikeyan Nallathambi

Assignees

THE BANK OF NEW YORK MELLON

Dates

Publication Date: 20260505
Application Date: 20250919

Claims (20)

1 . A computer-implemented method for generating segment code for a legacy mainframe-source artifact, the method comprising: receiving, from a compiler, a compiler-analysis artifact (CAA) generated by compiling a mainframe-source artifact (MSA); generating, from the CAA, an abstract-syntax-tree (CAA-AST); identifying, based on the CAA-AST, a plurality of logical segments of the MSA; for each logical segment: determining a segment descriptor comprising a segment identifier and a segment label; generating a segment-code prompt based on the segment descriptor; and generating, by applying the segment-code prompt to a large-language model (LLM), segment code; and integrating the segment code of the plurality of logical segments to form integrated project code for the MSA.
2 . The method of claim 1 , wherein the CAA comprises a SYSADATA file emitted by a mainframe COBOL compiler.
3 . The method of claim 1 , wherein the segment-code prompt is generated using a prompt template selected from a template library based on the segment label and a target programming language.
4 . The method of claim 1 , wherein the segment code comprises one or more Java classes that implement business logic extracted from the corresponding logical segment.
5 . The method of claim 1 , wherein integrating the segment code further comprises deduplicating overlapping artifacts based on segment identifiers and emitting a consolidated project structure.
6 . The method of claim 1 , further comprising generating segment documentation for the plurality of logical segments by: generating a segment-documentation prompt for each logical segment; applying the segment-documentation prompt to a second LLM to generate segment documentation; and integrating the segment documentation into integrated project documentation for the MSA.
7 . The method of claim 1 , wherein the segment code is generated by a code-generation engine implemented as a software agent.
8 . A system comprising: a processor; and non-transitory computer-readable storage medium comprising program instructions stored thereon that are executable by the processor to cause the following operations for generating segment code for a legacy mainframe-source artifact: receiving, from a compiler, a compiler-analysis artifact (CAA) generated by compiling a mainframe-source artifact (MSA); generating, from the CAA, an abstract-syntax-tree (CAA-AST); identifying, based on the CAA-AST, a plurality of logical segments of the MSA; for each logical segment: determining a segment descriptor comprising a segment identifier and a segment label; generating a segment-code prompt based on the segment descriptor; and generating, by applying the segment-code prompt to a large-language model (LLM), segment code; and integrating the segment code of the plurality of logical segments to form integrated project code for the MSA.
9 . The system of claim 8 , wherein the CAA comprises a SYSADATA file emitted by a mainframe COBOL compiler.
10 . The system of claim 8 , wherein the segment-code prompt is generated using a prompt template selected from a template library based on the segment label and a target programming language.
11 . The system of claim 8 , wherein the segment code comprises one or more Java classes that implement business logic extracted from the corresponding logical segment.
12 . The system of claim 8 , wherein integrating the segment code further comprises deduplicating overlapping artifacts based on segment identifiers and emitting a consolidated project structure.
13 . The system of claim 8 , the operations further comprising generating segment documentation for the plurality of logical segments by: generating a segment-documentation prompt for each logical segment; applying the segment-documentation prompt to a second LLM to generate segment documentation; and integrating the segment documentation into integrated project documentation for the MSA.
14 . The system of claim 8 , wherein the segment code is generated by a code-generation engine implemented as a software agent.
15 . Non-transitory computer-readable storage medium comprising program instructions stored thereon that are executable by a processor to cause the following operations for generating segment code for a legacy mainframe-source artifact: receiving, from a compiler, a compiler-analysis artifact (CAA) generated by compiling a mainframe-source artifact (MSA); generating, from the CAA, an abstract-syntax-tree (CAA-AST); identifying, based on the CAA-AST, a plurality of logical segments of the MSA; for each logical segment: determining a segment descriptor comprising a segment identifier and a segment label; generating a segment-code prompt based on the segment descriptor; and generating, by applying the segment-code prompt to a large-language model (LLM), segment code; and integrating the segment code of the plurality of logical segments to form integrated project code for the MSA.
16 . The medium of claim 15 , wherein the CAA comprises a SYSADATA file emitted by a mainframe COBOL compiler.
17 . The medium of claim 15 , wherein the segment-code prompt is generated using a prompt template selected from a template library based on the segment label and a target programming language.
18 . The medium of claim 15 , wherein the segment code comprises one or more Java classes that implement business logic extracted from the corresponding logical segment.
19 . The medium of claim 15 , wherein integrating the segment code further comprises deduplicating overlapping artifacts based on segment identifiers and emitting a consolidated project structure.
20 . The medium of claim 15 , the operations further comprising generating segment documentation for the plurality of logical segments by: generating a segment-documentation prompt for each logical segment; applying the segment-documentation prompt to a second LLM to generate segment documentation; and integrating the segment documentation into integrated project documentation for the MSA.

Description

RELATED APPLICATIONS This application claims the benefit of and priority to U.S. Provisional Patent Application No. 63/856,052, filed Aug. 1, 2025 and titled “SYSTEMS AND METHODS FOR TRANSFORMING LEGACY MAINFRAME CODE INTO INTEGRATED PROJECT CODE AND DOCUMENTATION”, the disclosure of which is incorporated herein by reference in its entirety. FIELD Embodiments relate generally to legacy code transformation and, more particularly, to systems and methods for converting source programs developed for computing environments into modern software artifacts, such as integrated source code and documentation. BACKGROUND Mainframe computing systems have been widely used for decades in enterprise environments to support critical business operations such as transaction processing, batch computing, financial record-keeping, and customer data management. These systems typically execute programs written in legacy programming languages tailored for mainframe architectures, such as COBOL, PL/I, Job Control Language (JCL), CICS command-level code, and IMS control blocks. Mainframe applications are often large, monolithic, and deeply integrated into enterprise workflows, with production environments that may encompass hundreds of millions of lines of source code. Modern software development increasingly favors modular, distributed systems built using object-oriented or service-oriented architectures, often implemented in widely adopted programming languages such as Java, Python, or Go. These languages are typically deployed in cloud-native environments, emphasize maintainability and scalability, and are supported by contemporary toolchains for build automation, testing, and documentation. Java in particular is frequently used to implement microservices and is supported by extensive developer ecosystems and frameworks. Many organizations seek to migrate or transform their legacy mainframe applications into modern software stacks to reduce operational costs, improve agility, and ensure long-term maintainability. SUMMARY Although mainframe systems continue to support core enterprise functions, transforming these systems into modern, maintainable software environments presents significant technical and operational challenges. Many legacy applications were developed decades ago using languages such as COBOL, JCL, and CICS, and often span millions of lines of tightly coupled, procedural code. Over time, institutional knowledge of these systems has diminished. Organizations now face a growing shortage of subject-matter experts (SMEs) capable of interpreting, modifying, or migrating mainframe code, as the workforce skilled in these technologies continues to retire or shift to other domains. Compounding the problem is a lack of accessible, up-to-date documentation for many legacy applications. In many cases, the original program specifications are incomplete, outdated, or nonexistent. This limits the ability of developers to understand business logic embedded in the code and slows down maintenance and transformation efforts. Conventional modernization techniques-such as manual rewrites, line-by-line code converters, or lift-and-shift rehosting-often fail to deliver the semantic clarity and modularity required for modern application architectures. Moreover, these approaches are typically expensive, time-consuming, and can be dependent on brittle, grammar-based parsers that must be tuned to each legacy environment's idiosyncrasies. Additionally, existing tools generally operate using static analysis alone and lack access to runtime insights that could improve transformation fidelity. They may generate syntactically correct target code, but with poor alignment to actual usage patterns, performance characteristics, or domain-specific business boundaries. Most solutions also treat code generation and documentation as distinct, disconnected processes, leading to gaps in traceability and limiting the maintainability of the resulting system. Moreover, approaches generally do not leverage compiler-generated analysis artifacts—such as SYSADATA files—to produce an abstract syntax tree, segment that tree, and drive coordinated operation of the documentation engine, code-generation engine, chat engine, and code-migration engine described in this disclosure. Provided are systems and methods for transforming legacy mainframe software artifacts into integrated project code and accompanying documentation using a compiler-guided transformation pipeline. In some instances, rather than relying on text-based parsing of source files, the disclosed system compiles a legacy mainframe-source artifact (MSA) using a mainframe compiler to produce a compiler object artifact (COA) and a compiler analysis artifact (CAA). The CAA provides a fully parsed, compiler-resolved representation of the program—including expanded copybooks, data definitions, symbol references, and control-flow structure—which is converted to machine-readable XML and parsed to form a canonical a