Provisioning Synthetic Data with GenAI at Enterprise Scale
Part 2: Scalability Requirements for Managing the Full Data Provisioning Life Cycle
In Part 1 of this series, we focused on how to leverage generative AI (GenAI) tools for provisioning synthetic data to ensure data quality in a complex enterprise environment. It described the limitations and risk factors presented by GenAI tools and their Large Language Models (LLMs). In Part 2, the essential factors for provisioning data on a global scale are examined along with strategies for leveraging GenAI using a single data platform at enterprise scale.
Managing the Data Provisioning Life Cycle
An enterprise-class service delivery platform is a robust, scalable, and secure framework designed to support the delivery of a wide range of services and capabilities within a large organization. GenAI technologies are a class of interactive tools meant to be operated by individual users. They were not designed as a platform capable of managing the full life cycle for test data provisioning.
The data provisioning life cycle starts with a distributed self-service portal for developers, testers and data scientists to requisition their data. In the GenRocket platform, these requests are processed in a streamlined and automated methodology to model, design, deploy, and manage the precise data needed for each data requirement. The cycle is complete when the required data is orchestrated to train a machine learning model or into an automated testing framework in the CI/CD release pipeline.
GenAI tools will never live up to this standard of service delivery. We know this because over the past 10 years, GenRocket has been working with global enterprise customers to build out a test data provisioning platform with true enterprise scalability.
Here are the important elements of an enterprise-class service delivery platform for Synthetic Test Data Automation (TDA) presented as a checklist to assess the enterprise scalability of GenAI.
One Data Provisioning Platform: Because data requirements vary from application to application, data science plus dev & test teams must be supported by a single platform for provisioning any kind of data. Otherwise support staff must learn multiple tools, increasing the cost of licenses, training, and administration. Similarly, the rate of adoption for synthetic data is not instantaneous for organizations with a large number of agile teams. Therefore, the platform must offer a blend of synthetic and production data solutions to meet the diverse requirements of a global workforce without the disruption of rapid technology change.
The GenRocket Solution: GenRocket’s TDA platform supports the subsetting and masking of production data, as well as the generation of synthetic data. This allows engineers and testing personnel to continue to use production data for testing because it is familiar to them. It also allows the graceful transition to synthetic data to improve coverage, data security, and faster provisioning times.
Strong Security Framework: Any enterprise class service delivery platform must provide security at multiple levels. It must employ a service architecture providing a secure operating environment, with access controls for global user communities, and absolute protection of data privacy, especially with regard to the use of production data.
The GenRocket Solution: GenRocket’s secure hybrid cloud service architecture offers the best of both worlds. Cloud-based collaboration and self-service supports globally distributed teams, while local runtime engines for data delivery operate behind the corporate firewall. Extensive role-based, multi-factor access controls are centrally managed. And the modeling, design, and generation of synthetic data only requires access to metadata (e.g., database schema or DDL file) to provide a blueprint for the data generation process. GenRocket never exposes customer data and never generates synthetic data outside the corporate firewall.
Centralized Platform Management: To properly deploy and manage a global platform, centralized management is required. The ability to add users, assign roles and responsibilities, track utilization and adoption, control system configuration options, and manage distributed system resources is essential.
The GenRocket Solution: To manage the GenRocket platform, one or more organization administrators are assigned to the system. Org Admins have the ability to set up and maintain all users and system resources. This includes security controls, configuration options, the management of distributed runtime engines and their repositories, and centralized control over the organization’s Test Data Project libraries.
Distributed Self Service: This capability is required to support a large, distributed team of users associated with one or more value streams within the organization. A self-service portal must allow users to submit test data requests to a centralized team of test data experts and track the resolution of those requests.
The GenRocket Solution: A capability called G-portal is an easy-to-use ticketing system for test data requests. Through G-portal users describe the nature of the data they require for a given test case and can download a preconfigured Test Data Case for execution in their local data environment. Users also have the ability to make quick changes and modifications to their Test Data Cases using a feature called G-questionnaire.
Automated Process Flow: Traditional Test Data Management (TDM) is known to be complex and cumbersome and often results in test data that fails to meet all test case objectives. This traditional approach requires modernization and automation to simplify and accelerate this outdated provisioning process. GenAI tools lack the fundamental capabilities to make test data provisioning an efficient and accelerated process on a global scale.
The GenRocket Solution: GenRocket employs a four-step methodology for test data provisioning with intelligent automation at each stage. That includes the process of data modeling, the design of Test Data Cases for multiple categories of testing, and the seamless integration with automation tools for development and testing. Additionally, Test Data Cases can be stored and categorized in a global repository, re-used and repurposed, version controlled, and synchronized between distributed user environments.
Automated Change Management: Because applications and their databases are dynamic, enterprise-class platforms must have the ability to adapt to these changes. Database tables can be added or modified and so generated test data must reflect those changes. As new software features are added, new data requirements will emerge. An enterprise class data provisioning platform must manage these changes and ensure they are up to date across the test data provisioning lifecycle.
The GenRocket Solution: GenRocket incorporates intelligent automation to detect changes in data structures and automatically updates its Test Data Projects to reflect those changes. Similarly, when test data engineers design or modify Test Data Cases, those changes are synchronized across a distributed repository where they are stored for user access. This makes the platform adaptable to change, while streamlining the data provisioning process.
Automated Data Delivery: A test data provisioning platform must fully automate and orchestrate the delivery of the right data into the right test at the right time. This implies seamless integration with all appropriate tools in the test environment, so data delivery is fully aligned with automated build, test, and release procedures.
The GenRocket Solution: GenRocket’s test data provisioning platform fully automates the delivery of test data into a test case within a CI/CD pipeline. With GenRocket a single command can invoke an executable Test Data Case (or a suite of Test Data Cases) during automated test procedures to deliver controlled and conditioned test data on-demand and in real time. GenRocket is able to integrate with any development tool or test automation framework.
Scalable Performance: An enterprise class solution for test data provisioning must be scalable both in terms of user capacity and processing power. The system must be capable of supporting thousands of users who might simultaneously request and receive test data needed for automated testing. In terms of performance, the solution must be capable of generating billions of rows of synthetic data in a short period of time without impacting or degrading the services provided to other users.
The GenRocket Solution: The GenRocket platform was designed with this kind of scalability in mind. It’s modular and distributed architecture lends itself to unlimited scalability. Any number of runtime engines can be made operational to support greater user capacity and processing loads. GenRocket’s partition engine allows synthetic data to be generated as parallel processes and delivered as a consolidated data set with billions of rows of data in a matter of minutes or hours. Users who are generating synthetic data on separate runtime engine instances will not be impacted by this heavy processing load.
Analytics & Reporting: in order for administrators to manage the expansion and adoption of synthetic data generation technology, a number of key metrics must be tracked to ensure the most efficient performance and service delivery available from the platform. In addition to utilization metrics and adoption rate, the platform should also track value metrics that allow an ongoing assessment of performance, efficiencies, and continuous improvement.
The GenRocket Solution: GenRocket offers the only test data provisioning solution that monitors the volume and frequency of synthetic data design along with data generation rates by users and Agile teams. It tracks the number of Test Data Projects created and Test Data Cases that have been designed. It also tracks the components used to generate synthetic data, such as rules and queries. And it tracks how Test Data Cases are combined into Stories and Epics for more advanced categories of testing. This helps to monitor and manage the continuous improvement of the system as it delivers greater test coverage, accelerated test cycle times, and improved team efficiencies. And it allows time savings and test coverage to be calculated as a return on your technology investment.
Platform Extensibility: An enterprise class test data provisioning platform must be extensible to support new forms of functionality, such as ways to leverage artificial intelligence to advance the state of the technology. Ultimately, test data provisioning can become a process that not only keeps pace with the speed of development but continues to raise the quality of software released to production. This will lead to new and innovative features that will make the organization more competitive in the marketplace.
The GenRocket Solution: The GenRocket platform has been continuously enhanced and refined over the past decade to form a highly extensible platform with more than 60 integrated components representing the industry’s most advanced Synthetic Test Data Automation solution. GenRocket continues to add new intelligent data generators and our technology roadmap is guided by the most effective use of artificial intelligence across the platform. One example is the recent addition of an intelligent data generator that integrates with Generative AI tools; this incorporates this new technology into GenRocket and provides GenAI with a new level of management and scalability. This will be covered more deeply in the next section.
Integrating GenAI with GenRocket – The Best of Both Worlds
As covered in Part 1 of this series, GenAI is well suited for generating conversational or textual content based on a pre-engineered “prompt” that defines the content generation requirements. On the other hand, GenRocket excels at tabular synthetic data generation where fine grained control over the volume, variety, and format of complex data environments can be quickly and accurately generated by GenRocket’s TDA platform.
The best of both worlds would be to combine the two technologies and leverage GenAI for its ability to generate predefined text and GenRocket to generate controlled and conditioned tabular data. The GenRocket platform can integrate these two complementary synthetic data generation processes and provide merged output directly into a test or development framework. Additionally, the enterprise scalability features inherent in the GenRocket platform (described above) can be used to deploy the integrated solution on a global scale.
This allows a graceful migration from traditional Test Data Management (TDM) systems and their over-dependency on production data to the more secure and streamlined use of synthetic data for all categories of testing. The use of one data platform consolidates data delivery into one scalable Test Data Automation (TDA) platform that enables self-service and integrates in any automated dev and test framework.
With the appropriate integration of GenAI technology, developers and testers can have easy access to quality synthetic data for fully automated testing as well as comprehensive training data for building accurate machine learning models for their AI-based applications.
GenAI integration is just one example of the extensibility of GenRocket’s platform to incorporate new technologies and support emerging use cases and applications. If you would like to learn how GenRocket can provide a single data platform for your organization’s data provisioning requirements, be sure to schedule personalized demo with one of our synthetic data experts.
Stay tuned for Part 3 in this series of leveraging GenAI for synthetic data provisioning as we present a real-world application of the integration strategies described in this article.