Consideration of Design and Regulatory Agency Requirements
As embedded and server based machine learning becomes more and more prevalent, verifying the performance of the systems for agency and regulatory compliance requires up front design considerations.
For instance in a medical device program, thousands of patient data samples may be gathered from existing sources, or in some instances obtained during clinical data gathering exercises, utilizing existing monitoring devices, or new devices created for the specific application being developed. In some cases, the equipment collecting the data, may not have a predicate, or a standardized mechanism of collection, and may not collect data in a yet known manner.
In this case, the data collected, becomes the gold standard for not only training the machine learning, but also verifying that the algorithm is indeed performing as required.
Upfront design choices on how collected data will be stored, partitioned into the required data sets, over read to determine that the algorithm output has successfully matched the input data, and updating and maintaining an evolving data base will ensure the data remains usable during the development and life of the product.
Development Set: The database that is used to develop the algorithms, consists of over read or annotated data. The development data set should ideally have representative data for all of the expected parameters, as well as parameters which may be experienced in a normal clinical setting, such as line noise in an ECG or EEG signal.
Validation Set: The database that is utilized to test and report on the performance of the algorithm. This data base should contain coverage of all expected normal parameter variation, abnormal conditions, noise expected etc. In addition, this database requires ‘full space’ coverage of the problem space.
Typically, a set of tools is developed that automate the processes of converting raw data in, allowing over reading, applying the data to the algorithm and reporting the results of the algorithm output. This set of tools must be validated prior to final verification testing of the algorithm.
Verification will be complicated if an applicant cannot convincingly establish ground truth ("ground truthing" refers to the process of gathering the proper objective (provable) data for a test, or the ground truth), because the FDAs approach is predicated on the ability to confidently establish ground truth of what an image or a classifier output accurately represents. There needs to be a comparison of performance in finding that truth between people and machine. Clinical studies need to compare human readers assisted by the software with readers who don’t have the software to determine validity of software.
In the 2012 guidance document, the FDA lists information they need for review for software employing deep neural networks including algorithm design, features, models, classifiers, the data sets used to train and test the algorithm, and the test data hygiene used. Although this guidance document applies to imaging and image classification, they represent the agencies thinking toward future machine learning submission and performance evaluation requirements.
Adaptive systems that evolve over time based on the new data collected after the device goes to market present a unique problem. When medical devices change, approvals must be sought, so the FDA needs to determine when adaptive systems are considered “new changes: thereby triggering the need for new validation.
In meeting agency requirements, data management will prove to be the key.