Guiding Principles for Real-World Data Used to Generate Real-World Evidence (Trial)

Promulgated by: Center for Drug Evaluation, National Medical Products Administration (NMPA-CDE). Document No.: NMPA-CDE Announcement 2021 No. 27 (国家药监局药审中心通告2021年第27号). Issued on April 13, 2021. Effective upon publication.

Background and Purpose

Real-world evidence forms an important component of the evidentiary chain for evaluating drug efficacy and safety; related concepts and applications are addressed in the Guiding Principles for Real-World Evidence Supporting Drug Development and Review (Trial) (2020). Real-world data, in turn, is the foundation for generating real-world evidence: without high-quality, fit-for-purpose real-world data, real-world evidence cannot be produced.

Real-world data means all data collected in the course of daily practice that relates to patients’ health status and/or their diagnosis, treatment, and healthcare. Not all real-world data, once analysed, can generate real-world evidence; only real-world data that satisfies suitability requirements, and that is subjected to appropriate and adequate analysis, has the potential to form real-world evidence. At present, the recording, collection, and storage processes for real-world data often lack rigorous quality control, giving rise to problems such as incomplete data, inconsistent data standards, non-uniform data models, and divergent descriptive methods — all of which impede the effective use of real-world data.

These guiding principles supplement the Guiding Principles for Real-World Evidence Supporting Drug Development and Review (Trial) and provide specific requirements and guidance recommendations on the definition, sources, assessment, governance, standards, security and compliance, quality assurance, and suitability of real-world data, in order to help sponsors better conduct data governance and assess real-world data suitability, and to be fully prepared to generate valid real-world evidence.

Section I — Overview

Real-world evidence is an important component of the evidentiary chain for evaluating drug efficacy and safety. Real-world data is the foundation for producing real-world evidence. Real-world data means all data collected in the course of daily practice that relates to patients’ health status and/or their diagnosis, treatment, and healthcare.

Not all real-world data, once analysed, can generate real-world evidence; only real-world data that satisfies suitability requirements, and that is subjected to appropriate and adequate analysis, may form real-world evidence. Current processes for recording, collecting, and storing real-world data often lack rigorous quality control; problems such as incomplete data, inconsistent data standards, non-uniform data models, and divergent descriptive methods may arise, obstructing the effective use of real-world data. Accordingly, how to render collected real-world data fit — or to render it fit after governance — for the analytical purposes required by clinical research, and how to assess whether real-world data is suitable for generating real-world evidence, are the key questions in using real-world data to produce real-world evidence that supports drug regulatory decisions.

These principles, as a supplement to the Guiding Principles for Real-World Evidence Supporting Drug Development and Review (Trial), provide specific requirements and guidance recommendations on the definition, sources, assessment, governance, standards, security and compliance, quality assurance, and suitability of real-world data, in order to help sponsors better conduct data governance, assess real-world data suitability, and be fully prepared to generate valid real-world evidence.

Section II — Sources and Current State of Real-World Data

Real-world data relevant to drug development principally includes data recorded during the diagnostic and treatment process in real-world healthcare settings (such as electronic medical records) and data from various observational studies. Such data may have been collected prior to the commencement of a real-world study, or may be newly collected for the purpose of conducting a real-world study.

(I) Common Principal Sources of Real-World Data

In China, real-world data sources, classified by functional type, mainly include: hospital information system data, medical insurance payment data, registry study data, active drug safety surveillance data, and natural population cohort data. The following sets out common real-world data sources classified by functional type.

1. Hospital Information System Data

Hospital information system data encompasses structured and unstructured, digital or non-digital patient records — such as a patient’s demographic characteristics, clinical characteristics, diagnoses, treatments, laboratory test results, safety information, and clinical outcomes — typically stored in disparate information systems within healthcare institutions, such as electronic medical records/electronic health records, laboratory information management systems, picture archiving and communication systems, and radiology information management systems. Some healthcare institutions have established institution-level research data platforms on the basis of data integration platforms or clinical data centres, consolidating information from outpatient, inpatient, and follow-up encounters into data directly usable for clinical research. Some regional medical databases, using relatively centralised physical environments for cross-institution clinical data storage and processing, are characterised by large storage capacity and diverse data types, and may serve as potential sources of real-world data.

Hospital information system data is based on records of the clinical diagnostic and treatment process, covering a broad range of clinical outcomes and drug exposures; electronic medical record data in particular is widely used in real-world research.

2. Medical Insurance Payment Data

Medical insurance payment data in China has two main sources: (i) basic medical insurance systems established and uniformly administered by governments and healthcare institutions, containing structured data fields relating to basic patient information, healthcare service utilisation, prescriptions, settlements, and medical claims; and (ii) commercial health insurance databases, established by insurers, with data classified by insurance company claim payments and policy duration, and relatively simple in data dimensions. Medical insurance systems, as a source of real-world data, are more commonly used for health technology assessment and pharmacoeconomic research.

3. Registry Study Data

Registry study data is data collected through organised systems using observational research methods from clinical and other sources, usable for evaluating clinical outcomes in specified disease, health-condition, or exposure populations. Registry studies are mainly classified, according to the characteristics of the defined study population, into medical product registries, disease registries, and health service registries; in China, the first two categories are predominant. Drug registries supported by healthcare institutions and enterprises, in particular, observe patients using a given drug, with a focus on monitoring clinical efficacy across different indications or observing adverse reactions.

The advantage of registry study databases lies in their use of specific patient populations as the study group, integrating multiple data sources such as clinical diagnosis and treatment data and medical insurance payment data; data collection is relatively standardised, generally including patient-reported data and long-term follow-up data; observed outcome indicators are usually rich, with the advantages of relatively high accuracy and strong structuring. Such data is well suited for evaluating drug efficacy, safety, economics, and adherence, and may also be used for research on natural disease history and prognosis.

4. Active Drug Safety Surveillance Data

Active drug safety surveillance data is principally used for drug safety research and pharmacoepidemiology research. Data is collected through national or regional drug safety surveillance networks from sources including healthcare institutions, pharmaceutical companies, medical literature, online media, and patient-reported outcomes. In addition, proprietary drug safety surveillance databases established by healthcare institutions and enterprises may also form part of this data source.

5. Natural Population Cohort Data

Natural population cohort data refers to data obtained through long-term prospective dynamic tracking and observation of healthy and/or patient populations. Natural population cohort data is characterised by uniform standards, informatised sharing, long time spans, and relatively large sample sizes. Such real-world data can help build risk models for common diseases and provide support for the precise targeting of drug development populations.

6. Omics Data

Omics data, as an important underpinning of precision medicine, principally includes genomic, epigenetic, transcriptomic, proteomic, and metabolomic data; these data characterise patients’ genetic, physiological, and biological features from a systems-biology perspective. Omics data typically requires combination with clinical data to become fit-for-purpose real-world data.

7. Death Registry Data

Population death registration is the continuous and complete collection and recording of death information for a country’s citizens. China currently has four systems for collecting population death information, administered respectively under the National Disease Control and Prevention Administration, the National Health Commission, the Ministry of Public Security, and the Ministry of Civil Affairs. Population death registry data encompasses all information in medical death certificates, recording detailed causes and times of death; it may serve as a data source for population cause-specific mortality rates and clinical outcomes of major diseases.

8. Patient-Reported Outcome Data

Patient-reported outcomes are indicators measuring and evaluating disease outcomes from the patient’s own perspective, encompassing symptoms, physiological factors, psychological factors, and satisfaction with healthcare services. Patient-reported outcomes are increasingly important in drug evaluation systems. They may be recorded on paper or electronically (the latter are called electronic patient-reported outcomes); the rise and application of electronic patient-reported outcomes makes it possible to interface patient-reported outcomes with electronic medical record systems to form a complete patient-level data flow.

9. Individual Health Monitoring Data from Mobile Devices

Personal health monitoring data may be collected in real time from individual physiological signs through mobile devices (such as smartphones and wearable devices). Such data is commonly generated in the course of ordinary people’s self-health management, healthcare institutions’ monitoring of patients with chronic diseases, and health insurers’ assessment of the health status of insured populations; it is typically stored in wearable-device enterprise databases, healthcare institution databases, and commercial insurance company data systems. Given the convenience and immediacy of wearable devices in collecting physiological data, interfacing with electronic health data can form more complete real-world data.

10. Other Purpose-Specific Data

(1) Public Health Surveillance Data

China has established a series of public health surveillance databases — covering, for example, infectious disease surveillance and monitoring of adverse events following immunisation — that record data usable for analysing the incidence of infectious diseases and the rates of general and abnormal adverse reactions to vaccines.

(2) Patient Follow-up Data

In real-world clinical diagnostic and treatment settings, in-hospital electronic medical record data often cannot encompass certain important clinical indicators for patients, such as overall survival, five-year survival rates, and adverse reaction information; supplementary long-term follow-up data is therefore required to form fit-for-purpose real-world data. Patient follow-up data principally refers to data collected by hospital follow-up departments or third-party authorised service providers — through correspondence, telephone, outpatient visits, text messages, and online follow-up — that serve patients who have left hospital for purposes including clinical endpoints, rehabilitation guidance, medication reminders, and satisfaction surveys. Such data is typically stored in hospital follow-up data systems; linkage with medical record data enables the integration of multi-source clinical data for exploring disease mechanisms, development patterns, treatment methods, and prognostic factors.

(3) Patient Medication Data

Patient medication data from the diagnostic and treatment process — including patient information, drug specifications, dosage and usage, and adverse reactions — is typically stored in hospital pharmacy management information systems, pharmaceutical e-commerce platforms, pharmaceutical enterprise product traceability and drug safety information databases, and drug use surveillance platforms. With the spread of remote diagnosis and treatment and internet-enabled chronic disease management, patient out-of-hospital medication data stored in prescription circulation platforms or pharmaceutical e-commerce platforms is gradually increasing; the effective use or linkage of such data may serve as a source of real-world data recording the patient-level diagnostic and treatment process.

As medical information technology continues to develop, new types and sources of real-world data will continuously emerge; however, their specific application depends on the clinical research question to be addressed and the suitability of the data for supporting the generation of real-world evidence.

(II) Major Challenges in Real-World Data Application

From the perspective of data sources, compared with randomised controlled trial (RCT) data, real-world data in most cases lacks rigorous quality control in its recording, collection, and storage processes, giving rise to problems such as incomplete data, missing key variables, and inaccurate records. These deficiencies in data quality significantly affect subsequent data governance and use, and may even affect data traceability, making it difficult for researchers to identify problems and to verify and correct them. Changes in factors such as patient disease course, treatment location, time, and space may lead to gaps in information about patients’ disease status and related factors, presenting challenges for the systematic evaluation of disease status and outcomes in clinical research. Selective data collection — in particular in registry study data — is a potential source of research result bias.

The relative independence and closure of various real-world data sources, the variety of data management systems, the dispersal of data storage, inconsistent data standards, and the difficulties of horizontal integration and exchange of data give rise to prominent data fragmentation and information-silo phenomena. With respect to electronic medical record data, given its high sensitivity, such systems are generally managed in a closed manner and use may be subject to certain restrictions. Electronic medical records may also be affected by subjective narrative descriptions and variation among recorders, influencing the objective assessment of clinical outcomes. Moreover, in the absence of uniform standards, data types are relatively diverse — encompassing both structured data and unstructured and semi-structured data such as text, images, and video — and redundant and duplicated data may also arise during the recording, collection, and storage process, further increasing the difficulty of data processing.

Section III — Real-World Data Suitability Assessment

Suitability assessment of real-world data should be based on the specific research purpose and regulatory decision-making use.

(I) Data Governance and Data Management of Real-World Data

Real-world data may be collected retrospectively or prospectively in terms of when the research is conducted. Retrospectively collected data generally requires data governance; such data principally originates from retrospective observational studies, prospective observational studies, and retrospective-prospective observational studies that have been previously conducted. Prospectively collected data, on the other hand, requires data management; such data principally originates from prospective observational studies to be conducted, or pragmatic clinical trials — since such data is similar to RCT data collection (i.e., databases are established and data is collected through electronic data capture systems in accordance with the research protocol, and is prospective, planned, structured, and standardised). Where a study uses both previously collected data and will collect future data — for example, a retrospective-prospective study beginning from the present — retrospectively collected data requires governance, while prospectively collected data is managed using data management methods. A key issue to note here is that the database resulting from governance of historical data should match the prospectively designed database. For single-arm clinical trials using external controls, historical controls require governance methods for the external data, while concurrent controls may use data management methods.

The suitability assessment of real-world data is primarily directed at retrospectively collected data, but also has guidance value for prospectively collected data.

The suitability assessment may be divided into two stages. The first stage involves the preliminary assessment and selection of source data on dimensions including accessibility, ethics, compliance, representativeness, completeness of key variables, sample size, and source data activity status, to determine whether the data satisfies the basic analytical requirements of the research protocol. The second stage involves assessment and analysis of data relevance and reliability, as well as of the data governance mechanism (data standards and common data model) adopted or to be adopted — specifically whether the governed data is suitable for generating real-world evidence (see Figure 1). If the real-world data is prospectively collected, the first stage of preliminary suitability assessment is not required.

(II) Suitability Assessment of Source Data

Source data satisfying basic analytical requirements should, at a minimum, meet the following conditions:

1. The database is active and data is accessible.

During the research period, the database should be continuously in an active state; all recorded data should be accessible — meaning there are usage rights to the data — and should be capable of evaluation by third parties, in particular regulatory authorities.

2. Data use complies with ethical and security requirements.

The use of source data should comply with the requirements of ethical review regulations and with relevant data security and privacy protection requirements.

3. Coverage of key variables.

Source data is typically incomplete, but should have a certain degree of coverage; it should at least include outcome variables, exposure/intervention variables, demographic variables, and important covariates relevant to the research purpose.

4. Sufficient sample size.

The potential for a significant reduction in source data case numbers after data governance should be fully considered and anticipated, in order to ensure the sample size required for statistical analysis.

(III) Suitability Assessment of Governed Data

The suitability assessment of governed real-world data is primarily based on data relevance and reliability.

1. Relevance Assessment

Relevance assessment aims to evaluate whether real-world data is closely related to the clinical question of interest, with a focus on the coverage of key variables, the accuracy of exposure/intervention and clinical outcome definitions, the representativeness of the target population, and the fusion of multi-source heterogeneous data.

(1) Coverage of key variables and information

Real-world data should contain important variables and information related to clinical outcomes, such as drug use, patient demographic and clinical characteristics, covariates, outcome variables, follow-up time, and potential safety information. Where some of the above variables are missing, a thorough assessment should be made of whether reliable statistical methods can be used to impute them, and of the possible effect on causal inference results.

(2) Accuracy of exposure/intervention and clinical outcome definitions

Selecting and accurately defining clinically meaningful outcomes, and accurately defining exposures/interventions, are critical for real-world research; these should be consistent with the clinical significance or theoretical basis of the research question. The definition of clinical outcomes should include the diagnostic criteria, measurement methods and quality control (if any), measurement instruments (such as scale use), calculation methods, measurement time points, variable types, variable type conversions (such as from quantitative to qualitative), and endpoint event adjudication mechanisms (such as the operating mechanism of the endpoint event adjudication committee) on which it is based. When different data sources use inconsistent definitions for clinical outcomes, a unified clinical outcome definition should be established and a reliable conversion method adopted. The definition of exposures/interventions should account for the reasonableness of the time window.

(3) Representativeness of the target population

One of the advantages of real-world research over traditional RCTs is a broader representativeness of the target population. Therefore, when formulating inclusion and exclusion criteria, these should, as far as possible, be consistent with the target population in the real-world setting.

(4) Fusion of multi-source heterogeneous data

Given the characteristics of real-world data, in many cases it consists of heterogeneous data from multiple sources that requires linkage, fusion, and homogenisation of data from different sources at the individual level. Therefore, individual-level accurate linkage through identifier variables should be used to support the integration of key variables from data sources using a common data model or data standards.

2. Reliability Assessment

The reliability of real-world data is primarily assessed from the perspectives of data completeness, accuracy, transparency, quality control, and quality assurance.

(1) Completeness

Completeness refers to the degree of missing data, including missing variables and missing variable values. For different studies, the degree of missingness, the distribution of missingness, the reasons for missingness, and the missing mechanisms of variable values will vary; these should be described in detail. When the proportion of missing data in a given study significantly exceeds the proportion in comparable studies, the uncertainty of research conclusions is increased; in such cases careful consideration should be given to whether the data can serve as data supporting the generation of real-world evidence. A detailed analysis of the reasons for missingness is helpful for an overall judgment of data reliability. Where imputation of missing data is involved, an appropriate imputation method should be adopted based on a reasonable assumption about the missing mechanism.

(2) Accuracy

Accuracy refers to whether the data is consistent with the objective characteristics it describes, including whether the source data is accurate, whether variable values are within a reasonable range, whether the trend in outcome variables over time is reasonable, and whether code mapping relationships correspond uniquely. Data accuracy requires identification and verification against a relatively authoritative reference — for example, whether an endpoint event has been adjudicated by an independent endpoint event adjudication committee.

(3) Transparency

The transparency of real-world data means that the governance plan and governance process for real-world data are clear and transparent. It should be ensured that key exposure/intervention variables, covariates, and outcome variables in the analytical dataset can be traced back to the source data, reflecting the extraction, cleaning, transformation, and standardisation process. Whether processed manually or through automated procedures, the standard operating procedures for data governance and verification and validation documents should be clearly recorded and archived — in particular issues reflecting data credibility, such as the degree of missingness, variable value ranges, methods for calculating derived variables, and mapping relationships. The data governance plan should be formulated in advance in accordance with the research purpose, and consistency between the data governance process and the governance plan should be ensured. Data transparency also includes transparency of data accessibility, information sharing among databases, and methods for protecting patient privacy. Where algorithms are used to define research cohorts, the development and validation of the algorithm should also be transparent.

(4) Quality control

Quality control refers to the techniques and activities implemented to verify that each step of data governance meets quality requirements. Quality control assessment includes, but is not limited to: whether data extraction, secure processing, cleaning, and structuring, and subsequent storage, transmission, analysis, and submission steps all have quality control in place to ensure that all data is reliable and that data processing is correct; and whether a complete, standardised, and reliable data governance plan is followed, supported by corresponding data quality inspection and system validation procedures, to ensure that the data governance system operates normally and in a steady state and that the accuracy and reliability of real-world data are assured.

(5) Quality assurance

Quality assurance refers to systematic measures to prevent, detect, and correct data errors or problems arising during the research process. Quality assurance for real-world data is closely associated with regulatory compliance and should run through every step of data governance. Considerations include, but are not limited to: whether a research plan, protocol, and statistical analysis plan relating to real-world data have been established; whether there are corresponding standard operating procedures; whether there is a clear process and qualified personnel for data collection; whether a common definition framework (i.e., a data dictionary) is used; whether a common time framework for collecting key data variables is observed; whether the technical methods used to capture data elements comply with pre-specified technical specifications and operating procedures, including integration of data from various sources, recording of drug use and laboratory test data, follow-up records, and linkage with other databases; whether data entry is timely and transmission is secure; and whether requirements for regulatory authority on-site inspection and access to source data and source documents are satisfied.

Section IV — Real-World Data Governance

Data governance means the governance performed on raw data — targeted at a specific clinical research question — in order to render it suitable for statistical analysis; its content includes, but is not limited to: data security processing, data extraction (including from multiple data sources), data cleaning (logical checks and abnormal data processing, and missing data processing), data transformation (data standards, common data model, normalisation, natural language processing, medical coding, derived variable calculation), data transmission and storage, and data quality control.

(I) Personal Information Protection and Data Security Processing

Real-world research involving personal information protection should comply with national information security technical specifications and relevant regulations on the security management of medical big data. Sensitive personal information should undergo de-identification processing, ensuring that sensitive personal information cannot be matched and re-identified from the data; and technical and managerial measures should be taken to prevent the leakage, damage, loss, or tampering of personal information.

Data security processing should be based on the types, quantities, nature, and content of the various data involved in the research — especially sensitive personal information — and should establish data encryption technical requirements, risk assessment, and emergency response operating procedures for each stage of data governance, and conduct audits of the effectiveness of security measures.

(II) Data Extraction

An appropriate method should be selected for data extraction based on factors such as the storage format of source data, whether it is electronic data, and whether it contains unstructured data. The following principles should be observed during data extraction:

The data extraction method should be validated to ensure that the extracted data meets the requirements of the research protocol. Data extraction should ensure consistency between the extracted raw data and the source data; timestamp management should be applied to the extracted raw data and the source data.

Using data extraction tools that are interoperable with or integrated into the source data system can reduce errors in data transcription, thereby improving data accuracy and the quality and efficiency of data collection in clinical research.

(III) Data Cleaning

Data cleaning refers to the removal of duplicated or redundant data from extracted raw data, logical checks of variable values and processing of abnormal values, and handling of missing data. It should be noted that, when correcting data, if traceability to the signature confirmation of the principal investigator or source data responsible party is not possible, the data should not be modified, in order to ensure data authenticity.

First, duplicated data and irrelevant data should be removed while preserving data integrity. Duplicated data may arise during the merging of different data sources and must be removed. At the same time, inaccuracies in the mapping relationship between data sources and the common data model may result in the collection of data irrelevant to the research objective; removing unnecessary observations from the dataset can reduce unnecessary effort.

Logical checks and abnormal data processing should then be conducted. Logical checks can reveal errors in raw data or in the data extraction process — for example, a discharge date earlier than an admission date, a date of birth inconsistent with age by calculation, laboratory test results outside the realistic range, and qualitative judgment results inconsistent with the judgment criteria defined in the protocol. Processing of abnormal data must be carried out with great care to avoid resulting bias. Any errors and abnormal data identified should be verified through further investigation before modifying the data; all modifications should be documented.

Finally, missing data should be handled during statistical analysis. For different studies, the degree, reasons, and missing mechanisms of variable values will vary. Where imputation of missing data is involved, an appropriate imputation method should be adopted based on a reasonable assumption about the missing mechanism.

(IV) Data Transformation

Data transformation is the process of uniformly converting the data format standards, medical terminology, coding standards, and derived variable calculations of raw data — after data cleaning — into a form suitable for real-world data, in accordance with the corresponding standards in the analytical database.

For the transformation of free-text data, reliable natural language processing algorithms may be used to improve transformation efficiency, provided that data transformation accuracy and traceability are ensured.

When calculating derived variables, the raw data variables and values used for calculation, the calculation method, and the definition of derived variables should be clearly specified, and timestamp management should be applied, to ensure data accuracy and traceability.

(V) Data Transmission and Storage

The transmission and storage of real-world data should be based on a trusted network security environment and controlled throughout the full lifecycle from data collection, processing, and analysis through to destruction. Encryption protection should be applied during both data transmission and storage. In addition, approval procedures for operational settings, role-based access controls, and minimum-authorisation access control policies should be established; the establishment of automated audit systems to monitor and record data processing and access activities is encouraged.

(VI) Data Quality Control

Data quality control is key to ensuring the integrity, accuracy, and transparency of research data. Data quality control requires establishing a comprehensive real-world data quality management system and standard operating procedures. Recommended principles include:

1. Ensuring the accuracy and authenticity of source data.

For electronic medical records serving as a key data source, medical record quality control standards should be in place to satisfy analytical requirements. Disease descriptions, diagnoses, and medication information originating from outpatient visits require supporting evidence chains. For any modification made during the entry process, confirmation and signature by the responsible person are required, along with the reason for modification, ensuring that a complete audit trail is maintained.

2. Giving full consideration to data integrity issues during data extraction.

Extraction fields should be assessed and established, and corresponding verification rules and database architecture developed.

3. Formulating a comprehensive data quality management plan.

A systematic and manual quality control plan should be formulated to ensure data accuracy and completeness. For key variables, comprehensive verification and source document review should be conducted; other variables may be sampled in accordance with actual circumstances — for example, sampling at a certain proportion for demographic information, thresholds of numerical variables, and coding mapping relationships to verify their accuracy and reasonableness.

(VII) Common Data Model

The common data model is a data model for the rapid centralisation and standardised processing of multi-source heterogeneous data under a multidisciplinary collaboration model; its main function is to convert source data with different standards into a uniform structure, format, and terminology, so as to enable data integration across databases/datasets.

Given the complexity of the structure and type, and the differences in sample size and standards, of multi-source data, the overall process of converting source data into a common data model requires extraction, transformation, and loading of source data; it should be ensured that the source data is consistent with the structure and terminology of the target analytical database syntactically and semantically (see Figure 2).

An ideal common data model should adhere to the following principles:

A common data model may be defined as a data governance mechanism through which source data can be standardised into a common structure, format, and terminology, thereby permitting data integration across multiple databases/datasets. A common data model should have the capability to access source data, be a dynamically extensible and continuously improvable data model, and have version control.
The definition, measurement, merging, recording, and corresponding validation of variables in the common data model should remain transparent; rules for converting data from multiple databases should be clear and consistent.
Common variables or concepts relating to safety and efficacy should all be mapped to the common data model so as to be applicable to different clinical research questions, and should be verifiable against established or known research results.

(VIII) Real-World Data Governance Plan

A real-world data governance plan should be formulated in advance, synchronised with the overall project research plan. If the governance plan requires revision during the course of the research, the review authority should be consulted and an updated governance plan submitted simultaneously. The plan should state the purpose of using real-world data for regulatory decision-making and the research design employing real-world data. It should also describe the real-world source data, including but not limited to: the type of real-world data source/source files — for example, health information system data, disease registry study data, and medical insurance data; an appropriate evaluation of the prior use of the real-world data source/source files, and a statement of the reasons for adopting them; the governance of real-world data — i.e., the governance process from real-world data sources to the analytical database; the data model and data standards adopted; the method for handling missing data; measures taken to reduce or control potential biases arising from the use of real-world data; quality control and quality assurance; and the suitability assessment of real-world data.

Section V — Compliance, Security, and Quality Management System for Real-World Data

(I) Data Compliance

Real-world data originates from diverse sources including individual patients’ diagnostic and treatment processes. The collection, processing, and use of such data involve ethical and patient privacy issues. In order to fully protect the safety and rights and interests of patients, the acquisition and use of real-world data for real-world research must pass review and approval by an ethics committee. Persons involved in real-world data governance are required to strictly observe the requirements of relevant laws and regulations; sponsors should strictly enforce these requirements and fulfil their obligations of protection and management.

(II) Data Security Management

Data security management should be carried out in accordance with national laws and regulations and industry regulatory requirements; necessary security protections should be applied to information systems and network facilities bearing health and medical data, as well as to cloud platforms. The scope of data security protection should cover every lifecycle stage, including data collection, extraction, transmission, storage, exchange, and destruction. Encryption technology should be used to ensure the integrity, confidentiality, and traceability of data throughout the processes of collection, extraction, transmission, and storage; where data is transmitted via media, the media should be subject to controls. Different protective measures should be adopted for different data formats on different media, and corresponding access control mechanisms should be established; access records should be reviewed, registered, archived, and audited.

Data auditing and related operating procedures provide a record and basis for data collection, extraction, transmission, maintenance, storage, sharing, and use; they should include personnel audits, management audits, and technical audits. Medical information system activity audit policies and appropriate standard operating procedures should be formulated and deployed. The scope of auditing should cover any operation on data in any state — including log-in, creation, modification, and deletion of records — all of which should automatically generate time-stamped audit records, including but not limited to authorisation information, time of operation, reason for operation, content of operation, operator identity, and signature; these records should be available for audit. Audit records should be stored securely and an access control policy established.

(III) Quality Management System

A complete quality management system should be established to standardise real-world data processing procedures and to continuously optimise and improve in the course of actual operations. Basic quality elements should cover: to ensure the quality of real-world data, operating procedures covering the full lifecycle management of real-world data should be established; computerised system functions should meet the management requirements of real-world data and comply with relevant regulatory requirements for computerised systems; a comprehensive personnel management system should be established, with persons involved in data collection, governance, and analysis trained appropriately and meeting competency requirements for their responsibilities, with standardised management of personnel permissions; a risk management process for every stage from data collection through data submission should be established; standard information and document management specifications (for paper and electronic media) should be formulated to ensure that records of real-world data processing procedures are complete, accurate, and transparent, protecting data security and compliance.

Section VI — Communication with Regulatory Authorities

To ensure that the quality of real-world data meets regulatory requirements, applicants are encouraged to communicate with regulatory authorities in a timely manner. Before a real-world study formally commences, communication should take place — based on the overall development strategy and specific research protocol — on whether real-world data can support the generation of real-world evidence, including the accessibility of real-world data, whether the sample size is sufficiently large, whether the data governance plan is reasonable and feasible, and whether data quality can be assured. During the course of the research, if the data governance plan is adjusted in response to changes in study implementation, the sponsor should weigh the potential impact of the adjustment to the data governance plan on the study objectives, explain the adequate reasons for the adjustment to the regulatory authority, obtain its agreement, and simultaneously submit an updated research protocol and data governance plan. After completion of the research and before submission, the sponsor may consult with the regulatory authority regarding submission materials and databases.

Annex 1 — Glossary

Electronic Medical Record (EMR): An electronic record of health-related information on an individual patient, created, collected, managed, and accessed by authorised clinical professionals within a healthcare institution.

Electronic Health Record (EHR): An electronic record of health-related information on an individual patient that complies with nationally recognised interoperability standards and that can be created, managed, and consulted by authorised clinical professionals across multiple healthcare institutions.

Observational Study: A study that, based on a specific research question, does not impose active intervention and uses general or clinical populations as its subjects to explore the causal relationship between exposure/treatment and outcomes.

Patient-Reported Outcome (PRO): An indicator measuring and evaluating disease outcomes from the patient’s own perspective, encompassing symptoms, physiological factors, psychological factors, and satisfaction with healthcare services. Records may be on paper or electronic (the latter are called electronic patient-reported outcomes, ePRO).

Edit Check (Logical Check): A check on the validity of clinical research data entered into a computer system, primarily evaluating whether the entered data has logical errors with respect to the expected numerical logic, value ranges, or value attributes.

Data Standard: A set of rules governing how a specific type of data is to be structured, defined, formatted, or exchanged among computer systems. Data standards enable submitted materials to be predictable and consistent, and in a form usable by information technology systems or scientific tools.

Data Cleaning: The process of identifying and correcting noise in data so as to minimise the impact of noise on data analysis results. Noise in data principally includes incomplete data, redundant data, conflicting data, and erroneous data.

Data Linkage: The merging, association, and combination of data and information from multiple sources to form a unified dataset.

Data Element: A single observation of a study participant recorded in clinical research — for example, date of birth, white blood cell count, pain severity, and other clinical observations.

Data Curation (Data Governance): The governance performed on raw data — targeted at a specific clinical research question — in order to render it suitable for statistical analysis; its content at minimum includes data extraction (from multiple data sources), data security processing, data cleaning (logical checks and abnormal data processing, and data integrity processing), data transformation (common data model, normalisation, natural language processing, medical coding, derived variable calculation), data quality control, and data transmission and storage.

Common Data Model (CDM): A data model for the rapid centralisation and standardised processing of multi-source heterogeneous data under a multidisciplinary collaboration model; its main function is to convert source data with different data standards into a uniform structure, format, and terminology, so as to enable data integration across databases/datasets.

Source Data: All information in original records and certified copies of original records of clinical symptoms, observations, and other activities in clinical research used to reconstruct and evaluate the study. Source data is contained in source documents (including original records or their valid copies).

Real-World Data (RWD): All data collected in the course of daily practice that relates to patients’ health status and/or their diagnosis, treatment, and healthcare. Not all real-world data, once analysed, can become real-world evidence; only real-world data satisfying suitability requirements may potentially generate real-world evidence.

Real-World Research/Study (RWR/RWS): A research process targeting a clinical research question, in which data relating to the health status and/or diagnosis, treatment, and healthcare of research subjects collected in a real-world setting (real-world data), or aggregate data derived from such data, is analysed to obtain clinical evidence (real-world evidence) on the use value and potential benefits and risks of a drug.

Real-World Evidence (RWE): Clinical evidence on drug use and potential benefits and risks obtained through appropriate and adequate analysis of fit-for-purpose real-world data.

Guiding Principles for Real-World Data Used to Generate Real-World Evidence (Trial).