Open-Source Does Not Mean Open Data — Zhang Ping on Training-Data Compliance for Open-Source AI

Editor’s Note — DCC.

Zhang Ping, professor at Peking University Law School, published this piece in 人民论坛 (People’s Tribune) — a state-affiliated theoretical journal of People’s Daily Publishing Group — as part of its 前沿 (“Frontier”) column on emerging legal and policy questions. The piece takes direct aim at two confusions that have dominated Chinese open-source AI discussion in the wake of DeepSeek, Qwen, and the broader open-weight wave: the conflation of “model weight open-source” with “training data open-source,” and the inference from “available on the internet” to “available for training.” DCC reproduces Zhang’s framework with framing for overseas counsel structuring China-related AI-model and training-data deployments.

The two misconceptions

Two patterns Zhang sees repeatedly in Chinese practitioner discussion:

Misconception 1: Open-source AI means training data has no copyright protection. Wrong. Open-source is conditional authorization based on a license. The licensor retains copyright; the licensee gets specified rights only within license scope. “Publicly accessible” content on the internet is not the same as “available for training” — most internet content is protected by copyright, the Personal Information Protection Law, or trade-secret regimes.

Misconception 2: Algorithm open-source compels simultaneous training-data publication. Wrong. Model weights and training data are two distinct objects subject to different legal rules. An enterprise can open-source the model architecture and weights while maintaining commercial autonomy over the training corpus. Doing so is both legally compliant and commercially coherent — and is, in fact, the standard practice for most major open-weight releases (the model is open; the data is not).

The misconceptions are not academic. They drive operational behavior: teams scraping web data on the assumption that “open = public domain,” teams publishing model weights and assuming the training data must follow, teams negotiating with data suppliers under wrong assumptions about default rights. Each produces a downstream compliance failure.

The full-chain risk

Open-source AI training-data use creates risk at every stage of the data lifecycle. Zhang structures the picture in three:

Acquisition stage

The dominant operational mode is automated crawling at scale. The legal problem: license-chain traceability collapses at scale. A scraped page may host content under multiple licenses, with licensing buried in linked agreements or invisible metadata. Aggregation across millions of sources produces a corpus where the original license terms are no longer traceable per item.

This produces what Zhang calls a “license laundering” (许可洗钱) effect — a striking term that captures how copyright-protected content becomes operationally indistinguishable from public-domain content once it’s been processed through a crawling and tokenization pipeline. The downstream operator cannot, in practice, separate the legitimately licensed content from the infringing content in the resulting corpus. From a compliance posture, every byte in the corpus carries inherited license-uncertainty.

Processing stage

Once acquired, the training data enters PI-protection obligations under PIPL — and these obligations are technically difficult to discharge. Two specific gaps:

The right of deletion (删除权). PIPL Article 47 establishes the deletion right; for personal information in a training corpus, exercising the right is technically non-trivial. Once a model has been trained on a dataset, removing a specific data point requires retraining or specialized “machine unlearning” techniques that are still maturing. The legal right exists; the operational mechanism is incomplete.
Purpose limitation. PIPL Article 6 limits processing to the disclosed purpose. Training data that was lawfully collected for a stated purpose (e.g., medical-records research) cannot, without additional consent, be redirected to a different purpose (e.g., training a foundation model for general use). Compliance teams underestimate how aggressively this constrains corpus repurposing.

Output stage

The model may, in inference, reproduce specific expressions from the training corpus — verbatim or near-verbatim. This triggers two distinct legal vectors:

Copyright infringement. Where the reproduced expression is identifiable as a copyrighted work, the model output may infringe; the deployer is exposed under direct or indirect infringement analysis.
PI re-identification. Where the reproduced expression contains personal information from the training corpus, the model output may constitute an unauthorized disclosure of personal information — even if the training input was processed with appropriate consent.

The output-stage risk is structurally novel because it implicates not the data acquisition or processing decisions but the model’s emergent behavior. Standard compliance-review postures designed for static-data flows don’t capture it.

The four-tier differentiated governance framework

Zhang’s most operationally useful contribution is a four-tier classification of training data with corresponding compliance gates. The tiers:

Tier 1 — Open-license or public-domain data

Lowest risk. Data under open licenses (Creative Commons, Apache, MIT, etc.) or in the public domain. Compliance posture: prioritize use; document the license; preserve attribution where required.

Tier 2 — Publicly accessible but with unclear licensing

Moderate risk. Data scraped from the web with unclear or untraceable license terms. Compliance posture: active license-chain verification before inclusion. If verification fails, exclude from corpus or restrict downstream model use. Crucially, publicly accessible ≠ licensed for training — Tier 2 requires affirmative documentation, not absence of explicit prohibition.

Tier 3 — Data containing personal information

High risk. Compliance posture: strict PIPL handling. De-identification or anonymization at the earliest possible pipeline point. PIIA prior to inclusion. Documentation of legal basis (consent, contractual necessity, statutory exemption). Separate handling protocols for sensitive PI under PIPL Article 28.

Tier 4 — Important data or trade secrets

Highest risk. Data within the important data category under the DSL classification regime, or third-party trade secrets. Compliance posture: highest-tier security protection. Access controls, encryption, audit logging, segregation from general-corpus pipelines. Separate review and approval gates. For important data, additional consideration of cross-border restrictions under the Measures for the Security Assessment of Data Export.

The four-tier framework is the operational analog of the three-tier data classification (general / important / core) Wang Qinglan walked through for general data assets, applied specifically to AI training corpora.

The four operational pathways

Zhang proposes four concrete pathways for enterprises operationalizing the framework.

Pathway 1 — Strengthen authorization contracting with data suppliers

Enterprises sourcing training data from third-party suppliers should contract for:

Complete data source proof. The supplier must provide documentation of where the data was collected and from whom.
Authorization-chain documentation. Full traceability from the original rights-holder through any intermediate licensees to the supplier.
Title-warranty clauses. Embedded warranties shifting infringement liability to the supplier in case of defects.

This shifts the license-laundering risk back up the supply chain to the party best positioned to verify provenance — the supplier — rather than concentrating it at the deployer.

Pathway 2 — Classification and grading system

Routine data inventory and asset ledgers documenting per data category: source, authorization form, applicable scope, compliance status. Differentiated access controls aligned to the four-tier classification.

Pathway 3 — Technical defenses

Pre-training automated tools. Remove personal information; identify high-copyright-risk data; flag potential trade-secret content. Apply before the data enters the training pipeline.
Output filtering mechanisms. At inference, intercept outputs that may reproduce training-corpus expressions verbatim. The output-stage risk needs an output-stage control.

Pathway 4 — Public corpus infrastructure development

The supply-side fix Zhang advocates: expand compliant public corpora. Government data, public cultural resources, scientific data — released under standardized authorization terms with quality control and continuous updating. The aim is to provide enterprises with a high-volume, low-risk, well-documented training-data source — reducing the operational incentive to scrape questionable web content.

What this tells overseas compliance teams

Stop conflating “model open-source” with “data open-source.” They’re distinct legal objects. Standard global practice (open weights + closed training corpus, or open weights + partially-documented training corpus) is legally coherent in China; the conflation is the compliance failure mode, not the structure.
“Publicly accessible” is not a legal status — it is a technical state. Web-accessibility does not entail training-use license. Compliance reviews of training-data sourcing should specifically reject “scraped from public web” as a sufficient documentation standard. Replace with affirmative license documentation per data source.
The license-laundering risk is structural. Scrape-aggregate-train pipelines obscure the license-chain in ways that cannot be unwound post hoc. The compliance posture has to be designed at the acquisition stage; downstream remediation is not, in practice, available.
The four-tier framework maps cleanly to internal corpus governance. Multinationals building or licensing AI models with any China-data exposure should map their training corpora against Zhang’s four tiers and document the compliance gates per tier. The framework is portable; many of its requirements (de-identification at pipeline entry, license-chain documentation, output filtering) are operational baseline globally.
The output-stage filtering capability is the under-deployed control. Most compliance attention focuses on data acquisition and processing; the inference-stage reproduction risk is where the most-visible failures occur in practice. Build the output filtering before the regulator surfaces a verbatim-reproduction case against your model.

The deeper point Zhang lands at the close of her piece: open-source and compliance are not in tension; they are both preconditions for sustainable AI industry development. China’s AI international competitiveness, she argues, requires both “continuous technological breakthroughs” and “solid legal infrastructure.” Compliance governance is not a constraint on innovation — it is the condition that lets innovation continue. The framing is consistent with the broader Chinese-regulator posture: the regime is trying to enable the AI industry, not suppress it, but is willing to absorb friction in the build-out to keep the architecture intact.

— 张平, 前沿 | 开源人工智能训练数据的合规治理 (Frontier: Compliance Governance of Open-Source AI Training Data), 人民论坛 (People’s Tribune), April 1, 2026. Original article (Chinese).

Not legal advice. The above is DCC’s structured summary of Zhang’s analysis, with framing for overseas counsel; the four-tier classification framework and the four operational pathways are Zhang’s.