13. Data Provenance and Supply Chain Security

13.1 Understanding Data Provenance in AI/LLM Systems
The Data Lifecycle in AI Systems
Why Provenance Matters
Provenance vs. Data Lineage vs. Data Governance
Concept
Focus
Purpose
Chain of Custody for AI Data
13.2 The AI/LLM Supply Chain Landscape

Overview of Supply Chain Components

Upstream Dependencies
Pre-trained Models
Datasets
Embedding Services
Lateral Dependencies
Code and Frameworks
Infrastructure
APIs and Services
Downstream Dependencies
Fine-tuning and Customization
Production Data
The "Trust But Verify" Problem
13.3 Supply Chain Attack Surfaces
13.3.1 Model Supply Chain
Pre-trained Model Repositories
Example Attack
Model Type
Name
Model Weights and Checkpoint Integrity
Model Poisoning During Training
Example Backdoor
13.3.2 Training Data Supply Chain
Public Datasets
Risks
Scraped Web Data
Attack Scenario
Step
Attacker Action
Impact
Crowdsourced Data and Annotations
13.3.3 Code and Framework Dependencies
ML Framework Vulnerabilities
Python Package Ecosystem
Attack Vectors
Historical Example: UA-Parser-JS (2021)
Container Images
13.3.4 Infrastructure and Platform Dependencies
Cloud Model APIs
Supply Chain Risk Example
Vector Databases and Embedding Services
GPU Compute Providers
13.3.5 Third-Party Integrations
Plugins and Extensions
Plugin Risks
Example Attack Vector
13.4 Common Supply Chain Vulnerabilities
13.4.1 Model Poisoning and Backdoors

Definition
Attack Mechanics
Training-Time Poisoning

Example
Example Type
Input
Output
Input Type
Behavior
Inference-Time Attacks
Trojan Triggers in Models
Real-World Examples
BadNets (2017)
Poisoning Language Models
Federated Learning Attacks
13.4.2 Data Poisoning
Clean-Label Poisoning
Label Flipping
Web Scraping Manipulation
Attack Methodology
Step
Action
Details
Example Product Recommendation Attack
Step
Strategy Details
Adversarial Data Injection in Fine-Tuning
RLHF (Reinforcement Learning from Human Feedback) Poisoning
13.4.3 Dependency Confusion and Substitution
Typosquatting in Package Repositories
Malicious Package Injection
Attack Flow
Step
Attack Flow
Dependency Confusion Attack
Package Location
Package Name
Real-World Example (2021)
Compromised Maintainer Accounts
13.4.4 Model Extraction and Theft
Stealing Proprietary Models via API Access
Model Extraction Techniques
Query-based Extraction
Effectiveness
Knowledge Distillation as a Theft Vector
Reconstruction Attacks on Model Weights
13.4.5 Compromised Updates and Patches
Malicious Model Updates
Attack
Step
Event
Status
Backdoored Library Versions
SolarWinds-Style Supply Chain Attacks
Potential ML Equivalent
Automatic Update Mechanisms as Attack Vectors
13.5 Provenance Tracking and Verification
13.5.1 Model Provenance
Model Cards (Documentation Standards)
Example Model Card Template
Cryptographic Signing of Model Weights
Process
Step
Action
Step
Action
Tools
Provenance Metadata
13.5.2 Data Provenance
Source Tracking for Training Data
Example Data Provenance Record
Transformation and Preprocessing Logs
Attribution and Licensing Information
Data Freshness and Staleness Indicators
13.5.3 Code and Dependencies Provenance
Software Bill of Materials (SBOM) for AI Systems
Example SBOM for ML Project
Tools for SBOM Generation
Dependency Trees and Vulnerability Scanning
Code Signing and Attestation
Build Reproducibility
13.5.4 Provenance Documentation Standards
Model Cards (Google, Mitchell et al. 2019)
Data Sheets for Datasets (Gebru et al. 2018)
Data Sheet Sections
Nutrition Labels for AI Systems
Supply Chain Transparency Reports
13.6 Red Teaming Supply Chain Security
13.6.1 Reconnaissance and Mapping
Identification Tasks
1. Model Dependencies
2. Data Dependencies
3.Code Dependencies
4. Infrastructure Dependencies
Building Supply Chain Attack Tree
Attack Vector
Sub-Techniques
13.6.2 Integrity Verification Testing
Verifying Model Weight Checksums and Signatures
Test Procedure
Testing for Backdoors and Trojan Triggers
Approach 1: Behavioral Testing
Approach 2: Statistical Analysis
Approach 3: Model Inspection Tools
Validating Training Data Authenticity
13.6.3 Dependency Analysis
Scanning for Known Vulnerabilities (CVEs)
Example Output
Package
Version
CVE
Description
Severity
Fixed In
Testing for Dependency Confusion
Test Procedure
Evaluating Transitive Dependencies
13.6.4 Simulating Supply Chain Attacks
Test 1: Model Injection Simulation (in isolated test environment)
Test 2: Data Poisoning Simulation
Test 3: Dependency Confusion Attack Simulation
13.6.5 Third-Party Risk Assessment
Evaluating Vendor Security Postures
Security Questionnaire Template
Testing API Provider Security
Assessing Plugin Ecosystem Risks
13.7 Real-World Supply Chain Attack Scenarios
Scenario 1: Poisoned Pre-trained Model from Public Repository
Attack Setup
Attack Execution (Scenario 1)
Impact (Scenario 1)
Detection
Mitigation
Scenario 2: Malicious Python Package in ML Dependencies
Attack Setup
Attack Execution (Scenario 2)
Impact (Scenario 2)
Real-World Example
Detection and Mitigation
Scenario 3: Compromised Training Data via Web Scraping
Attack Scenario: "Operation Poison Well"
Attack Execution (Scenario 3)
Impact (Scenario 3)
Defense
Scenario 4: Cloud API Provider Compromise
Attack Scenario
Attack Execution (Scenario 4)
Impact (Scenario 4)
Real-World Parallel
Mitigation
Scenario 5: Insider Threat in Fine-Tuning Pipeline
Attack Scenario
Attack Execution (Scenario 5)
Impact (Scenario 5)
Detection
Mitigation
13.8 Conclusion
Chapter Takeaways
Recommendations for Red Teamers
Recommendations for Defenders
Future Considerations
Next Steps
Last updated
Was this helpful?

