Public Domain as Lowest Risk Data

For AI training data, public domain sources represent the lowest legal and ethical risk tier. No copyright claims, no license restrictions, no terms of service violations.

Public Domain Sources

  • Government publications (US federal)
  • Works with expired copyright
  • Explicitly dedicated works (CC0)
  • Facts and data (not creative expression)

Risk Hierarchy

  1. Lowest: Public domain, CC0
  2. Low: Permissive licenses (CC-BY, MIT)
  3. Medium: Copyleft licenses (GPL, CC-SA)
  4. High: Fair use claims on copyrighted material
  5. Highest: Terms of service violations, scraped personal data

Practical Considerations

  • Public domain != public availability (hosting still costs)
  • Quality may be lower (no commercial curation)
  • Coverage gaps in specialized domains
  • Still need provenance for reproducibility

Strategy

Start with public domain foundations. Layer proprietary data where public domain is insufficient. Document risk assessment for each data source.

Related: 04-atom—data-provenance, 04-atom—data-governance