Open Source Data Assets for Enterprise AI Enrichment

Internal whitepaper prepared for Legal, Privacy, and Global Trade review. December 2025.

Purpose

License assessment and compliance framework for integrating open source datasets with enterprise data systems to support Microsoft Copilot deployment.

Key Contribution

A three-tier license framework for evaluating open source data:

TierLicense TypeRiskCompliance
1Public Domain (CC0, US Gov, PDDL)LowestNone required
2Attribution Required (CC BY, Apache, BSD)LowSimple workflow
3ShareAlike (CC BY-SA, ODbL)Requires assessmentDerivative work determination

Connection to Garden Content

This whitepaper is the practical implementation of 04-molecule—reference-data-multiplier:

“Integrating permissively-licensed open source datasets with proprietary business data creates a semantic enrichment layer that enhances AI system performance.”

The datasets inventoried connect to earlier work:

Datasets Evaluated

Tier 1 (Public Domain): Wikidata, BLS OEWS, BLS ORS, SOC System, ISCO-08

Tier 2 (Attribution): O*NET, ESCO, Canadian SCT, OSMT/RSDs, Common Core Ontologies, WordNet, GraphGen4Code, CodeOntology, ATOMIC 2020, Freebase

Tier 3 (ShareAlike): DBpedia, YAGO 4.5, ConceptNet

Excluded: BabelNet (non-commercial), SFIA (commercial license), Lightcast (subscription), OpenCyc (discontinued), NELL (no license)

Extracted Content

Atoms:

Molecules:

Key Recommendations

  1. Prioritize Tier 1 (public domain) for immediate deployment
  2. Establish attribution workflow for Tier 2
  3. Get legal guidance on ShareAlike before using Tier 3
  4. Maintain provenance documentation in data catalog
  5. Quarterly license review as datasets update

Related: 04-molecule—reference-data-multiplier, 04-atom—data-governance, 06-atom—entity-linking