Open Source Data Assets for Enterprise AI Enrichment
Internal whitepaper prepared for Legal, Privacy, and Global Trade review. December 2025.
Purpose
License assessment and compliance framework for integrating open source datasets with enterprise data systems to support Microsoft Copilot deployment.
Key Contribution
A three-tier license framework for evaluating open source data:
| Tier | License Type | Risk | Compliance |
|---|---|---|---|
| 1 | Public Domain (CC0, US Gov, PDDL) | Lowest | None required |
| 2 | Attribution Required (CC BY, Apache, BSD) | Low | Simple workflow |
| 3 | ShareAlike (CC BY-SA, ODbL) | Requires assessment | Derivative work determination |
Connection to Garden Content
This whitepaper is the practical implementation of 04-molecule—reference-data-multiplier:
“Integrating permissively-licensed open source datasets with proprietary business data creates a semantic enrichment layer that enhances AI system performance.”
The datasets inventoried connect to earlier work:
- Knowledge graphs (Wikidata, DBpedia) → 06-molecule—qualitative-research-knowledge-graph, 06-atom—entity-linking-dimensionality
- Skills taxonomies (O*NET, ESCO) → workforce analytics applications
- Technical ontologies → domain-specific enrichment
Datasets Evaluated
Tier 1 (Public Domain): Wikidata, BLS OEWS, BLS ORS, SOC System, ISCO-08
Tier 2 (Attribution): O*NET, ESCO, Canadian SCT, OSMT/RSDs, Common Core Ontologies, WordNet, GraphGen4Code, CodeOntology, ATOMIC 2020, Freebase
Tier 3 (ShareAlike): DBpedia, YAGO 4.5, ConceptNet
Excluded: BabelNet (non-commercial), SFIA (commercial license), Lightcast (subscription), OpenCyc (discontinued), NELL (no license)
Extracted Content
Atoms:
- 04-atom—license-tier-framework
- 04-atom—sharealike-derivative-ambiguity
- 04-atom—public-domain-lowest-risk
Molecules:
Key Recommendations
- Prioritize Tier 1 (public domain) for immediate deployment
- Establish attribution workflow for Tier 2
- Get legal guidance on ShareAlike before using Tier 3
- Maintain provenance documentation in data catalog
- Quarterly license review as datasets update
Related: 04-molecule—reference-data-multiplier, 04-atom—data-governance, 06-atom—entity-linking