How we use public-domain dictionaries without copying restricted data
WordFor's results come from real dictionaries. But "real dictionaries" includes a messy mix of licenses — some truly public domain, some copyleft, some proprietary. To keep the visible product clean and redistributable, WordFor sorts every source into three lanes and is strict about which lane is allowed to put text in front of you.
Lane 1: the clean public-domain core (visible)
The words and definitions you actually see come only from sources that are public domain or openly licensed for redistribution:
- Open English WordNet (CC BY 4.0) — modern senses and synonyms.
- Webster's 1913 (public domain) — the classic unabridged base.
- Century Dictionary (public domain) — broad historical coverage.
- Chambers's Twentieth Century Dictionary, 1908 (public domain).
- The public-domain Webster-derived subset of GCIDE — only the 1913-Webster lineage, never GPL additions.
- A small set of CC0 definitions for modern words missing from the old dictionaries.
Each visible entry carries a source label, and an automated audit fails the build if any restricted text ever lands in this lane.
Lane 2: build-time-only signals (never shown)
Some excellent resources are copyleft (CC-BY-SA) or otherwise unsuitable for redistribution. WordFor still learns from them at build time — for ranking and quality scoring — but never copies their text into the product:
- Wiktionary and ConceptNet (CC-BY-SA) — used only as cross-validation and centrality signals; their definitions are not redistributed.
- Moby Thesaurus — enriches synonym suggestions.
- Pronunciation and frequency lists — used to weight, not to display.
The distinction is deliberate: a ranking signal derived at build time is not the same as shipping someone else's copyrighted text.
Lane 3: optional / research-only packs (blocked from core)
Sources that are GPL (the full GCIDE), proprietary, edition-unverified, or jurisdiction- sensitive are kept out of the visible core entirely. Examples: the full GPL GCIDE, aggregator sites, and historical works whose specific scan/edition we haven't verified as public domain. Some may later ship as clearly-labelled optional packs, but never silently in the core.
Why this matters
- You can trust the labels. Every visible word can be traced to a license-clean source.
- The product is redistributable. No copyleft or proprietary text rides along.
- Quality without compromise. Restricted resources still improve ranking as build-time signals, so you get good results without ingesting their text.
For how those sources turn into a ranked list, see how WordFor ranks candidate words.