Astute AI copyright observer Michael Weinberg raises some good questions about the Common Pile, an AI training dataset billed as being composed of only “openly licensed text”:
On one hand, this is an interesting effort to build a new type of training dataset that illustrates how even the “easy” parts of this process are actually hard. On the other hand, I worry that some people read “openly licensed training dataset” as the equivalent of (or very close to) “LLM free of copyright issues.”
[michaelweinberg.org]