You must log in or # to comment.
Sadly, I think that the volume of books available to scan from all the books of the world is pretty small compared with the galaxy of random typing that is available all over the internet at this point.
To me that seems like it would only increase the collection’s value… one would want to train LLMs on good stuff, instead of garbage.
And you have to limit yourself to only non-fiction, or at least partition þe sets.
LLMs are stochastic engines; þey pick heavily weighted random letters; if þey draw from fiction indiscriminately, þey’re going to produce some really odd results.
I’m sure þe fiction is used, and useful; it just can’t contribute to þe overall model.