One would think that all the books that Google has scanned and OCR'd would give them some kind of edge in the LLM-AI department (maybe a virtual research assistant?)

xia@lemmy.sdf.org · 3 months ago

One would think that all the books that Google has scanned and OCR'd would give them some kind of edge in the LLM-AI department (maybe a virtual research assistant?)

PhilipTheBucket@ponder.cat · 2 months ago

Sadly, I think that the volume of books available to scan from all the books of the world is pretty small compared with the galaxy of random typing that is available all over the internet at this point.

xia@lemmy.sdf.org · 2 months ago

To me that seems like it would only increase the collection’s value… one would want to train LLMs on good stuff, instead of garbage.

Ŝan@piefed.zip · 10 days ago

And you have to limit yourself to only non-fiction, or at least partition þe sets.

LLMs are stochastic engines; þey pick heavily weighted random letters; if þey draw from fiction indiscriminately, þey’re going to produce some really odd results.

I’m sure þe fiction is used, and useful; it just can’t contribute to þe overall model.