Most of the data used in training GPT4 has been gathered through open initiatives like Wikipedia and CommonCrawl. Both are freely accessible by anyone. As for building datasets and models, there are many non-profits like LAION and EleutherAI involved that release their models for free for others to iterate on.
While actually running the larger models at a reasonable scale will always require expensive computational resources, you really only need to do the expensive base model training once. So the cost is not nearly as expensive as one might first think.
Any headstart OpenAI may have gotten is quickly diminishing, and it’s not like they actually have any super secret sauce behind the scenes. The situation is nowhere as bleak as you make it sound.
Fighting against the use of publicly accessible data is ultimately as self-sabotaging ludditism as fighting against encryption.
Doing it yourself is fine as an educational exercise for newbies, but skilled linux users generally have better things to do than to do the setup by hand for the nth time. On the other hand the “vanilla”/bleeding-edge approach of Arch makes it one of the best bases for derivative distros available, so basing your distro on it is a no-brainer for many.