Machine learning

There is a large dataset for training Large Language Model’s called The Pile. It can be found here. It is an 825 GiB open-source dataset including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers.