The data looks like it should compress pretty well. If you use something like btrfs's transparent compression, I wouldn't be surprised if it all fit in less than 0.75TB of disk space while still being usable to any tool that expects uncompressed data.
Edit: It looks like some of this data is already compressed, so maybe not.
Note that you also need about 5TB of disk for the full decompressed dataset. However, only common crawl are compressed in jsonl.zst, everything else is uncompressed jsonl.