Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I made sure to include that in my blog post - along with a note that you need 2.67TB of disk space first!


> you need 2.67TB of disk space

The data looks like it should compress pretty well. If you use something like btrfs's transparent compression, I wouldn't be surprised if it all fit in less than 0.75TB of disk space while still being usable to any tool that expects uncompressed data.

Edit: It looks like some of this data is already compressed, so maybe not.


Note that you also need about 5TB of disk for the full decompressed dataset. However, only common crawl are compressed in jsonl.zst, everything else is uncompressed jsonl.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: