Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For every 1 robots.txt that is genuinly configured, there's 9 that make absolutely no sense at all.

Worse. GETing the robots.txt automatically flags you as a 'bot'!

So as a crawler that wants to respect the spirit of the robots.txt, not the inane letter that your hired cheapest junior webadmin copy/pasted there from some reddit comment, we now have to jump through hoops such as geeting hhe robots.txt from a separate vpn etc.



Well, robots.txt being an opaque and opt out system was broken from the start. I've just started havi g hidden links and pages only mentioned in robots.txt and any ip that tries those is immediatly blocked for 24 hours. There is no reason to continue entertaining these companies.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: