Hacker Newsnew | past | comments | ask | show | jobs | submit | hashar's commentslogin

I do not understand why the scrappers do not do it in a smarter way: clone the repositories and fetches from there on a daily or so basis. I have witnessed one going through every single blame and log links across all branches and redoing it every few hours! It sounds like they did not even tried to optimize their scrappers.


> I do not understand why the scrappers do not do it in a smarter way

If you mean scrapers in terms of the bots, it is because they are basically scraping web content via HTTP(S) generally, without specific optimisations using other protocols at all. Depending on the use case intended for the model being trained, your content might not matter at all, but it is easier just to collect it and let it be useless than to optimise it away⁰. For models where your code in git repos is going to be significant for the end use, the web scraping generally proves to be sufficient so any push to write specific optimisations for bots for git repos would come from academic interest rather than an actual need.

If you mean scrapers in terms of the people using them, they are largely akin to “script kiddies” just running someone else's scraper to populate their model.

If by scrapers in terms of people writing them, then the fact that just web scraping is sufficient as mentioned above is likely the significant factor.

> why the scrappers do not do it in a smarter way

A lot of the behaviours seen are easier to reason if you stop considering scrapers (the people using scraper bots) to be intelligent, respectful, caring, people who might give a damn about the network as a whole, or who might care about doing things optimally. Things make more sense if you consider them to be in the same bucket as spammers, who are out for a quick lazy gain for themselves and don't care, or even have the foresight to realise, how much it might inconvenience¹ anyone else.

----

[0] the fact this load might be inconvenient to you is immaterial to the scraper

[1] The ones that do realise that they might cause an inconvenience usually take the view that it is only a small one, and how can the inconvenience little them are imposing really be that significant? They don't think the extra step of considering how many people like them are out there thinking the same. Or they think if other people are doing it, what is the harm in just one more? Or they just take the view “why should I care if getting what I want inconveniences anyone else?”.


Because that kind of optimization takes effort. And a lot of it.

Recognize that a website is a Git repo web interface. Invoke elaborate Git-specific logic. Get the repo link, git clone it, process cloned data, mark for re-indexing, and then keep re-indexing the site itself but only for things that aren't included in the repo itself - like issues and pull request messages.

The scrapers that are designed with effort usually aren't the ones webmasters end up complaining about. The ones that go for quantity over quality are the worst offenders. AI inference-time data intake with no caching whatsoever is the second worst offender.


Because they don't have any reason to give any shits. 90% of their collected data is probably completely useless, but they don't have any incentive to stop collecting useless data, since their compute and bandwidth is completely free (someone else pays for it).

They don't even use the Wikipedia dumps. They're extremely stupid.

Actually there's not even any evidence they have anything to do with AI. They could be one of the many organisations trying to shut down the free exchange of knowledge, without collecting anything.


The way most scrapers work (I've written plenty of them) is that you just basically get the page and all the links and just drill down.


So the easiest strategy to hamper them if you know you're serving a page to an AI bot is simply to take all the hyperlinks off the page...?

That doesn't even sound all that bad if you happen to catch a human. You could even tell them pretty explicitly with a banner that they were browsing the site in no-links mode for AI bots. Put one link to an FAQ page in the banner since that at least is easily cached


When I used to build these scrapers for people, I would usually pretend to be a browser. This normally meant changing the UA and making the headers look like a read browser. Obviously more advanced techniques of bot detection technique would fail.

Failing that I would use Chrome / Phantom JS or similar to browse the page in a real headless browser.


I guess my point is since it's a subtle interference that leaves the explicitly requested code/content fully intact you could just do it as a blanket measure for all non-authenticated users. The real benefit is that you don't need to hide that you're doing it or why...


You could add a feature kind of like "unlocked article sharing" where you can generate a token that lives in a cache so that if I'm logged in and I want to send you a link to a public page and I want the links to display for you, then I'd send you a sharing link that included a token good for, say, 50 page views with full hyperlink rendering. After that it just degrades to a page without hyperlinks again and you need someone with an account to generate you a new token (or to make an account yourself).

Surely someone would write a scraper to get around this, but it couldn't be a completely-plain https scraper, which in theory should help a lot.


I would build a little stoplight status dot into the page header. Red if you're fully untrusted. Yellow if you're semi-trusted by a token, and it shows you the status of the token, e.g. the number of requests remaining on it. Green if you're logged in or on a trusted subnet or something. The status widget would links to all the relevant docs about the trust system. No attempt would be made to hide the workings of the trust system.


And obviously, you need things fast, so you parallelize a bunch!


I was collecting UK bank account sort code numbers (to a buy a database at the time costs a huge amount of money). I had spent a bunch of time using asyncio to speed up scraping and wondered why it was going so slow, I had left Fiddler profiling in the background.


« mann » comes from Old English and stands for a human being.

« wïfmann », literally "female human", led to « wife ».

« were » means man and comes from Germanic and I don't think « weremann » has ever been a thing.


Blew my mind when I found out “world” is a direct derivative of the word “were”.


world” is derived from “were” + “eald” (old), and meant “the age of humans”, which was distinguished from the age of the Gods, when the Æsir and Vanir dominated, and the age of the Jötnar.

I find it interesting how the term shifted from a (mythical) temporal concept to a spatial concept, to now often a social concept (e.g. the Fourth World).


Why add the complexity of having to maintain an Ansible installation, a logging stack, deal with their upgrades and whatever python issue one might encounter. I had the issue of Ansible builtin `shell` not doing the right thing (sh vs bash) or it being unnecessarily slow when uselessly looking up `cowsay`.

Adding layers and layers of tooling is often overkill and it is hard to bit the simplicity of 33 lines of shell when the use case is a single person doing the code, deployment and maintenance.


I’m with you on the usecase. Simple server deployment on a VM, bash script is fine, in fact I recommend it. It’s when you start dealing with 5+ VMs that I would start looking into using a tool like Ansible.


@unixispower , you might consider adding the site to TheOldNet webring ( https://webring.theoldnet.com/submit ). I have discovered it from a yesterday post about ucanet (a DNS for retro site). Your site would be an excellent ring member!


Thanks for the heads up. I'll have to make up some banners later and submit. I have a few sites that would go nicely there.


> inability to moderate which is really important for adult content

Given Elon Musk twitted about moderation being censorship (twist: it is not), what could go wrong!?!


His position was about moderating legal content. He thinks moderating legal content is censorship. He is in favor of taking down illegal content. If people think currently legal content should be taken down they should appeal to change the laws and it shouldn't be part of at least their platform to judge the legal content. That's his position, not my position.


That's what he says his position is. He bans people who post the movements of his private jet. He also has had Twitter file lawsuits against people merely for saying mean things about Twitter.

Of course, it's also possible that he's a complete idiot and has no idea what the First Amendment actually permits in speech, like he stopped paying attention after Schenck v US.


Just because he's a hypocrite, doesn't mean we should censor more.


Yeah, that is hypocritical. He should be congruent in his views and in any ambiguity in content disfavoring him he should lean towards giving benefit of the doubt to prove lack of bias. Although in this case there didn't seem to be any ambiguity, he should have not banned the account.


Filing lawsuits is entirely consistent with a position that the courts, not private companies, should regulate speech online.

Then there's Jack Sweeney, the guy tracking Elon and his private jet - who, by the way, was accused of facilitating stalking by Taylor Swift. For him, X made a policy against any account "doxxing real-time location info of anyone".

Does anyone here actually want to argue that tracking real-time location info should be allowed?


I'll bite.

Real-time tracking of a person or their ground transport: not ok.

Real-time tracking of a person's plane: ok.

And the reason for such a distinction is that planes can only land on specific ground slots. That also means that real-time tracking of a person's helicopter falls under "not ok". And by extension, the same will hold for flying taxis, once they take off[tm].

We don't (yet?) live in a world where shoulder mounted surface-to-air missiles, outside of war zones, are a realistic threat.


Once a stalker knows a specific place and time to find their victim, they can simply follow them until they have an opportunity to do worse.

"Missiles" are just a straw man.

Hopefully Taylor Swift will follow through, Sweeney will be sued or even prosecuted for stalking, and we'll find out if this really is legal.


He just banned posting the identity of pseudonymous accounts which is definitely not illegal. He also banned posting public information about the movements of his private jet which is also definitely not illegal.

This stuff would easier to take seriously he was consistent about it. At this point it’s kind of insulting


This is one where “won’t someone thing if the children” (underage and non-consensual) is relevant. It tends to be a lot more… universally agreed upon limits to whatever you think free speech is.


> Elon Musk twitted about moderation being censorship

And instead, X has implemented hellbanning, where nobody outside X knows who is being censored and why. People just slowly figure out that, actually, no one sees their posts. At least with outright <scare-quote> "moderation" you would know that you had been cancelled.


I work for Wikimedia and I have been one of the few weirdos assisting in its creation back in the early 2000's. The first time I got exposed to wikis predates Wikipedia and my stance was:

- a site editable by anyone on the internet? That is never going to fly

Few months later I followed a link to Wikipedia and that clicked, we can definitely build an encyclopedia online and the hope (I was in my early 20's) or bet was that more people were wiling to write article than people willing to deface it. I guess we won that bet by a large margin :-]

Even if I was younger, I was not clueless. I was well aware some people would deface it and it did happen. I have also been involved in two very long and tedious fights with editors having a political agenda, which diverting me from actually writing articles. I think that is the real danger: shifting the focus of people from writing articles toward pointless long discussions.


That burning up of resources of the good guys is exactly why a lot of stuff eventually derails. Just like the fight against the online crime rings that hold people's data ransom and that deface and destroy: the defenders have to succeed all the time, the attackers only have to succeed part of the time to be successful. So over a long enough run the attackers have an edge. Wikipedia is an exception, so far. Enough people cherish it that they are willing to put in the effort. But the day enough of them blink at the same time the assholes will take over. I hope that day will never come, but I'm not sure it will not. On a human scale Wikipedia is still very young.


As a non US person, I have a couple questions:

* What is P&A tech?

* How one retires at 25 when working at Google which is way past IPO and the 100x return on stock option which is only possible at the earliest stage?


>How one retires at 25 when working at Google which is way past IPO and the 100x return on stock option which is only possible at the earliest stage?

Parent is not talking about Google employees caching on Google stock.

They're talking about acquired company founders, getting the acquisition money from Google and retiring (or having the money to do show) through their "exit", while living Google with shitty startup code, created in "startup mode" with no regard to the future, just to ship, patch it to get enough traction, and exit quickly.


Oops, should've been M&A (mergers and acquisitions)


>* How one retires at 25 when working at Google which is way past IPO and the 100x return on stock option which is only possible at the earliest stage?

levels.fyi reports a L4 averages $270k/yr at Google. Can sock away a whole lot of that pretty fast.


$270k in the Bay Area is not retirement at 25 money unless you’re that Googler living in a van in the parking lot, or your retirement plan is living simply in a poor country. It’s a fine living, to be clear, but in a high cost of living area you’re paying high rent until you can buy an expensive house, etc. and the American healthcare system alone means you need to have millions saved as a buffer against illness over that kind of timeframe (kinda hard to re-enter the workforce at 40 with cancer when you realize your cost projections were optimistic).


>It’s a fine living, to be clear, but in a high cost of living area you’re paying high rent until you can buy an expensive house, etc.

You don't need to live in a high cost of living area, except if you're competing American Psycho style.

You could live where the other 95% of people working in the Bay Area live.

I'm pretty sure that the baristas serving those Googlers in cafes outside the Googleplex don't make $270K a year, and still get to work in the same area.

If that means more commuting, that's always an option. People commute 1-2 hours per direction too, to make $50K, I'm pretty sure a 20-something making $270K can handle it.

Heck, even a daily two-way Uber would be totally doable, and other expenses added, they'd still get to save over $150K per year still.


Because despite Wikipedia and sister projects being one of the largest web property, it is running on a thin budget and has starved engineering resources. As far as I know, the transcode code is maintained by a single employee (possibly as a side gig / on top of everything else) and the assistance of a volunteer.

For Motion JPEG a recent config change ( https://gerrit.wikimedia.org/r/c/operations/mediawiki-config... ) indicates:

> Recent versions of iOS can play back suitably packaged VP9 video and Opus or MP3 audio, with a Motion-JPEG low-res fallback for older devices.

So I guess it is there for back compatibility :)


The problem is not resources. It is an ideological choice. Wikimedia Commons only supports non-proprietary file formats. That means either open formats or formats whose patents have expired. (MPEG-4 Part 2 patents only expired in the US a few weeks ago.)


Why though ? They spend $160M a year [1] and grew their cash reserves by 50% year on year in 2023, so not particularly running in an operating deficit environment.

Transcoding is expensive but not that much, if my company doesn’t make 1/20 of Wikipedia and we can afford to do 1000s of hours a day of transcoding surely they can too.

[1] https://wikimediafoundation.org/wp-content/uploads/2023/11/W...


It's not about the cost of transcoding; it's an ideological stance about open / royalty-free formats.

e.g. there was a pretty strong consensus about not supporting MP4 back when the WMF asked whether it should be allowed, mostly on "it's not free" grounds: https://commons.wikimedia.org/wiki/Commons:Requests_for_comm...


The decision to not supply H.264 is ideological for sure, and I can understand that from the patent perspective, but then they have MP3 (patent-expired in 2017) but not MPEG-2/H.262 (patent-expired in 2018).

Also note that VP8/VP9 is still patented, but just licensed freely. IMHO that's less free than patent-expired (public domain).


My understanding was H264 is kind is licensed freely too after Cisco made their agreement usable for everyone ?

Firefox can support mp4 over h264 despite their clear FOSS aligned goals , I am surprised that Wikipedia whose goals more align to open information rather than open source directly has challenges .


Because in the end every organization is vulnerable to being eaten from the inside and worn as a skinsuit by parasites. Especially charities.

Why would they spend money on improvements to the site when they could spend money on other things instead?


it is the easiest line item to spend on .

Wikipedia has one of the best SRE teams, they were pretty transparent too, a lot of the communication was on IRC channels you could see, at least that was the case few years back.

Running the top 5 website in the world is no joke especially as a non-profit and they do it well. They haven’t had any down time or major incident in the last decade which is pretty impressive.

I would think their SRE team is not just good but also very motivated in the mission otherwise they would leave for much higher paying jobs, infra jobs are very lucrative if you have prior experience at more scale not much more scale than Wikipedia .


I agree that their SRE team is good, well motivated, and transparent. That does not mean that they are the first priority for resources, or that it's the easiest line item to spend on.


MPEG-2 or even MPEG-4 ASP seems like a better choice for back-compat.


We have that filed in our bug tracker back in July 2022 and when I looked at the issue ( https://phabricator.wikimedia.org/T313114#8093706 ) almost all of the traffic came from a handful of IP addresses. Thus it is most probably some kind of probe, maybe to check whether internet is reachable.

As a result of the bug above, the entry is filtered out when post processing, but the page view dataset used for https://pagevews.wmcloud.org/ relies on the raw aggregated data.


Thank you! Cause I have expanded it to Elastic Common Schema! https://www.elastic.co/guide/en/ecs/current/ecs-reference.ht...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: