You lose atomic deployment and have a distributed system the moment you ship Javascript to a browser.
Hell, you lose "atomic" assets the moment you serve HTML that has URLs in it.
Consider switching from <img src=kitty.jpg> to <img src=puppy.jpg>. If you for example, delete kitty from the server and upload puppy.jpg, then change html, you can have a client with URL to kitty while kitty is already gone. Generally anything you published needs to stay alive for long enough to "flush out the stragglers".
They just refresh the page, it's not a big deal. It'll happen on form submission or any navigation anyway. Some people might be caught in a weird invalid state for, like, a couple minutes absolute maximum.
Right, there's level of solutions. You can't sit here and say that a few seconds of invalid state on the front-end only for mayyyyybe .01% of your users is enough to justify a sprawling distributed system because "well deployments aren't atomic anyway!1!".
IMO, monorepos are much easier to handle. Monoliths are also easier to handle. A monorepo monolith is pretty much as good as it gets for a web application. Doing anything else will only make your life harder, for benefits that are so small and so rare that nobody cares.
Monorepo vs not is not the relevant criteria. The difference is simply whether you plan your rollout to have no(/minimal) downtime, or not. Consider SQL schema migration to add a non-NULL column on a system that does continuous inserts.
Again, that's trivial if you use up and down servers. No downtime, and to your users, instant deployment across the entire application.
If you have a bajillion services and they're all doing their own thing with their own DB and you have to reconcile version across all of them and you don't have active/passive deployments, yes that will be a huge pain in the ass.
So just don't do that. There, problem solved. People need to stop doing micro services or even medium sized services. Make it one big ole monolith, maybe 2 monoliths for long running tasks.
Magical thinking about monorepos isn't going to make SQL migrations with backfill instantaneous and occur simultaneously with the downtime you have while you switch software versions. You're just not familiar with the topic, I guess. That's okay. Please just don't claim the problem doesn't exist.
And yes, it's often okay to ignore the problem for small sites that can tolerate the downtime.
Not sure what GP had in mind, but I have a few reasons:
Cherry picks are useful for fixing releases or adding changes without having to make an entirely new release. This is especially true for large monorepos which may have all sorts of changes in between. Cherry picks are a much safer way to “patch” releases without having to create an entirely new release, especially if the release process itself is long and you want to use a limited scope “emergency” one.
Atomic changes - assuming this is related to releases as well, it’s because the release process for the various systems might not be in sync. If you make a change where the frontend release that uses a new backend feature is released alongside the backend feature itself, you can get version drift issues unless everything happens in lock-step and you have strong regional isolation. Cherry picks are a way to circumvent this, but it’s better to not make these changes “atomic” in the first place.
Do you take down all of your projects and then bring them back up at the new version? If not, then you have times at which the change is only partially complete.
I would see a potentially more liberal use of atomic, that if the repo state reflects the totality of what I need to get to new version AND return to current one, then I have all I need from a reproducibility perspective. Human actions could be allowed in this, if fully documented. I am not a purist, obviously.
Blue/green might allow you to do (approximately) atomic deploys for one service, but it doesn't allow you to do an atomic deploy of the clients of that service as well.
Why that? In a very simple case, all services of a monorepo run on a single VM. Spin up new VM, deploy new code, verify, switch routing. Obviously, this doesn't work with humongous systems, but the idea can be expanded upon: make sure that components only communicate with compatible versions of other components. And don't break the database schema in a backward-incompatible way.
So yes, in theory you can always deploys sets of compatible services, but it's not really workable in practice: you either need to deploy the world on every change, or you need to have complicated logic to determine which services are compatible with which deployment sets of other services.
There's a bigger problem though: in practice there's almost always a client that you don't control, and can't switch along with your services, e.g. an old frontend loaded by a user's browser.
The notion of external clients is a smell. If that’s the case, you need a compat layer between that client and your entrypoints, otherwise you’ll have a very hard time evolving anything. In practice, this can include providing frontend assets under previously cached endpoints; a version endpoint that triggers cache busting; a load balancer routing to a legacy version for a grace period… sadly, there‘s no free lunch here.
The only way I could read their answer as being close to correct is if the clients they're referring to are not managed by the deployment.
But (in my mind) even a front end is going to get told it is out of date/unusable and needs to be upgraded when it next attempts to interact with the service, and, in my mind atleast, that means that it will have to upgrade, which isn't "atomic" in the strictest sense of the word, but it's as close as you're going to get.
If your monorepo compiles to one binary on one host then fine, but what do you do when one webserver runs vN, another runs v(N-1), and half the DB cluster is stuck on v(N-17)?
A monorepo only allows you to reason about the entire product as it should be. The details of how to migrate a live service atomically have little to do with how the codebase migrates atomically.
That's why I mention having real stable APIs for cross-service interaction, as you can't guarantee that all teams deploy the exact same commit everywhere at once. It is possible but I'd argue that's beyond what a monorepo provides. You can't exactly atomically update your postgres schema and JavaScript backend in one step, regardless of your repo arrangement.
Adding new APIs is always easy. Removing them not so much since other teams may not want to do a new release just to update to your new API schema.
But isn't that a self-inflicted wound then? I mean is there some reason your devs decided not to fix the DB cluster? Or did management tell you "Eh, we have other things we want to prioritize this month/quarter/year?"
This seems like simply not following the rules with having a monorepo, because the DB Cluster is not running the version in the repo.
Maybe the database upgrade from v(N-17) to v(N-16) simply takes a while, and hasn't completed yet? Or the responsible team is looking at it, but it doesn't warrant the whole company to stop shipping?
Being 17 versions behind is an extreme example, but always having everything run the latest version in the repo is impossible, if only because deployments across nodes aren't perfectly synchronised.
This is why you have active/passive setup and you don't run half-deployed code in production. Using API contracts is a weak solution, because eventually you will write a bug. It's simpler to just say "everything is running the same version" and make that happen.
each deployment is a separate "atomic change". so if a one-file commit downstream affects 2 databases, 3 websites and 4 APIs (madeup numbers), then that is actually 9 different independent atomic changes.
Spent the last three months building a competitor/lookalike ML model + API. Started using plain embedding similarity and quickly realized you end up with similar noisy results as ocean.io. Ended up using similarity learning which works quite well with little data. Launched this as an API and small web app. Hardest part right now is to fend off scrapers honestly.
I don't understand why so many VCs fall for "not invented here". Vibe-coded or not, this is just another in-house solution, inferior and more expensive than most out-of-the-box products already out there.
(Author here) Maybe you missed the point of what I wrote. I thought the disclaimer made it clear this is just a tiny project for 3 users only and not something meant to scale :) Is my product inferior to Notion, Slack, etc.? OF COURSE. Do I use Notion extensively? Fuck no. I'm more of a Bear (now Craft!) user, but I needed Notion for a handful of tiny features that Tiptap now gives me. So should I pay $60 per seat for the little I need, and miss out on the fun of building my own tool? I think not. But hey, that's just me :)
it's in Hongdae ("Hongdae T Stay" on Booking). I paid ₩39,000 a night (~23 euros) with a 25 day stay.
I don't think people stay long there, it was mostly foreigners. The bed is like a plank, there is no window. But it was cool, I enjoyed it :)
Why GPT-based then? There are libraries that do this: You give examples, they generate the rules for you and give you a scraper object that takes any html and returns the scraped data.
Great projects, thank you for the links.
On a brief scan neither cover paging/loops - or js frameworks where one would need to use headless browsers and wait for content to load, where a low/lazy code solution might provide the most added value.
Has it? Can you give me an example of a site that is hard to scrape by a motivated attacker?
I'm curious, because I've seen stuff like the above but of course it only fools a few off the shelf tools, it does nothing if the attacker is willing to write a few lines of node.js
i guess the lazy way to prevent this in a foolproof way is to add an ocr somewhere in the pipeline, and use actual images generated from websites. although maybe then you'll get #010101 text on a #000000 background
Personally, this feels like the direction scraping should move into. From defining how to extract, to defining what to extract. But we're nowhere near that (yet).
A few other thoughts from someone who did his best to implement something similar:
1) I'm afraid this is not even close to cost-effective yet. One CSS rule vs. a whole LLM. A first step could be moving the LLM to the client side, reducing costs and latency.
2) As with every other LLM-based approach so far, this will just hallucinate results if it's not able to scrape the desired information.
3) I feel that providing the model with a few examples could be highly beneficial, e.g. /person1.html -> name: Peter, /person2.html -> name: Janet. When doing this, I tried my best at defining meaningful interfaces.
4) Scraping has more edge-cases than one can imagine. One example being nested lists or dicts or mixes thereof. See the test cases in my repo. This is where many libraries/services already fail.
If anyone wants to check out my (statistical) attempt to automatically build a scraper by defining just the desired results:
https://github.com/lorey/mlscraper
I was most worried about #2 but surprised how much temperature seems to have gotten that under control in my cases. The author added a HallucinationChecker for this but said on Mastodon he hasn't found many real-world cases to test it with yet.
Regarding 3 & 4:
Definitely take a look at the existing examples in the docs, I was particularly surprised at how well it handled nested dicts/etc. (not to say that there aren't tons of cases it won't handle, GPT-4 is just astonishingly good at this task)
Your project looks very cool too btw! I'll have to give it a shot.
This seems like part of the problem we're always complaining about where hardware is getting better and better but software is getting more and more bloated so the performance actually goes down.
Yeah seems like it would make way more sense to have an LLM output the CSS rules. Or maybe output something slightly more powerful, but still cheap to compute.
I have organised many events when I was in school, when you're doing things for fun and with a sense of community, it takes a couple of stings like this for you to start taking contracts more seriously.
I'm using a monorepo for my company across 3+ products and so far we're deploying from stable release to stable release without any issues.
reply