The underlying problem is that most container images are not cache efficient. Compressed tarballs arent and that’s what most of container images are. And Bazel relies heavily on caching to stay fast.
Most of the hyper scaler actually do not store container images as tarballs at scale. They usually flatten the layers and either cache the entire file system merkle tree, or breaking it down to even smaller blocks to cache them efficiently. See Alibaba Firefly Nydus, AWS Firecracker, etc… There is also various different forms of snapshotters that can lazily materialize the layers like estargz, soci, nix, etc… but none of them are widely adopted.
Without indicating my personal feelings on monorepo vs polyrepo, or expressing any thoughts about the experience shared here, I would like to point out that open-source projects have different and sometimes conflicting needs compared to proprietary closed-source projects. The best solution for one is sometimes the extreme opposite for the other.
In particular many build pipelines involving private sources or artifacts become drastically more complicated than their those of publicly available counterparts.
The thing that nobody tells junior programmers, but which you really have to pick up from experience, is this:
"The right final architecture" is never achieved by just immediately going out and building that architecture, i.e. by hooking up all the tools required to support that architecture. That's cargo-culting the architecture.
Facebook's use of Cassandra and CI lint-checks and blue-green deployments is just like military cargo planes' use of radio towers — they didn't build those first; they scaled the thing they were doing to the point that these things became necessary support structures, and then they built them.
The "right way" — the right process for engineering a solution — has very little to do with up-front architectural design. The "right way" — the way that'll be most likely to get you to that "right final architecture" eventually — is really the tenable way: the iterative approach that allows you to build up your solution while keeping only one change or consideration in your head at a time. Which means that engineering "the right way" involves not doing all those cargo-cult practices unless/until they become necessary, and even then, only adopting them one at a time. Just like you wouldn't try to make ten different refactorings in a codebase in one patch-set.
Or, to put that another way: YAGNI applies to processes and tools just as much as it does to code. Some projects never exceed 1000 lines. Do those projects need CI cyclomatic-complexity checkers? No.
Only introduce support structures to a project, as the pain of not having them starts to outweigh the pain of adding them.
This is simple enough if you're okay with intermediary files littering the entire codebase. Targets and their intermediary files will be written right next to their sources and the whole tree will be a huge mess.
The makefile quickly gets complicated if one wants an organized source tree such as:
I posted this hours ago and then stepped away. The story captures so much about the Lee I knew so well. I'll add one piece of praise for Lee's early architecture of Cloudflare.
Everything was controlled by a single Postgres database that made very heavy use of stored procedures, that called other procedures, that called others. It was one giant program inside the database. It took me a while to comprehend what he'd done but it was really great. The database ran everything and all those functions made sure that audit logs were kept, that the calls were allowed for the user ID being passed in, and some of these procedures made external calls to APIs including getting things like SSL certificates.
It was a magnificent monolith inside a database.
I worked on the periphery of the database (it was truly Lee's domain) and he'd tell me what output to expect or API to create and I'd code to his spec. and we'd just hook it up.
If any single artefact represents what he did at Cloudflare, it's that database. And he used to code it on a laptop we called "The Beast" because it was so crazily heavy and overloaded with memory etc. that he'd carry around a mini, test Cloudflare wherever he went.
>The conclusion I draw from this is that you can only really use Git if you understand how Git works.
I say this as someone who uses git regularly, and who prefers it to all other version control systems I have tried:
A tool that breaks the principle of encapsulation by forcing you to grok its internals if you are to have any hope of understanding its arcane and inconsistent usage syntax is frankly not a very good tool.
By contrast, I don't understand how vim works beyond the base conceptual level (keypress goes in, character shows up on screen or command is executed) and yet I don't have any trouble using it. I don't need to know vim's internals to use it effectively. Vim is a good tool.
Whatever code we write as developers is only very tip of a the iceberg of software that comprises any substantial application. Controlling what goes into that iceberg, and how it's assembled is an essential part of the engineering of software. The details and quality of your build system determine the composition and construction of that iceberg, not to mentioned the reliability and velocity of your development process.
Even 'basic' local build systems like CMake, maven/gradle/ivy, sbt, lein, cargo, go, ... bridge dependency management and task execution. They decide what goes into the software artifact you ultimately distribute (or deploy), and how that's assembled.
At the scale of buck, bazel, ... tools of that shape are necessary to make forward progress in that's composed of internal dependencies that are managed by different teams, written in different languages, targeting a variety of environments, that are so numerous they require distribution to complete in reasonable timeframes, and require absolute.
I'm not VS/C# user, but MSBuild is definitely a build system, and both in and the developer definitely have care about these complexities, even if they come under the heading of "IDE" instead of "Build System".
Also:
As the joke in my first comment implied, if you can't identify the build system, you're probably the build system
It seems the pro-Jira comments here are basically, "It's fine if your organization isn't a mess and you don't abuse it."
That's the problem with any tool, framework, language, etc where the defenders main argument is, "It's great as long as you don't abuse it." or to put the former bluntly "It's not the tool, it's your organization." Over and over and over again, if an organization is allowed to abuse and misuse something, it eventually will.
Good tooling has opinions and will give you guardrails to not hang yourself.
(for context - I'm not interested in first class node support)
This seems pretty cool. I particularly like how 'gradual' it seems to be relative to things like Bazel, i.e. you can take some shell scripts and migrate things over. I did have a play and hit an initial problem around project caching I think, which I raised at [0].
One comment, from the paranoid point of view of someone who has built distributed caching build systems before is that your caching is very pessimistic! I understand why you hash outputs by default (as well as inputs), but I think that will massively reduce hit rate a lot of the time when it may not be necessary? I raised [1].
Edit: for any future readers, I spotted an additional issue around the cache not being pessimistic enough [3]
As an aside, I do wish build systems moved beyond the 'file-based' approach to inputs/outputs to something more abstract/extensible. For example, when creating docker images I'd prefer to define an extension that informs the build system of the docker image hash, rather than create marker files on disk (the same is true of initiating rebuilds on environment variable change, which I see moon has some limited support for). It just feels like language agnostic build systems saw the file-based nature of Make and said 'good enough for us' (honorable mention to Shake, which is an exception [2]).
This is one of my absolute favorite topics. Pardon me while I rant and self-promote :D
Dockerfiles are great for flexibility, and have been a critical contributor to the adoption of Docker containers. It's very easy to take a base image, add a thing to it, and publish your version.
Unfortunately Dockerfiles are also full of gotchas and opaque cargo-culted best practices to avoid them. Being an open-ended execution environment, it's basically impossible to tell even during the build what's being added to the image, which has downstream implications for anybody trying to get an SBOM from the image for example.
Instead, I contribute to a number of tools to build and manage images without Dockerfiles. Each of them are less featureful than Dockerfiles, but being more constrained in what they can do, you can get a lot more visibility into what they're doing, since they're not able to do "whatever the user wants".
1. https://github.com/google/go-containerregistry is a Go module to interact with images in the registry and in tarballs and layouts, in the local docker daemon. You can append layers, squash layers, modify metadata, etc. This library is used by all kinds of stuff, including buildpacks, Bazel's rules_docker, and all of the below, to build images without Docker.
2. crane is a CLI that uses the above (in the same repo) to make many of the same modifications from the commandline. `crane append` for instance adds a layer containing some contents to an image, entirely in the registry, without even pulling the base image.
3. ko (https://ko.build) is a tool to build Go applications into images without Dockerfiles or Docker at all. It runs `go build`, appends that binary on top of a base image, and pushes it directly to the registry. It generates an SBOM declaring what Go modules went into the app it put into the image, since that's all it can do.
4. apko (https://apko.dev) is a tool to assemble an image from pre-built apks, without Docker. It's capable of producing "distroless" images easily with config in YAML. It generates an SBOM declaring exactly what apks it put in the image, since that's all it can do.
Bazel's rules_docker is another contender in the space, and GCP's distroless images use it to place Debian .debs into an image. Apko is its spiritual successor, and uses YAML instead of Bazel's own config language, which makes it a lot easier to adopt and use (IMO), with all of the same benefits.
I'm excited to see more folks realizing that Dockerfiles aren't always necessary, and can sometimes make your life harder. I'm extra excited to see more tools and tutorials digging into the details of how container images work, and preaching the gospel that they can be built and modified using existing tooling and relatively simple libraries. Excellent article!
There are many technical solutions to this problem, as others have pointed out. What I would add is that data at the edge should be considered immutable.
If records are allowed to change, then you end up in situations where changes don't converge. But if you instead collect a history of unchanging events, then you can untangle these scenarios.
Event Sourcing is the most popular implementation of a history of immutable events. But I have found that a different model works better for data at the edge. An event store tends to be centrally localized within your architecture. That is necessary because the event store determines the one true order of events. But if you relax that constraint and allow events to be partially ordered, then you can have a history at the edge. If you follow a few simple rules, then those histories are guaranteed to converge.
Rule number 1: A record is immutable. It cannot be modified or deleted.
Rule number 2: A record refers to its predecessors. If the order between events matters, then it is made explicit with this predecessor relationship. If there is no predecessor relationship, then the order doesn't matter. No timestamps.
Rule number 3: A record is identified only by its type, contents, and set of predecessors. If two records have the same stuff in them, then they are the same record. No surrogate keys.
Following these rules, analyze your problem domain and build up a model. The immutable records in that model form a directed acyclic graph, with arrows pointing toward the predecessors. Send those records to the edge nodes and let them make those millisecond decisions based only on the records that they have on hand. Record their decisions as new records in this graph, and send those records back.
No matter how you store it, treat data at the edge as if you could not update or delete records. Instead, accrue new records over time. Make decisions at the edge with autonomy, knowing that they will be honored within the growing partially-ordered history.
> And if there's any fault in the language chosen for AP computer science, the blame should go to the designer of the curriculum.
Java was recommended for AP CS 22 years ago. At this point I lay less fault at their feet and I'm more interested in why it's still there at long last.
> If by blockly you mean the visual interface, what you said is going to be true for any language.
My research isn't published yet but I've found this is not necessarily true. This is a product of taking a thing not built for kids but rather professional programmers, and then forcing kids to use it. The problem with Java is that it's the kind of system where you need to know everything about it to use it all. Going from 0 to "hello world" requires over a dozen distinct Java concepts explained.
public class MyFirstJavaProgram {
public static void main(String []args) {
System.out.println("Hello World");
}
}
We've got access modifiers, classes, scopes, lifetimes, variable types, void variables, methods and in particular the main method, String types, arrays of strings, input arguments to main method, the System class, output streams, function calls, string literals, and to top it all off a good old fashioned statement terminator as the cherry on top of that feature sundae. The great irony is that the AP Computer Science Ad Hoc Committee call this behemoth "simple".
That's lesson one, and all it gets you is "Hello World". Most instructors handle this as "Ignore all the things you're writing down, we'll talk about them over the course of the next 3 months." Because that's how long it really takes.
This leaves students in a perpetual state of confusion, where they feel like they never really understand what's going on. Everything is a mysterious incantation and nothing really makes sense. Do wrong thing and it tells you it's wrong, but you won't understand why.
Students spend a lot of time memorizing "static void pub main()"... or was it public void static main? Or static public void class? I dunno, let's consult the spellbook. Because on day one their instructor said "ignore all of that, it's too complicated for you." Well students do ignore all of that, and it teaches exactly the wrong lesson -- they don't learn Java or programming, they learn that programming is hard and not for them yet. It's beyond them, but they're going to do it, and maybe one day they will understand, but not today. Today just write the code and run it.
At this point, most students give up. And I will say that yes, this is Java's fault. Not all programming languages require you to learn a dozen very deep and nuanced concepts to get to "Hello World".
Here's the ideal Hello World program for students and honestly for me too:
"Hello World"
That's it. That's the ideal. You get that in languages like Matlab, and I've had great success teaching robotics using Matlab. With Matlab, students are writing real robot programs within a few hours, whereas with Java students are still raising their hand asking "Wait, so what's the difference between an object and a class? What's a method versus a function? What's static mean? What does void mean? What does []args mean?"
It's my contention that the design of the Java language frequently turns away all but the most technically minded students at precisely the time when they are the most impressionable when it comes to their perceptions of computing, and we're all worse off for it.
Well yes, when you have a project that does 99% of what you need, you add the other 1% to the project instead of starting a whole new project. This is just how all software evolves.
Yes. There are lots of specific technical requirements from the FCC on this. First, even if the phone is locked, calls to 911 have to work. If there's no SIM card, calls to 911 have to work. For 911 calls, the phone's transmitter goes to full power and the receive side will attempt to connect even if the signal is too weak. If you're subscribed to one carrier and they're down, the phone has to try other carriers in range. If no talk channel is available, the cell site has to free one up, kicking off a non-emergency call if necessary. If the billing system is down in the cellular system, the call has to go through anyway. For newer technologies, VOIP has to support 911, with location info.
"Oh, we decided to divert all calls to Teams first" is just not going to fly.
Name field is a blob that can contain a vector graphic with animation capability, or an audio file. Each individual can have 0 to n names. For each name, allocate a canvas for the image and scale the name image into it, or present audio controls. A person with no name may be identified by pronouns.
If someone manages to change their name to a taste, smell, or tactile sensation, you may need to revise your system again.
A deaf person may consider their name to be a series of hand movements. A mariachi may identify by their signature grito. A luchador may be known mainly by the pattern on their mask. A corporation has its logo, trademarks, and audio marks ("by Mennen", Meow Mix jingle, etc.). A tiny purple musician could switch to an unpronounceable symbol, to protest something in their recording contract. Clowns are identified by their egg.
If your name is a million Unicode characters long, you might need zooming and panning capability for your name canvas.
You are absolutely right about the main motivation of using a monorepo: Allowing upsrteam library maintainers to see downstream usage of their code and make the required downstream changes themselves at the same time they change their libraries.
Also like you say the easiest way to get those advantages is to just check out the monorepo locally, so if there are no other reasons preventing you from doing just that, go for it.
However there are a few reasons why this is not always sufficient:
Size:
The repo might be so large that cloning it all will makes local tools (git cli, guis,...) slow to use, or in the most extreme case require to much disk space for your machine. To address this there are some git native tools like partial clone and sparse checkout, so size alone is not really the the main issue for us.
History "pollution":
Having a lot of somewhat loosely related projects in one tree means a history that shows all the changes. Yes git can filter them, but once again that might be a performance concern, but once again not really the biggest motivation to create a new approach/tool.
Permissions:
In some organisations (like the one I work for) it is not possible to give all developers access to all the code and thus the advantages of monorepo get lost just by trying to comply with data protection standards. The only solution with native git is to split the repo at legal (not necessary technical) boundaries and try to coordinate the changes across those. Loosing most of the benefits described.
Josh does not have a full blown permissions system yet, but the concept certainly allows for it and implementation is work in progress.
Sharing with others (aka, distributed VCS):
This is the biggest motivation for using something like Josh. The partial repos are repos in their own right and all the distributed features of git can be used with them. In a monorepo setup as you describe distributed workflow is sacrificed for monorepo advantages. Only developers in the same monorepo see the same sha1s and can easily exchange changes.
In Josh the same library can be part of different monorepos at different organisations and while the monorepos have different history and therefore sha1s, the “projected” or “partial” library subrepos will have compatible history with identical sha1s.
In this way Josh can serve as a bridge between organisations using different repo structures.
The disconnect is you have an opinion and I'm trying to inform you the effects of nationwide economic policy is a legitimate discipline of study.
This isn't about personal gumption and rugged individualism or some self-help personal bootstrap story. It's an economic policy question
Here's an example.
Let's say there's a formula for success rate, here we'll define it as generating more revenue than cost for the state over the lifetime of an individual.
A policy proposal, say your rugged individualism, has a given success rate, return on aggregate investment, confounding variables that can affect its outcome, etc. You can typify the classes of failures and bucket them into categories that can be characterized by cost, which is up for definition. It could include a loss of potential economic output for instance.
That's what we're talking about here. It's not story time, it's math, data collection and stats time.
The minimum wage question is "should it be legal to compensate a full time worker at a rate where their wages cannot cover some definition of sustination? And if so, how will the remaining slack be accounted for?" not whether someone can visualize their success and follow some 12 rules for life or whatever. It's not relevant
Quantitative economics used to be big in the US, kinda made it a world power. If you read newspapers from the 1930s it becomes obvious fast. How it got replaced with policy based on quite literally narrative fiction is for another conversation. It wasn't a good move, folklore is a terrible substitute for science.
Many IT workers are smug. Programmers, network admins, DBAs, sysadmins, architects. They all think they know something that other people don't (which is sometimes true!) and then transmute that feeling into a sense of superiority, which funnily enough ends up limiting them.
All programming languages are garbage to me now. Even my favorite languages. Most systems, networks, databases, are garbage to me. Frameworks, architectures, patterns, paradigms, protocols, standards, conventions..... all garbage.
I now see the whole system as like a municipal waste treatment plant. If we work really hard, the highest thing we can aspire to is to prevent waves of unprocessed shit from exploding out a release valve into a nearby stream. Our work should be unglamorous and practical, because ultimately someone's going to have to drink our water, and making sure it doesn't have shit in it should be our highest priority.
Let me propose a maturity scale and suggest where things are...
Level 1) Are you recording metrics?
That's the first basic level of maturity, without recording data you cannot have data from which to support any decisions. Without data you are blind. The assumption here is that you are able to record everything that you need to... recording 3 metrics isn't enough, you need to be able to get to all your data and record it over time.
Level 2) Are you able to visualise your raw data?
This is the basic dashboard. Whether you build something pretty or use an existing software package it's just the ability to see the data that you are collecting. This relies on people to figure out the inter-connectedness of data and to make inferences from it. Basic graphs and trend analysis fit in here.
Level 3) Are you able to visualise insights in your data?
This is about changing your dashboard or reports such that you can start to consider metrics that are abstractions from the raw data but important to you. Think of cohort charts, revenue relationships to user behaviour, UX funnels, engagement charts, etc.
Level 4) Are you able to record exceptions in your reporting?
As in, can you create thresholds against a metric such that if the metric goes above or below the threshold an event is triggered to record it. Exception tracking coupled with the insights in level 3 should give you the ability to do things like release a new feature and see the resulting effect on certain metrics and be alerted when those effects are outside of some normal range.
Level 5) Are you able to visualise your exceptional events?
And here we are at the newsfeed of data, the events which stand-out are highlighted and all of the other noise drops away. This is great for producing a list of things you might want to know or react to... but you can't get here without going through (and still having) levels 1-4 and it's important to understand that different people in your organisation still need all those prior levels (try telling your sysadmin that he can't have log files or munin and that instead he's got a news feed).
Indeed, do not let perfect be the enemy of better. The best way to improve transportation in SF is to tell everyone in a car to fuck right off, and run a bus both ways down every street every 3 minutes. No rails required.
"I love the idea of moving entirely to just one graphics API and keeping the server simple"
I'm all for that, if and only if Wayland doesn't force me to give up what I consider to be important features of X.
Apart from the aforementioned remote access, a post titled "Why I'm not going to switch to Wayland yet"[1] goes in to some requirements that I also find important:
- Third party screen shot/capture/share (shutter, OBS, ffmpeg, import, peek, scrot, VNC, etc.)
- Color picker (gpick, gcolor3, kcolorchooser)
- xdotool
That post is a couple of years old now, and I've been told in other HN threads on Wayland that some of this stuff is being worked on now, but until it's all there and it is actually mature and full-featured, I would not willingly switch to Wayland.
1) the insistence that every regulation scale perfectly and apply in all cases is the root of the problem
2) I did point out that each regulation, taken in isolation, looks like a good idea: you're illustrating that with your objection
3) There will always be a gap between reasonable accommodation and perfect accommodation, and what's reasonable is always subject to some debate: the effect of this is to push the reasonable further out into the realm of the unreasonable, in pursuit of the perfect
After all, no one wants to strand a wheelchair user with nowhere to pee. Certainly I don't.
But no one, wheelchair or no, can enjoy a coffeeshop which never existed because the starting costs were too high.
> Background: Despite veterans' preference hiring policies by law enforcement agencies, no studies have examined the nature or effects of military service or deployments on health outcomes. This study will examine the effect of military veteran status and deployment history on law enforcement officer (LEO)-involved shootings.
> Methods: Ten years of data were extracted from Dallas Police Department records. LEOs who were involved in a shooting in the past 10 years were frequency matched on sex to LEOs never involved in a shooting. Military discharge records were examined to quantify veteran status and deployment(s). Multivariable logistic regression was used to estimate the effect of veteran status and deployment history on officer-involved shooting involvement.
> Results: Records were abstracted for 516 officers. In the adjusted models, veteran LEOs who were not deployed were significantly more likely to be involved in a shooting than non-veteran officers. Veterans with a deployment history were 2.9 times more likely to be in a shooting than non-veteran officers.
> Conclusions: Military veteran status, regardless of deployment history, is associated with increased odds of shootings among LEOs. Future studies should identify mechanisms that explain this relationship, and whether officers who experienced firsthand combat exposure experience greater odds of shooting involvement.
One notable example I can think of is accessibility services.
In the US, public transit must accommodate the disabled, and for some types of trips or some types of disabilities there is a totally parallel transit system that involves specialized vehicles, operators, dispatchers to efficiently route vehicles, etc. It's also a massive PITA from the rider's POV, since you have to dial a call center to schedule a day in advance and you get a time window in which the driver will show up. This system dates from the '80s, before the Internet and before taxis were mandated to be accessible.
New York City tried a pilot program in which this system was replaced by subsidizing rideshare rides, since in the 21st century all taxis are required to have accommodations for the disabled anyways and you can leverage a well-tested system of ordering rides instantly and a large fleet of vehicles. While this did reduce per trip costs from $69 to $39, the increased convenience caused ridership to also skyrocket, so it ended up being a net drain on finances. [1] http://archive.is/N3DjJ
I don't think it's true that the US is unprepared. Most major grids have documented procedure to do a blackstart.
Which would be painful as hell but hardly 90% of population dies level painful. Author seems to be assuming more widespread damage I guess. In which case a handful of HV transformers aren't like to be the biggest headache. I'd imagine there is a hell of a lot more sensitive equipment out there than HV transformers. Internet, stock market, comms etc.
> If it wasn't useful, people wouldn't be paying for the services it provides, they're all optional.
Honestly, in my opinion this could not be a more misinformed take on the reality of banking. Today there exist no alternatives to commercial money (private bank debt), because our governments' demand taxes in it.
Finance and banking are also different, yet you seem to use the words interchangeably.
Banking = creation of credit.
Finance (today) is about creating overly complex financial products to take part in the game of high volume automated trading, such as with the use of BlackRock's Aladdin - where the same financial products are sold and resold hundreds of times in an hour. Pure speculation/extraction/Rentierism.
As the comment you replied to wrote:
> The first step to fixing this is to give citizens the ability to opt out of private banks and bank directly with the central bank. Private banks should not be the only ones with this privilege.
"The problem is largely in the system of exchange we call “money,” and in the banks that store and distribute it. Rather than allowing the free exchange of labor and materials for production, our system of banking and credit has acted as a tourniquet on production and a parasite draining resources away.
Genuine economic freedom requires that credit flow freely for productive use. But today, a handful of giant banks diverts that flow into an exponentially-growing self-feeding pool of digital profits for themselves. In the wake of the 2008 financial crisis, much of the global economy has been battling economic downturn, with rampant unemployment, government funding problems, and harsh austerity measures imposed on the people. Meanwhile, the banks that caused this devastation have been bailed out at government expense and continue to thrive at the public trough. All this has caused irate citizens to rise up against the banks, particularly the large international banks. But for better or worse, we cannot do without the functions they perform; and one of these is the creation of “money” in the form of credit when banks make loans.
This advance of bank credit has taken the form of “fractional reserve” lending, which has been heavily criticized. Yet historically, it is this sort of credit created out of nothing on the books of banks that has allowed the wheels of industry to turn. Employers need credit at each stage of production before they have finished products that can be sold on the market, and banks need to be able to create credit as needed to respond to this demand. Without the advance of credit, there will be no products or services to sell; and without products to sell, workers and suppliers cannot get paid.
If banks have an unfair edge in this game, it is because they have managed to get private control of the credit spigots. They use this control not to serve business, industry, and society’s needs but for their private advantage. They can turn credit on and off at will, direct it to their cronies, or use it for their own speculative ventures; and they collect the interest as middlemen. This is not just a modest service fee. Interest has been calculated to compose a third of everything we buy."[1]
The strongest alternative I am seeing emerge at this point is a new distributed peer to peer cryptographically secured accounting framework/pattern called Holochain. It allows us to rapidly prototype, and start using, new types of mutually sovereign asset backed Mutual Credit 'currencies' [2] (wealth-acknowledgement systems), based on productive capacity and measuring this wealth in new ways that isn't possible to integrate with today's money system. This includes the use of reputation currencies (think FairTrade labels, Organic veggie labels etc.). Building on this are projects like http://valueflo.ws.
Most of the hyper scaler actually do not store container images as tarballs at scale. They usually flatten the layers and either cache the entire file system merkle tree, or breaking it down to even smaller blocks to cache them efficiently. See Alibaba Firefly Nydus, AWS Firecracker, etc… There is also various different forms of snapshotters that can lazily materialize the layers like estargz, soci, nix, etc… but none of them are widely adopted.