I've done a handful of interviews recently where the 'scaling' problem involves something that comfortably fits on one machine. The funniest one was ingesting something like 1gb of json per day. I explained, from first principals, how it fits, and received feedback along the lines of "our engineers agreed with your technical assessment, but that's not the answer we wanted, so we're going to pass". I've had this experience a good handful of times.
I think a lot of people don't realize machines come with TBs of RAM and hundreds of physical cores. One machine is fucking huge these days.
The wildest part is they’ll take those massive machines, shard them into tiny Kubernetes pods, and then engineer something that “scales horizontally” with the number of pods.
Maybe you are right about kubernetes, I don't have enough experience to have an opinion. I disagree about containers though, especially the wider docker toolchain.
It is not that difficult to understand a Dockerfile and use containers. Containers, from a developer pov, solve the problem of reliably reproducing development, test and production environments and workloads, and distributing those changes to a wider environment. It is not perfect, its not 100% foolproof, and its not without its quirks or learning curve.
However, there is a reason docker has become as popular as it is today (not only containers, but also dockerfiles and docker compose), and that is because it has a good tradeoff between various concerns that make it a highly productive solution.
I can maybe make a case for running in containers if you need some specific security properties but .. mostly I think the proliferation of 'fucked up piles of shit' is the problem.
Containers are just processes plus some namespacing, nothing really stops you from running very huge tasks on Kubernetes nodes. I think the argument for containers and Kubernetes is pretty good owing to their operational advantages (OCI images for distributing software, distributed cron jobs in Kubernetes, observability tools like Falco, and so forth).
So I totally understand why people preemptively choose Kubernetes before they are scaling to the point where having a distributed scheduler is strictly necessary. Hadoop, on the other hand, you're definitely paying a large upfront cost for scalability you very much might not need.
Time to market and operational costs are much higher on kubernetes and containers from many years of actual experience. This is both in production and in development. It’s usually a bad engineering decision. If you’re doing a lift and shift, it’s definitely bad. If you’re starting greenfield it makes sense to pick technology stacks that don’t incur this crap.
It only makes sense if you’re managing large amounts of large siloed bits of kit. I’ve not seen this other than at unnamed big tech companies.
99.9% of people are just burning money for a fashion show where everyone is wearing clown suits because someone said clown suits are good.
Writing software that works containerized isn't that bad. A lot of the time, ensuring cross platform support for Linux is enough. And docker is pretty easy to use. Images can be spun up easily, and the orchestration of compose is simple but quite powerful. I'd argue that in some cases, it can speed up development by offering a standardized environment that can be brought up with a few commands.
Kubernetes, on the other hand, seems to bog everything down. It's quite capable and works well once it's going, but getting there is an endeavor, and any problem is buried under mountains of templatized YAML.
Imagine working an a project for the first time, having a Dockerfile that works or compose file, that just downloads and spins up all dependencies and builds the project succesfully. Usually that just works and you get up and running within 30 minutes or so.
On the other hand, how it used to be: having to install the right versions of, for example redis, postgres, nginx, and whatever unholy mess of build tools is required for this particular hairball, hoping it works on you particular (version) of linux. Have fun with that.
Working on multiple projects, over a longer period of time, with different people, is so much easier when setup is just 'docker compose up -d' versus spending hours or days debugging the idiosyncrasies of a particular cocktail that you need to get going.
Thanks. You’ve reassured me that I’m not going mad when I look at our project repo and seriously consider binning the Dockerfile and deploying direct to Ubuntu.
The project is a Ruby on Rails app that talks to PostreSQL and a handful of third party services. It just seems unnecessary to include the complexity of containers.
I have a lot of years of actual experience. Maybe not as much as you, but a good 12 years in the industry (including 3 at Google, and Google doesn't use Docker, it probably wouldn't be effective enough) and a lot more as a hobbyist.
I just don't agree. I don't find Docker too complicated to get started with at all. A lot of my projects have very simple Dockerfiles. For example, here is a Dockerfile I have for a project that has a Node.JS frontend and a Go backend:
FROM node:alpine AS npmbuild
WORKDIR /app
COPY package.json package-lock.json .
RUN npm ci
COPY . .
RUN npm run build
FROM golang:1.25-alpine AS gobuilder
WORKDIR /app
COPY go.mod go.sum .
RUN go mod download
COPY . .
COPY --from=npmbuild /app/dist /app/dist
RUN go build -o /server ./cmd/server
FROM scratch
COPY --from=gobuilder /server /server
ENTRYPOINT ["/server"]
It is a glorified shell script that produces an OCI image with just a single binary. There's a bit of boilerplate but it's nothing out of the ordinary in my opinion. It gives you something you can push to an OCI registry and deploy basically anywhere that can run Docker or Podman, whether it's a Kubernetes cluster in GCP, a bare metal machine with systemd and podman, a NAS running Synology DSM or TrueNAS or similar, or even a Raspberry Pi if you build for aarch64. All of the configuration can be passed via environment variables or if you want, additional command line arguments, since starting a container very much is just like starting a process (because it is.)
But of course, for development you want to be able to iterate rapidly, and don't want to be dealing with a bunch of Docker build BS for that. I agree with this. However, the utility of Docker doesn't really stop at building for production either. Thanks to the utility of OCI images, it's also pretty good for setting up dev environment boilerplate. Here's a docker-compose file for the same project:
And if your application is built from the ground up to handle these environments well, which doesn't take a whole lot (basically, just needs to be able to handle configuration from the environment, and to make things a little neater it can have defaults that work well for development), this provides a one-command, auto-reloading development environment whose only dependency is having Docker or Podman installed. `docker compose up` gives you a full local development environment.
I'm omitting a bit of more advanced topics but these are lightly modified real Docker manifests mainly just reformatted to fewer lines for HN.
I adopted Kubernetes pretty early on. I felt like it was a much better abstraction to use for scheduling compute resources than cloud VMs, and it was how I introduced infrastructure-as-code to one of the first places I ever worked.
I'm less than thrilled about how complex Kubernetes can be, once you start digging into stuff like Helm and ArgoCD and even more, but in general it's an incredible asset that can take a lot of grunt work out of deployment while providing quite a bit of utility on top.
Is there a book like Docker: The Good Parts that would build a thorough understanding of the basics before throwing dozens of ecosystem brand words at you? How does virtualisation not incur an overhead? How do CPU- and GPU-bound tasks work?
I think the key thing here is the difference between OS virtualization and hardware virtualization. When you run a virtual machine, you are doing hardware virtualization, as in the hypervisor is creating a fake devices like a fake SSD which your virtual machine's kernel then speaks to the fake SSD with the NVMe protocol like it was a real physical SSD. Then those NVMe instructions are translated by the hypervisor into changes to a file on your real filesystem, so your real/host kernel then speaks NVMe again to your real SSD. That is where the virtualization overhead comes in (along with having to run that 2nd kernel). This is somewhat helped by using virtio devices or PCIe pass-through but it is still significant overhead compared to OS virtualization.
When you run docker/kubernetes/FreeBSD jails/solaris zones/systemd nspawn/lxc you are doing OS virtualization. In that situation, your containerized programs talk to your real kernel and access your real hardware the same way any other program would. The only difference is your process has a flag that identifies which "container" it is in, and that flag instructs the kernel to only show/allow certain things. For example "when listing network devices, only show this tap device" and "when reading the filesystem, only read from this chroot". You're not running a 2nd kernel. You don't have to allocate spare ram to that kernel. You aren't creating fake hardware, and therefore you don't have to speak to the fake hardware with the protocols it expects. It's just a completely normal process like any other program running on your computer, but with a flag.
Docker is just Linux processes running directly on the host as all other processes do. There is no virtualization at all.
The major difference is that a typical process running under Docker or Podman:
- Is unshared from the mount, net, PID, etc. namespaces, so they have their own mount points, network interfaces, and PID numbers (i.e. they have their own PID 1.)
- Has a different root mount point.
- May have resource limits set with cgroups.
(And of course, those are all things you can also just do manually, like with `bwrap`.)
There is a bit more, but well, not much. A Docker process is just a Linux process.
So how does accessing the GPU work? Well sometimes there are some more advanced abstractions for the benefit of I presume stronger isolation, but generally you can just mount in the necessary device nodes and use the GPU directly, because it's a normal Linux process. This is generally what I do.
About 25 years here and 10 years embedded / EE before that.
The problem is that containers are made of images and those and kubernetes are incredibly stateful. They need to be stored. They need to be reachable. They need maintenance. And the control responsibility is inverted. You end up with a few problems which I think are not tenable.
Firstly, the state. Neither docker itself or etcd behind Kubernetes are particularly good at maintaining state consistently. Anyone who runs a large kubernetes cluster will know that once it's full of state, rebuilding it consistently in a DR scenario is HORRIBLE. It is not just a case of rolling in all your services. There's a lot of state like storage classes, roles, secrets etc which nothing works if you don't have in there. Unless you have a second cluster you can tear down and rebuild regularly, you have no idea if you can survive a control plane failure (we have had one of those as well).
Secondly, reachability. The container engine and kubernetes require the ability to reach out and get images. This is such a fucking awful idea from a security and reliability perspective it's unreal. I don't know how people even accept this. Typically your kubernetes cluster or container engine has the ability to just pull any old shit off docker hub. That also couples to you that service being up, available and not subject to the whims of whatever vendor figures they don't want to do their job any more (broadcom for example). To get around this you end up having to cache images which means more infrastructure to maintain. There is of course a whole secondary market for that...
Thirdly, maintainance. We have about 220 separate services. When there's a CVE, you have to rebuild, test and deploy ALL those containers. We can't just update an OS package and bounce services or push a new service binary out and roll it. It's a nightmare. It can take a month to get through this and believe me we have all the funky CD stuff.
And as mentioned above, control is inverted. I think it's utterly stupid on this basis that your container engine or cluster pulls containers in. When you deploy, the relationship should be a push because you can control that and mandate all of the above at once.
In the attempt to solve problems, we created worse ones. And no one is really happy.
I get your points but I'm not sure I agree. Kubernetes is a different kind of difficulty but I don't think its so different from handling VM fleets.
You can have 220 vms instead and need to update all of them too.
They also are full of state and you will need some kind of automatic deployment (like ansible) to make it bearable, just like your k8s cluster.
If you don't configure the network egress firewall, they can also both pull whatever images/binaries from docker hub/internet.
> To get around this you end up having to cache images which means more infrastructure to maintain
If you're not doing this for your VMs packages and your code packages, you have the same problem anyway.
> When there's a CVE
If there is a CVE in your code, you have to build all you binaries anyway. If it's in the system packages, you have to update all your VMs. Arguably, updating a single container and making a rolling deployment is faster than updating x VMs. In my experience updating VMs was harder and more error prone than updating a service description to bump a container version (you don't just update a few packages, sometimes you need to go from Centos 5 to Centos 7/8 or something and it also takes weeks to test and validate).
I mostly agree with you, with the exception that VMs are fully isolated from one another (modulo sharing a hypervisor), which is both good and bad.
If your K8s cluster (or etcd) shits the bed, everything dies. The equivalent to that for VMs is the hypervisor dying, but IME it’s far more likely that K8s or etcd has an issue than a hypervisor. If nothing else, the latter as a general rule is much older, much more mature, and has had more time to work out bugs.
As to updating VMs, again IME, typically you’d generate machine images with something like Packer + Ansible, and then roll them out with some other automation. Once that infrastructure is built, it’s quite easy, but there are far more examples today of doing this with K8s, so it’s likely easier to do that if you’re just starting out.
> If your K8s cluster (or etcd) shits the bed, everything dies.
When etcd and/or kubelet shits the bed, it shouldn't do anything other than halt scheduling tasks. The actual runtime might vary between setups, but typically containerd is the one actually handling the individual pod processes.
Of course, you can also run Kubernetes pods in a VM if you want to, there have always been a few different options for this. I think right now the leading option is Kata Containers.
Does using Kata Containers improve isolation? Very likely: you have an entire guest kernel for each domain. Of course, the entire isolation domain is subject to hardware bugs, but I think people do generally regard hardware security boundaries somewhat higher than Linux kernel security boundaries.
But, does using Kata Containers improve reliability? I'd bet not, no. In theory it would help mitigate reliability issues caused by kernel bugs, but frankly that's a bit contrived as most of us never or extremely infrequently experience the type of bug that mitigates. In practice, what happens is that the point of failure switches from being a container runtime like containerd to a VMM like qemu or Firecracker.
> The equivalent to that for VMs is the hypervisor dying, but IME it’s far more likely that K8s or etcd has an issue than a hypervisor. If nothing else, the latter as a general rule is much older, much more mature, and has had more time to work out bugs.
The way I see it, mature code is less likely to have surprise showstopper bugs. However, if we're talking qemu + KVM, that's a code base that is also rather old, old enough that it comes from a very different time and place for security practices. I'm not saying qemu is bad, obviously it isn't, but I do believe that many working in high-assurance environments have decided that qemu's age and attack surface is large enough to have become a liability, hence why Firecracker and Cloud Hypervisor exist.
I think the main advantage of a VMM remains the isolation of having an entire separate guest kernel. Though, you don't need an entire Linux VM with complete PC emulation to get that; micro VMs with minimal PC emulation (mostly paravirtualization) will suffice, or possibly even something entirely different, like the way gVisor is a VMM but the "guest kernel" is entirely userland and entirely memory safe.
I think his point is that instead of hundreds of containers, you can just have a small handful of massive servers and let the multitasking OS deal with it
Containers are too low-level. What we need is a high-level batch job DSL, where you specify the inputs and the computation graph to perform on those inputs, as well as some upper limits on the resources to use, and a scheduler will evaluate the data size and decide how to scale it. In many cases that means it will run everything on a single node, but in any case data devs shouldn't be tasked with making things run in parallel because the vast majority aren't capable and they end up with very bad choices.
And by the way, what I just described is a framework that Google has internally, named Flume. 10+ years ago they had already noticed that devs aren't capable of using Map/Reduce effectively because tuning the parallelism was beyond most people's abilities, so they came up with something much more high-level. Hadoop is still a Map/Reduce clone, thus destined to fail at useability.
Are you saying that running your application in a pile of containers somehow helps that problem ..? It's the same problem as CPU scheduling, we just don't have good schedulers yet.. Lots of people are working on it though
Not really? At the moment it's done by some user-land job scheduler. That could be something container based like k8s, something in-process like ray, or a workload manager like slurm.
This is especially aggravating when the os inside the container and the language runtimes are much heavier than the process itself.
I've seen arguments for nano services (I wouldn't even call them micros services), that completely ignored that part. Split a small service in n tiny services, such that you have 10(os, runtime, 0.5) rather than 2(os, runtime, x).
There is no os inside the container. That's a big part of the reason containerization is so popular as a replacement for heavier alternatives like full virtualization. I get that it's a bit confusing with base image names like "ubuntu" and "fedora", but that doesn't mean that there is a nested copy of ubuntu/fedora running for every container.
To be fair each of those pods can have dedicated, separate external storage volumes which may actually help and it’s def easier than maintaining 200 iscsi or more whatever targets yourself
I recently had to parse 500MB to 2GB daily log files into analytical information for sales. Quick and dirty, the application would of needed 64GB RAM and work laptop only has 48GB RAM. After taking time cleaning it up, it was using under 1GB of RAM and worked faster by only retaining records in RAM if need be between each day.
It is not about what you are doing, it is always about how you do it.
This was the same with doing OCR analysis of assembly and production manuals. Quick and dirty, it would of took over 24 hours of processing time, after moving to semaphores with parallelization it took less than two hours to process all the information.
In interviews just give them what they are looking for. Don't overthink it. Interviews have gotten so stupidly standardized as the industry at large copied the same Big Tech DSA/System Design/Behavioral process. And therefore interview processes have long been decoupled from the business reality most companies face. Just shard the database and don't forget the API Gateway
100%. Interviews should be a two-way filter. I’m sympathetic to unemployed-and-just-need-something, but also: boy are there a lot of companies hiring data engineers.
Meh .. I've played that game; it doesn't work out well for anyone involved.
I optimize my answers for the companies I want to work for, and get rejected by the ones I don't. The hardest part of that strategy is coming to terms with the idea that I constantly get rejected by people that I think are mostly <derogatory_words_here>, but I've developed thick skin over the years.
I'd much rather spend a year unemployed (and do a ton of painful interviews) and find a company who's values align with mine, than work for a year on a team I disagree with constantly and quit out of frustration.
The company's values may align to yours, even though they reject you. It's because the interview process doesn't need to have anything to do with their real-world process. Their engineers probe you for the same "best practices" that they themselves were constantly probed for in their own interviews. Interviewing is its very own skill that doesn't necessarily translate into real-life performance.
I agree with your observation. My issue is (from experience) it's really hard to tell from the outside if a teams' values align with mine. Many teams talk the talk, but don't walk the walk, as the saying goes. It's just easier to not participate than it is to guess, and be wrong.
I also believe that running a broken interview process actively selects for qualities you actually don't want, so it's much more likely that teams conducting those interviews aren't teams I want to work on.
Edit: As credence for my claims, the best team I've ever worked on was a team I did 90%+ of the hiring for, and we didn't do any of the 'typical' interview bullshit most companies do.
What we did instead was sit people down and have deep technical conversations about systems they'd worked on in the past. The candidate would explain, in as much detail as they could muster, a system they'd worked on in the past, down to the lowest level details. Usually, they would talk to us for at least 20-30 minutes, then, we (the interviewers) would pose questions, usually starting with the form 'if we changed X, what effect would it have'. Doing interviews in this style make a few things immediately obvious:
1. Did the candidate have a deep, systemic understanding of the system they worked on?
2. Does the candidate have a good mental model for evaluating change in the system?
That's how I conduct interviews, and unsurprisingly, when I get interviewed like that, my success rate is 100%. I don't think I've ever done an interview like that which did not result in an offer.
Anyways, there's some rambling and unsolicited opinions for you :)
The interview process determines who gets hired, which determines their real-world process. Even if most of their people were hired under a better system, future hires will come in under this one.
This. Most interviewers don't want to do interviews, they have more important job to do (at least, that's what they claim). So they learn questions and approaches from the same materials and guides that are used by candidates. Well, I'm guilty of doing exactly this a few times.
You could have learned this if you were better about collecting requirements. You can tell the interviewer "I'd do it like this for this size data, but I'd do it like this for 100x data. Which size should I design this for?" If they're looking for one direction and you ask which one, interviewers will tell you.
I've done that too and, in my experience, people that ask a scaling question that fits on a single machine don't have the capacity to have that nuanced conversation. I usually try to help the interviewer adjust the scale to something that actually requires many machines, but they usually don't get it.
Said another way, how do you have a meaningful conversation about scaling with a person who thinks their application is huge, but in reality only requires a tiny fraction of a single machine? Sometimes, there's such a massive gulf between perception and reality that the only thing to do is chuckle and move on.
Yes, but then how are these people going to justify the money they're spending on cloud systems?... They need to find only reasons to maintain their "investment", otherwise they could be held as incompetent when their solution is proven to be ineffective. So, they have to show that it was a unanimous technical decision to do whatever they wanted in the first place.
I've actually worked on distributed systems that were so broken, I created a script to connect to prod and just create the report from my laptop. My manager offered to buy me a second laptop for running the report since it was easier than getting approval from the architects to get rid of the distributed report system (it only created that one report).
Yeah I had this problem at a couple of times in startup interviews where the interviewer asked a question I happened to have expertise in and then disagreed with my answer and clearly they didn't know all that much about it. It's ok, they did me a favor.
It may or may not be related that the places that this happened were always very ethnically monotone with narrow age ranges (nothing against any particular ethnic group, they were all different ethnic monotones)
> I explained, from first principals, how it fits, and received feedback along the lines of "our engineers agreed with your technical assessment, but that's not the answer we wanted, so we're going to pass". I've had this experience a good handful of times.
Probably a better outcome than being hired onto a team where everyone know you're technically correct but they ignore your suggestions for some mysterious (to you) reason.
Though I do not know the situation AT the firm you were interviewing in, if there is some unexpected increase in data volume OR say a job fails on certain days or you need to do some sort of historical data load (>= 6 months of 1 gig of data per day), the solution for running it on a single VM might not scale. BUT again, interviews are partially about problem solving, partially about checking compliance at least for IC roles (IN my anecdotal experience).
That being said yeah I too have done some similar stuff where some data engineering jobs could be run on a single VM but some jobs really did need spark, so the team decision was to fit the smaller square peg into a larger square peg and call it a da.In fact, I had spent time refactoring one particular pivotal job to run as an API deployed on our "macrolith" and integrated with our Airflow but it was rejected, so I stopped caring about engineering hygiene.
You can parse JSON at several GB/s: https://github.com/simdjson/simdjson
And you could scale that by one or two orders of magnitude with thread-based parallelism on recent AMD Epyc or Intel Xeon CPUs. So parsing alone should not pose a problem (maybe even sub-second for 6 months of data). We would need a more precise problem statement to judge whether horizontal scaling is needed.
As other commentors pointed out, 1gb/day isn't a problem for storage and retroactive processing until you get to like, hundreds of years of data. You can chew through a few hundred TB of JSON data in a day, per core + nvme drive.
Regardless, storage and retroactive processing wasn't part of the problem. The problem was explicitly "parse json records as they come in, in a big batch, and increment some integers in a database".
I'm not going to figure out what the upper limit is on a single bare-metal machine, but you can be damn sure it's a metric fuck-ton higher than 1gb/day. You can do a lot with a 10TB of memory and 256 cores.
If we are talking about cloud VMs: sure, their cpu performance is atrocious and io can be horrible. This won't scale to infinity
But if there's the option to run this on a fairly modest dedicated machine, I'd be comfortable that any reasonable solution for pure ingest could scale to five orders of magnitude more data, and still about four orders of magnitude if we need to look at historical data. Of course you could scale well beyond that, but at that point it would be actual work
I have a funny story I need to tell some day about how I could get a 4GB JSON loaded purely in the browser at some insane speed, by reading the bytes, identifying the "\n" then making a lookup table. It started low stakes but ended up becoming a multi-million internal project (in man-hours) that virtually everyone on the company used. It's the kind of project that if started "big" from the beginning, I'd bet anything it wouldn't have gotten so far.
Edit: I did try JSON.parse() first, which I expected to fail and it did fail BUT it's important that you try anyway.
Yes, but I didn't read the full file, I kept the File reference and read the bytes in pages of 10MB IIRC to find all of the line break offsets. Then used those to slice and only read the relevant parts.
“there’s no wrong answer, we just want to see how you think” gaslighting in tech needs to be studied by the EEOC, Department of Labor, FTC, SEC, and Delaware Chancery Court to name a few
let’s see how they think and turn this into a paid interview
I think a lot of people don't realize machines come with TBs of RAM and hundreds of physical cores. One machine is fucking huge these days.