I personally prefer the "area of shapes in a plane" model. The sample space is a rectangle with area 1. Events are shapes inside the sample space, with probability equal to the % of the area of the sample space covered. The probability of intersection of two events is the area of the intersection of their corresponding shapes, and likewise for unions.
Conditioning on B is like treating B as the whole sample space. So what's the conditional probability of A given B? It's just the % of B that is covered by A. How much of B is covered by A? Of course, that's just the intersection of A and B. Then we divide by B to rescale that area relative to B, rather than the original sample space.
This "group by" model says something similar of course, but the "area of shapes" model I think covers a more general case while also being a literal mental picture that just about anyone can hold in their head.
To be precise, shapes are inside the measurable space of the random variable and the rectangle can be a cube or hypercube based on the number of variables
Hmm, I think when the author gets to imagining an "infinite dataframe" to deal with conditioning with continuous variables, it's kind of revealed that this isn't the best way to build intuition. Worse, sometimes we talk about conditioning on _unobserved_ variables which wouldn't be in your dataframe. This happens when we describe a model with latent/hidden variables that impact the stuff we do observe.
I think a conventional way of describing this, which I think is in intro textbooks, is better -- we thinking of cutting slice through a multivariate distribution, and then normalizing. If you have some distribution P(X, Y) where X and Y are both real, then its density is like some landscape (i.e. each x,y is associated with some height). P(X | Y=y) is like a slice through that landscape at Y=y, which leaves a 1-D plot of the density along X. If we re-scale it so that the area under it is 1, then it's a well-formed PDF.
> we thinking of cutting slice through a multivariate distribution, and then normalizing. If you have some distribution P(X, Y) where X and Y are both real, then its density is like some landscape (i.e. each x,y is associated with some height). P(X | Y=y) is like a slice through that landscape at Y=y, which leaves a 1-D plot of the density along X. If we re-scale it so that the area under it is 1, then it's a well-formed PDF.
This is a case where a picture is worth a thousand words. The slicing intuition is quite useful and I imagine quite common for those who have taken calculus. It is the very same mental picture as you would use for computing area using two nested integrals. This video [1] has some nice graphics along these lines.
You'll also probably like Think Stats by Allen Downy which approaches teaching stats in a very similar framework (that you can code yourself) https://greenteapress.com/thinkstats/
> Grouping can be regarded as a non-deterministic computation.
Can you elaborate on that? I don't see how grouping can be a non-deterministic computation. When you compute the groups of a set you always get the same ones, or am I missing something?
Yes, when you aggregate the group's values, you can have non-deterministic behaviors. Operations like "calculate an even number" can be non-deterministic, but other operations like "give me the group average" (which is what the post proposes) are deterministic.
Anecdote time: some weeks ago I had a bug for this reason. I was doing a groupby + select the first element, and since the first element is not always the same I was getting some weird results. I solved it by sorting the group elements before selecting the first element.
I personally prefer the "area of shapes in a plane" model. The sample space is a rectangle with area 1. Events are shapes inside the sample space, with probability equal to the % of the area of the sample space covered. The probability of intersection of two events is the area of the intersection of their corresponding shapes, and likewise for unions.
Conditioning on B is like treating B as the whole sample space. So what's the conditional probability of A given B? It's just the % of B that is covered by A. How much of B is covered by A? Of course, that's just the intersection of A and B. Then we divide by B to rescale that area relative to B, rather than the original sample space.
This "group by" model says something similar of course, but the "area of shapes" model I think covers a more general case while also being a literal mental picture that just about anyone can hold in their head.