Conditioning Is Grouping By

nerdponx · on Dec 21, 2023

Whatever helps you understand!

I personally prefer the "area of shapes in a plane" model. The sample space is a rectangle with area 1. Events are shapes inside the sample space, with probability equal to the % of the area of the sample space covered. The probability of intersection of two events is the area of the intersection of their corresponding shapes, and likewise for unions.

Conditioning on B is like treating B as the whole sample space. So what's the conditional probability of A given B? It's just the % of B that is covered by A. How much of B is covered by A? Of course, that's just the intersection of A and B. Then we divide by B to rescale that area relative to B, rather than the original sample space.

This "group by" model says something similar of course, but the "area of shapes" model I think covers a more general case while also being a literal mental picture that just about anyone can hold in their head.

amirmi78 · on Dec 21, 2023

To be precise, shapes are inside the measurable space of the random variable and the rectangle can be a cube or hypercube based on the number of variables

senderista · on Dec 21, 2023

That's exactly how I've always visualized conditional probabilities as well.

_sword · on Dec 21, 2023

This is super helpful thank you!

abeppu · on Dec 21, 2023

Hmm, I think when the author gets to imagining an "infinite dataframe" to deal with conditioning with continuous variables, it's kind of revealed that this isn't the best way to build intuition. Worse, sometimes we talk about conditioning on _unobserved_ variables which wouldn't be in your dataframe. This happens when we describe a model with latent/hidden variables that impact the stuff we do observe.

I think a conventional way of describing this, which I think is in intro textbooks, is better -- we thinking of cutting slice through a multivariate distribution, and then normalizing. If you have some distribution P(X, Y) where X and Y are both real, then its density is like some landscape (i.e. each x,y is associated with some height). P(X | Y=y) is like a slice through that landscape at Y=y, which leaves a 1-D plot of the density along X. If we re-scale it so that the area under it is 1, then it's a well-formed PDF.

refulgentis · on Dec 21, 2023

Oh, I see!

ShamelessC · on Dec 21, 2023

> we thinking of cutting slice through a multivariate distribution, and then normalizing. If you have some distribution P(X, Y) where X and Y are both real, then its density is like some landscape (i.e. each x,y is associated with some height). P(X | Y=y) is like a slice through that landscape at Y=y, which leaves a 1-D plot of the density along X. If we re-scale it so that the area under it is 1, then it's a well-formed PDF.

How intuitive!

/s

markisus · on Dec 21, 2023

This is a case where a picture is worth a thousand words. The slicing intuition is quite useful and I imagine quite common for those who have taken calculus. It is the very same mental picture as you would use for computing area using two nested integrals. This video [1] has some nice graphics along these lines.

[1] https://www.youtube.com/watch?v=gifQWtTWqEY

relyks · on Dec 21, 2023

What other examples are there of using code to gain intuition behind the meaning of a mathematical operation or concept?

data-ottawa · on Dec 21, 2023

You might like this page which shows the equivalence between mathematical sum notation and for loops https://thomasward.com/sigmas-and-for-loops/

And this wikipedia page on iterated function notation https://en.wikipedia.org/wiki/Iterated_function

You'll also probably like Think Stats by Allen Downy which approaches teaching stats in a very similar framework (that you can code yourself) https://greenteapress.com/thinkstats/

dr_kiszonka · on Dec 21, 2023

I really enjoy those "X for programmers" kind of materials. I wish there were more of them, especially on less popular topics.

relyks · on Dec 21, 2023

Thanks, I'm going to take a look at these :)

hackandthink · on Dec 21, 2023

A conditional probability is a probabilistic function.

It goes from a value x to a probability distribution of Y.

It is more general than a non-deterministic computation. You get a list of results with weights attached to it. (ignoring continuous distributions)

Grouping can be regarded as a non-deterministic computation.

It goes from a value x to a list of Y.

Computing averages or other aggregations from lists or distributions is another step.

https://dennybritz.com/posts/probability-monads-from-scratch...

https://github.com/mattearnshaw/lawvere/blob/master/pdfs/196...

alexmolas · on Dec 21, 2023

> Grouping can be regarded as a non-deterministic computation.

Can you elaborate on that? I don't see how grouping can be a non-deterministic computation. When you compute the groups of a set you always get the same ones, or am I missing something?

hackandthink · on Dec 21, 2023

I made that up beforehand and it may be exaggerated, I was inspired by the List Monad.

The groups are deterministic. The non-determinism lies in which element of the group is then selected.

"Calculate an even number"

equals

"Take any element from the group of even numbers"

https://www.schoolofhaskell.com/school/starting-with-haskell...

alexmolas · on Dec 21, 2023

Yes, when you aggregate the group's values, you can have non-deterministic behaviors. Operations like "calculate an even number" can be non-deterministic, but other operations like "give me the group average" (which is what the post proposes) are deterministic.

Anecdote time: some weeks ago I had a bug for this reason. I was doing a groupby + select the first element, and since the first element is not always the same I was getting some weird results. I solved it by sorting the group elements before selecting the first element.