The weirdest paradox in statistics (and machine learning)

1,052,256
0
Published 2022-08-30
🌏 AD: Get Exclusive NordVPN deal here ➼ nordvpn.com/mathemaniac. It's risk-free with Nord's 30-day money-back guarantee! ✌

Second channel video:    • Why James-Stein estimator dominates o...  

Stein's paradox is of fundamental importance in modern statistics, introducing concepts of shrinkage to further reduce the mean squared error, especially in higher dimensional statistics that is particularly relevant nowadays, in the world of machine learning, for example. However, this is usually ignored, because it is mostly seen as a toy problem. Precisely because it is such a simple problem that illustrates the problem of maximum likelihood estimation! This paradox is the subject of many blogposts (linked below), but not really here on YouTube, except in some lecture recordings, so I have to bring this up to YouTube.

This is not to say that maximum likelihood estimator is not useful - in most situations, especially in lower dimensional statistics, it is still good, but to hold it to such a high place, as statisticians did before 1961? That is not a healthy attitude to this theory.

One thing I did not say, but perhaps a lot of people will want me to, is that this is an emprical Bayes estimator, but again, more links below.

Video chapters:
00:00 Introduction
04:38 Chapter 1: The "best" estimator
09:48 Chapter 2: Why shrinkage works
15:51 Chapter 3: Bias-variance tradeoff
18:45 Chapter 4: Applications

Further reading:
The “baseball paper”: efron.ckirby.su.domains//other/Article1977.pdf
Wikipedia: en.wikipedia.org/wiki/Stein%27s_example
Dominating the (positive-part) James-Stein estimator: projecteuclid.org/journals/annals-of-statistics/vo…
Wikipedia (Empirical Bayes): en.wikipedia.org/wiki/Empirical_Bayes_method

Other writeups:
www.ime.unicamp.br/~veronica/MI677/steinparadox.pd…
joe-antognini.github.io/machine-learning/steins-pa…
www.jchau.org/2021/01/29/demystifying-stein-s-para…
www.naftaliharris.com/blog/steinviz/
austinrochford.com/posts/2013-11-30-steins-paradox…
duphan.wordpress.com/2016/07/10/steins-paradox-or-…
www.statslab.cam.ac.uk/~rjs57/SteinParadox.pdf

(Philosophical implications) philsci-archive.pitt.edu/13303/1/Philosophical%20s…

Other than commenting on the video, you are very welcome to fill in a Google form linked below, which helps me make better videos by catering for your math levels:
forms.gle/QJ29hocF9uQAyZyH6

If you want to know more interesting Mathematics, stay tuned for the next video!

SUBSCRIBE and see you in the next video!

If you are wondering how I made all these videos, even though it is stylistically similar to 3Blue1Brown, I don't use his animation engine Manim, but I will probably reveal how I did it in a potential subscriber milestone, so do subscribe!

Social media:

Facebook: www.facebook.com/mathemaniacyt
Instagram: www.instagram.com/_mathemaniac_/
Twitter: twitter.com/mathemaniacyt
Patreon: www.patreon.com/mathemaniac (support if you want to and can afford to!)
Merch: mathemaniac.myspreadshop.co.uk
Ko-fi: ko-fi.com/mathemaniac [for one-time support]

For my contact email, check my About page on a PC.

See you next time

All Comments (21)
  • @mathemaniac
    Go to nordvpn.com/mathemaniac to get the two year plan with an exclusive deal PLUS 4 months free. It’s risk free with NordVPN’s 30 day money back guarantee! Please sign up because it really helps the channel! [My pinned comment gets removed by YouTube AGAIN!!!]
  • @ludomine7746
    This is insane. The demonstration with the points in 3d and 2d space not only made it clear why it works, but also made it clear why it doesnt work as well in 2d. Going from the paradox being magic to somewhat understandable is beautiful. I loved this video.
  • Lesson: One should not geometrize independent variables. A similar "paradox" occurs in uncertainties for complex numbers, so even the 2D case. If you Monte Carlo sample the modulus of a random z ϵ ℂ with 2D Z-dist centred at 0 + 0i the average |z| is something like sqrt(π)/2 (Rayleigh distribution). But if you sample real and imaginary parts, average them, then compute the mean z then take |·| of that , it'll converge to zero as N → ∞. The average |z| ∼ sqrt(π)/2, but the average z = 0 + 0i.
  • @marshallc6215
    For a layman, I think the worry after first seeing this explained (given the very fast hand waving with the errors at the beginning) is that you might suddenly be able to estimate something better by adding your own random data to the question, which by definition, makes the three data points not independent. The thing is, and I'm surprised you never clarified this, we aren't talking about a better estimation for any given distribution. We're talking about the best estimator for all three distributions as a collective. We're no longer asking 3 questions about 3 independent data sets, but 1 question about 1 data set containing 3 independent series. There is no paradox here, because it is pure numbers gamesmanship and is no longer the intuitive problem we asked at the beginning. When we went to multiple data sets, the phrasing of the question is the same, but the semantic meaning changes.
  • @ChatSceptique
    I'm a PhD in statistics, never heard of that one before. It's really cool, thanks for sharing <3 Cheers from Belgium.
  • @SirGisebert
    The bias-variance decomposition is Part of my PhD thesis and i just gotta say your visualizations and explanations are very clean and intuitive. Good job!
  • This is very good. The only notes I have for how it might be improved are: 1. Make it clearer that when we have the 3 data points early in the video, we know from which distribution each of them comes, rather than just having 3 numbers. So, we know that we have say 3 generated from X_1, 9 generated from X_2 and 4 generated from X_3 rather than knowing that there's X_1, X_2 and X_3 and each generated a number and the set of the numbers that were generated is 3, 9, 4 but have no idea which comes from which. It can be sort of inferred from them ending up in a vector, but still. 2. "Near end" vs "far end", the near end being finite vs far end being infinite is a bit ehh as a point. It invites the thought of "well who cares how big the effect is in the finite area or how small it is in the infinitely large area, there will be more total shift in the latter anyway - it's infinite after all!". What matters is the probability mass for each of those areas (and it's distribution and what happens to it), and that's finite either way. Other than that, excellent video. Nice and clear for some relatively high level concepts.
  • The fact that I'm not particularly interested in statistics and also on my only 3 weeks of holidays from my maths-centric studies, yet I still was really excited to watch this video speaks for its quality. Thank you again for the amazing free content you provide to everyone!!
  • I have to admit that as someone not very familiar with statistics I was starting to get lost until you got to the 2D vs 3D visualization and I immediately grasped what was going on. That was an excellent way to explain it, and reminded me a lot of 3blue1brown's visual maths videos.
  • @scraps7624
    This is a masterclass in how to teach statistics, absolutely incredible work. Scripting, visualization, pacing, everything was on point
  • All this paradox comes from trying to minimize the squared errors. The squared errors are used mostly because its easy to compute for most of classical statistics law and it fit prety well with most minimization algorithms. But in real world,in many cases, one will be more interested of the average absolute errors instead of squared errors. I think the "paradox" is there, we are using a arbitrary metric, and we never question it. When I used to be a quantitative analyst I often used the abs value instead of squared for error minimization, I found the result way more relevant despite some slight difficulty to run some algorithms.
  • @Anis_Hdd
    I did my PhD on shrinkage estimators of a covariance matrix. This is the best vulgarization of Stein's paradox I have ever seen! Thanks
  • @ej3281
    this was really good, thank you! I used to work in a machine learning/DSP shop and did a lot of reading about estimators but I'm not sure I ever fully understood until I saw this video.
  • @dcterr1
    I'm not all that familiar with advanced statistics, but I was pretty blown away by this paradox when you first presented it! However, once you started explaining how we normally throw out outliers in any case, It began to make a lot more sense. Good video!
  • It is pretty awesome that you covering one of the most counterintuitive examples in statistics. This example motivates many exciting ideas in modern statistics like empirical Bayes. Keep up the good work.
  • @tanvach
    I think shrinkage isn’t widely discussed is because choosing MSE as a metric for goodness of parameter estimation is an arbitrary choice. It makes sense that introducing this metric would couple the individual estimations together, so it’s not really a paradox (in hindsight). In some sense, you want to see how well the model works, not how accurate the parameters are, since a model is usually too simplistic. But I do see this used in econometrics. I think I’m seeing more L1 norm used in deep learning as the regularizer, wonder what form of shrinkage factor that will have?
  • @djtwo2
    The relevance of "mean square error" here can be considered in the context of the probability distribution of the error. Here "error" means total squared error as in the video. Applying the shrinkage moves part of range of errors towards zero, while moving a small part of the range away from zero. The "mean squared error" measure doesn't care enough about the few possibly extremely large errors resulting from being moved away from zero to counterbalance the apparent benefit from moving some errors towards zero, But any other measure of goodness of estimation has the same problem. There are approaches to ranking estimation methods (other ways of defining "dominance") that are based on the whole of the probability distribution of errors not just a summary statistic. This is similar to the idea if "uniformly most powerful" for significance tests. The practical worry here is that a revised estimation formula can occasionally produce extremely poor estimates, as is illustrated in the slides in this video.
  • @amphicorp4725
    I kept forgetting that the distributions were unrelated and every time I remembered, it blew my mind. Absolutely fantastic video
  • @amaarquadri
    This is one of the most counterintuitive things I've ever seen! Statistics is crazy.
  • @ssvis2
    This is a great explanation of estimators and non-intuitive relations. I like that you highlighted its importance in machine learning. It would be worth doing another video about how the variance/bias relation and subsequent weightings adjustments affect those models, especially in the context of overfitting.