Naive Bayes Application: Penguins
Contents
Naive Bayes Application: Penguins#
References
Reference: The Bayes Rules Book.
prior reading of concept before this.
Dependencies#
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from sklearn.metrics import accuracy_score
from sklearn import datasets, model_selection
from scipy.stats import multivariate_normal
from mpl_toolkits.mplot3d import Axes3D
import plotly.graph_objects as go
import torch
from scipy.stats import multivariate_normal
from rich import print
import sys
from pathlib import Path
parent_dir = str(Path().resolve().parents[3])
print(parent_dir)
sys.path.append(parent_dir)
from src.utils.plot import plot_scatter, plot_contour, plot_surface, plot_bar
sns.set_style('white')
/home/runner/work/gaohn-probability-stats/gaohn-probability-stats
Problem Statement#
There exist multiple penguin species throughout Antarctica, including the Adelie, Chinstrap, and Gentoo. When encountering one of these penguins on an Antarctic trip, we might classify its species
by examining various physical characteristics, such as whether the penguin weighs more than the average
as well as measurements of the penguin’s bill
The penguins_bayes data, originally made available by Gorman, Williams, and Fraser (2014) and distributed by Horst, Hill, and Gorman (2020), contains the above species and feature information for a sample of 344 Antarctic penguins:
penguins = sns.load_dataset('penguins')
display(penguins.head())
# penguins.head().style.set_table_attributes('style="font-size: 13px"')
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
display(penguins.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 species 344 non-null object
1 island 344 non-null object
2 bill_length_mm 342 non-null float64
3 bill_depth_mm 342 non-null float64
4 flipper_length_mm 342 non-null float64
5 body_mass_g 342 non-null float64
6 sex 333 non-null object
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
None
display(penguins.describe())
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
count | 342.000000 | 342.000000 | 342.000000 | 342.000000 |
mean | 43.921930 | 17.151170 | 200.915205 | 4201.754386 |
std | 5.459584 | 1.974793 | 14.061714 | 801.954536 |
min | 32.100000 | 13.100000 | 172.000000 | 2700.000000 |
25% | 39.225000 | 15.600000 | 190.000000 | 3550.000000 |
50% | 44.450000 | 17.300000 | 197.000000 | 4050.000000 |
75% | 48.500000 | 18.700000 | 213.000000 | 4750.000000 |
max | 59.600000 | 21.500000 | 231.000000 | 6300.000000 |
Among these penguins, 152 are Adelies, 68 are Chinstraps, and 124 are Gentoos. We’ll assume throughout that the proportional breakdown of these species in our dataset reflects the species breakdown in the wild. That is, our prior assumption about any new penguin is that it’s most likely an Adelie and least likely a Chinstrap:
display(penguins['species'].value_counts(normalize=False))
display(penguins['species'].value_counts(normalize=True))
Adelie 152
Gentoo 124
Chinstrap 68
Name: species, dtype: int64
Adelie 0.441860
Gentoo 0.360465
Chinstrap 0.197674
Name: species, dtype: float64
def missing_values_table(df: pd.DataFrame) -> pd.DataFrame:
"""Function to calculate missing values by column.
credit: https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction
"""
# Total missing values per column
missing_values_per_column = df.isnull().sum()
# Percentage of missing values
missing_values_per_column_percent = 100 * df.isnull().sum() / len(df)
# Make a table with the results
missing_value_table = pd.concat(
[missing_values_per_column, missing_values_per_column_percent], axis=1
)
# Rename the columns
missing_value_table_renamed = missing_value_table.rename(
columns={0: "Missing Values", 1: "% of Total Values"}
)
# Sort the table by percentage of missing descending
missing_value_table_renamed = (
missing_value_table_renamed[missing_value_table_renamed.iloc[:, 1] != 0]
.sort_values("% of Total Values", ascending=False)
.round(1)
)
# Print some summary information
print(
f"Your selected dataframe has {str(df.shape[1])} columns.\n"
f"There are {str(missing_value_table_renamed.shape[0])}"
" columns that have missing values."
)
# Return the dataframe with missing information
return missing_value_table_renamed
missing_values_table(penguins)
Your selected dataframe has 7 columns. There are 5 columns that have missing values.
Missing Values | % of Total Values | |
---|---|---|
sex | 11 | 3.2 |
bill_length_mm | 2 | 0.6 |
bill_depth_mm | 2 | 0.6 |
flipper_length_mm | 2 | 0.6 |
body_mass_g | 2 | 0.6 |
We drop sex
column for simplicity as there are quite a few missing values in it.
penguins = penguins.drop(columns=["sex"], inplace=False).reset_index(drop=True)
We also drop the other NA rows for simplicity, there are only 2 of them and they turn out to be the same row.
penguins = penguins.dropna().reset_index(drop=True) # drop rows with NAs, only 2 rows
penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 |
3 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 |
4 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 |
... | ... | ... | ... | ... | ... | ... |
337 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 |
338 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 |
339 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 |
340 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 |
341 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 |
342 rows × 6 columns
display(penguins['species'].value_counts(normalize=False))
display(penguins['species'].value_counts(normalize=True))
Adelie 151
Gentoo 123
Chinstrap 68
Name: species, dtype: int64
Adelie 0.441520
Gentoo 0.359649
Chinstrap 0.198830
Name: species, dtype: float64
mean_body_mass_g = penguins['body_mass_g'].mean()
print(f"The mean body mass of penguins is {mean_body_mass_g:.2f} grams.")
The mean body mass of penguins is 4201.75 grams.
We create a new categorical feature overweight
which is 1 if the body_mass_g
is over the
mean, and 0 otherwise. This feature corresponds to our earlier defined random variable
penguins["overweight"] = (penguins["body_mass_g"] > penguins["body_mass_g"].mean()).astype(int)
display(penguins["overweight"].value_counts())
0 193
1 149
Name: overweight, dtype: int64
The Prior#
The prior distribution for
which implies
Given a new penguin not in the dataset, the prior assumption says that the probability
of it being an Adelie is
We then map
Let us create a new column class_id
which is 0 for Adelie, 1 for Chinstrap, and 2 for Gentoo.
penguins["class_id"] = penguins["species"].map({"Adelie": 0, "Chinstrap": 1, "Gentoo": 2})
display(penguins["class_id"].value_counts())
0 151
2 123
1 68
Name: class_id, dtype: int64
y = penguins["class_id"].values
def get_prior(y: torch.Tensor) -> torch.Tensor:
if isinstance(y, np.ndarray):
y = torch.from_numpy(y)
probs = torch.unique(y, return_counts=True)[1] / len(y)
distribution = torch.distributions.Categorical(probs=probs)
return distribution
prior = get_prior(y)
print(f"The prior distribution is {prior.probs}")
The prior distribution is tensor([0.4415, 0.1988, 0.3596])
This indeed matches the prior probabilities (i.e. the relative frequencies of each species in the dataset).
class_names = ["Adelie", "Chinstrap", "Gentoo"]
class_colours = ['blue', 'orange', 'green']
fig, ax = plt.subplots(figsize=(8, 6))
bar = plot_bar(ax, x=class_names, y=prior.probs.numpy(), color=class_colours)
ax.set_xlabel("Class")
ax.set_ylabel("Prior probability")
ax.set_title("Prior distribution of penguin species")
plt.show()

Classifying one penguin#
Consider a new penguin with the following features:
body_mass_g
: (overweight
= 0 ifbody_mass_g
, 1 otherwise)bill_length_mm
:flipper_length_mm
:
Then we want to find out the posterior distribution of the species of this penguin, given the features. In other words, what is the probability of this penguin being an Adelie, Chinstrap, or Gentoo, given the features mentioned above.
One Categorical Feature#
Let us assume that we are building our Naive Bayes based off one categorical feature, overweight
.
Then, since this penguin has a body_mass_g
of overweight
.
Let us call this feature
Let’s reproduce the figure 14.1 from the book.
total = penguins.groupby("species").size().reset_index(name="counts")
overweight = (
penguins[penguins["overweight"] == 1]
.groupby("species")["overweight"]
.sum()
.reset_index()
)
overweight["overweight"] = [
i / j * 100 for i, j in zip(overweight["overweight"], total["counts"])
]
total["counts"] = [i / j * 100 for i, j in zip(total["counts"], total["counts"])]
display(total)
display(overweight) # percentage of overweight penguins for each species
species | counts | |
---|---|---|
0 | Adelie | 100.0 |
1 | Chinstrap | 100.0 |
2 | Gentoo | 100.0 |
species | overweight | |
---|---|---|
0 | Adelie | 16.556291 |
1 | Chinstrap | 10.294118 |
2 | Gentoo | 95.121951 |
# bar chart 1 -> top bars (group of 'overweight=No')
bar1 = sns.barplot(x="species", y="counts", data=total, color="darkblue")
# bar chart 2 -> bottom bars (group of 'overweight=Yes')
bar2 = sns.barplot(x="species", y="overweight", data=overweight, color="lightblue")
# add legend
top_bar = mpatches.Patch(color="darkblue", label="overweight = No")
bottom_bar = mpatches.Patch(color="lightblue", label="overweight = Yes")
plt.legend(handles=[top_bar, bottom_bar])
# show the graph
plt.show()

From the conditionals below, the Chinstrap species have the highest probability of underweight penguins by the relative freqeuncy table. That is, for each species, we compute the relative frequency within each species as
Note that we are abusing the notation
The expressions above are the conditional probability of
So one might say that the Chinstrap species is the least likely to be overweight, and the Gentoo species is the most likely to be overweight. Well, this makes intuitive sense since we are talking
about “likelihood” here: of all
Yet before we can make any conclusions, we need to take into account the prior probabilities of each species. We should weight the relative frequencies by the prior probabilities of each species. Intuitively, since Chinstrap is also the rarest species, it diminishes the likelihood of the penguin being a Chinstrap.
We need to use both the prior and the likelihood to compute the posterior distribution of the species of the penguin. The posterior distribution is the probability of the species of the penguin given the features.
For example, if our given feature of the test penguin is
and get the argmax of the above three expressions. The argmax is the species with the highest probability.
Note this makes sense because
The argmax expression is as follows:
Note in the Bayes Rules Book, the authors used the notation
Let’s digress slightly:
We reconcile this by expressing Bayes Rule in terms of the probability density function (PDF) of the posterior distribution, see Bayes’ Rule forms.
The conditional PDF of the posterior distribution is
We make a bold statement (not proved here but true) now that the
If this statement is true, then finding the argmax of the posterior PDF is the same as finding the argmax of the posterior distribution in terms of Bayes Rule.
Let’s go back to where we left off.
The table below breaks down the joint distribution table of
A |
126 |
25 |
151 |
C |
61 |
7 |
68 |
G |
6 |
117 |
123 |
193 |
149 |
342 |
The below table represents the marginal distribution table of
A |
0.365 |
0.072 |
0.437 |
C |
0.178 |
0.020 |
0.198 |
G |
0.017 |
0.342 |
0.359 |
0.560 |
0.434 |
1.000 |
The image is a sketch of the joint distribution table of

Fig. 24 Joint distribution diagram of
We can also use seaborns jointgrid
to do a joint distribution plot.
g = sns.JointGrid(data=penguins, x="overweight", y="species")
low, high = 0, 1
# bins = np.arange(0, high + 1.5) - 0.5 # [-0.5, 0.5, 1.5]
g.plot_joint(sns.histplot,discrete=True, cbar=True, color="blue")
g.plot_marginals(sns.histplot, discrete=True, color="green");

The darker tone of color corresponds to a higher impulse of the joint distribution.
Indeed we can see that the combo of Adelie + overweight = 0 and Gentoo + overweight = 1 presents
a darker tone than the rest, indicating that
In our table they correspond to 0.365 and 0.342 respectively, which is consistent with the plot.
The top and right are the marginal distribution of
Note the above all uses histogram as default, where we used normal histograms to “estimate” the PDF of the joint distribution. We can also use a kernel density estimation (KDE) to do so, which is a bit more smooth and accurate. Note everything here is just an estimation. We will use it in continuous distributions later.
To see the empirical conditional distribution of
What is
In this scenario, the author mentioned that we can actually directly compute the posterior probability from the empirical joint distribution table.
For example, if we want to calculate the posterior probability of
or equivalently
Note that we are still talking about empirical here, so nothing is really proven yet.
In a similar fashion, we have
This should not be surprising, since if we can construct the joint probability table, then both the marginal and conditional probability tables can be constructed from it. We can confirm this by computing the posterior distribution using the formula above, using Bayes’ rule.
Firstly, our prior says that
These values will handle the
Next, we find the likelihood terms of
Again, these are empirical probabilities, we are not sure if these are the actual probabilities of the features given the species, but we can reasonably estimate it using statistics, and
in this case we again use relative frequency to estimate it. We will see later this is modelled
by the Multinomial (Categorical) distribution with parameter
The Categorical distribution is a generalization of the Bernoulli and Binomial distributions.
When
Plugging these priors and likelihoods into the formula above, we get the denominator, the normalizing constant, which is
We pause a while to note that
Of course, this is also merely the relative frequency of
Finally, by Bayes’ rule, we get the posterior probability of
which is the same as the one we calculated earlier using the empirical joint distribution table.
In a similar fashion, we get the posterior probability of
And the argmax of these three posterior probabilities is
So we can omit the denominator, and we get the same result, it may no longer sum to 1, but it does not affect the final result.
We also saw here that if our prior is very low, then even though the likelihood is high, the posterior can still be
low if prior << likelihood
. This is the reason why we need to use prior knowledge to help us make better predictions.
One Quantitative Predictor#
We now ignore the earlier categorical predictor bill_length_mm
. This penguin has a bill_length_mm
of 50mm,
so we want to predict the species of the penguin given this information.
We know that as we move on to continuous space, we can no longer use “relative frequency” to estimate the probability of a continuous variable happening, as we will see later.
Let’s do some EDA.
bill_length_mm = penguins[["bill_length_mm", "species", "class_id"]]
display(bill_length_mm)
bill_length_mm | species | class_id | |
---|---|---|---|
0 | 39.1 | Adelie | 0 |
1 | 39.5 | Adelie | 0 |
2 | 40.3 | Adelie | 0 |
3 | 36.7 | Adelie | 0 |
4 | 39.3 | Adelie | 0 |
... | ... | ... | ... |
337 | 47.2 | Gentoo | 2 |
338 | 46.8 | Gentoo | 2 |
339 | 50.4 | Gentoo | 2 |
340 | 45.2 | Gentoo | 2 |
341 | 49.9 | Gentoo | 2 |
342 rows × 3 columns
x2_mean = bill_length_mm["bill_length_mm"].mean()
print(
f"Mean bill length across the whole population "
f"without conditioning on class: {x2_mean:.2f} mm"
)
Mean bill length across the whole population without conditioning on class: 43.92 mm
The kdeplot above is for an univariate, empirical estimation of the whole dataset, not conditional on any class label.
# plotting both distibutions on the same figure
_ = sns.kdeplot(data=penguins, x="bill_length_mm", fill=True, common_norm=False, alpha=.5, linewidth=0, legend=True)
# plot vertical line
_ = plt.axvline(x=x2_mean, color='red', linestyle='--')
# plt.legend()
plt.show();

To see the conditional distribution, we can use the hue
parameter in seaborn.kdeplot
to plot the conditional distribution
More concretely, you just imagine that given
say Adelie has happened, then we zoom into the reduced sample space of the dataset where
I want to emphasize that for 1 predictor conditioned on another random variable,
in this case
x2 = 50 # penguin bill length
# plotting both distibutions on the same figure
_ = sns.kdeplot(
data=penguins,
x="bill_length_mm",
hue="species",
fill=True,
common_norm=False,
alpha=0.5,
linewidth=0,
legend=True,
)
# plot vertical line
_ = plt.axvline(x2, color="r", linestyle="--", label="50 mm")
plt.title("Density plot of bill length")
# plt.legend()
plt.show();

We have seen the conditional distribution plot above, let see the joint distribution of
g = sns.JointGrid(data=penguins, x="bill_length_mm", y="species")
g.plot_joint(sns.histplot, color = "red", cbar=True)
g.plot_marginals(sns.histplot, color = "green")
sns.jointplot(data=penguins, x="bill_length_mm", y="species")
<seaborn.axisgrid.JointGrid at 0x7f22b469adf0>


We see that the for this penguin with bill_length_mm
of 50mm, our gut feeling tells us
that it most likely is not an Adelie, since the distribution of
We can again use Bayes’ rule to compute the posterior probability of
For example, since our given feature is bill_length_mm
of 50mm, we see
Again, we use argmax to solve the problem,
Now we met our first hurdle here since species
vs bill_length_mm
does not work here).
Further, we haven’t assumed a model for
Notice here that it is possible that the three conditional distributions are different with
different
From the conditional plot above, we can see that the three species are not well separated in the joint distribution space, so we cannot expect a good performance from the Gaussian Naive Bayes model. For example, 50 mm could very well be either Gentoo or Chinstrap from the plot since they are not well separated.
In our example, the conditional distribution of
The next question is how do we estimate the
It turns out that the MLE of
The below table summarizes the sample (empirical) mean and sample standard deviation of
Species |
Sample Mean |
Sample Standard Deviation |
---|---|---|
A |
38.8 |
2.66 |
C |
48.8 |
3.34 |
G |
47.5 |
3.08 |
Remember again, the sample mean and sample standard deviation are unbiased estimators of the true
bill_length_mean_std = bill_length_mm.groupby('species').agg({'bill_length_mm': ['mean', 'std']})
bill_length_mean_std
bill_length_mm | ||
---|---|---|
mean | std | |
species | ||
Adelie | 38.791391 | 2.663405 |
Chinstrap | 48.833824 | 3.339256 |
Gentoo | 47.504878 | 3.081857 |
Let’s plot the normal distribution with the sample mean and sample standard deviation for each
# rv_adelie = stats.norm(bill_length_mean_std.loc['Adelie']['bill_length_mm']['mean'], bill_length_mean_std.loc['Adelie']['bill_length_mm']['std'])
# rv_chinstrap = stats.norm(bill_length_mean_std.loc['Chinstrap']['bill_length_mm']['mean'], bill_length_mean_std.loc['Chinstrap']['bill_length_mm']['std'])
# rv_gentoo = stats.norm(bill_length_mean_std.loc['Gentoo']['bill_length_mm']['mean'], bill_length_mean_std.loc['Gentoo']['bill_length_mm']['std'])
rv_adelie = stats.norm(38.8, 2.66)
rv_chinstrap = stats.norm(48.8, 3.34)
rv_gentoo = stats.norm(47.5, 3.08)
fig, ax = plt.subplots(1, 1)
ax.plot(np.linspace(30, 60, 100), rv_adelie.pdf(np.linspace(30, 60, 100)), 'r-', lw=5, alpha=0.6, label='Adelie')
ax.plot(np.linspace(30, 60, 100), rv_chinstrap.pdf(np.linspace(30, 60, 100)), 'b-', lw=5, alpha=0.6, label='Chinstrap')
ax.plot(np.linspace(30, 60, 100), rv_gentoo.pdf(np.linspace(30, 60, 100)), 'g-', lw=5, alpha=0.6, label='Gentoo')
# plot vertical line
_ = plt.axvline(x2, color="y", linestyle="--", label="50 mm")
_ = plt.axvline(x2+0.5, color="y", linestyle="--", label="50.5 mm")
_ = plt.axvline(x2-0.5, color="y", linestyle="--", label="49.5 mm")
ax.set_xlabel('bill_length_mm')
ax.set_ylabel('Probability density')
ax.legend()
plt.show();

As we can see, the this naive assumption of Gaussian distribution for
Now we can finally solve our “hurdle” problem of computing
We can use the Gaussian density function to compute the probability density of
In practice, we just need to compute the probability density of
As mentioned earlier, we should compute the posterior conditional PDF and maximize over it instead of the posterior conditional probability since probability at a point is 0 for continuous variables.
And so,
Similarly,
Note that
px2_given_adelie = rv_adelie.pdf(x2)
px2_given_chinstrap = rv_chinstrap.pdf(x2)
px2_given_gentoo = rv_gentoo.pdf(x2)
print(f'P(X_2=50|Adelie) = {px2_given_adelie:.7f}')
print(f'P(X_2=50|Chinstrap) = {px2_given_chinstrap:.3f}')
print(f'P(X_2=50|Gentoo) = {px2_given_gentoo:.5f}')
P(X_2=50|Adelie) = 0.0000212
P(X_2=50|Chinstrap) = 0.112
P(X_2=50|Gentoo) = 0.09317
As an example, the probability density of
So the marginal PDF of observing a penguin with a 50 mm bill length is,
Once again, this number is not a probability, it is the probability density, it just means for
every 1 mm, there is a 0.05579 mm chance of observing a penguin with a 50 mm bill length
around a small neighborhood
Then, the posterior conditional PDF of
Again note that these are not probabilities, they are probability densities. But since they have
the same denominator, it is normalized and so sum up to 1 in this case. This time round,
the final results was pushed over again by the fact that Gentoo is more common in the wild (prior
Two (Continuous) Predictors#
The reality is that the data is usually multi-dimensional, and we need to use multiple predictors to make a prediction.
In the previous two examples, we have seen that for the same penguin with the following features,
body_mass_g
: (overweight
= 0 ifbody_mass_g
, 1 otherwise)bill_length_mm
:flipper_length_mm
:
the predictions for the species is an Adelie if we only use overweight
, and
a Gentoo if we only use bill_length_mm
. This inconsistency suggest that
our model may have room for improvement. Intuitively, we can add
multiple predictors to our model to make a more accurate prediction, though
this is not always the case if the predictors are correlated, or if there are
too many predictors (curse of dimensionality).
flipper_length_mm = penguins[["flipper_length_mm", "species", "class_id"]]
display(flipper_length_mm)
flipper_length_mm | species | class_id | |
---|---|---|---|
0 | 181.0 | Adelie | 0 |
1 | 186.0 | Adelie | 0 |
2 | 195.0 | Adelie | 0 |
3 | 193.0 | Adelie | 0 |
4 | 190.0 | Adelie | 0 |
... | ... | ... | ... |
337 | 214.0 | Gentoo | 2 |
338 | 215.0 | Gentoo | 2 |
339 | 222.0 | Gentoo | 2 |
340 | 212.0 | Gentoo | 2 |
341 | 213.0 | Gentoo | 2 |
342 rows × 3 columns
x3_mean = flipper_length_mm["flipper_length_mm"].mean()
print(
f"Mean flipper length across the whole population "
f"without conditioning on class: {x3_mean:.2f} mm"
)
Mean flipper length across the whole population without conditioning on class: 200.92 mm
# plotting both distibutions on the same figure
_ = sns.kdeplot(
data=penguins,
x="flipper_length_mm",
fill=True,
common_norm=False,
alpha=0.5,
linewidth=0,
legend=True,
label="flipper_length_mm",
color="blue",
)
# plot vertical line
_ = plt.axvline(x=x3_mean, color="red", linestyle="--", label="mean")
plt.legend()
plt.title("Density plot of flipper length.")
plt.show()

x3 = 195 # penguin flipper length
# plotting both distibutions on the same figure
_ = sns.kdeplot(
data=penguins,
x="flipper_length_mm",
hue="species",
fill=True,
common_norm=False,
alpha=0.5,
linewidth=0,
legend=True,
)
# plot vertical line
_ = plt.axvline(x3, color="r", linestyle="--", label="195 mm")
plt.title("Density plot of flipper length")
# plt.legend()
plt.show();

# plotting both distibutions on the same figure
x2 = 50
_ = sns.kdeplot(
data=penguins,
x="bill_length_mm",
hue="species",
fill=True,
common_norm=False,
alpha=0.5,
linewidth=0,
legend=True,
)
# plot vertical line
_ = plt.axvline(x2, color="r", linestyle="--", label="50 mm")
plt.title("Density plot of bill length")
# plt.legend()
plt.show();

The bill_length_mm
and flipper_length_mm
are plotted above as density plots,
just by looking at them alone may be hard to distinguish between the
Chinstrap and Gentoo for the bill_length_mm
plot because they are
quite close (overlap in density curve) with each other, while if you
only look at the flipper_length_mm
plot, you can see that the Chinstrap and
Adelie are now quite close to each other, but if you combine
both bill_length_mm
and flipper_length_mm
together, you can see that
it is much easier to distinguish between them.
This plot is from Seaborn guide, one can see with default setting
we have a joint plot of bill_length_mm
and flipper_length_mm
with the marginal
histograms on the side. The joint plot is a scatter plot of the two variables.
The plot with kind=hist
will show the joint distribution as a histogram, and the
rectangular bins are colored by the density of the points in each bin. The darker
the color the more points in that bin. You can understand this as impulses from chan’s book.
Lastly, the plot with kind=kde
will show the joint distribution as a contour plot.
sns.jointplot(data=penguins, x="bill_length_mm", y="flipper_length_mm", color="orange")
sns.jointplot(
data=penguins,
x="bill_length_mm",
y="flipper_length_mm",
kind="hist",
color="orange",
cbar=True,
)
sns.jointplot(
data=penguins,
x="bill_length_mm",
y="flipper_length_mm",
kind="kde",
color="orange",
# cbar=True,
)
<seaborn.axisgrid.JointGrid at 0x7f21f3a63070>



The contour plot is a 2d slice of the 3d density plot where
To see the joint distribution conditioned on the class (species) hue
argument.
To emphasise, the way to approach conditional distributions is to zoom in on the
“reduced” sample space, for example if I want to look at the conditional distribution
of
sns.jointplot(data=penguins, x="bill_length_mm", y="flipper_length_mm", hue="species")
sns.jointplot(
data=penguins, x="bill_length_mm", y="flipper_length_mm", hue="species", kind="hist"
)
sns.jointplot(
data=penguins, x="bill_length_mm", y="flipper_length_mm", hue="species", kind="kde"
)
<seaborn.axisgrid.JointGrid at 0x7f21f0455d60>



Now we plot two dashed lines for where
To internalize this concept, we are again looking at the argmax expression below:
or equivalently,
We want to know given each species
sns.jointplot(data=penguins, x="bill_length_mm", y="flipper_length_mm", hue="species")
_ = plt.axvline(x2, color="black", linestyle="--", label="50 mm")
_ = plt.axhline(x3, color="black", linestyle="--", label="195 mm")
sns.jointplot(
data=penguins, x="bill_length_mm", y="flipper_length_mm", hue="species", kind="hist"
)
_ = plt.axvline(x2, color="black", linestyle="--", label="50 mm")
_ = plt.axhline(x3, color="black", linestyle="--", label="195 mm")
sns.jointplot(
data=penguins, x="bill_length_mm", y="flipper_length_mm", hue="species", kind="kde"
)
_ = plt.axvline(x2, color="black", linestyle="--", label="50 mm")
_ = plt.axhline(x3, color="black", linestyle="--", label="195 mm")



Conditional Independence#
Let’s use naive Bayes to make a prediction for the same penguin with the following features
Another hurdle is presented in front of us, we need to compute the conditional
joint distribution (PDF)
This is because we were dealing with only one-dimensional
distribution, even though there are two variables just now, but as I mentioned,
when we condition on
To reconcile this, Naive Bayes assumes that the predictors are conditionally independent1.
It states that given
The theorem holds for discrete case as well with the difference being that the PDF is replaced by the PMF.
This theorem means that within each class (species), the predictors are independent.
More concretely, if we say that bill_length_mm
(flipper_length_mm
(Gentoo
, we can see that the bill_length_mm
and flipper_length_mm
exhibit some positive correlation. We will still use this assumption
even though it is not perfect.
display(penguins.corr()) # correlation matrix but not conditioned on class
/tmp/ipykernel_2843/2213928773.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
display(penguins.corr()) # correlation matrix but not conditioned on class
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | overweight | class_id | |
---|---|---|---|---|---|---|
bill_length_mm | 1.000000 | -0.235053 | 0.656181 | 0.595110 | 0.448506 | 0.731369 |
bill_depth_mm | -0.235053 | 1.000000 | -0.583851 | -0.471916 | -0.533576 | -0.744076 |
flipper_length_mm | 0.656181 | -0.583851 | 1.000000 | 0.871202 | 0.762945 | 0.854307 |
body_mass_g | 0.595110 | -0.471916 | 0.871202 | 1.000000 | 0.851748 | 0.750491 |
overweight | 0.448506 | -0.533576 | 0.762945 | 0.851748 | 1.000000 | 0.689371 |
class_id | 0.731369 | -0.744076 | 0.854307 | 0.750491 | 0.689371 | 1.000000 |
Adelie = penguins[penguins["species"] == "Adelie"]
Chinstrap = penguins[penguins["species"] == "Chinstrap"]
Gentoo = penguins[penguins["species"] == "Gentoo"]
display(Adelie.corr())
display(Chinstrap.corr())
display(Gentoo.corr())
/tmp/ipykernel_2843/509882408.py:5: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
display(Adelie.corr())
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | overweight | class_id | |
---|---|---|---|---|---|---|
bill_length_mm | 1.000000 | 0.391492 | 0.325785 | 0.548866 | 0.338386 | NaN |
bill_depth_mm | 0.391492 | 1.000000 | 0.307620 | 0.576138 | 0.319450 | NaN |
flipper_length_mm | 0.325785 | 0.307620 | 1.000000 | 0.468202 | 0.336676 | NaN |
body_mass_g | 0.548866 | 0.576138 | 0.468202 | 1.000000 | 0.715684 | NaN |
overweight | 0.338386 | 0.319450 | 0.336676 | 0.715684 | 1.000000 | NaN |
class_id | NaN | NaN | NaN | NaN | NaN | NaN |
/tmp/ipykernel_2843/509882408.py:6: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
display(Chinstrap.corr())
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | overweight | class_id | |
---|---|---|---|---|---|---|
bill_length_mm | 1.000000 | 0.653536 | 0.471607 | 0.513638 | 0.287084 | NaN |
bill_depth_mm | 0.653536 | 1.000000 | 0.580143 | 0.604498 | 0.375972 | NaN |
flipper_length_mm | 0.471607 | 0.580143 | 1.000000 | 0.641559 | 0.398093 | NaN |
body_mass_g | 0.513638 | 0.604498 | 0.641559 | 1.000000 | 0.655613 | NaN |
overweight | 0.287084 | 0.375972 | 0.398093 | 0.655613 | 1.000000 | NaN |
class_id | NaN | NaN | NaN | NaN | NaN | NaN |
/tmp/ipykernel_2843/509882408.py:7: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
display(Gentoo.corr())
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | overweight | class_id | |
---|---|---|---|---|---|---|
bill_length_mm | 1.000000 | 0.643384 | 0.661162 | 0.669166 | 0.236458 | NaN |
bill_depth_mm | 0.643384 | 1.000000 | 0.706563 | 0.719085 | 0.235313 | NaN |
flipper_length_mm | 0.661162 | 0.706563 | 1.000000 | 0.702667 | 0.240309 | NaN |
body_mass_g | 0.669166 | 0.719085 | 0.702667 | 1.000000 | 0.425197 | NaN |
overweight | 0.236458 | 0.235313 | 0.240309 | 0.425197 | 1.000000 | NaN |
class_id | NaN | NaN | NaN | NaN | NaN | NaN |
It is obvious that the assumption of conditional independence is not true from the empirical data above. However, we will still use this assumption even though it is not perfect, and in the event if our Naive Bayes does not perform well, we know where went wrong…
Back to our
We now have an answer on how to unpack the joint conditional distribution
We can merely do
so that our previous expression becomes
We have already gotten the conditional distribution of
flipper_length_mean_std = flipper_length_mm.groupby("species").agg({"flipper_length_mm": ["mean", "std"]})
flipper_length_mean_std
flipper_length_mm | ||
---|---|---|
mean | std | |
species | ||
Adelie | 189.953642 | 6.539457 |
Chinstrap | 195.823529 | 7.131894 |
Gentoo | 217.186992 | 6.484976 |
x2_x3_mean_std = penguins.groupby("species").agg(
{"bill_length_mm": ["mean", "std"], "flipper_length_mm": ["mean", "std"]}
)
display(x2_x3_mean_std)
bill_length_mm | flipper_length_mm | |||
---|---|---|---|---|
mean | std | mean | std | |
species | ||||
Adelie | 38.791391 | 2.663405 | 189.953642 | 6.539457 |
Chinstrap | 48.833824 | 3.339256 | 195.823529 | 7.131894 |
Gentoo | 47.504878 | 3.081857 | 217.186992 | 6.484976 |
rv_adelie = stats.norm(190., 6.54)
rv_chinstrap = stats.norm(196., 7.13)
rv_gentoo = stats.norm(217., 6.48)
fig, ax = plt.subplots(1, 1)
ax.plot(np.linspace(170, 240, 100), rv_adelie.pdf(np.linspace(170, 240, 100)), 'r-', lw=5, alpha=0.6, label='Adelie')
ax.plot(np.linspace(170, 240, 100), rv_chinstrap.pdf(np.linspace(170, 240, 100)), 'b-', lw=5, alpha=0.6, label='Chinstrap')
ax.plot(np.linspace(170, 240, 100), rv_gentoo.pdf(np.linspace(170, 240, 100)), 'g-', lw=5, alpha=0.6, label='Gentoo')
# plot vertical line
_ = plt.axvline(x3, color="y", linestyle="--", label="195 mm")
_ = plt.axvline(x3+0.5, color="y", linestyle="--", label="195.5 mm")
_ = plt.axvline(x3-0.5, color="y", linestyle="--", label="194.5 mm")
ax.set_xlabel('flipper_length_mm')
ax.set_ylabel('Probability density')
ax.legend()
plt.show();

px3_given_adelie = rv_adelie.pdf(x3)
px3_given_chinstrap = rv_chinstrap.pdf(x3)
px3_given_gentoo = rv_gentoo.pdf(x3)
print(f'P(X_3=195|Adelie) = {px3_given_adelie:.5f}')
print(f'P(X_3=195|Chinstrap) = {px3_given_chinstrap:.5f}')
print(f'P(X_3=195|Gentoo) = {px3_given_gentoo:.7f}')
P(X_3=195|Adelie) = 0.04554
P(X_3=195|Chinstrap) = 0.05541
P(X_3=195|Gentoo) = 0.0001934
For formality, the code above translates to
Similarly, we can compute the other two conditional distributions to be
Next, we compute the denominator, which is the joint distribution of observing
We plug into Bayes’ rule to get the posterior probability of
Similarly, we can compute the posterior probability of
We then take the argmax of the posterior probabilities to get the prediction. In this case,
it is
Three Predictors#
Even though not mentioned in the book, we can also use 3 predictors to make a prediction.
The three predictors are overweight
, bill_length_mm
, and flipper_length_mm
. The tricky
part is that overweight
is a categorical variable, and bill_length_mm
and flipper_length_mm
are continuous variables. We can still do it though, dafriedman97
mentioned that we for a random vector overweight
is assumed to be follow a Bernoulli distribution,
bill_length_mm
and flipper_length_mm
are assumed to be follow a Gaussian distribution. Note again that
this is an assumption, and we can always change it to something else if we want to since Naive Bayes does
not have any assumptions on the distribution of the predictors (potentially confusing part!).
Summary#
Here we summarize a few key points.
If prior is low but likelihood is high, then posterior may not be high if prior << likelihood.
The normalizing constant is constant during argmax comparison, and hence is often omitted. You can always recover the normalizing constant since you have the posterior and the numerator.
The key here is estimation theory, for example, to estimate the priors, all we need to do is to find the proportion of each class in the training set. This proportion is an unbiased estimator of the true prior probability.
Conditional Independence is key for Naive Bayes to work, but if your model does not do well, then this assumption may be violated. Note that the conditional independence assumption along with the
assumpiton are the only assumptions for Naive Bayes. Assuming the conditional distribution is Gaussian is not an assumption, it is just a way to make the model tractable. You can come up with your own assumptions for the conditional distribution.
PyTorch#
Subset data to only include bill_length_mm
and flipper_length_mm
.
Further split the data into training and test sets.
X = penguins[["bill_length_mm", "flipper_length_mm"]].values
y = penguins["class_id"].values
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.1, random_state=42)
def estimate_prior_parameters(y: torch.Tensor) -> torch.Tensor:
"""Calculate the prior distribution of the classes based on the
relative frequencies of the classes in the dataset."""
probs = torch.unique(y, return_counts=True)[1] / len(y)
distribution = torch.distributions.Categorical(probs=probs)
return distribution
The prior distribution of the whole (not split) is indeed the same as the prior distribution we got in the previous section The Prior.
prior = estimate_prior_parameters(torch.from_numpy(y))
print(f"Prior distribution: {prior.probs}")
Prior distribution: tensor([0.4415, 0.1988, 0.3596])
def estimate_likelihood_parameters(X: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
"""Estimate the parameters of X conditional on y where X | y is
assumed to be univariate gaussian for each X_d in X."""
n_classes = len(torch.unique(y))
n_features = X.shape[1]
# mean vector = loc -> (n_classes, n_features)
loc = torch.zeros(n_classes, n_features)
# covariance matrix -> (n_classes, n_features)
# diagonal matrix since features are independent
covariance_matrix = torch.zeros(n_classes, n_features, n_features)
for k in range(n_classes):
X_where_k = X[torch.where(y == k)]
for d in range(n_features):
mean = X_where_k[:, d].mean()
var = X_where_k[:, d].var()
loc[k, d] = mean
covariance_matrix[k, d, d] = var
distribution = torch.distributions.MultivariateNormal(
loc=loc, covariance_matrix=covariance_matrix
)
return distribution
Sanity check that mean and variance are correct when compared to the results from pandas dataframe.
x2_x3_mean_std_var = penguins.groupby("species").agg({"bill_length_mm": ["mean", "std", "var"], "flipper_length_mm": ["mean", "std", "var"]})
display(x2_x3_mean_std_var)
bill_length_mm | flipper_length_mm | |||||
---|---|---|---|---|---|---|
mean | std | var | mean | std | var | |
species | ||||||
Adelie | 38.791391 | 2.663405 | 7.093725 | 189.953642 | 6.539457 | 42.764503 |
Chinstrap | 48.833824 | 3.339256 | 11.150630 | 195.823529 | 7.131894 | 50.863916 |
Gentoo | 47.504878 | 3.081857 | 9.497845 | 217.186992 | 6.484976 | 42.054911 |
class_conditionals = estimate_likelihood_parameters(torch.from_numpy(X), torch.from_numpy(y))
print(f"Mean vector:\n{class_conditionals.mean}")
print(f"Covariance matrix:\n{class_conditionals.covariance_matrix}")
Mean vector: tensor([[ 38.7914, 189.9536], [ 48.8338, 195.8235], [ 47.5049, 217.1870]])
Covariance matrix: tensor([[[ 7.0937, 0.0000], [ 0.0000, 42.7645]], [[11.1506, 0.0000], [ 0.0000, 50.8639]], [[ 9.4978, 0.0000], [ 0.0000, 42.0549]]])
The mean is a x2_x3_mean_std_var
table.
For the mean matrix, the first row is the mean (bill_length_mm
and flipper_length_mm
for Adelie,
the second row is the mean (bill_length_mm
and flipper_length_mm
for Chinstrap, and the third row
is the mean (bill_length_mm
and flipper_length_mm
for Gentoo.
For instance, the first row is bill_length_mm
and flipper_length_mm
for Adelie.
For the covariance matrix, the first row is a
For instance, the first row being bill_length_mm
and flipper_length_mm
for Adelie where bill_length_mm
and flipper_length_mm
.
Now, this may seem confusing at first, as we have established that each
conditional distribution
The catch is in Random Vectors, finding the mean vector and covariance matrix of class_conditionals.covariance_matrix
are the
variance of each univariate Gaussian distribution, and the off-diagonals are 0.
This is a compact way to represent the mean and covariance matrix of a multivariate Gaussian distribution where each of its components are independent.
In my code, theta
is a theta
, it is a
In our theta
, the first row is a
Below is some functions to plot the contours.
def get_meshgrid(x0_range, x1_range, num_points=100):
x0 = np.linspace(x0_range[0], x0_range[1], num_points)
x1 = np.linspace(x1_range[0], x1_range[1], num_points)
return np.meshgrid(x0, x1)
def contour_plot(x0_range, x1_range, prob_fn, batch_shape, colours, levels=None, num_points=100):
X0, X1 = get_meshgrid(x0_range, x1_range, num_points=num_points)
X_input_space = np.expand_dims(np.array([X0.ravel(), X1.ravel()]).T, 1)
X_input_space = torch.from_numpy(X_input_space)
Z = prob_fn(X_input_space)
Z = torch.exp(Z) # convert log pdf to pdf
Z = Z.numpy()
# Z = prob_fn(np.expand_dims(np.array([X0.ravel(), X1.ravel()]).T, 1))
Z = np.array(Z).T.reshape(batch_shape, *X0.shape)
for batch in np.arange(batch_shape):
if levels:
plt.contourf(X0, X1, Z[batch], alpha=0.2, colors=colours, levels=levels)
else:
plt.contour(X0, X1, Z[batch], colors=colours[batch], alpha=0.3)
def plot_data(x, y, labels, colours):
for c in np.unique(y):
inx = np.where(y == c)
plt.scatter(x[inx, 0], x[inx, 1], label=labels[c], c=colours[c])
plt.title("Training set")
plt.xlabel("bill_length_mm")
plt.ylabel("flipper_length_mm")
plt.legend()
class_names = ["Adelie", "Chinstrap", "Gentoo"]
class_colours = ['blue', 'orange', 'green']
We assume the features are conditionally independent given the class label, and follows a Gaussian.
Note we need to use torch.exp
to exponentiate the log-likelihood back to normal scale, as the log-likelihood is in log-space.
class_conditionals.log_prob(torch.tensor([50, 195])), torch.exp(class_conditionals.log_prob(torch.tensor([50, 195])))
(tensor([-13.8483, -5.0759, -11.0133]),
tensor([9.6774e-07, 6.2458e-03, 1.6482e-05]))
plt.figure(figsize=(10, 6))
plot_data(X, y, class_names, class_colours)
x0_min, x0_max = X[:, 0].min(), X[:, 0].max()
x1_min, x1_max = X[:, 1].min(), X[:, 1].max()
contour_plot(
(x0_min, x0_max),
(x1_min, x1_max),
class_conditionals.log_prob,
3,
class_colours,
num_points=100,
)
plt.title("Training set with class-conditional density contours")
plt.show()

So the above contour plot represents each class conditional distribution, for example,
the blue dots are Adelie with bill_length_mm
on the x-axis and flipper_length_mm
on the y-axis and the contour is a 2d visualization of the joint distribution of bill_length_mm
and flipper_length_mm
given Adelie (
Of course the real data points (blue scatters) are not exactly on the contour, since
there is no reason to believe that Adelie’s bill_length_mm
and flipper_length_mm
are exactly Gaussian distributed with such mean and covariance matrix.
Further Readings#
JOHNSON, ALICIA A. “Chapter 14. Naive Bayes Classification.” In Bayes Rules!: An Introduction to Applied Bayesian Modeling. S.l.: CHAPMAN & HALL CRC, 2022.
https://www.kaggle.com/code/parulpandey/penguin-dataset-the-new-iris