| id | type |
|---|---|
| 28 | fake |
| 80 | real |
| 101 | fake |
| 111 | fake |
| 137 | real |
| 133 | fake |
| 144 | fake |
| 132 | fake |
| 98 | fake |
| 103 | real |
Number of real articles: 3
Number of fake articles: 7
Proportion that are fake: 0.70
Introduction to Probability

With the rise of online news and social media platforms that allow users to share articles with minimal oversight — fake, misleading, and biased news has become widespread.
To illustrate this, we will look at a sample of 150 articles that were shared on Facebook during the run-up to the 2016 U.S. presidential election. The data were collected as part of a study called “FakeNewsNet: A Data Repository with News Content, Social Context and Spatialtemporal Information for Studying Fake News on Social Media”.
We’ll consider four variables that describe the articles:
Press Run Code to import the data frame.
An example data table
| id | type |
|---|---|
| 28 | fake |
| 80 | real |
| 101 | fake |
| 111 | fake |
| 137 | real |
| 133 | fake |
| 144 | fake |
| 132 | fake |
| 98 | fake |
| 103 | real |
Number of real articles: 3
Number of fake articles: 7
Proportion that are fake: 0.70
On your worksheet:
On your worksheet, create a bar chart to represent the number of articles you selected that were real and fake.
Represent the same information in the sample space diagram on your worksheet. A sample space is the set of all possible outcomes of a random experiment (here: Real or Fake). Write the number of your drawn articles that are Fake inside the bubble. Write the number of your drawn articles that are not Fake (the complement of Fake — in other words, Real) outside the bubble but still inside the rectangle.
Each trial is Bernoulli (i.e., a Bernoulli trial).
Each experiment (10 trials) produces one Binomial outcome (i.e, the number of fake articles that you drew).
The class’s results across many experiments form the empirical distribution of that Binomial random variable.
Now we will work together to build the empirical distribution of counts.
Take turns sharing your result. One at a time, each student will say how many of their 10 articles were fake.
Mark the chart. For each student’s result:
Find that number of fake articles on the x-axis.
Shade in the next empty square above that number.
Continue around the room. As we go, the columns above each x-axis value will grow taller, showing how many students had that result.
By the end, the chart will display the empirical distribution of the number of fake articles in our class’s trials.
Now that we have filled in the empirical distribution of counts (the left chart), we’ll use it to create the empirical cumulative distribution of counts (the right chart).
Start at 0. Look at how many students had 0 fake articles.
Add as you go. For each next value on the x-axis (1, 2, 3, …, 10):
Add up all the students with that number of fake articles plus all the students with fewer fake articles.
Record this running total in the corresponding column on the cumulative chart.
Example: If 3 students had 1 fake article and 2 students had 0 fake articles, then the cumulative count at 1 equals 3 + 2 = 5.
Continue to the end. Repeat this process until you reach 10 fake articles. The final cumulative total should equal the total number of students in the class.
Now that we’ve built an empirical distribution from our class experiments, let’s compare it to the theoretical probabilities we would expect if each trial were truly a Bernoulli draw with probability (\(p\)) of being Fake.
In the study from which these data were drawn, 40% of the articles were classified as fake. This means the probability that any single article is fake can be modeled as \(p =0.40\).
A Bernoulli trial is a random experiment with exactly two possible outcomes (here: Fake or Real).
These outcomes follow a Bernoulli distribution with parameter:
The Bernoulli PMF gives the probability of each possible outcome (Fake or Real) in a single trial.
By the Law of Total Probability, the sum of the probabilities of all possible outcomes equals 1. \[P(\text{Fake}) + P(\text{Real}) = 0.40 + 0.60 = 1.0.\]
When we repeat a Bernoulli trial \(n\) times (say, drawing 10 articles), the random variable is no longer “Fake or Real,” but rather the number of Fake articles out of 10.
These outcomes follow a Binomial distribution with parameters:
- \(n = 10\): the number of trials, and
- \(p = 0.40\): the probability that any single article is fake.
The Binomial PMF gives the probability of observing exactly \(0, 1, 2, \dots, 10\) Fake articles across those 10 trials.
The function dbinom() in R calculates probabilities from the Binomial distribution. It answers the question: What is the probability of getting exactly k successes in n independent Bernoulli trials, each with success probability p?
Take a few moments with a partner to change the values of n_trials and p_success — notice how the graph changes.
There are a couple of new functions in this code. First, data.frame(k = 0:n_trials) creates a quick dataset with all possible outcomes (k = 0, 1, 2, …, n_trials) — all the values we could possibly observe. Second, dbinom(k, size = n_trials, prob = p_success) inside aes() computes the probabilities on the fly (no need to create a data frame first, as we did previously). Last, glue() (from the glue package) is used to insert variable values directly into text strings. So if you change n_trials or p_success at the top, the labels automatically update — no need to edit the title by hand.
In this version, the x-axis is treated as continuous (the factor(k) argument is removed). We make this trade-off in this exploration graphic because using a truly discrete scale becomes unreadable with larger numbers of trials, as every single integer value would need to be labeled on the axis.
The center of the distribution shifts:
If p < 0.5, the distribution leans to the left (more weight on smaller counts).
If p > 0.5, it leans to the right (more weight on larger counts).
If p = 0.5, it’s symmetric around \(n/2\).
The spread also changes: values of p closer to 0 or 1 make the distribution more concentrated at the edges, while p near 0.5 gives the widest spread.
The x-axis range expands: with larger n, there are more possible outcomes (0 all the way up to n).
The distribution looks smoother: for small n (like 5 or 10) it’s jagged, but as n grows, the bars form a more bell-shaped curve.
For large n, the distribution begins to resemble a normal distribution — even when the probability of success is far from 0.5.
In the fake news study, a second variable considered by the researchers was whether or not the article had an exclamation point in the title.
Through the study, the researchers determined that:
The probability that an article was fake, \(P(\text{Fake}) = 0.40\).
The probability that an article had an exclamation point in the title, \(P(\text{With !}) = 0.12\).
That these two events were not independent:
The joint probability — the probability that both events occur (an article is Fake and has an Exclamation point in the title) — for dependent events is given by the multiplication rule:
\[ P(\text{Fake and With !}) = P(\text{Fake}) \times P(\text{With !} \mid \text{Fake}) \]
\[ = 0.40 \times 0.267 \approx 0.11 \]
How can we represent this in the sample space diagram?
16 articles were fake and had an exclamation point, 44 were fake but didn’t have an exclamation point, 2 were real but had an exclamation point. The remainder (88) were real and didn’t have an exclamation point.
Divide every cell (including the margins) by the grand total (N).
| CROSS TABULATION WITH MARGINS | |||
|---|---|---|---|
| Cell proportions | |||
| real | fake | Total | |
| without ! | 0.587 | 0.293 | 0.880 |
| with ! | 0.013 | 0.107 | 0.120 |
| Total | 0.600 | 0.400 | 1.000 |
Raw counts are frequencies, proportions are relative frequencies, probabilities assume the data represent the whole population.
| Cross table with fake articles highlighed | |||
|---|---|---|---|
type
|
Total | ||
| real | fake | ||
| exclamation | |||
| without ! | 88 | 44 | 132 |
| with ! | 2 | 16 | 18 |
| Total | 90 | 60 | 150 |
| Cross table with ! articles highlighed | |||
|---|---|---|---|
type
|
Total | ||
| real | fake | ||
| exclamation | |||
| without ! | 88 | 44 | 132 |
| with ! | 2 | 16 | 18 |
| Total | 90 | 60 | 150 |
Prior belief: \(P(\text{Fake}) = 0.40\).
Evidence observed: Title has an exclamation point (\(\text{With !}\)).
Bayes’ Rule: \(P(\text{Fake}\mid \text{With !}) \;=\; \dfrac{P(\text{With !}\mid \text{Fake})\,P(\text{Fake})}{P(\text{With !})} \;=\; \dfrac{0.267 \times 0.40}{0.12} \; 0.889.\)
Interpretation: seeing “!” updates belief from \(0.40\) to about \(0.889\) that the article is fake.