Find the intersection of overlapping histograms in R

Here, I demonstrate how to find the point where two histograms overlap. While this is an approximation, it seems to have a very high level of precision.

Prepare simulated data

I created two data sets, gamma_dist and norm_dist, which are made up of a different number of values sampled randomly from a gamma distribution and normal distribution, respectively. I specicially made the data sets different sizes to make the point that this method is still applicable.

library(tibble)

set.seed(0)

gamma_dist <- rgamma(1e5, shape = 2, scale = 2)
norm_dist <- rnorm(5e5, mean = 20, sd = 5)

df <- tibble(
  x = c(gamma_dist, norm_dist),
	original_dataset = c(rep("gamma_dist", 1e5), rep("norm_dist", 5e5))
)
df
#> # A tibble: 600,000 x 2
#>        x original_dataset
#>    <dbl> <chr>
#>  1  6.89 gamma_dist
#>  2  2.25 gamma_dist
#>  3  1.30 gamma_dist
#>  4  4.10 gamma_dist
#>  5  7.77 gamma_dist
#>  6  5.08 gamma_dist
#>  7  4.58 gamma_dist
#>  8  2.30 gamma_dist
#>  9  1.36 gamma_dist
#> 10  1.67 gamma_dist
#> # … with 599,990 more rows

I used ‘ggplot2’ to plot the densities of the two data sets. The gamma distribution is in red and the normal distribution is in blue. I broke the creation of the plot into two steps: the essential step to create the density curves, and the styling step to make the plot look nice. Of course, these could be combined into a single long ggplot statement.

library(ggplot2)

p <- ggplot(df) +
  geom_density(aes(x = x, color = original_dataset))

p <- p +
  scale_y_continuous(expand = expand_scale(mult = c(0, 0.05))) +
  scale_color_manual(values = c("tomato", "dodgerblue")) +
  theme_minimal() +
  theme(
      legend.title = element_blank(),
      plot.title = element_text(hjust = 0.5)
  ) +
  labs(x = "values",
       title = "Two density curves")

Finding the point of intersection

To find the point of intersection, I first binned the data sets using density. It is essential to use the same from and to values for each data set. The density function creates 512 bins, thus, providing the same starting and ending parameters makes density use the same bins for each data set.

from <- 0
to <- 40
gamma_density <- density(gamma_dist, from = from, to = to)
norm_density <- density(norm_dist, from = from, to = to)

The final step was to find where the density of the gamma distribution was less than the normal distribution. Therefore, I applied this logic to create the boolean vector idx. I also included two other filters to contain the result between 5 to 20 because, from the plot above, I can see that the intersection falls within this range.

idx <- (gamma_density$y < norm_density$y) &
       (gamma_density$x > 5) &
       (gamma_density$x < 20)
poi <- min(gamma_density$x[idx])
poi
#> 10.64579

That’s it, the point of intersection has been approximated to a high precision. A vertical line was added to the plot below at poi.

p <- p + 
  geom_vline(xintercept = poi, linetype = 2, size = 0.3, color = "black") +
  annotate(geom = "text", label = round(poi, 3),
           x = poi - 1, y = 0.1, size = 4, angle = 90)

Avatar
Joshua Cook
Graduate Student

My research interests include cancer genetics and evolution. I also learning about programming and computer science in general.

Related