top of page

HODP x Bluebonnet: Modeling Political District Similarity

Lauren Chen, Alex Cheng, Richard Zhu


This semester, a team of three at the Harvard Open Data Project, a student-faculty group that aims to increase transparency and solve problems on campus using public Harvard data, partnered with Bluebonnet Data. We helped Bluebonnet create a model to analyze demographic, political, and geographic district similarity between all 435 US congressional districts to help identify potential elections and districts for Bluebonnet to target for progressive campaigns.

For initial political campaign analysis, it's often valuable to get some information about the district by looking at previous comparable elections for the same seat. However, in many cases, there aren't many, or in some cases any, comparable elections to look at (e.g. the incumbent has been unchallenged for a decade or more). This issue is especially prevalent at the state and local levels, leading to the need for an alternative means to obtain information about specific districts. Instead, it is useful to look at comparable districts that have had previous elections that are similar. Therefore, we created a model to help address this issue, where we are able to identify comparable districts that have some previous elections.


Several factors can go into determining similarity of districts in terms of political behavior, including how rural they are, the demographics of the district, turnout in various elections, etc. For our model, we decide to focus on demographics, voting in past House and Presidential elections, and geographic location.

Demographic data were acquired from the Census database and included information about race, age, income, and education for each congressional district. Political data from the 2012, 2016, and 2020 presidential elections and 2014, 2016, and 2018 house elections were provided by Daily Kos and Harvard Dataverse, respectively. Shapefiles of congressional districts were used to determine their centroids.


The model was based on the 2018 Politico district similarity maps, which analyzed political and demographic similarity of districts separately. For our model, we retrained the political model on the 3 most recent House and Presidential election cycles and merged it with the demographic model, which was retrained on the most recent 2018 census data. We then added the geographic centroid of each district as another model parameter to account for locational similarity. In short, we combined demographic information, past political results, and geographic location of each congressional district into one model. Below is a diagram of the pipeline used to develop our model.

We used a weighted Euclidean distance method to measure district similarity. Districts with lower Euclidean distance values (which essentially means they are “closer” together across our measured variables) are considered more similar, while districts with larger Euclidean distances are considered more different. However, we wanted to weight each variable based on its political importance (rather than weighting all variables equally); for example, a district’s 2018 house election results are likely to be a more important indicator of political behavior than a district’s longitude.

As such, we assigned weights to each variable by their political importance based on coefficients from a multiple linear regression model using Python’s scikit-learn package. We grouped our variables into two buckets, and ran separate multivariate regressions for each bucket:

  1. Demographic and geographic variables (race, education, income, density, and centroids of the district)

  2. Political variables (past election results for the House and President)

Our dependent variables for both models were presidential election result margins for each district. From there, we ran a multiple linear regression for both buckets and received the following set of coefficients. Each variable’s corresponding coefficient magnitude would become its weight in our Euclidean distance; for example, income has a coefficient of -0.428, so its scalar weight is 0.428.

We then take their weighted distance for each measure between every pairwise combination of districts to ultimately obtain a full list of similarities for each district. The calculated similarity values can be found here.


The model outputs the 22 most similar districts for each of the 435 districts and performs well based on political ad hoc analysis using metrics from previous political elections.

Historically Republican strongholds were matched with similar Republican districts. For example, Alabama District 1, home of Representative Jerry Carl (R) who won and a district that went 63% in favor of Trump in the 2020 election, was identified by the model to be most similar to Mississippi District 1 which is also a Republican stronghold in a similar area.


Similarly, Democratic districts were matched with historically Democratic strongholds. For example, California's 18th congressional district, Anna Eshoo (D), which went 76.4% for Biden, was rated most similar by our model to Jerry Nadler’s (D) NY-10, which represents Manhattan.


For toss up districts, results seem more mixed with districts such as California’s 21st district, a district that voted for 54% for Biden and narrowly for Republican representative David Valadao, being matched with Texas’s 34th district, which has been held by Democrat Filemon Vela Jr. for the past 8 years (won 55.4%-41.9% against his Republican opponent in 2020) and went in favor 52% for Biden.



Using demographic, political, and geographic to measure similarity between political districts proved to produce an effective model. Our model can be used to help political candidates running in districts with less data run more compelling campaigns by helping

determine how previous candidates ran in similar districts.

Future steps for the project include applying the model to down ballot races, where such a model could help candidates greatly as there is often limited data available. Model improvements could also be made by tweaking the weights of the variables used and by adding in additional potentially significant variables. In addition, it would be helpful to build a user interface that displays the data in more accessible ways, allowing users to easily navigate similarity ratings, demographic breakdowns, political backgrounds, and other relevant information.

289 views0 comments


bottom of page