Jobs Due to Biden’s Infrastructure Plan: What is Being Discussed is Not What You Think

A.  Introduction

Politicians have always been eager to announce that a program they have proposed will “create jobs”.  The Biden administration is no exception.  Indeed, President Biden has titled his $2.2 trillion proposal to rebuild America’s infrastructure the “American Jobs Plan”.  And all this is understandable, given the politics.  You would be forgiven, however, for assuming that what is being discussed on the additional jobs that would follow from Biden’s infrastructure proposals has something to do with jobs such as those depicted in the picture above.  They don’t.  The numbers on “new jobs created” that are being bandied about are on something else entirely.

There has also been some confusion on how many jobs that might be.  In remarks made on April 2, soon after his initial announcement of the proposed $2.2 trillion infrastructure initiative, Biden said:  “Independent analysis shows that if we pass this plan, the economy will create 19 million jobs — good jobs, blue-collar jobs, jobs that pay well.”  The estimate is from an analysis made by Mark Zandi, Chief Economist of Moody’s Analytics (a subsidiary of Moody’s, the bond credit rating agency).  Zandi is a well-respected economist, who was an economic advisor to John McCain during his 2008 campaign for the presidency and who has advised both Democrats and Republicans.

The 19 million jobs figure is an estimate made by Zandi and his team at Moody’s Analytics of how many more jobs there would be in the US (or, more precisely, non-farm employees) in 2030 as compared to the average number in 2020, in a scenario where Biden’s infrastructure plan is approved as proposed and then implemented.  But it is important to note that this is an estimate of the total number of jobs that “the economy will create” over the decade if the plan is passed (which is what Biden specifically said), and not an estimate of the extra number of jobs that can be attributed to the American Jobs Plan itself.  But it would be easy to miss this distinction.  The Moody’s Analytics estimates are that the number of jobs in the economy would rise between 2020 and 2030 by 19.0 million if the plan is passed as proposed, but by 16.3 million if only the covid-relief plan (Biden’s $1.9 trillion American Rescue Plan) is passed (as it has been), and by 15.7 million in a scenario where neither plan was passed.  Thus in the Moody’s Analytics forecasts, the number of jobs in 2030 would be 2.7 million higher than otherwise if the infrastructure plan is now passed (on top of the extra 0.6 million if only the covid-relief plan were passed).

But it is easy to misstate these distinctions, and some of the administration appointees discussing the proposal with the press at first did so.  In particular, Pete Buttigieg, the Transportation Secretary, and Brian Deese, the head of the National Economic Council in the White House, at first used wording that implied that the full 19 million additional jobs would be due to the infrastructure plan itself.  They later clarified that they had misspoke, and that the Moody’s Analytics estimates were of 2.7 million additional jobs due to the infrastructure plan.  However, this did not keep various news media fact-checkers (including at CNN and at the Washington Post) from taking them to task on it (and for the Washington Post to award Biden “two Pinocchios” in their fact-checking scoring system for being, in their view, misleading).

One can question whether this is quibbling over language that was not fully clear.  But what is of far greater importance is that it misses the fundamental question of what any of these employment forecasts (whether of 19 million, or 2.7 million, or 0.6 million from the $1.9 trillion covid-relief plan) actually mean.  Keep in mind that they are all estimates of how many more people will be employed in 2030 compared to the number employed in 2020, or in a comparison of one scenario for 2030 compared to another.  They are specifically not estimates of the number of jobs of primarily construction workers who would be employed as a direct result of the new infrastructure investments being built.  Yet the wording of Biden, stating that these would be well-paying blue-collar jobs, would appear to indicate that that is what he had in mind when citing the figures.

Furthermore, if the job figures were intended to refer to the blue-collar construction workers who would be hired to build these projects, it does not make much sense to base a comparison on 2030.  By that point the infrastructure plan would be essentially over, with just a small residual amount still to be spent as the program is tailing off (of the $2.2 trillion total, just $81 billion in 2030 and a final $35 billion in 2031 would remain to be spent in the Moody’s estimates).  Few construction workers would still be employed on those projects by that point.  Rather, what may be of interest is not some relatively small change in the overall number of people employed at some end-point, but rather the number of person-years of employment of such workers during the full period of the infrastructure plan.  But the Moody’s estimates are specifically not that.

This then brings up the question of what is Moody’s in fact estimating?  That will be the focus of this blog post.  It is not the number of jobs in construction that will be created as a result of the new work on infrastructure, as these will be down to a fairly minor level by 2030.  As we will see, it is rather an estimate resulting from some secondary aspects of the Moody’s model, and it is not even clear whether the differences were intended to be meaningful.

To start, this post will review how estimates of future employment are traditionally made – for example by the Bureau of Labor Statistics (BLS).  In brief, they are based on population estimates and on forecasts of what share of different population groups will seek to be part of the labor force (the labor force participation rates), with then the assumption that the economy will be at full employment at that future date.  The full employment assumption is made not because the forecaster is confident the economy will in fact be at full employment in that forecast year.  Rather, they do not really know what the short-term conditions will be in that future year, and assuming full employment is just for setting a benchmark.  Unemployment depends on how successful monetary and fiscal policies would have been in that future year to bring the economy to full employment.  Such policies are short-term, depend on the immediate situation, and we have no way of knowing now (in 2021) what shocks or surprises the economy will be facing in 2030.

With this the case, why is Moody’s forecasting any difference at all in the 2030 employment numbers?  The differences are in fact not large when compared to what overall employment will be in that year.  But there is some, and we will discuss why that is.

The post will then look at what one might say on jobs in the intervening years.  While Moody’s has produced year-by-year estimates, its approach for those years (after the next couple of years, as they forecast the economy moves to full employment) is fundamentally similar to what they assume for 2030.  What Moody’s specifically did not do in its analysis was try to estimate the direct number of jobs (or more precisely, person-years of employment) of those employed on the infrastructure projects in Biden’s plan.  Someone will likely do that at some point, but it was not done here.  The question I will then look at it is whether this should be seen as “job creation”.  I will argue that it would be more appropriate to look at it as job shifting rather than job creation, as the total number of jobs in the economy (the number employed) will likely not be all that much different.  And there is nothing wrong with that.  The primary objective, after all, is to build and maintain our badly needed infrastructure.  And on the employment that would follow, providing more attractive jobs that workers will seek to shift into is a good thing.  But the total number employed may not change, and if that is the metric one tries to use, one will likely be disappointed.  Many, including politicians, are often confused about this.

None of this should be taken to imply that the infrastructure plan is not warranted.  It desperately is, as will be discussed in the penultimate section of this post.  The US has underinvested in public infrastructure for decades, and what we have is an embarrassment compared to what is seen in Europe or East Asia.  And it has direct implications for productivity.  Truck drivers are not productive when they are sitting in traffic jams due to our poor highways.  But it is wrong to assess the value of an infrastructure investment program by some estimate of the number of jobs created.  Yes, there will be workers employed on the projects, in likely well-paid jobs.  But that should not be the objective – better public infrastructure should be the objective, achieved as efficiently as possible.  A focus on “jobs created” is instead likely to lead to confusion, as it has with the Moody’s numbers.

We will then end with a short summary and conclusions section.

Finally, note that the version of Biden’s infrastructure plan examined by Zandi and his team was estimated to cost $2.2 trillion over ten years.  However, one will see references to Biden’s plan as costing $2.0 trillion, or $2.3 trillion, or some other amount.  The final amount will depend, of course, on whatever Congress approves, but for consistency I will focus here on the plan as assessed by Zandi, at an estimated cost of $2.2 trillion.

B.  Forecasting Future Employment Levels

Yogi Berra purportedly said:  “It’s tough to make predictions, especially about the future”.  Whether he actually said that is not so clear, but it is certainly true.  And this is especially true of predictions of future employment.  But some things are more predictable than others, and the trick is to make use of factors that change only slowly over time.

In particular, population forecasts for periods of a decade or so are relatively reliable.  Those in a particular age bracket now will be ten years older a decade from now, and all one needs then to adjust for are mortality rates (which are known and change only slowly over time) and net migration rates (which are relatively small in magnitude).  Thus the Census Bureau can produce fairly reliable population forecasts for periods of a decade, and can provide these for groups broken down by age bracket as well as sex, race, and ethnicity.

The Bureau of Labor Statistics starts from such Census Bureau forecasts to produce its projections of the labor force and employment.  The BLS does this annually, with the most recent such projections from September 2000 covering the period 2019 to 2029.  The BLS takes the Census Bureau forecasts for the adult population (age 16 and above), with these broken up into age groups (mostly 10-year groups, i.e. aged 25 to 34, 35 to 44, etc.) and by sex, with overriding checks based on race (white, black, other) and ethnic (Hispanic and non-Hispanic) classifications.  For each of these groups, it estimates, based on a statistical analysis of historical trends, what its labor force participation rate can be expected to be in the projection year.  The labor force participation rate is the share of the population within each group who choose to be part of the labor force (i.e. either employed or, if unemployed, seeking a job).  Labor force participation rates change only slowly over time (as was discussed in this earlier post on this blog), so this is a reasonable approach for estimating what the labor force might be in a decade’s time.

Employment will then be the labor force minus the number who are unemployed.  But there is no way to know beyond the next few years what the unemployment rate might then be.  It will depend on what shocks or surprises there might have been to the economy at that time, and these are by definition not predictable.  If they were, they would not be surprises.  While active monetary and fiscal policy would then seek to bring unemployment down to just frictional levels, how long this will take depends on many factors, including political ones.  And the problem is one that can only be addressed in the near term, as it depends on when the shock came. Thus the Fed’s Board of Governors meets as a group every six weeks throughout the year to monitor the situation, and to decide based on what they know at the time whether to tweak monetary policy through some instrument (normally short-term interest rates, which they may adjust up or, when they can, down, to affect growth).

There is thus no way to know now, in 2021, what the rate of unemployment will be in 2030.  For this reason, to set a benchmark to which comparisons under different scenarios can be made, the BLS and others following this approach assume the economy will be operating at full employment in that projection year.  That is, the benchmark sets unemployment at some specific, low, rate to reflect just frictional unemployment.  While there has been debate on what that specific rate might be (different analysts generally peg it at between 4 and 5% currently), a specific rate would be chosen for the comparisons.  Employment will then be equal to the labor force in that forecast year minus the number unemployed at this assumed rate of unemployment.

[MInor technical note:  The employment figure arrived at in this way will be employment as measured at the individual level, and will include the self-employed as well as on-farm employment.  It will also count as one person employed even if the individual holds multiple jobs.  The employment figures normally cited (and used by Moody’s) are of non-farm payroll employment, which comes from surveys of establishments, excludes the self-employed and on-farm employment, and counts each job even if one person might hold more than one job (as the establishment will only know who they employ, and will not know if some of their employees might hold second jobs).  But the differences due to these factors are small, and adjustments can be made.]

Thus, for any given set of forecast population figures (by age group, etc.), employment will follow from the labor force participation rate and the assumed rate of frictional unemployment (i.e. unemployment when the economy is assumed to be operating at full employment).  Forecast employment in any future year under different scenarios will therefore only differ if either the labor force participation rate, or the unemployment rate (or both), differ for some reason.

C.  The Moody’s Employment Scenarios for 2030

Moody’s Analytics examined three scenarios for 2030 (and the path to it):  A base case where neither the infrastructure plan of Biden nor the covid-relief plan of Biden existed, a scenario where only the covid-relief plan was in place, and a scenario where both are in place.  In the first (base case) scenario it forecasts that employment in the US would rise to 157.9 million in 2030 from an average of 142.2 million in 2020, or an increase of 15.7 million.  In the scenario with only the covid-relief plan, Moody’s forecasts that employment in 2030 would then total 158.5 million, or 0.6 million more than in the base case.  And in the scenario where the infrastructure plan is also passed and implemented, Moody’s forecasts that employment in 2030 would total 161.2 million, or 2.7 million more than in the scenario with only the covid-relief plan passed and 19.0 million more than average total employment in 2020.

But why would employment levels in 2030 differ at all between these scenarios?  As discussed above, they can only differ if labor force participation rates differ or the assumed unemployment rates in that forecast year differ.  (The basic population numbers for that year should certainly not differ.)  In the Moody’s numbers they both do, but it is not clear why.

It is in particular difficult to understand why Moody’s allowed the assumed unemployment rates in 2030 to differ across their scenarios.  The scenario with just the covid-relief plan, which will be over by 2023 at the latest, should in particular not have an impact on the unemployment rate in 2030.  But in the Moody’s figures it does, albeit by only a minor amount (with unemployment at 4.5% in 2030 in the base scenario, and 4.4% in the scenario with the covid-relief plan).

The difference is larger in the scenario with both the covid-relief plan and the infrastructure plan.  Moody’s forecasts that unemployment in 2030 would then be just 3.8%, or well less than the 4.5% rate in the base scenario.  Why would that be?  While there would still be a small amount of spending under the infrastructure plan in 2030 (Moody’s uses a figure of $81 billion in its scenario), the impact of such spending in that year would be small (just 0.2% of forecast GDP in that year) and would in any case have been diminishing over time as the infrastructure plan was being phased down.  That is, the reductions in spending under the infrastructure plan in the outer years, relative to what they would have been a few years before, would (if not offset by other actions) be deflationary at that point, not expansionary.  But regardless of whether Biden’s infrastructure plan had been passed in 2021 or not, one would assume that fiscal and monetary policy would have sought in that future year (2030) to bring the economy to full employment, at whatever the assumed rate of (frictional) unemployment that it then is. There is no rationale for assuming the rate of unemployment in 2030 will differ across the scenarios.

The other difference in the Moody’s forecasts for 2030 under the different scenarios is in the labor force participation rates.  One can work out from the numbers Moody’s provided in its document (coupled with the BLS numbers for the adult population) that the labor force participation rate would be 58.5% in the base scenario, 58.7% in the scenario where only the Biden covid-relief package was passed, and 59.3% if the Biden infrastructure plan is also passed.  (More precisely, these are the Moody’s figures for non-farm payroll employment as a share of the population, not the overall labor force, with the small differences noted above between those two concepts).  Compared to the scenario of the covid-relief plan only, two-thirds (66%) of the extra 2.7 million in employment in 2030 is due to the higher labor force participation rates Moody’s forecasts for that year, and one-third (34%) is due to its forecast of a lower unemployment rate in that year.

Why should the labor force participation rate be higher in 2030 if Biden’s infrastructure plan is passed?  One could postulate a connection, but it would be tenuous and it is not clear if this was in fact intended by Moody’s or was just an outcome following from other relationships in its model.  I do not know enough about the structure of its model to say.  But one can speculate that the model may have linked the labor force participation rate in a forecast year to real wages in that year, with a higher real wage leading to a higher labor force participation rate.  Furthermore, the model might link greater infrastructure investment (or greater investment generally) to higher productivity, and higher productivity to higher wages.  In that case, the higher investment might lead, by such a route, to a higher labor force participation rate.  But this would require estimation of the responses in a series of steps, each of which might be tenuous.  It is difficult to forecast how much economy-wide productivity might rise as a result of such investment; difficult to forecast how much real wages would rise if productivity rises (real wages have been flat since around 1980, even though overall productivity rose by almost 80%); and difficult to forecast how much a rise in real wages might then raise the labor force participation rate.

But this is conceivable.  Whether it was an intended relationship in the Moody’s model is not so clear.  Such models are large and complicated, with a focus on particular issues.  Certain results might then follow, but those constructing the model might not have paid much attention to such outcomes when constructing the model, as the focus was on something else.

In any case, one has to be careful in interpreting the results as implying there would be 2.7 million additional jobs “created” in 2030 as a consequence of the Biden infrastructure plan.  There would, in the model, be 2.7 million more people employed, but this would mostly be due to a higher proportion of the population seeking employment in that year (a higher labor force participation rate).  And assuming an economy at full employment in that year, the additional number seeking employment would translate into that additional number being employed.  But it would be a stretch to interpret this as the infrastructure plan “creating” those additional jobs.  Rather, a higher share of the population are looking for work (a higher labor force participation rate), and are assumed to be able to find it.

D.  The Jobs Directly Created by the Infrastructure Plan

The Biden infrastructure plan would certainly create a huge number of jobs while the infrastructure is being built.  There would be jobs such as depicted in the photo at the top of this post, and with $2.2 trillion being spent there would be a large number of them (even with a share of the $2.2 trillion being spent in high priority areas outside of what is traditionally considered “hard” infrastructure, such as for labor training and health infrastructure).

These would, however, be jobs for a fixed period.  Once the particular projects are finished, those jobs would end.  Thus one should think of these as being so many person-years of employment (employment of one person for one year).  These are not permanent jobs being “created”, but rather workers being employed for a period of time to build a project or to complete a specific maintenance or repair task (e.g. repaving a road).

While not permanent jobs, it would still be important to have good estimates of how many there would be.  Moody’s did not do that, nor was it their intention, but one needs to be clear about that.  It will be important, however, that there be a serious effort at some point to work out such estimates, and I would guess that someone in government is working on this now.  They are needed precisely because there will be a large number who will be employed on these infrastructure projects, and workers with the necessary skills for such work are limited, in part because the US has so woefully underinvested in its infrastructure in recent decades (as will be discussed in the next section below).  It will thus be important to pay attention to the phasing of the individual projects, both over time and geographically, to ensure there will be sufficient capacity (both in terms of the workers needed and the firms that manage such projects) to build the projects at a given place and at a particular time.  It does not help much that there might be workers with the requisite skill in New York, say, when the need is for a project in California.

This will therefore need to be worked out, and I suspect it will be.  This will also guide what workforce development and training needs there will need to be, and the BLS routinely provides such estimates (at least at a broad, economy-wide, level).  But while it is correct to term jobs (or more precisely person-years of jobs) as being “created” under such an infrastructure plan, this does not necessarily mean that the total number of jobs in the economy will be higher.  If the economy is at full employment (and the labor force participation rate otherwise unchanged), the total number employed in the economy will be unchanged.  It is just that some share of those employed will be working on these infrastructure projects.  And that means fewer will be working in other jobs.

That is not a bad thing.  While the overall number employed will be the same, there will be jobs in the infrastructure projects which will have been attractive enough (either due to higher wages that they pay or for some other reason) to draw workers to those jobs.  Those who shift to those new jobs will then be better off, which is good.  Furthermore, the workers shifting to those new jobs would then have left positions that others may find attractive enough to move into (due to a higher wage, or whatever).  Thus there would be shifts across the economy.  Some less attractive jobs would cease to be filled, with employers forced to learn how to make do with less, but that is how competition works.

It is thus not correct to assert the total number employed in the economy will be higher as a consequence of the infrastructure investment plan (aside from during an initial few years as the economy moves to full employment – and Moody’s forecasts that this will be complete by 2022 with the covid-recovery and infrastructure plans enacted and even by 2024 without them).  The total number employed in such forecasts will be largely the same with or without the plans.  But that does not mean they are not without value to workers.  There will be new jobs to be filled, which will need to be attractive enough to draw workers to them.  And that helps workers.

E.  Public Infrastructure Investment in the US

Public infrastructure in the US is an embarrassment.  And it has a direct impact on productivity.  As was noted before, a truck driver sitting in a traffic jam is not terribly productive.  Similarly, exporters of soybeans who have to wait weeks to ship their product due to inadequate capacity at the ports cannot be terribly competitive in global markets (and will have to accept a price cut in order to sell their product).  And so on.

The major reason public infrastructure in the US is so poor is that the US has simply underinvested in it.  Using a broad definition of all government investment excluding that for the military, as a share of GDP, one has (calculated from BEA NIPA statistics):

Government investment peaked in the mid-1960s (as a share of GDP) and has declined ever since.  In gross terms it has been lower in recent years than in any time since the early 1950s.  Net of depreciation, it has been a good deal lower over the last half-decade (to 2019 – the 2020 figure is not yet available) than it has ever been in the last 70 years at least.  (And note that the blip up in the GDP share in 2020 was not because public investment rose.  The rate of growth of gross government investment in 2020 was in fact less than in 2019 and about the same as in 2018.  Rather it was because GDP collapsed in 2020, in the last year of the Trump administration, which pushed the share higher.)

What is of most interest for the state of public infrastructure is such investment net of depreciation.  That is shown as the curve in red in the chart, and it has fallen from a peak of 3.0% of GDP in 1966 to just 0.7% of GDP in recent years (up to 2019), a fall of 77%.  And at such a pace of adding to the net stock of public capital (infrastructure), the stock of such capital as a share of GDP will be falling.  By simple arithmetic, the ratio will be falling if the stock of that capital as a share of GDP is greater than the net investment share of GDP (0.7% here) divided by the rate of growth of nominal GDP.  Taking a nominal growth rate for GDP of, say, 4% (i.e. a real growth rate of 2% and a growth in prices of 2%), then the stock of public capital as a share of GDP will fall if the current stock of that capital is 17.5% of GDP or more (where 17.5% is equal to 0.7% / 4%).  The stock of public capital will certainly be well more than that in any modern economy, including the US.  And that underinvestment is why our highways are becoming increasingly subject to traffic jams, for example.  Our infrastructure is simply not keeping up.

Major public investment will be needed to reverse this, and the Biden infrastructure plan will be a start.  To put things in perspective, I have taken what would be spent annually under the Biden Plan (as estimated by Moody’s), as a share of GDP, and added this to a base amount where I simply assume other government investment in gross terms will remain at the average share it was between 2013 and 2019 (when it was quite steady at about 2.65% of GDP).  The figures for real GDP used for these calculations were those forecast by Moody’s under the scenario that the Biden infrastructure plan goes ahead, with these converted to nominal GDP (for the shares) using the forecast GDP deflators of the Congressional Budget Office.  Spending under the Biden Plan alone would start at 0.5% of GDP in 2023, rise to a peak of 1.3% of GDP in 2025, and then fall to 0.2% of GDP in 2030 and 0.1% in 2031.  Adding these figures to a base level of 2.65%, one would have:

A $2.2 trillion infrastructure investment plan is certainly large.  But the chart puts this in perspective.  Even with such an investment program, public investment would still not rise to as high as it was in the mid-1960s, nor would it last nearly as long.  Public investment had been relatively high (compared to later periods) from the mid-1950s to around 1980 – almost a quarter-century.  The $2.2 trillion Biden plan would raise public investment, but only for about eight years.  A question that will need to be addressed later is what happens after that.  Reverting to the recent, low, levels of infrastructure investment, would eventually lead back to the problems we have now.

F.  Summary and Conclusions

Politicians will always tout the jobs that will be “created” if their programs are approved.  If they didn’t, they likely would not hold office for long.  President Biden is no exception.  And the administration has cited independent estimates made by Mark Zandi’s team at Moody’s Analytics to say that Biden’s “American Jobs Plan” would indeed create a large number of jobs.  They cite Moody’s estimates that the number of jobs in 2030 would be 19 million higher than in 2020 if the infrastructure plan (as well as the covid-relief plan) are approved, and 2.7 million higher in 2030 if that infrastructure plan is approved as compared to a scenario where it is not.

These are, indeed, the Moody’s numbers.  But one should be careful in the interpretation of what they in fact mean, and Moody’s can be criticized for not being fully clear on this.  These are not jobs, generally in construction, that would follow directly from the infrastructure investment program (which should be counted as person-years of employment in any case, as such jobs are not permanent).  Rather, what Moody’s has done has been to use its model of the US economy to examine what overall employment levels would be in 2030 under the various scenarios.  It found that the number employed would be 2.7 million higher in 2030 (1.7% of forecast employment in that year) in the scenario with the infrastructure plan as compared to a scenario without it.  One can calculate that roughly two-thirds of this would be due to a higher labor force participation rate, and one-third due to a lower unemployment rate in that year.

It is not clear, however, why forecasts of either of those two variables – participation rates and the unemployment rate – should differ at all across the scenarios.  I would not be surprised if these were simply unintended consequences in a complex model.  In any case the differences in employment in that forecast year of 2030 are small, as one would expect.  Furthermore, by 2030 the infrastructure plan would be winding down, with only small residual amounts remaining to be spent.

During the course of the 2020s, however, a very significant number of people will be employed on these infrastructure investments.  They will be employed for limited periods until the projects are completed (and hence should be counted in person-years of employment), but this would still be significant.  It will be important to estimate not just how many will be employed and for what periods, but also what skills will be required and where and when they will be required.  This is probably now being done somewhere in government.  But Moody’s did not attempt to do that.

And while such jobs, mostly in construction, can be correctly termed as “created” under the infrastructure investment plan, this does not necessarily mean the overall number of people employed in the economy will be higher.  Unless labor force participation rates would then be higher for some reason (and it is difficult to see why that would be the case) or the unemployment rate is lower (which it cannot be if the economy is already at full employment), the overall number employed in the economy will be unchanged.  What would happen, rather, would be shifts in the job structure, not in the number of jobs overall.  Some workers would shift into the construction jobs needed to build the infrastructure, and others would shift into the jobs these workers had occupied before.  That is all good – the new jobs will need to be more attractive in terms of pay and/or for other reasons for workers to shift to them – but the total number employed (the total number of “jobs”) would largely be the same.

The public infrastructure is certainly needed.  The US has been underinvesting in its public infrastructure for decades, and when account is taken for depreciation it is clear that the net stock of public capital has not kept up with the overall growth of the economy.  That is why roads, for example, are now so often jammed.  The Biden Plan would bring public investment up to levels not seen for decades, although still not matching (even at $2.2 trillion) the public investment levels of the 1960s as a share of GDP.  It is also a time-limited program, which would phase down in the second half of the 2020s.  At some point, this will need to be addressed.  Bringing public investment levels back down to the far from adequate levels of recent decades will lead to the same problems again.  But that will likely be an issue that will not be seriously considered until the next presidential term.

Lower Life Expectancy in a State is Correlated with a Higher Share Voting for Trump

A lower life expectancy in a state is associated with a higher share in the state voting for Trump.  The chart above shows the simple correlation, using state-wide averages, between the life expectancy in a state and Trump’s share of the vote in that state in the 2020 presidential election.  States where life expectancy is relatively low saw, on average, a higher share of their population voting for Trump.  Life expectancy was especially low in a set of mostly Southern states that also had a high share voting for Trump (the bottom right corner of the chart).

The figures on life expectancy come from a recently issued set of estimates produced by the CDC.  The CDC estimates are geographically highly detailed, providing estimates down to the census tract level, but I have only used here the overall state-wide averages.  Due to their fine level of geographic detail, the CDC estimates are averaged over several years (2010 to 2015) to smooth out year-to-year statistical noise.  But life expectancy figures generally change only slowly over time (2020 was an exception, due to Covid-19), so figures for 2010-15 will provide a good estimate of what should be considered normal for life expectancy currently (i.e. with the exception of the Covid-19 impact).  The presidential election results are from Wikipedia, where the Trump share is his share in the overall vote in each state (including third party and other minor candidates).

The correlation is a strong one.  The regression equation (shown in the chart) for the relationship has an R-squared of 0.45.  This means that if one simply knew the life expectancy in a state, one could predict 45% of the variation in the share across the states that would vote for Trump.  This is high for such a simple cross-section relationship.  The negative slope of the equation (-0.11) means that every percentage point increase in the share of the vote for Trump is associated with a 0.11 year lower life expectancy.  Or put another way, a state with a life expectancy that is one year less than in another is associated with an expected 9 percentage point higher share of those voting for Trump (where 9 is roughly equal to 1 / 0.11).

Why this correlation?  Note that it is not saying that a high or low life expectancy in itself would necessarily be driving a tendency to vote for Trump or not.  Rather, a number of factors that enter into the determination of life expectancy are quite possibly also factors in common with the views of Trump supporters.  Life expectancy depends on personal factors and decisions (smoking, diet and exercise, obesity, vaccinations, whether to wear a mask to protect oneself and others to reduce the spread of a deadly disease), as well as on decisions made by state and local governments chosen by that electorate   (such as on access to health care, e.g. whether Medicaid should be available for the poor).  Life expectancy also depends on income levels and for any given average income level on income inequality.

And it will depend on the social norms of the region, such as car driving habits (speeding) and access to guns.  Of the factors reducing life expectancy in the US between 2014 and 2017 (mostly offsetting factors that would have, by themselves, led to a higher life expectancy) unintentional injuries accounted for just over half (50.6%) while suicides and homicides accounted for a further 15% (suicide 7.8% and homicide 7.5%).  That is, these non-medical factors accounted for two-thirds of the factors that had a negative impact on life expectancy in this period.

Few would question that better health is better than poorer health.  The high correlation seen here between life expectancy and the degree of Trump support suggests that there are significant commonalities in the various states between behaviors (both personal and social) that lead to poorer health outcomes and support for Trump.

Was Sturgis a Covid-19 Superspreader Event?: Evidence Suggests That It May Well Have Been

A.  Introduction

The Sturgis Motorcycle Rally is an annual 10-day event for motorcycle enthusiasts (in particular of Harley-Davidsons), held in the normally small town in far western South Dakota of Sturgis.  It was held again this year, from August 7 to August 16, despite the Covid-19 pandemic, and drew an estimated 460,000 participants.  Motorcyclists gather from around the country for lots of riding, lots of music, and lots of beer and partying.  And then they go home.  Cell phone data indicate that fully 61% of all the counties in the US were visited by someone who attended Sturgis this year.

Due to the pandemic, the town debated whether to host the event this year.  But after some discussion, it was decided to go ahead.  And it is not clear that town officials could have stopped it even if they wanted.  Riders would likely have shown up anyway.

Despite the on-going covid pandemic, masks were rarely seen.  Indeed, many of those attending were proud in their defiance of the standard health guidelines that masks should be worn and social distancing respected, and especially so in such crowded events.  T-shirts were sold, for example, declaring “Screw Covid-19, I Went to Sturgis”.

Did Sturgis lead to a surge in Covid-19 cases?  Unfortunately, we do not have direct data on this because the identification of the possible sources of someone’s Covid-19 infection is incredibly poor in the US.  There is little investigation of where someone might have picked up the virus, and far from adequate contact tracing.  And indeed, even those who attended the rally and later came down with Covid-19 found that their state health officials were often not terribly interested in whether they had been at Sturgis.  The systems were simply not set up to incorporate this.  And those attending who were later sick with the disease were also not always open on where they had been, given the stigma.

One is therefore left only with anecdotal cases and indirect evidence.  Recent articles in the Washington Post and the New York Times were good reports, but could only cover a number of specific, anecdotal, cases, as well as describe the party environment at Sturgis.  One can, however, examine indirect evidence.  It is reasonable to assume that those motorcycle enthusiasts who had a shorter distance to get to Sturgis from their homes would be more likely to go.  Hence near-by states would account for a higher share (adjusted for population) of those attending Sturgis and then returning home than would be the case for states farther away.  If so, then if Covid-19 was indeed spread among those attending Sturgis, one would see a greater degree of seeding of the virus that causes Covid-19 in the near-by states than would be the case among states that are farther away.  And those near-by states would then have more of a subsequent rise in Covid-19 cases as the infectious disease spread from person to person than one would see in states further away.

This post will examine this, starting with the chart at the top of this post.  As is clear in that chart, by early November states geographically closer to Sturgis had far higher cases of Covid-19 (as a share of their population) than those further away.  And the incidence fell steadily with geographic distance, in a relationship that is astonishingly tight.  Simply knowing the distance of the state from Sturgis would allow for a very good prediction (relative to the national average) of the number of daily new confirmed cases of Covid-19 (per 100,000 of population) in the 7-day period ending November 6.

A first question to ask is whether this pattern developed only after Sturgis.  If it had been there all along, including before the rally was held, then one cannot attribute it to the rally.  But we will see below that there was no such relationship in early August, before the rally, and that it then developed progressively in the months following.  This is what one would expect if the virus had been seeded by those returning from Sturgis, who then may have given this infectious disease to their friends and loved ones, to their co-workers, to the clerks at the supermarkets, and so on, and then each of these similarly spreading it on to others in an exponentially increasing number of cases.

To keep things simple in the charts, we will present them in a standard linear form.  But one may have noticed in the chart above that the line in black (the linear regression line) that provides the best fit (in a statistical sense) for a straight line to the scatter of points, does not work that well at the two extremes.  The points at the extremes (for very short distances and very long ones) are generally above the curve, while the points are often below in the middle range.  This is the pattern one would expect when what matters to the decision to ride to the rally is not some increment for a given distance (of an extra 100 miles, say), but rather for a given percentage increase (an extra 10%, say).  In such cases, a logarithmic curve rather than a straight (linear) line will fit the data better, and we will see below that indeed it does here.  And this will be useful in some statistical regression analysis that will examine possible explanations for the pattern.

It should be kept in mind, however, that what is being examined here are correlations, and being correlations one can not say with certainty that the cause was necessarily the Sturgis rally.  And we obviously cannot run this experiment over repeatedly in a lab, under varying conditions, to see whether the result would always follow.

Might there be some other explanation?  Certainly there could be.   Probably the most obvious alternative is that the surge in Covid-19 cases in the upper mid-west of the US between September and early November might have been due to the onset of cold weather, where the states close to Sturgis are among the first to turn cold as winter approaches in the US.  We will examine this below.  There is, indeed, a correlation, but also a number of counter-examples (with states that also turned colder, such as Maine and Vermont, that did not see such a surge in cases).  The statistical fit is also not nearly as good.

One can also examine what happened across the border in the neighboring provinces of Canada.  The weather there also turned colder in September and October, and indeed by more than in the upper mid-west of the US.  Yet the incidence of Covid-19 cases in those provinces was far less.

What would explain this?  The answer is that it is not cold weather per se that leads to the virus being spread, but rather cold weather in situations where socially responsible behavior is not being followed – most importantly mask-wearing, but also social distancing, avoidance of indoor settings conducive to the spread of the virus, and so on.  As examined in the previous post on this blog, mask-wearing is extremely powerful in limiting the spread of the virus that causes Covid-19.  But if many do not wear masks, for whatever reason, the virus will spread.  And this will be especially so as the weather turns colder and people spend more time indoors with others.

This could lead to the results seen if states that are geographically closer to Sturgis also have populations that are less likely to wear masks when they go out in public.  And we will see that this was likely indeed a factor.  For whatever reason (likely political, as the near-by states are states with high shares of Trump supporters), states geographically close to Sturgis have a generally lower share of their populations regularly wearing masks in this pandemic.  But the combination of low mask-wearing and falling temperatures (what statisticians call an interaction effect) was supplemental to, and not a replacement of, the impact of distance from Sturgis.  The distance factor remained highly significant and strong, including when controlling for October temperatures and mask-wearing, consistent with the view that Sturgis acted as a seeding event.

This post will take up each of these topics in turn.

B.  Distance to Sturgis vs. Daily New Cases of Covid-19 in the Week Ending November 6

The chart at the top of this post plots the average daily number of confirmed new cases of Covid-19 over the 7-day period ending November 6 in a state (per 100,000 of population), against the distance to Sturgis.  The data for the number of new cases each day was obtained from USAFacts, which in turn obtained the data from state health authorities.  The data on distance to Sturgis was obtained from the directions feature on Google Maps, with Sturgis being the destination and the trip origin being each of the 48 states in the mainland US (Hawaii and Alaska were excluded), plus Washington, DC.  Each state was simply entered (rather than a particular address within a state), and Google Maps then defaulted to a central location in each state.  The distance chosen was then for the route recommended by Google, in miles and on the roads recommended.  That is, these are trip miles and not miles “as the crow flies”.

When this is done, with a regular linear scale used for the mileage on the recommended routes, one obtains the chart at the top of this post.  For the week ending November 6, those states closest to Sturgis saw the highest rates of Covid-19 new cases (130 per 100,000 of population in South Dakota itself, where Sturgis is in the far western part of the state, and 200 per 100,000 in North Dakota, where one should note that Sturgis is closer to some of the main population centers of North Dakota than it is to some of the main population centers of South Dakota).  And as one goes further away geographically, the average daily number of new cases falls substantially, to only around one-tenth as much in several of the states on the Atlantic.

The model is a simple one:  The further away a state is from Sturgis, the lower its rate (per 100,000 of population) of Covid-19 new cases in the first week of November.  But it fits extremely well even though it looks at only one possible factor (distance to Sturgis).  The straight black line in the chart is the linear regression line that best fits, statistically, the scatter of points.  A statistical measure of the fit is called the R-squared, which varies between 0% and 100% and measures what share of the variation observed in the variable shown on the vertical axis of the chart (the daily new cases of Covid-19) can be predicted simply by knowing the regression line and the variable shown on the horizontal axis (the miles to Sturgis).

The R-squared for the regression line calculated for this chart was surprisingly high, at 60%.  This is astonishing.  It says that if all we knew was this regression line, then we could have predicted 60% of the variation in Covid-19 cases across states in the week ending November 6 simply by knowing how far the states are from Sturgis.  States differ in numerous ways that will affect the incidence of Covid-19 cases in their territory.  Yet here, if we know just the distance to Sturgis, we can predict 60% of how Covid-19 incidence will vary across the states.  Regressions such as these are called cross-section regressions (the data here are across states), and such R-squares are rarely higher than 20%, or at most perhaps 30%.

But as was discussed above in the introduction, trip decisions involving distances often work better (fit the data better) when the scale used is logarithmic.  On a logarithmic scale, what enters into the decision to make the trip of not is not some fixed increment of distance (e.g. an extra 100 miles) but rather some proportional change (e.g. an extra 10%).  A statistical regression can then be estimated using the logarithms of the distances, and when this estimated line is re-calculated back on to the standard linear scale, one will have the curve shown in blue in the chart:

The logarithmic (or log) regression line (in blue) fits the data even better than the simple linear regression line (in black), including at the two extremes (very short and very long distances).  And the R-squared rises to 71% from the already quite high 60% of the linear regression line.  The only significant outlier is North Dakota.  If one excludes North Dakota, the R-squared rises to 77%.  These are remarkably high for a cross-section analysis.

This simple model therefore fits the data well, indeed extremely well.  But there are still several issues to consider, starting with whether there was a similar pattern across the states before the Sturgis rally.

C.  Distance to Sturgis vs. Daily New Cases of Covid-19 in the Week Ending August 6, and the Progression in Subsequent Months

The Sturgis rally began on August 7.  Was there possibly a similar pattern as that found above in Covid-19 cases before the rally?  The answer is a clear no:

In the week ending August 6, the relationship of Covid-19 cases to distance from Sturgis was about as close to random as one can ever find.  If anything, the incidences of Covid-19 cases in the 10 or so states closest to Sturgis were relatively low.  And for all 48 states of the Continental US (plus Washington, DC), the simple linear regression line is close to flat, with an R-squared of just 0.4%.  This is basically nothing, and is in sharp contrast to the R-squared for the week ending November 6 of 60% (and 71% in logarithmic terms).

One should also note the magnitudes on the vertical scale here.  They range from 0 to 40 cases (per 100,000 of population) per day in the 7-day period.  In the chart for cases in the 7-day period ending on November 6 (as at the top of this post), the scale goes from 0 to 200.  That is, the incidence of Covid-19 cases was relatively low across US states in August (relative to what it was later in parts of the US).  That then changed in the subsequent months.  Furthermore, one can see in the charts above for the week ending November 6 that the states further than around 1,400 miles from Sturgis still had Covid new case rates of 40 per day or less.  That is, the case incidence rates remained in that 0 to 40 range between August and early November for the states far from Sturgis.  The states where the rates rose above this were all closer to Sturgis.

There was also a steady progression in the case rates in the months from August to November, focused on the states closer to Sturgis, as can be seen in the following chart:

Each line is the linear regression line found by regressing the number of Covid-19 cases in each state (per 100,000 of population) for the week ending August 6, the week ending September 6, the week ending October 6, and the week ending November 6, against the geographic distance to Sturgis.  The regression lines for the week ending August 6 and the week ending November 6 are the same as discussed already in the respective charts above.  The September and October ones are new.

As noted before, the August 6 line is essentially flat.  That is, the distance to Sturgis made no difference to the number of cases, and they are also all relatively low.  But then the line starts to twist upwards, with the right end (for the states furthest from Sturgis) more or less fixed and staying low, while the left end rotated upwards.  The rotation is relatively modest for the week ending September 6, is more substantial in the month later for the week ending October 6, and then the largest in the month after that for the week ending November 6.  This is precisely the path one would expect to find with an exponential spread of an infectious disease that has been seeded but then not brought under effective control.

D.  Might Falling Temperatures Account for the Pattern?

The charts above are consistent with Sturgis acting as a seeding event that later then led to increases in Covid-19 cases that were especially high in near-by states.  But one needs to recognize that these are just correlations, and by themselves cannot prove that Sturgis was the cause.  There might be some alternative explanation.

One obvious alternative would be that the sharp increase in cases in the upper mid-west of the US in this period was due to falling temperatures, as the northern hemisphere winter approached.  These areas generally grow colder earlier than in other parts of the US.  And if one plots the state-wide average temperatures in October (as reported by NOAA) against the average number of Covid-19 cases per day in the week ending November 6 one indeed finds:

There is a clear downward trend:  States with lower average temperatures in October had more cases (per 100,000 of population) in the week ending November 6.  The relationship is not nearly as tight as that found for the one based on geographic distance from Sturgis (the R-squared is 35% here, versus 60% for the linear relationship based on distance), but 35% is still respectable for a cross-state regression such as this.

However, there are some counterexamples.  The average October temperatures in Maine and Vermont were colder than all but 7 or 10 states (for Maine and Vermont, respectively), yet their Covid-19 case rates were the two lowest in the country.

More telling, one can compare the rates in North and South Dakota (with the two highest Covid-19 rates in the country in the week ending November 6) plus Montana (adjacent and also high) with the rates seen in the Canadian provinces immediately to their north:

The rates are not even close.  The Canadian rates were all far below those in the US states to their south.  The rate in North Dakota was fully 30 times higher than the rate in Saskatchewan, the Canadian province just to its north.  There is clearly something more than just temperature involved.

E.  The Impact of Wearing Masks, and Its Interaction With Temperature

That something is the actions followed by the state or provincial populations to limit the spread of the virus.  The most important is the wearing of masks, which has proven to be highly effective in limiting the spread of this infectious disease, in particular when complemented with other socially responsible behaviors such as social distancing, avoiding large crowds (especially where many do not wear masks), washing hands, and so on.  Canadians have been far more serious in following such practices than many Americans.  The result has been far fewer cases of Covid-19 (as a share of the population) in Canada than in the US, and far fewer deaths.

Mask wearing matters, and could be an alternative explanation for why states closer to Sturgis saw higher rates of Covid-19 cases.  If a relatively low share of the populations in the states closer to Sturgis wear masks, then this may account for the higher incidence of Covid-19 cases in those near-by states.  That is, perhaps the states that are geographically closer to Sturgis just happen also to be states where a relatively low share of their populations wear masks, with this then possibly accounting for the higher incidence of cases in those states.

However, mask-wearing (or the lack of it), by itself, would be unlikely to fully account for the pattern seen here.  Two things should be noted.  First, while states that are geographically closer to Sturgis do indeed see a lower share of their population generally wearing masks when out in public, the relationship to this geography is not as strong as the other relationships we have examined:

The data in the chart for the share who wear masks by state come from the COVIDCast project at Carnegie Mellon University, and was discussed in the previous post on this blog.  The relationship found is indeed a positive one (states geographically further from Sturgis generally have a higher share of their populations wearing masks), but there is a good deal of dispersion in the figures and the R-squared is only 27.5%.  This, by itself, is unlikely to explain the Covid-19 rates across states in early November.

Second, and more importantly:  While the states closer to Sturgis generally have a lower share of mask-wearing, this would not explain why one did not see similarly higher rates of Covid-19 incidence in those states in August.  Mask-wearing was likely similar.  The question is why did Covid-19 incidence rise in those states between August (following the Sturgis rally) and November, and not simply why they were high in those states in November.

However, mask-wearing may well have been a factor.  But rather than accounting for the pattern all by itself, it may have had an indirect effect.  With the onset of colder weather, more time would be spent with others indoors, and wearing a mask when in public is particularly important in such settings.  That is, it is the combination of both a low share of the population wearing masks and the onset of colder weather which is important, not just one or the other.

These are called interaction effects, and investigating them requires more than can be depicted in simple charts.  Multiple regression analysis (regression analysis with several variables – not just one as in the charts above) can allow for this.  Since it is a bit technical, I have relegated a more detailed discussion of these results to a Technical Annex at the conclusion of this post for those who are interested.

Briefly, a regression was estimated that includes miles from Sturgis, average October temperatures, the share who wear masks when out in public, plus an interaction effect between the share wearing masks and October temperatures, all as independent variables affecting the observed Covid-19 case rates of the week ending November 6.  And this regression works quite well.  The R-squared is 75.4%, and each of the variables (including the interaction term) are either highly significant (miles from Sturgis) or marginally so (a confidence level of between 6 and 8% for the variables, which is slightly worse than the 5% confidence level commonly used, but not by much).

Note in particular that the interaction term matters, and matters even while each of the other variables (miles to Sturgis, October temperatures, and mask-wearing) are taken into account individually as well.  In the interaction term, it is not simply the October temperatures or the share wearing masks that matter, but the two acting together.  That is, the impact of relatively low temperatures in October will matter more in those states where mask-wearing is low than they would in states where mask-wearing is high.  If people generally wore masks when out in public (and followed also the other socially responsible behaviors that go along with it), the falling temperatures would not matter as much.  But when they don’t, the falling temperatures matter more.

From this overall regression equation, one can also use the coefficients found to estimate what the impact would be of small changes in each of the variables.  These are called elasticities, and based on the estimated equation (and computing the changes around the sample means for each of the variables):  a 1% reduction in the number of miles from Sturgis would lead to a 1.0% rise in the incidence of Covid-19 cases; a 1% reduction (not a 1 percentage point increase, but rather a 1% reduction from the sample mean) in the share of the population wearing masks when out in public would lead to a 1.7% rise in the incidence of Covid-19 cases; and a 1% reduction in the average October temperature across the different states would lead to a 1.2% rise in the incidence of Covid-19 cases.  All of these elasticity estimates look quite plausible.

These results are consistent with an explanation where the Sturgis rally acted as a significant superspreader event that led to increased seeding of the virus in the locales, in near-by states especially. This then led to significant increases in the incidence of Covid-19 cases in the different states as this infectious disease spread to friends and family and others in the subsequent months, and again especially in the states closest to Sturgis.  Those increases were highest in the states that grew colder earlier than others when the populations wearing masks regularly in those states was relatively low.  That is, the interaction of the two mattered.  But even with this effect controlled for, along with controlling also for the impact of colder temperatures and for the impact of mask-wearing, the impact of miles to Sturgis remained and was highly significant statistically.

F.  Conclusion

As noted above, the analysis here cannot and does not prove that the Sturgis rally acted as a superspreader event.  There was only one Sturgis rally this year, one cannot run repeated experiments of such a rally under various alternative conditions, and the evidence we have are simply correlations of various kinds.  It is possible that there may be some alternative explanation for why Covid-19 cases started to rise sharply in the weeks after the rally in the states closest to Sturgis.  It is also possible it is all just a coincidence.

But the evidence is consistent with what researchers have already found on how the virus that causes Covid-19 is spread.  Studies have found that as few as 10% of those infected may account for 80% of those subsequently infected with the virus.  And it is not just the biology of the disease and how a person reacts to it, but also whether the individual is then in situations with the right conditions to spread it on to others.  These might be as small as family gatherings, or as large as big rallies.  When large numbers of participants are involved, such events have been labeled superspreader events.

Among the most important of conditions that matter is whether most or all of those attending are wearing masks.  It also matters how close people are to each other, whether they are cheering, shouting, or singing, and whether the event is indoors or outdoors.  And the likelihood that an attendee who is infectious might be there increases exponentially with the number of attendees, so the size of the gathering very much matters.

A number of recent White House events matched these conditions, and a significant number of attendees soon after tested positive for Covid-19.  In particular, about 150 attended the celebration on September 26 announcing that Amy Coney Barrett would be nominated to the Supreme Court to take the seat of the recently deceased Ruth Bader Ginsburg.  Few wore masks, and at least 18 attendees later tested positive for the virus.  And about 200 attended an election night gathering at the White House.  At least 6 of those attending later tested positive.  While one can never say for sure where someone may have contracted the virus, such clusters among those attending such events are very unlikely unless the event was where they got the virus.  It is also likely that these figures are undercounts, as White House staff have been told not to let it become publicly known if they come down with the virus.  Finally, as of November 13 at least 30 uniformed Secret Service officers, responsible for security at the White House, have tested positive for the coronavirus in the preceding few weeks.

There is also increasing evidence that the Trump campaign rallies of recent months led to subsequent increases in Covid-19 cases in the local areas where they were held.  These ranged from studies of individual rallies (such as 23 specific cases traced to three Trump rallies in Minnesota in September), to a relatively simple analysis that looked at the correlation between where Trump campaign rallies were held and subsequent increases in Covid-19 cases in that locale, to a rigorous academic study that examined the impact of 18 Trump campaign rallies on the local spread of Covid-19.  This academic study was prepared by four members of the Department of Economics at Stanford (including the current department chair, Professor B. Douglas Bernheim).  They concluded that the 18 Trump rallies led to an estimated extra 30,000 Covid-19 cases in the US, and 700 additional deaths.

One should expect that the Sturgis rally would act as even more of a superspreader event than those campaign rallies.  An estimated 460,000 motorcyclists attended the Sturgis rally, while the campaign rallies involved at most a few thousand at each.  Those at the Sturgis rally could also attend for up to ten days; the campaign rallies lasted only a few hours.  Finally, there would be a good deal of mixing of attendees at the multiple parties and other events at Sturgis.  At a campaign rally, in contrast, people would sit or stand at one location only, and hence only be exposed to those in their immediate vicinity.

The results are also consistent with a rigorous academic study of the more immediate impact of the Sturgis rally on the spread of Covid-19, by Professor Joseph Sabia of San Diego State University and three co-authors.  Using anonymous cell phone tracking data, they found that counties across the US that received the highest inflows of returning participants from the Sturgis rally saw, in the immediate weeks following the rally (up to September 2), an increase of 7.0 to 12.5% in the number of Covid-19 cases relative to the counties that did not contribute inflows.  But their study (issued as a working paper in September) looked only at the impact in the immediate few weeks following Sturgis.  They did not consider what such seeding might then have led to.  The results examined in the analysis here, which is longer-term (up to November 6), are consistent with their findings.

It is therefore fully plausible that the Sturgis rally acted as a superspreader event.  And the evidence examined in this post supports such a conclusion.  While one cannot prove this in a scientific sense, as noted above, the likelihood looks high.

Finally, as I finish writing this, the number of deaths in the US from this terrible virus has just surpassed 250,000.  The number of confirmed cases has reached 11.6 million, with this figure rising by 1 million in just the past week.  A tremendous surge is underway, far surpassing the initial wave in March and April (when the country was slow to discover how serious the spread was, due in part to the botched development in the US of testing for the virus), and far surpassing also the second, and larger, wave in June and July (when a number of states, in particular in the South and Southwest, re-opened too early and without adequate measures, such as mask mandates, to keep the disease under control).  Daily new Covid-19 cases are now close to 2 1/2 times what they were at their peak in July.

This map, published by the New York Times (and updated several times a day) shows how bad this has become.  It is also revealing that the worst parts of the country (the states with the highest number of cases per 100,000 of population) are precisely the states geographically closest to Sturgis.  There is certainly more behind this than just the Sturgis rally.  But it is highly likely the Sturgis rally was a significant contributor.  And it is extremely important if more cases are to be averted to understand and recognize the possible role of events such as the rally at Sturgis.

Average Daily Cases of Covid-19 per 100,000 Population

7-Day Average for Week Ending November 18, 2020

Source:  The New York Times, “Covid in the US:  Latest Map and Case Count”.  Image from November 19, with data as of 8:14 am.

 


Technical Annex:  Regression Results

As discussed in the text, a series of regressions were estimated to explore the relationship between the Sturgis rally and the incidence of Covid-19 cases (the 7-day average of confirmed new cases in the week ending November 6) across the states of the mainland US plus Washington, DC.  Five will be reported here, with regressions on the incidence of Covid-19 cases (as the dependent variable) as a function of various combinations of three independent variables: miles from Sturgis (in terms of their natural logarithms), the average state-wide temperature in October (also in terms of their natural logarithms), and the share of the population in the respective states who reported they always or most of the time wore masks when out in public.  Three of the five regressions are on each of the three independent variables individually, one on the three together, and one on the three together along with an interaction effect measured by multiplying the October temperature variable (in logs) with the share wearing masks.  The sources for each variable were discussed above in the main text.

The basic results, with each regression by column, are summarized in the following table:

Regressions on State Covid-9 Cases – November 6

     Miles to Sturgis and Temperatures are in natural logs

Miles only

Temp only

Masks only

Miles, Temp, &Masks

All with Interaction

Miles to Sturgis

Slope

-54.9

-41.9

-36.6

t-statistic

-10.7

-5.2

-4.3

Avg Temperature

Slope

-133.3

-45.5

-516.8

t-statistic

-5.5

-2.0

-1.9

Share Wear Masks

Slope

-3.1

-0.8

-22.4

t-statistic

-3.9

-1.3

-1.8

Interaction Temp & Masks

Slope

5.44

t-statistic

1.8

Intercept

425.5

572.5

309.4

582.5

2,422.5

t-statistic

11.9

6.0

4.5

7.1

2.3

R-squared

71.0%

39.4%

24.2%

73.7%

75.4%

In the regressions with each independent variable taken individually, all the coefficients (slopes) found are highly significant.  The general rule of thumb is that a confidence level of 5% is adequate to call the relationship statistically “significant” (i.e. that the estimated coefficient would not differ from zero just due to random variation in the data).  A t-statistic of 2.0 or higher, in a large sample, would signal significance at least at a 5% confidence level (that is, that the estimated coefficient differs from zero at least 95% of the time), and the t-statistics are each well in excess of 2.0 in each of the single-variable regressions.  The R-squared is quite high, at 71.0%, for the regression on miles from Sturgis, but more modest in the other two (39.4% and 24.2% for October temperature and mask-wearing, respectively).

The estimated coefficients (slopes) are also all negative.  That is, the incidence of Covid-19 goes down with additional miles from Sturgis, with higher October temperatures, and with higher mask-wearing.  The actual coefficients themselves should not be compared to each other for their relative magnitudes.  Their size will depend on the units used for the individual measures (e.g. miles for distance, rather than feet or kilometers; or temperature measured on the Fahrenheit scale rather than Centigrade; or shares expressed as, say, 80 for 80% instead of 0.80).  The units chosen will not matter.  Rather, what is of interest is how the predicted incidence of Covid-19 changes when there is, say, a 1% change in any of the independent variables.  These are elasticities and will be discussed below.

In the fourth regression equation (the fourth column), where the three independent variables are all included, the statistical significance of the mask-wearing variable drops to a t-statistic of just 1.3.  The significance of the temperature variable also falls to 2.0, which is at the borderline for the general rule of thumb of 5% confidence level for statistical significance.  The miles from Sturgis variable remains highly significant (its t-statistic also fell, but remains extremely high).  If one stopped here, it would appear that what matters is distance from Sturgis (consistent with Sturgis acting as a seeding event), coupled with October temperatures falling (so that the thus seeded virus spread fastest where temperatures had fallen the most).

But as was discussed above in the main text, there is good reason to view the temperature variable acting not solely by itself, but in an interaction with whether masks are generally worn or not.  This is tested in the fifth regression, where the three individual variables are included along with an interaction term between temperatures and mask-wearing.  The temperature, mask-wearing, and interaction variables now all have a similar level of significance, although at just less than 5% (at 6% to 8% for each).  While not quite 5%, keep in mind that the 5% is just a rule of thumb.  Note also that the positive sign on the interaction term (the 5.44) is an indication of curvature.  The positive sign, coupled with the negative signs for the temperature and mask-wearing variables taken alone, indicates that the curves are concave facing upwards (the effects of temperature and mask-wearing diminish at the margin at higher values for the variables).  Finally, the miles to Sturgis variable remains highly significant.

Based on this fifth regression equation, with the interaction term allowed for, what would be the estimated response of Covid-19 cases to changes in any of the independent variables (miles to Sturgis, October temperatures, and mask-wearing)?  These are normally presented as elasticities, with the predicted percentage change in Covid-19 cases when one assumes a small (1%) change in any of the independent variables.  In a mixed equation such as this, where some terms are linear and some logarithmic (plus an interaction term), the resulting percentage change can vary depending on the starting point is chosen.  The conventional starting point taken is normally the sample means, and that will be done here.

Also, I have expressed the elasticities here in terms of a 1% decrease in each of the independent variables (since our interest is in what might lead to higher rates of Covid-19 incidence):

Elasticities from Full Equation with Interaction Term

      Percent Increase in Number of Covid-19 Cases from a 1% Decrease Around Sample Means

Elasticity

Miles to Sturgis

1.02%

October Temperature

1.16%

Share Wearing Masks

1.69%

All these estimated elasticities are quite plausible.  If one is 1% closer in geographic distance to Sturgis (starting at the sample mean, and with the other two variables of October temperature and mask-wearing also at their respective sample means), the incidence of Covid-19 cases (per 100,000 of population) as of the week ending November 6 would increase by an estimated 1.02%.  A 1% lower October temperature (from the sample mean) would lead to an estimated 1.16% increase in Covid-19 cases.  And the impact of the share wearing masks is important and stronger, where a 1% reduction in the share wearing masks would lead to an estimated 1.69% increase in cases, with all the other factors here taken into account and controlled for.

These results are consistent with a conclusion that the Sturgis rally led to a significant seeding of cases, especially in near-by states, with the number of infections then growing over time as the disease spread.  The cases grew faster in those states where mask-wearing was relatively low, and in states with lower temperatures in October (leading people to spend more time indoors).  When the falling temperatures were coupled with a lower share (than elsewhere) of the population wearing masks, the rate of Covid-19 cases rose especially fast.