Manipulation in the RD Design
Introduction
This post is an instructive formal proof to demonstrate precisely why manipulation invalidates the identification of the treatment effect in a Regression Discontinuity (RD) design.
The Ideal Regression Discontinuity Setup
Let’s define our terms within the potential outcomes framework (Rubin Causal Model).
- Running Variable,
$X_i$: A continuous variable for an individual$i$. - Cutoff,
$c$: A pre-determined threshold. - Treatment,
$T_i$: A binary variable indicating treatment status. In a sharp RD design:$T_i = \mathbf{1}[X_i \ge c]$, where$\mathbf{1}[\cdot]$is the indicator function. - Potential Outcomes:
$Y_i(1)$: The potential outcome for individual$i$if they receive the treatment.$Y_i(0)$: The potential outcome for individual$i$if they do not receive the treatment.
- Observed Outcome,
$Y_i$: The outcome we actually see in the data.$Y_i = T_i Y_i(1) + (1-T_i) Y_i(0)$.
The Estimand of Interest
In RD, we aim to identify the Local Average Treatment Effect (LATE) at the cutoff $c$: \[ \tau_{RD} = E[Y_i(1) - Y_i(0) | X_i = c] \] This is the average treatment effect for the specific subpopulation of individuals whose running variable is exactly at the threshold.
The Key Identification Assumption
To identify $\tau_{RD}$, we rely on the continuity of conditional expectation functions of potential outcomes at the cutoff. Formally: \[ \lim_{x \to c^+} E[Y_i(0) | X_i = x] = \lim_{x \to c^-} E[Y_i(0) | X_i = x] \quad \text{(Assumption 1a)} \] \[ \lim_{x \to c^+} E[Y_i(1) | X_i = x] = \lim_{x \to c^-} E[Y_i(1) | X_i = x] \quad \text{(Assumption 1b)} \]
Intuition: This assumption states that individuals just above and just below the cutoff are, on average, identical in terms of their potential outcomes. The only systematic difference between them is that one group received the treatment and the other did not. This “as-if-random” assignment in an infinitesimally small neighborhood around the cutoff is what allows for causal inference.
Under this assumption, we can identify the treatment effect: \[ \tau_{RD} = \lim_{x \to c^+} E[Y_i | X_i = x] - \lim_{x \to c^-} E[Y_i | X_i = x] \] Why? Because: $\lim_{x \to c^+} E[Y_i | X_i = x] = \lim_{x \to c^+} E[Y_i(1) | X_i = x]$ $\lim_{x \to c^-} E[Y_i | X_i = x] = \lim_{x \to c^-} E[Y_i(0) | X_i = x]$ Substituting these into the formula for $\tau_{RD}$ and using Assumption 1a and 1b gives us the desired estimand.
Formalizing Manipulation
Manipulation occurs when individuals can precisely control or influence their score on the running variable, $X_i$, to gain or avoid treatment.
Let’s define two states for the running variable:
$X_i^0$: The latent, pre-manipulation, or “true” score an individual would have without any strategic effort.$X_i$: The final, observed score after any manipulation has occurred.
Manipulation implies that for some individuals, $X_i \neq X_i^0$. Specifically, individuals with a pre-manipulation score just below the cutoff ($X_i^0 < c$) have an incentive to alter their score to be just above the cutoff ($X_i \ge c$).
Let’s define a characteristic, $M_i$, which represents an individual’s ability or motivation to manipulate their score. For instance, $M_i$ could be a latent variable for “resourcefulness” or “ambition.” It is highly plausible that this characteristic is correlated with potential outcomes. For example, more resourceful students ($M_i$ is high) may also have higher potential earnings ($Y_i(0)$) even without a scholarship (the treatment).
Let’s assume for simplicity that more resourceful individuals would have better outcomes regardless of treatment:
\[ E[Y_i(0) | M_i = m] \text{ is an increasing function of } m. \]
Manipulation Violates the Identification Assumption
Now we will prove that manipulation causes the key continuity assumption (1a) to fail.
Consider the population of individuals whose observed score $X_i$ is in an infinitesimally small neighborhood just below the cutoff, $c$. Let this neighborhood be $[c-\epsilon, c)$.
Any individual in this range who had the ability and incentive to manipulate their score would have already done so, “jumping” over the cutoff to $X_i \ge c$.
Therefore, the population we observe with scores in $[c-\epsilon, c)$ consists disproportionately of individuals with lower levels of the manipulation characteristic, $M_i$. They are the “non-manipulators.”
Now consider the population whose observed score $X_i$ is in an infinitesimally small neighborhood just above the cutoff, $[c, c+\epsilon)$.
This population is a mixture of two groups: - Group A: Individuals whose “true” score was already above the cutoff ($X_i^0 \ge c$) and had no need to manipulate. - Group B: Individuals whose “true” score was just below the cutoff ($X_i^0 < c$) but who successfully manipulated their score to be $X_i \ge c$. These are the “manipulators.”
By definition, the manipulators (Group B) have higher levels of the characteristic $M_i$ than the non-manipulators left behind just below the cutoff.
Let’s re-examine the limits of the conditional expectation of the potential outcome $Y_i(0)$:
The Left-Hand Limit: As we approach $c$ from below, the population is systematically purged of individuals with high $M_i$. \[ \lim_{x \to c^-} E[Y_i(0) | X_i = x] = E[Y_i(0) | X_i \approx c, \text{is a non-manipulator}] \] This reflects the average potential outcome for individuals with a “true” score near $c$ and a low propensity to manipulate.
The Right-Hand Limit: As we approach $c$ from above, the population is a mixture, containing individuals with both high and low $M_i$. Crucially, it includes the high-$M_i$ individuals who jumped the cutoff. \[ \lim_{x \to c^+} E[Y_i(0) | X_i = x] = E[Y_i(0) | X_i \approx c, \text{is a mix of manipulators and non-manipulators}] \]
Since we assumed that $E[Y_i(0) | M_i = m]$ is increasing in $m$, the presence of the high-$M_i$ manipulators in the group just above the cutoff will mechanically inflate the average potential outcome for that group, compared to the group just below it.
This leads directly to the violation of the identification assumption: \[ \lim_{x \to c^+} E[Y_i(0) | X_i = x] > \lim_{x \to c^-} E[Y_i(0) | X_i = x] \]
Therefore, assumption 1a is violated.
When manipulation is possible, the jump we observe in the outcome variable at the cutoff is no longer just the treatment effect. It is a composite of the true treatment effect and the systematic difference in potential outcomes between the manipulators (who sorted into the treatment group) and the non-manipulators (who remained in the control group).
$$\text{Observed Jump} = \underbrace{\left( \lim_{x \to c^+} E[Y_i(1) | X_i=x] - \lim_{x \to c^-} E[Y_i(0) | X_i=x] \right)}_{\text{What we measure}}$$ $$= \tau_{RD} + \underbrace{\left( \lim_{x \to c^+} E[Y_i(0)|X_i=x] - \lim_{x \to c^-} E[Y_i(0)|X_i=x] \right)}_{\text{Selection Bias}}$$
Since manipulation causes the selection bias term to be non-zero, the observed jump is a biased and inconsistent estimator of the true Local Average Treatment Effect, $\tau_{RD}$. The identification strategy has failed.
McCrary Test
This formal proof has a clear empirical prediction. If individuals just below the cutoff are “jumping” to be just above it, we should see a drop in the population density of the running variable just before $c$ and a spike in the density just after $c$. This creates a discontinuity in the probability density function of $X_i$ at the cutoff. The McCrary (2008)1 test is a formal statistical test for precisely this discontinuity, serving as a primary diagnostic for manipulation in RD studies. in RD studies.
Footnotes
McCrary, J. (2008). Manipulation of the running variable in the regression discontinuity design: A density test. Journal of econometrics, 142(2), 698-714.↩︎