
Statistical Proof of Employment Discrimination
Statistical testing has been an element of proof in employment discrimination litigation since the appearance of the binomial model in Castaneda v. Partida, 430 U.S. 482 (1977) (concerning the race of jurors) and Hazelwood School District v. U.S, 33 U.S. 299 (1977) (concerning the race of newly hired teachers.) Since that time, statistical testing has been used extensively to compare the expected number of members of some protected group to the actual number of members of that protected group that have been hired, (Hazelwood), fired, (Sheehan, infra), or otherwise involved in a significant employment action.
The persuasiveness of statistical proof in employment discrimination matters seems well established. See International Brotherhood of Teamsters v. United States, 431 U.S. 324 at 33940 (1977) (stating that "our cases make it unmistakably clear that 'statistical analyses have served and will continue to serve an important role in cases in which the existence of discrimination is a disputed issue . . . . We have repeatedly approved the use of statistical proof, where it reached proportions comparable to those in this case, to establish a prima facie case of racial discrimination in jury selection cases . . . . Statistics are equally competent in proving employment discrimination). Also, see generally, Bazemore. The notion of using regression analysis in employment discrimination litigation dates at least to 1975 and the publication of a student note that advocated the idea. See, Note, Beyond the Prima Facie Case in Employment Discrimination Law: Statistical Proof and Rebuttal, 89 Harv. L. Rev. 387 (1975). The applicability of regression to the analysis of discrimination has been extensively discussed and documented since then. See Finkelstein, The Judicial Reception of Multiple Regression Studies in Race and Sex Discrimination Cases, 80 Colum. L. Rev. 737 (1980), Rubinfeld, Econometrics in the Courtroom 85 Colum. L. Rev. 1048, Lempert, Statistics in the Courtroom: Building on Rubinfeld *5 Colum. L. Rev. 1098 (1985), Note: Title VII, Multiple Linear Regression Models, and the Courts: An Analysis J. Law & Contemp. Probs. Fall 1983, p. 284, Feinberg, The Increasing Sophistication of Statistical Assessments as Evidence in Discrimination Litigation, 77 Am. Stat. A. J. 784 (1982).
The regression issues raised in the Econometrics Section inform the use of regression analysis in employment discrimination cases much as they do in antitrust and securities litigation. Regression models must be properly specified and must meet the basic regression assumptions in employment discrimination cases just as they must in the other types of litigation. With respect to model specification (what economists call including all relevant explanitory variables), it is appropriate to repeat here the conflict between Daubert, which says that regression must meet the standards that economists would apply to their nonlitigation research, and Bazemore, which seems to suggest that a regression analysis is not fatally flawed just because it leaves out a relevant variable. A Seventh Circuit opinion highlights why an omitted variable is fatal to the ability of statistical analysis to inform the finder of fact and illustrates what the scholarly literature cited throughout this chapter discusses as "undesirable results" that occur when regression models are incorrectly specified.
In Sheehan v. Daily Racing Form, Inc. 104 F.3d 940 (7th Cir. 1997), plaintiff Sheehan was a well regarded older employee of a racing newspaper company that used manual layout proceedures to generate its papers. Defendant purchased a like company that used computerized layout procedures, and converted its operations to the computerized techniques.
In subsequent layoffs, Sheehan and most of the other older employees, age 48 and above, were terminated, while most of the younger workers, aged 42 and less, were retained. Sheehan brought suit for age discrimination and his expert proffered a statistical study that showed a strong correlation between age and the pattern of dismissal. The court excluded the expert’s testimony, noting that the expert had failed to consider computer skill as an explanatory variable in his analysis of terminations and that the omitted variable, computer skill, was correlated with age. As a result, if Daily Racing Form had terminated employees that lack computer skills, and the older workers tended to lack computer skills, then a study that omitted computer skills as an explanatory variable would find a correlation between dismissal and age whether age was a criteria for dismissal or not. While the opinion does not identify the type of statistical analysis employed, this failure is an example of the class of misspecification problems discussed in multiple contexts thoroughout this site and in the econometrics literature. When a regression model omits explanatory variables that are correlated with included explanatory variables, the regression coefficients and their tests and error rate calculations lose the desirable properties that makes the law deem them reliable. This is a prime example of why regression that omits an important variable must be excluded by the gatekeeper, rather than being admitted and going to weight. When important explanitory variable are omitted the statistical analysis is unreliable. It appears to be saying things that it is not saying. It not only misleads, it lacks the capacity to inform, so it cannot be shown to have probative value. In Rule 403 terms, it has no probative value, but it surely has the capacity to misinform the jury, so the latter danger must substantially outweigh the nonexistent former probative value. Analogous statements hold for nonregression statistical models. Not to put too fine a point on this because this statement goes for important exclusions of explanatory variables that are correlated with variables in the model.
Sobel v. Yeshiva University, 839 F.2d 18 (2nd Cir., 1988)
The experts in Sobel provide a contrast. In Sobel, plaintiffs offered a multiple regression model of salary determination that showed a sex dummy variable to be a statistically significant determinant of salary. Defendants countered that in their model, using slightly different independent variables the sex variable was not statistically significant. Sobel 839 F.2d 18. There is little need to reiterate the statistical discussion of model specification here, but it is appropriate to note that Sobel addresses precisely the model specification issue discussed previously in this chapter.
The analysis of the Sobel trial court provides a quick review of some relevant cases and a prototype investigation:
"As noted earlier, the plaintiffs relied extensively upon statistics in their attempt to prove their case. Such a heavy reliance upon statistical evidence in employment discrimination cases has, of course, been widely accepted by the courts and even received the imprimatur of the Supreme Court. See Hazelwood School District v. United States, 433 U.S. 299, 30708 (1977). In such a case, the court's task is to determine whether the plaintiff's statistics make out a prima facie case of a practice or pattern of discrimination, and, if so, whether that case is fatally undercut by a showing that the plaintiff's "proof is either inaccurate or insignificant." Teamsters v. United States, supra, 431 U.S. at 360. Of course, in determining whether the plaintiff has established a prima facie case, the court should not ignore the defendant's relevant evidence of nondisparate treatment. 'Prima facie evidence means the 'net of the evidence;' that is, the court must consider the evidence presented by both parties.' Thus, this Court has carefully weighed both the plaintiffs' and the defendant's statistical proof before deciding whether a prima facie case of discrimination has been presented.
Here, the plaintiffs' experts designed a multiple regression model to estimate the effects that various independent variables had upon the single, dependent variable  salary level. When properly used in a Title VII case, this "methodology provides the ability to determine how much influence factors such as sex, experience, and education each have had on determining the value of a variable such as salary level."
Significantly, the courts have generally accepted the idea that the plaintiff, to establish a prima facie case by statistical evidence, is required to demonstrate a difference of more than two or three standard deviations between the expected incidence of a particular type of event (or, as in this case, the expected level of salary for the minority in question) and the actual incidence of such events (or actual level of the minority's salary). See, e.g., Hazelwood, supra, 433 U.S. at 30809 n.14; Castenada v. Partida, 430 U.S. 482, 49697 n.17 (1977); Board of Education of the City of New York v. Califano, 584 F.2d 576, 584 n.29 (2d Cir. 1978), aff'd, 444 U.S. 130 (1979). Although some have suggested that such an absolute requirement should not be imposed in all cases, few would disagree that where, as here, the fungibility of the faculty of a medical college is considerably in question and the number of independent variables so great, one can have little confidence in drawing inferences from statistics that reveal a disparity that is significant at a level of less than two standard deviations." (some citations omitted).
Statistical testing is used by plaintiff to establish prima facie evidence of disparate impact and then by defendant to show flaws in plaintiff’s analysis, perhaps by showing that there are legitimate jobrelated characteristics of affected personnel that, when accounted for, compromise plaintiff’s statistical analysis. 