Research and Validity at PI
Learn about fairness, reliability, and validity research at PI.
Our Science Team monitors, improves, and builds the defensibility of the PI assessments by conducting ongoing research and validity studies. Assessments are constantly monitored to ensure strong psychometrics and fairness. The PI Behavioral Assessment has been certified by third-party reviewers.
Our Scientific Principles
The Predictive Index has an extensive scientific background going back over 60 years. We have conducted over hundreds of client validation studies demonstrating relationships between assessment results and job performance across jobs, industries, and countries. Our assessments are built following the industry best practices defined by The Standards for Educational and Psychological Testing. This entails a broad range of research, development, and maintenance tasks including (but not limited to):
- Domain mapping and content validity studies
- Item analysis
- Convergent and discriminant construct validity studies
- Reliability and dimensionality analysis
- Test-retest studies of stability
- Fairness studies, including differential item functioning analysis
- Criterion-related validity studies
- Norm studies
According to The Standards for Educational and Psychological Testing, the foundations for any assessment include validity, reliability, and fairness, which we discuss in further depth below.*
What are some examples of what we look for in terms of fairness, reliability, and validity?
Does the assessment measure members of the population the same way? What are the risks of adverse impact when using this assessment? Is the assessment format appropriate for people from different populations, both culturally and geographically?
Is an assessment of adequate precision for the intended use? Are scores consistent if someone takes the assessment more than once?
Does an assessment measure what it is intended to measure? Does it predict what it is supposed to predict?
*For detailed technical documentation regarding our assessments and ongoing research, please reach out to your PI consultant.
Item Analysis and Fairness Research
When developing and implementing assessments for hiring, the assessments and processes surrounding them need to be fairfair and generally free of bias. We use a method called differential item functioning (DIF) when developing and maintaining our tests to see if individuals from different populations who generally score similarly have meaningful differences on particular questions. If a difference is present, this is evidence of DIF and it can be assumed that there is measurement bias taking place. For example, imagine a cognitive ability test where males and females typically receive similar scores on the overall assessment, but there are certain questions on the test where DIF is present, and males are more likely to respond correctly. This suggests that measurement bias is present and those questions should be removed. These analyses are conducted both during assessment development and on an ongoing basis. We also examine mean differences based on overall mean scores. For example, do test-takers under the age of 40 score significantly differently on the overall assessment than those 40 and older?
PI assessments are built to be used for workplace decision-making, measure what they are intended to measure (construct validity), relate to job performance (criterion validity), and are stable and accurate enough for their intended purposes (reliability). The assessments’ development, design, and documentation follow industry best-practices used by all reputable assessment providers.
The PI Behavioral Assessment has been reviewed by other researchers periodically over the course of the company’s history. In late 2018, Form V of the PI Behavioral Assessment underwent successful certification review under the guidelines published by the European Federation of Psychologists’ Associations (EFPA). This included an in-depth look at both the assessment and its properties, with auditors reviewing the development process, fairness studies, assessment report design, norm structures, and more. The EFPA certification is PI’s largest and most advanced peer review to date. The PI Cognitive Assessment achieved EFPA certification in 2020.
Reliability refers to the precision or consistency of measurement. There are many different types of reliability, as well as ways to measure reliability. One common way to estimate reliability is by computing internal consistency reliability, most often with coefficient alpha which reflects the extent to which items on a measure are intercorrelated. In other words, coefficient alpha captures whether individual questions are consistent with the other questions on the scale or assessment. When developing and maintaining our assessments, we measure coefficient alpha for each factor of the PI Behavioral Assessment, as well as the components of the PI Cognitive Assessment. We continue to conduct reliability analyses, which include region-specific samples.
Reliability can also refer to stability over time. For example, if the PI Behavioral Assessment is used to inform a hiring decision, will the information gleaned at the hiring stage still hold true a year later? What about five years later? The most straightforward approach to estimating this kind of reliability is via repeated measurements of the same person, or test-retest reliability The PI Assessments have demonstrated adequate test-retest reliability, indicating that scores on both the PI Behavioral Assessment and the PI Cognitive Assessment should remain relatively stable over time.
Numerous studies have examined the internal consistency and test-retest reliability of the PI Behavioral and Cognitive Assessments, all yielding results indicating at least acceptable levels of reliability for the assessment’s intended use case. Given that the PI Assessments are intended to be used as single pieces of evidence in recruitment and talent management decisions (combined with other information such as applicant details), a moderately high level of internal consistency reliability of 0.70 or higher is considered acceptable.
The validity of an assessment refers to how well the assessment captures the construct it is intended to capture. Construct validity studies, specifically, use data to ensure that the assessments measure what they are supposed to measure. There are many types of validity including construct validity, criterion validity, and content validity, as well as ways to measure various forms of validity.
|Criterion Validity||Scores on the measure are linked to relevant outcome variables.||The PI Behavioral Assessment is positively related to job performance.|
|Construct Validity||The assessment captures the construct it is intended to capture.||The PI Cognitive Assessment measures cognitive ability.|
|Content Validity||The content of the measure adequately captures the universe of content that defines the construct.||The PI Behavioral Assessment covers the content domain of adult work-related personality.|
|Face Validity||The content of the measure appears to reflect the construct being measured.||The questions on the PI Cognitive Assessment appear to require cognitive ability.|
Clients are typically interested in criterion-related validity. Such studies show how assessment results relate to outcomes of interest, such as job performance or work behaviors. The PI Behavioral Assessment has undergone over 400 criterion-related validity studies since 1992, demonstrating relationships with a variety of performance measures and workplace behaviors across roles, job levels, and industries.
Our Science Team engages in continuous improvement for our validation, just as we do for our software. Almost all of the Science Team’s core science initiatives relate to aspects of validity. When a new assessment form is developed or an existing form is updated, we conduct all types of validation research, as well as studies of reliability, dimensionality, and fairness. Some of this work is repeated every few years for assessment maintenance or for analysis of specific regions/subpopulations. For example, for the European Federation of Psychologists’ Associations (EFPA) certification of the PI Behavioral Assessment, we reanalyzed the validation, reliability, dimensionality, and fairness work. To maintain EFPA certification, this work is redone every three years.
Validation is a continuous process—it is not established once and deemed to be sufficient. We are always working on reinforcing our validity research, especially as use cases and inferences about the assessments grow. For example, in the past several years, we have conducted research on the interpretation of Match Scores when the assessments are used for hiring and the development of extended time forms for the PI Cognitive Assessment. We have also completed other studies of internal consistency and test-retest reliability and fairness—all of which relate to and underscore the concept of validity.