This study explores the implications of early between-school tracking within educational systems – a practice that involves sorting students into different educational pathways based on their achievement levels. We examine two potential effects of this process: (i) the promotion of homogeneous learning environments through tracking, and (ii) the potential for tracking to exacerbate social segregation among schools. To scrutinize these effects, we analyze data from the assessment studies PISA, TIMSS, and PIRLS (1995–2019). Additionally, we investigate whether school selectivity influences the tracking effects. Using difference-in-differences models combined with multiverse analyses, our findings demonstrate that early between-school tracking indeed contributes to the homogeneity of learning environments and can lead to increased social school segregation. However, our results do not indicate a moderating role of school selectivity.
Zusammenfassung: In dieser Studie untersuchen wir die frühzeitige Aufteilung von Schülerinnen und Schüler mit unterschiedlichen Leistungsniveaus in Schulen mit verschiedenen Bildungsgängen (engl.: „Tracking"). Wir betrachten zwei mögliche Effekte der Gliederung: (i) die Entstehung homogener Lernumgebungen und (ii) die Verstärkung der sozialen Segregation zwischen Schulen. Um diese Effekte zu analysieren, verwenden wir die Daten aller PISA, TIMSS und PIRLS-Studien zwischen 1995 und 2019. Darüber hinaus untersuchen wir, ob die Selektivität der Schulen die gefundenen Effekte von Tracking beeinflusst. Die Ergebnisse unserer „Difference-in-Differences"-Modelle und der durchgeführten „Multiverse"-Analysen zeigen, dass frühzeitiges Tracking tatsächlich zur Homogenität von Lernumgebungen beiträgt und zu einer erhöhten sozialen Segregation führen kann. Allerdings deuten unsere Ergebnisse nicht auf eine moderierende Rolle der Selektivität hin.
Keywords: Early between-school Tracking; Difference-in-Differences; Multiverse Analysis; Segregation; Gliederung; Bildungssystem; Differences-in-Differences; Multiverse-Analyse
Article note Maximilian Brinkmann and Nora Huth-Stöckle share first authorship.
Over the past decades, scholars have debated the merits of sorting students into secondary schools of different educational tracks based on students' scholastic achievement at the end of primary school, referred to as early between-school tracking (hereinafter "tracking" or "early-tracking"). Especially the question whether tracking increases overall achievement and enforces social inequalities in achievement led to a large body of empirical research ([
Consequently, the effects of educational tracking on achievement and inequality of achievement continue to be a topic of intense debate ([
Considering the controversial nature of early between-school tracking, it is surprising that these ambivalent effects of tracking – homogenous learning environments and social segregation – remain largely untested (however, see Strello et al. 2021; [
To operationalize these questions, we pooled data from twenty large-scale assessment studies (i.e., PIRLS, TIMSS, PISA) conducted between 1995 to 2019. This allows us to estimate how similar students are in a given school with regard to their achievement and social status – both in primary and secondary school and across 63 unique countries and regions. We then constructed synthetic cohorts and estimated Difference-in-Differences (DiD) models. Thereby we exploit the fact that no country tracks students in primary school (Hanushek & Wößmann 2006) and observe tracked and non-/late-tracked education systems before and after a potential tracking policy was administered. Lastly, we apply multiverse analyses ([
This study makes two main contributions to the literature. First, we enhance our understanding of the effects of early between-school tracking by empirically evaluating two core mechanisms. Hence, we put central vantage points from which research on the effect of tracking is understood (i.e., efficiency via homogenous learning environments vs. stratification via social segregation) on the grounds of empirical evidence. Second, we motivate the use of multiverse analysis in educational research. Using multiverse analysis means that we systematically vary our data-analytic decisions, make them transparent, and show their impact on results ([
Across 5,760 model specifications, we found consistent evidence that early between-school tracking increases the homogeneity of learning environments. Furthermore, our analyses across 3,840 model specifications suggest that tracking students at an early age can increase social segregation across schools. However, we do not find evidence that the selectivity of education systems moderates the effects of tracking on the homogeneity of learning environments or social segregation.
There are various but often similar ways of conceptualizing tracking. Most of the definitions identify tracking as some sort of ability grouping: "the practice of assigning students to instructional groups on the basis of ability" ([
However, the main part of the controversy that surrounds tracking concerns the existence of early between-school tracking, where a high level of tracking and a low age of tracking culminate (Hanushek & Wößmann 2006; Strello et al. 2021). In prototype-like early between-school tracking countries like Germany, Austria or the Netherlands students are actively sorted into different types of secondary schools after four or six years of primary school. Typically, these secondary schools prepare for either vocational or academic training, though intermediate or mixed-types also exist. Early between-school tracking therefore has substantive consequences for students' educational careers, since different school types (tracks) imply different curricula and educational credentials.
Our definition of tracking therefore refers to early between-school tracking, which implies an active sorting of students after primary school. We define early tracking countries as those countries in which students are already tracked in different types of schools in grade 8. Because the concept of "early" is ambiguous, we will also consider countries that track in grade 9 as an alternative operationalization of early tracking.
A large body of empirical literature has investigated the effects of early between school tracking on students' achievement and (in-)equalities in achievement ([
The empirical literature suggests that educational tracking does not have positive effects on achievement but increases inequality (Terrin & Triventi 2022). However, a few but vocal scholars remain skeptical, most prominently Hartmut Esser. This is also evidenced in a recent controversy, in which the results of a study concerning tracking and selectivity (Esser & Seuring 2020) were challenged by other researchers (Heisig & Matthewes 2022; see Esser & Seuring 2023 for a reply). In this context, we argue that the debate would benefit from a more comprehensive understanding of tracking effects. For example, a naive explanation for the observed empirical patterns (i.e., Terrin & Triventi 2020) is that tracked systems lead to increased social segregation without promoting homogeneity in achievement, which may account for the observed patterns. A more nuanced explanation could involve the interplay of homogenous learning environments, social segregation, and resource stratification (unequal allocation of resources across tracks), resulting in a zero-sum game for achievement but heightened inequalities (Betts 2011; Terrin & Triventi 2022). Although our study does not aim to delve into these explanatory models, it is crucial to highlight that both explanations are plausible given the limited empirical evidence available. Does tracking lead to more homogenous learning environments, does it amplify social segregation? In the following sections, we demonstrate that both perspectives can be reasonably argued and are not mutually exclusive.
A common rationale for the implementation of tracking is efficiency (Brunello & Checchi 2007; Hanushek & Wößmann 2006; Matthewes 2021). This notion emphasizes that sorting is based on ability, or at least on ability proxies such as scholastic achievement at the end of primary school. As a result, students in a given track tend to be more similar in terms of their achievement, creating homogeneous learning environments. Such environments are posited as prerequisites for the higher efficiency of tracked systems, as they allow teachers and schools to tailor their entire mode of instruction to learning groups characterized by only minor variances in achievement. Consequently, learning could become more efficient for all students within these systems, and the early sorting of students ensures that these positive effects are sustained over a prolonged period (Brunello & Checchi 2007).
In contrast, critics of early between-school tracking contend that the sorting process primarily perpetuates the existing societal stratification. Differences in track placement may often reflect disparities in school preparedness, in familiarity with the school environment, and in parents' ability to intervene on behalf of their children in an educational setting ([
Although these different perspectives on tracking may strongly disagree on its merits, both are viable. Tracking could lead to homogenous learning environments and socially segregated schools. This phenomenon can be understood within Boudon's theoretical framework (1974), which explains differences in educational achievement through primary and secondary effects. Primary effects suggest an association between achievement and a student's social status, indicating that students from a higher social status, on average, achieve higher levels of academic success at the end of primary school. Hence, if the sorting process is primarily based on observed achievement, track placement may still align with a student's social status, resulting in homogenous learning environments and social segregation. Consequently, our study aims to investigate two hypotheses.
H1: Compared to non-/late-tracking systems, students in early between-school tracking systems show a higher similarity in secondary schools than in primary schools with respect to their achievement.
H2: Compared to non-/late-tracking systems, students in early between-school tracking systems show a higher similarity in secondary schools than in primary schools with respect to their social status.
While tracked systems aim to select students based on their ability (i.e., via observed scholastic achievement), many education systems do not strictly select on achievement alone but allow parents and teachers to intervene in the selection process. This enables so-called secondary effects to play out ([
For example, in German-speaking countries (Austria, Germany, Switzerland) and the Flemish part of Belgium, students' achievement at the end of primary school is the basis for track recommendations, but the selection process may also involve school or parental influence (Boone & Van Houtte 2013; Dumont et al. 2019; Neuenschwander & Grunder 2010). However, sorting students into different tracks may be more rigid in other countries. In Singapore, students must take a nationwide standardized test that essentially determines their track choice (Singapore Examinations and Assessment Board 2023; [
Given this heterogeneity, tracked systems vary in their degree of selectivity, which refers to the extent to which the sorting process is based on achievement. Within the framework of secondary effects (differential decision-making based on social status), the selectivity of a system can shape the extent to which social status overrides the observed achievement. In less selective systems, families with higher social status may "game the system" (Dumont et al. 2019), enabling their children to move to higher tracks despite insufficient levels of achievement. By contrast, higher selectivity may prevent such behavior, as students are primarily sorted based on their achievement. Consequently, higher selectivity leads to more homogenous learning environments and less social segregation in schools. However, it is important to note that this does not imply the absence of social segregation, since primary effects (the association between achievement and social status) still cause differences in achievement at the end of primary school based on social status. Previous research (focusing on single countries) has indicated either heterogeneous (e.g., Esser & Hoenig 2018; [
Given this background, we examine selectivity from a cross-country perspective and inquire whether selectivity moderates the effects of tracking on the homogeneity of learning environments and the degree of social segregation:
H3a: In early between-school tracking systems, the greater the emphasis on achievement (rather than SES) in the selection process, the more similar students are within secondary schools compared to primary schools with respect to their achievement.
H3b: In early between-school tracking systems, the greater the emphasis on achievement (rather than SES) in the selection process, the less similar students are within secondary schools compared to primary schools with respect to their social status.
Rather than estimating a single set of models to examine the effect of early between-school tracking, we conduct multiverse analyses (Steegen et al. 2016). Multiverse analyses help to investigate the robustness of the results to various data analytic decisions made during the research process. Researchers face numerous and sometimes arbitrary data analytic decisions, creating a "garden of forking paths" (Gelman & Loken 2013; Young & Holsteen 2017), where a single analysis represents only one possibility from a larger set of alternatives. Multiverse analyses address two key challenges in scientific research: the enhancement of transparency and the uncertainty in modeling (Young & Holsteen 2017). In our study, we estimated various plausible models which systematically vary in their characteristics. We then visualized the different estimated coefficients for the tracking effect and evaluated their robustness and consistency across the different specifications using specification curves (Simonsohn et al. 2020) and influence regressions (Young & Holsteen 2017).
In this section, we first describe our analytical decisions concerning data exclusion, handling of missing data, choice of statistical model, operationalization, and inclusion of variables, as well as reasonable alternatives to these decisions. Table 1 summarizes our choices, and Table A1 in the Appendix presents the different operationalizations of our measures. The specifications presented in italics represent our initial or preferred choices, i.e. the model we would have estimated if we had not conducted a multiverse analysis. It is important to note that "preferred" does not necessarily mean that these choices are superior in all cases; we will demonstrate that the different specifications are sometimes equally plausible. In the analyses section, we discuss the results of our initial specification and subsequently explore the robustness of these results using multiverse analyses.
To test our hypotheses, we combined data from multiple school assessment studies, including the Progress in International Reading Literacy Study (PIRLS: 2001, 2006, 2011, 2016), the Trends in International Mathematics and Science Study (TIMSS) 4th grade (1995, 2003, 2007, 2011, 2015) and 8th grade (1999, 2007, 2011, 2015, 2019), and the Programme for International Student Assessment (PISA: 2000, 2006, 2009, 2012, 2015, 2018). These datasets were chosen for their suitability to our research question based on three key reasons. First, they provide information on students' educational achievements and family backgrounds, allowing us to measure homogeneity of learning environments and social segregation in schools. Second, the datasets encompass a range of countries exhibiting both tracked and non-/late-tracked education systems, enabling us to compare students from both types of systems. Third, the datasets cover both pre-tracking (4th grade in TIMSS and PIRLS) and post-tracking periods (8th grade in TIMSS and 15-year-old students in PISA), facilitating a difference-in-differences approach to comparing students before and after tracking (further details below).
As we are interested in differences between tracked and non-/late-tracked countries, we aggregated the student-level data on the country level for each cohort, using sample weights. Thus, country-level aggregates constitute the units of analysis in all models. Our final dataset includes 63 countries (with 876 country-cohort observations), with nine of them already implementing tracking in the 8th grade or earlier (see Table A2 in the Appendix for a detailed overview). While the data are generally well suited for addressing our research question, it is important to note four key points. Firstly, the studies employ different achievement measures: PIRLS and TIMSS assess skills and knowledge taught in schools, whereas PISA evaluates students' ability to apply these skills and knowledge. However, the differences in assessments do not conflict with our research design, since we are primarily interested in the distribution of test scores across schools and country aggregates, focusing on the similarity of students within schools.
Second, the studies differ in their sampling strategies. TIMSS and PIRLS select students from specific grades (e.g., 4th or 8th grade), whereas PISA tests students who are approximately 15 years old, regardless of their grade level. This poses a challenge for our analysis because students in the same school may have varying levels of achievement because they are in different grades. Moreover, some country samples include students from grades before and after they are tracked into different educational paths. Ignoring this would lead to inaccurate estimates of the potential effects of tracking because our outcomes measure the similarity of students within schools (see the measures section below). To address this challenge, we generated the country-level aggregates based only on students from one grade and excluded students who have already been tracked in countries that are not classified as "early-trackers." We selected the grade with the highest number of student-level observations (referred to as the "modal grade") and set a minimum requirement of 1,000 student-level observations. In cases where countries track students after grade 9, but grade 10 is the most common, we selected grade 9 under specific conditions: if a country tracks after grade 9, grade 9 represents at least 25 percent of the sample, and there are more than 1,000 student-level observations available for that grade.
A third challenge arising from the data is the heterogeneity within the sample of countries (refer to Table A3 in the Appendix). To ensure that any observed tracking effects do not stem from specific characteristics of the sample composition, we employed various samples for the multiverse analyses. We re-ran our models to include only economically comparable countries, according to the World Bank's classification by gross national income (World Bank 2021a). In particular, we proceeded with an alternate specification that omitted low-income economies, and a subsequent specification excluding both low-income and lower-middle-income economies. Moreover, in further model specifications, we excluded certain countries due to their selective student populations in secondary schools. This exclusion affected countries where only a restrictive segment of students proceeded to the 8th or 9th grade. For these nations, we removed those with a secondary school enrollment rate below 80 percent.
Finally, the surveys differ with regard to the achievement test content. PIRLS focuses exclusively on reading achievement, whereas TIMSS surveys math and science achievement. PISA covers all three domains with a focus on one specific domain per survey year. Consequently, our sample sizes vary depending on the respective outcome variable. Models estimating homogeneity in achievement are based on fewer observations compared to models that assess social segregation since information on students' social backgrounds has been collected in all three surveys.
We estimate different models to test our hypotheses. Whether early between-school tracking leads to homogeneous learning environments (Hypothesis 1) is investigated in model M1a. In model M1b, we investigate whether early between-school tracking increases social segregation (Hypothesis 2). The question of whether selectivity moderates the effects of tracking is investigated in models M2a (outcome: homogeneous learning environments) and M2b (outcome: social segregation).
In the following section, we give a detailed description of the operationalization of our measures and their different variants. An overview of the various operationalizations is provided in Table A1 in the Appendix.
We assess two dependent variables: (
As an alternative specification, we used the dissimilarity index. The dissimilarity index compares the distribution of two groups and thus requires a binary variable. We varied the cut-off points to transform the outcome variables into binary variables (see Table A1 in the Appendix). The ICC and dissimilarity index have been calculated based on weighted data, incorporating sampling weights (total student weights) to ensure that the dependent variables represent nationally representative estimates. Table A4 in the Appendix provides information on the range and distribution of the dependent and independent variables.
For our treatment variable, we distinguished between early-tracking and non-tracking/late-tracking education systems. We operationalized the tracking indicator based on the grade at which tracking was first implemented. Education systems in which tracking occurs before the 8th grade are categorized as early-tracking systems, implying that students are already tracked in grade 8 or earlier. In an alternative model specification, we defined early-tracking systems as countries that track before the 9th grade. Integrated systems, i.e. non-tracking or late-tracking countries, are those with no tracking, or where tracking commences after the 8th or 9th grade. To operationalize the tracking grade, we referred to previous research (Strello et al. 2021) and additional public resources, such as the TIMSS Wiki and Eurydice (EU). An overview of the countries' timing of tracking is provided in Table A3 in the Appendix.
To measure the selectivity of an education system, we utilized data from the PISA school questionnaires. We used the questions asking the school principals to indicate whether admission to their school was based on a student's academic record or the recommendation of feeder schools, since this recommendation, in turn, is often based on students' academic performance. We calculated the percentage of secondary school students in a country and year attending a school that considered either one or both of these factors during the admission process. To ensure the national representativeness of the moderator variable, we calculated the student shares using sampling weights (total student weights). A value of 0 (or 100) indicates that no (or all) students attend selective secondary schools. For the model variants using TIMSS 8 survey data, we matched the school selectivity information obtained from the PISA data to the closest TIMSS country-year observation. Therefore, the moderation models are based on a smaller number of country observations as they include only those countries that participated in the PISA study. In the main analysis, we calculated the share of students attending a secondary school that considers a student's academic record in the admission decision. In alternative specifications, we calculated the share of students attending schools that consider both criteria or rely on the feeder school's recommendation. In all model variants, we mean-centered the school selectivity variable.
The countries in our sample not only differ in terms of having integrated or tracked education systems, but they also exhibit other characteristics that might influence both the treatment and the outcome. To account for these differences, we estimated models with country and cohort fixed effects and thereby adjusted for all time-constant variations between countries and cohorts. Additionally, we included three time-varying covariates: GDP per capita ([
To identify the effect of tracking on social segregation and homogeneity of learning environments, we employed a Difference-in-Differences (DiD) approach ([
To account for potential cohort effects, we matched primary and secondary school students from approximately the same cohort (e.g., TIMSS 2011 4th-grade students and TIMSS 2015 8th-grade students). By observing the same student cohort during both elementary and secondary school, we aimed to mitigate cohort-specific confounding factors. It is important to note that some surveys can be matched with several others (e.g., TIMSS 2011 4th-grade students can be matched with both PISA 2012 students and TIMSS 2011 8th-grade students), resulting in multiple observations for certain countries in the dataset. To address this issue, we applied weights to the countries using the inverse of the country observations (1/n). This ensured that each country had an equal impact on the analyses, regardless of the number of observations. Furthermore, we conducted our models based on the pooled dataset, and incorporated country and cohort fixed effects, as well as country and cohort robust standard errors.
Table 1: Model specifications of the multiverse analysis
Dimension Specification Model I Operationalization SES M1b-M2b 2 – Parents' highest educational level M1b-M2b Achievement M1a-M2a 2 – Reading M1a-M2a 3 – Science M1a-M2a Homogeneity/Segregation indices M1-M2 2 – Dissimilarity Index M1-M2 Selectivity M2 2 – Share of students in secondary schools considering the feeder school recommendations in their admission decision M2 3 – Share of students in schools considering academic performance criteria or feeder school recommendations in their admission decision M2 Early between-school tracking M1-M2 2 – Tracking before 9th grade M1-M2 II Covariates GDP per capita M1-M2 2 – GDP per capita excluded Population density M1-M2 2 – Population density excluded Private secondary schools M1-M2 2 – Share of students in private schools included M1-M2 III Fixed effects Cohort M1-M2 Country M1-M2 Cluster robust standard errors Cohort M1-M2 2 – Not adjusted M1-M2 Country M1-M2 IV Subsample Country subsample: secondary school enrollment M1-M2 2 – Secondary school enrollment > 80 percent M1-M2 Country subsample: economy's income group (World Bank 2021b) M1-M2 2 – Low- and low-middle-income economies excluded M1-M2 3 – Low-, low-middle-, and upper-middle-income economies excluded M1-M2 PISA: grade selection M1-M2 2 – Modal grade M1-M2
Note: The initial model specifications of the main models M1 and M2 are listed first (indicated with 1) and italicized, the alternative specifications of the multiverse analysis are listed subsequently.
Graph: Figure 1: Homogeneity of learning environments in primary and secondary school for tracked and non-/late-tracking education systemsNote: Homogeneity of learning environments was measured using the ICC of math achievement (see Table A1 in the Appendix for a detailed variable description). The figure is based on 522 observations (74 observations for early-tracking countries; 448 observations for late/no tracking countries).
Before we turn to the DiD models, we initially examine the data by presenting the changes in the homogeneity of learning environments (Figure 1) and social segregation (Figure 2). These figures illustrate the mean scores for countries without (represented by solid lines) and with an early between-school tracking system (represented by dashed lines) before and after the implementation of tracking (primary vs. secondary school). Figure 1 reveals that the level of homogeneity of learning environments is similar between the two education systems during primary school (i.e., pre-treatment). Second, there is a minimal change in the homogeneity of learning environments between primary and secondary school in integrated education systems, i.e., non-tracking/late-tracking systems (solid line). Lastly, in contrast to integrated systems, early-tracking education systems demonstrate increased homogeneity of their learning environments during secondary school compared to primary school (dashed line).
Figure 2 demonstrates that both integrated and tracked education systems exhibit a comparable mean level of social segregation during primary school (i.e., pre-treatment). Integrated systems show little change in their average social segregation between primary and secondary school (i.e., post-treatment), while early-tracking education systems demonstrate increased social segregation during secondary education.
Taken together, these observations indicate systematic differences in the trends of both achievement homogeneity and social segregation between the two education systems. Tracked education systems experience an increase in both outcomes, whereas the more integrated education systems maintain stability.
We begin by discussing our initial specifications before turning to the multiverse analyses. Our first research question aims to investigate whether tracking contributes to a more homogeneous learning environment (Hypothesis 1) and higher social segregation of schools (Hypothesis 2).
Graph: Figure 2: Social segregation in primary and secondary school for tracked and non-/late-tracking countriesNote: Social segregation was measured using the ICC of books at home (books 1, see Table A1 in the Appendix for a detailed variable description). The figure is based on 794 observations (148 observations for early-tracking countries; 646 observations for late/no tracking countries).
Model 1 (Table 2) displays the DiD-estimate of early tracking (time x early tracking) on both the homogeneity of learning environments (M1a) and the social segregation (M1b). This estimate indicates whether the tracking of students leads to an increase in homogeneity of learning environments (or alternatively, in social segregation) when compared to integrated education systems.
Table 2: Model 1 – Tracking Effect
M1a: Homogeneity of learning environments (ICC of Math achievement) M1b: Social segregation (ICC of books at home) Coef. Cluster-robust Std. Err. Coef. Cluster-robust Std. Err. Time: secondary school (Ref.: primary school) –.015 (.021) -.026 (.015) Tracking (Ref.: late-tracking countries) Time x Tracking .255*** (.051) .038 (.031) Population Density .622** (.182) .021 (.072) Time x Population Density –.477*** (.060) –.221*** (.039) GDP per capita –.763** (.219) –.034 (.044) Time x GDP per capita .621*** (.095) .302*** (.067) Intercept .320*** (.031) .145*** (.010) Country fixed effects ✓ ✓ Cohort fixed effects ✓ ✓ Observations 522 794 Countries 58 60 Cohorts 12 18 R-squared .78 .70 Within R-squared .36 .096
Note: The outcome variable of model M1a is the ICC of math achievement (see Table A1 in the Appendix for a detailed variable description). The model's standard errors are adjusted for 58 country and 12 cohort clusters. The outcome variable of model M1b is the ICC of books at home (books 1, see Table A1 in the Appendix for a detailed variable description). The model's standard errors are adjusted for 60 country and 18 cohort clusters. The varying numbers of observations between the models result from the fact that the respective outcome variables were surveyed with different frequencies (see data section). * p <.05; ** p <.01; *** p <.001
Our findings regarding the homogeneity of learning environments suggest that tracking indeed contributes to an increase in the similarity of student achievement in schools (M1a. b =.255, rob. SE =.051). Specifically, we found that the similarity of student achievement in schools increases by a.25 unit change between primary and secondary school in early-tracking education systems. These effects are substantial considering that our dependent variable has a standard deviation of.152 and ranges from.04 to a maximum of.79 (Table A4 in the Appendix). Thus, our results support Hypothesis 1, indicating that tracking leads to a more homogeneous learning environment.
Graph: Figure 3: Predictive margins plot of the DiD-tracking effect on homogenous learning environment (model M1a)Note: The outcome variable of model M1a is the ICC of math achievement (see Table A1 in the Appendix for a detailed variable description). The model's standard errors are adjusted for 58 country and 12 cohort clusters.
Figure 3 visualizes the predicted DiD estimate of tracking. It demonstrates that in countries with an integrated education system, there is no substantial increase in the homogeneity of learning environments between primary and secondary school. Conversely, in tracked countries, the homogeneity of learning environments tends to increase as students progress to secondary school. Next, we examine the tracking effects on social segregation in model M1b. We find a positive effect of early tracking on social segregation between schools (b =.038, rob. SE =.031). However, it is important to note that there is a high degree of statistical uncertainty associated with this finding. The predicted impact, however, is not small, given that the predicted change constitutes about 57 percent of the standard deviation in social segregation (.066). Figure 4 illustrates that social segregation tends to increase in early-tracking countries, although this increase does not reach statistical significance. Thus, model M1b does not confirm Hypothesis 2.
The incremental change of the within R² before (see Table A6 in the Appendix) and after including the DiD tracking effect (see Table 2) suggests that tracking has more explanatory power for the homogeneity in achievement than for the social segregation of schools. The within R
Our second research question examines whether a stronger school selectivity based on prior achievement moderates the effect of tracking on the two outcome variables (Table 3). Contrary to our hypothesis, the results from Model M2a do not indicate that school selectivity moderates the effect of tracking on the homogeneity of learning environments (b = –.002, rob. SE =.002). Similarly, Model M2b does not suggest that school selectivity mitigates the positive effect of tracking on social segregation (b = –.000, rob. SE =.001). These findings collectively indicate that a higher degree of selectivity of secondary schools does not lead to increased homogeneity of learning environments within tracked systems, nor does it alleviate the potential increase in social segregation. Thus, our findings do not provide support for Hypotheses 3a and 3b.
Graph: Figure 4: Predictive margins plot of the DiD-tracking effect on social segregation (model M1b)Note: The outcome variable of model M1b is the ICC of books at home (books 1, see Table A1 in the Appendix for a detailed variable description). The model's standard errors are adjusted for 60 country and 18 cohort clusters.
Table 3: Model 2 – Moderating effect of tracking x school selectivity
M2a: Homogeneity of learning environments (ICC of Math achievement) M2b: Social segregation (ICC of books at home) Coef. Cluster-robust Std. Err. Coef. Cluster-robust Std. Err. Time: secondary school (Ref.: Primary School) .008 (.038) –.034 (.021) Tracking (Ref.: late-tracking countries) Time x Tracking .255*** (.029) .049 (.026) School Selectivity –.001 (.001) –.000 (.000) Time x School Selectivity .002 (.001) .000 (.001) School Selectivity x Tracking .002* (.001) .001 (.001) Time x School Selectivity x Tracking –.002 (.002) –.000 (.001) Private Secondary School –.084 (.152) –.167* (.078) Time x Private Secondary School –.109 (.162) –.019 (.097) Population Density .292** (.073) –.294 (.157) Time x Population Density –.551** (.134) –.243* (.086) GDP per capita –.437** (.113) .284 (.162) Time x GDP per capita .677** (.179) .326* (.114) Intercept .323*** (.031) .176*** (.014) Country fixed effects ✓ ✓ Cohort fixed effects ✓ ✓ Observations 360 618 Cluster: Country 34 36 Cohorts 12 18
Note: The outcome variable of model M2a is the ICC of math achievement (see Table A1 in the Appendix for a detailed variable description). The model's standard errors are adjusted for 34 country and 12 cohort clusters. The outcome variable of model M2b is the ICC of books at home (books 1, see Table A1 in the Appendix for a detailed variable description). The model's standard errors are adjusted for 36 country and 18 cohort clusters. The varying numbers of observations between the models result from the fact that the respective outcome variables were surveys with different frequencies (see data section). * p <.05; ** p <.01; *** p <.001
However, these results represent only one specification out of a large number of plausible specifications (Simonsohn et al. 2020; Steegen et al. 2016). Therefore, we employed multiverse analysis to assess the robustness of our findings across alternative specifications. Figure 5 displays the specification curves of the tracking effect in Model M1a, which examines the effect of early tracking on the homogeneity of learning environments. To generate the specification curves, we ran 5,760 models. These models constitute all possible combinations of model choices that can be derived from the different specifications described in Table 1. To simplify the visual presentation, we randomly selected 100 models from the total of 5,760 models (Simonsohn et al. 2020). The specification curves based on all specifications are presented in the Appendix (Figure A3 and A4). Each point in the upper part of the figure represents the estimated tracking effect for a particular specification together with the corresponding 95 percent confidence interval. The gray dots indicate statistically insignificant coefficients and black dots statistically significant coefficients. The black circle indicates the estimate of our initial model specification that we discussed in the previous section. The lower part of the figure presents the specifications of each model, such as the segregation indicator and covariates used.
The results of the multiverse analysis demonstrate the high robustness of our findings in terms of direction and statistical significance: The tracking effect on the homogeneity of learning environments was consistently positive across all model specifications and statistically significant in 99.8 percent of the models (p-value <.05 in 5,749 out of 5,760 models) (Figure 5). This indicates that the positive relationship between early tracking and the homogeneity of learning environments is a robust finding. Despite the overall robustness of the results, we observe some variation in the effect size. The operationalization of the outcome variable appears to have a notable influence on the magnitude of the tracking effect. Specifically, the dissimilarity index based on the 10 percent quantile cutoff seems to yield systematically lower tracking effects, whereas using the ICC as outcome is associated with systematically stronger tracking effects.
Graph: Figure 5: Specification curve of the tracking effect on the homogeneity of learning environments (model M1a)Note: The specification curve is based on a random sample of n = 100 out of 5,760 model specifications. The tracking effect of the initial model specification (M1a) discussed above is b =.255.
Graph: Figure 6: Specification curve of the tracking effect on social segregation (model M1b)Note: The specification curve is based on a random sample of n = 100 out of 3,840 model specifications. The tracking effect of the initial model specification (M1b) discussed above is b =.038.
Figure 6 illustrates the specification curve for the tracking effect on social segregation (model M1b). The possible combinations of different specifications led to a total of 3,840 models. Again, we drew a random sample of n = 100 for presentation. The statistical significance of the estimates is less consistent, with only 26 percent of the estimated coefficients being statistically significant (p-value <.05 in 999 out of 3,840 models). However, the results are consistent in terms of sign stability: 87.7 percent of the estimated coefficients are positive (b > 0 in 3,366 out of 3,840 models). Similar to the main analysis, we do not find clear support for the notion that tracking increases social segregation between schools. However, based on the robust positive tracking effect, we cannot easily reject Hypothesis 2. Similar to the specification curve described above, Figure 6 suggests that the operationalization of the outcome variable influences the effect size. Specifically, outcomes based on parental education appear to be associated with smaller tracking effects compared to outcomes based on the books at home variable. To further investigate the impact of different model variants on the tracking effects, we conducted an influence regression using Young and Holsteen's (2017) approach. The influence regression provides insight into how each specification, on average, affects the tracking effect on homogenous learning environment (coefficient: Figure A5 in the Appendix; p-value: Figure A6 in the Appendix) and social segregation (coefficient: Figure 7; p-value: Figure 8). Given the high robustness of the tracking effect on homogenous learning environments and large variance observed for social segregation, we focus on the influence regression of the latter outcome variable.
The magnitude of the estimated tracking effects on social segregation (Figure 7) and their statistical significance (Figure 8) are primarily influenced by the operationalization of the outcome variable. Specifically, models based on parents' education and the dissimilarity index tend to yield lower tracking coefficients and higher p-values compared to our initial model specification. Similarly, using a social segregation variable based on more than 200 books at home (books 6) decreases the estimated effect by.03. However, we find evidence of tracking having a segregation-enhancing effect when examining the segregation of students from low socioeconomic status (SES) families, as indicated by the operationalizations based on books 3 and books 4. This suggests that tracking might indeed increase social segregation, but not in a uniform way: It does not appear to increase segregation among children coming from the highest social strata (referred to as "elite-segregation") but to increase social segregation among children from disadvantaged social backgrounds. This would be consistent with a situation in which children from high- and middle-class backgrounds predominantly attend the higher track schools, whereas children from disadvantaged, low class backgrounds attend the lower track schools. Moreover, models based on a country sample that is more similar in terms of income levels lead to higher tracking estimates and lower p-values. For instance, when utilizing subsamples consisting of high-income economies, the tracking effect increases by.014 (Figure 7) and the p-value decreases by.12 (Figure 8).
Graph: Figure 7: Influence regression for regression coefficients of the tracking effect on social segregation (model M1b)Note: The influence regression is based on 3,840 model specifications. The tracking effect of the initial model specification (M1b) is b =.038.
Next, we assessed the robustness of the moderating effect of school selectivity on the homogeneity of learning environments. Figure 9 presents the specification curve for this analysis (model M2a; refer to Figure A7 in the Appendix for the specification curve encompassing all specifications). From various specifications, we generated a total of 17,280 models, and a random sample of n = 100 models. The specification curves indicate that school selectivity does not moderate the tracking effect. Only approximately 43.8 percent of the model specifications (7,576 out of 17,280 models) exhibit a positive moderation effect of school selectivity on tracking, and only 11.9 percent of the specifications (2,056 out of 17,280 models) reach statistical significance.
Graph: Figure 8: Influence regression for p-values of the tracking effect on social segregation (model M1b)Note: The influence regression is based on 3,840 model specifications. The p-value of the initial model specification (M1b) is p =.239.
In the multiverse analysis in Figure 10 (see Figure A8 in the Appendix for the specification curve based on all specifications), we examined whether school selectivity moderates the tracking effect on social segregation. The different specifications yielded 11,520 models, again we used a randomly selected sample of n = 100 models for visualization. The specification curves suggest that school selectivity does not moderate the tracking effect on social segregation. The moderation effects are negative in about 40.7 percent of model specifications (4,688 out of 11,520 models), with statistical significance found in only about two percent of the model specifications (239 out of 11,520 models). Given the high robustness of these results, we do not delve further into the influence of the individual specifications. Detailed results of the influence regressions for the homogenous learning environment models (Figure A9 and A10) and the social segregation models (Figure A11 and A12) are provided in the Appendix.
In summary, the findings of the multiverse analysis do not support our hypotheses that school selectivity moderates the tracking effects. The results suggest that school selectivity neither reinforces the positive tracking effect on homogenous learning environments nor mitigates the tracking effect on social segregation.
Graph: Figure 9: Specification curve of the moderation effect (selectivity x tracking) on the homogeneity of learning environments (model M2a)Note: The specification curve is based on a random sample of n = 100 out of 17,280 model specifications. The tracking x selectivity effect of the initial model specification (M2a) discussed above is b = –.002.
Graph: Figure 10: Specification curve of the moderation effect (selectivity x tracking) on social segregation (model M2b)Note: The specification curve is based on a random sample of n = 100 out of 11,520 model specifications. The tracking x selectivity effect of the initial model specification (M2b) discussed above is b = –.000.
In this study, we aim to contribute to the ongoing debate on early between-school tracking. We take a step back from the debate on the consequences of tracking for (inequality of) achievement, and investigate initial effects. Specifically, we address whether tracking leads to more homogenous learning environments and increased social segregation. We also explored whether these tracking effects are moderated by the selectivity of an education system. Despite the importance of these questions for understanding the empirical evidence on the effects of tracking on achievement inequality, they have received surprisingly little attention (however, see Strello et al. 2021; Engzell & Raabe 2023). To address these questions, we utilized a Difference-in-Differences approach by pooling data from PISA, PIRLS, and TIMSS covering a period of 24 years. Moreover, to ensure robustness, we conducted a multiverse analysis consisting of a wide range of plausible specifications.
While our study provides robust evidence that tracking leads to increased similarity in student achievement, the results regarding similarity in social status are more varied. Regardless of the sample composition, variable operationalization, and control variables, we consistently find a positive effect of tracking on the homogenization of learning environments across all 5,760 specifications, these effects are virtually always significant (99.8 percent). In contrast, the effects of tracking on social segregation have yielded mixed results. A common criticism is that students from privileged (disadvantaged) backgrounds tend to be overrepresented in higher (lower) tracks. Although the vast majority (87.7 percent) of our model specifications result in positive estimates (indicating increased social segregation due to tracking), only about a quarter of them reach statistical significance. Our multiverse analysis revealed that the effect size and level of statistical uncertainty strongly depend on the operationalization of the outcome.
Interestingly, the influence regression predicts that tracking has the strongest effect on social segregation among students from lower social backgrounds while having a less pronounced effect on students from middle or high social backgrounds. This suggests that tracking might work in a non-uniform way: It does not appear to increase segregation among children coming from the highest social strata, i.e., elite-segregation, but on the other end of the social strata, i.e., among children from disadvantaged social backgrounds.
With regard to school selectivity (i.e., the extent to which the sorting process is based on prior achievement), our results contradict the hypotheses. The vast majority of specifications does not indicate that selectivity influences the relationship between tracking and similarity in student achievement or between tracking and student social background. This is in line with the heterogeneous findings in the literature on selectivity based on within-country designs (c.p. Esser & Hoenig 2018; Jähnen & Helbig 2015; Lorenz et al. 2023).
How can we reconcile these findings with existing literature? The essential intention of between-school tracking – increased similarity among students – appears to be fulfilled. However, the majority of empirical evidence suggests that while tracking does not lead to an increase in (average) achievement, it does contribute to social inequality (Terrin & Triventi 2022). Three key factors should be considered in this context.
First, tracking may coincide with stratification (Betts 2011; Terrin & Triventi 2022), which implies that higher tracks generally offer better learning conditions, such as improved student-to-teacher ratios (Brunello & Checchi 2007: 795), or a self-selection of highly motivated teachers into more prestigious (and/or better paying) higher tracks (Betts 2011). This might also give rise to a stigmatization of students in the lower tracks ([
Second, it is important to recognize that homogeneity of learning environments is not exclusive to between-school tracking. In integrated (i.e. non-/late-tracked) systems, within-school tracking or streaming may exist (Betts 2011; [
Third, the debate on tracking often overlooks the fact that social segregation in schools is also prevalent in integrated systems ([
We acknowledge five main limitations of our study. First, our large-scale comparative approach investigates aggregated effects at the country level, which limits our ability to capture variation within countries. While this approach allows us to observe a large sample of education systems over time, we may miss important nuances and heterogeneity within each country.
Second, our study combined data from different datasets (PISA, TIMSS, and PIRLS), which vary in sampling schemes and measures ([
Third, when interpreting the theoretically unexpected (non-)effects of selectivity, two things should be considered. First, our selectivity measure is necessarily noisy. It is aggregated from countries which are potentially heterogenous at the level of states or districts. Furthermore, while our measure captures whether student achievement was considered it only provides limited variation in the degree to which achievement was considered. Though there likely are larger differences in the selectivity of early between-school tracking countries, our data indicates an overall high level of school selectivity in tracked education systems. To better understand the effects of selectivity on tracking, future research could further explore within-country differences (c.p. Esser & Hoenig 2018; Jähnen & Helbig 2015).
Fourth, our use of a Difference-in-Differences design, while appropriate for our analysis, relies on the parallel-trends assumption. While this assumption can never be explicitly tested (because it refers to a counterfactual outcome), approaches exist to make this assumption more plausible. However, these approaches rely on the existence of rich data (e.g., to gauge prior trends), or specific conditions (e.g., triple DiD) which are not available in our case. We argue, however, that there are reasons to believe that our approach is appropriate. Note the absence of (substantial) differences in the levels of homogenous learning environments or social segregation of schools when both groups are observed in primary school, but an impressive divergence when these groups are observed after a potential tracking policy was applied. Moreover, this result is robust when a number of potential confounders are directly (e.g., students in private schools) or indirectly controlled (e.g., country and cohort fixed effects).
Fifth, our measure of homogenous learning environments could be influenced by differential learning rates across tracks, since our data does not allow us to control this effect. However, our influence regression shows that there are only small differences in effect sizes between our different tracking definitions (which indirectly measures students' jointly spent time in secondary education). Thus, differences in learning rates should not drive our results.
Finally, multiverse analysis is a great approach to making data-analytic decisions transparent and to assessing the robustness of the results (Simonsohn et al. 2020; Steegen et al. 2016; Young & Holsteen 2017). However, it offers no remedy if the underlying design is flawed. In other words, if the models are misspecified, it does not matter if we estimate one or 100,000 models. Thus, one should still be cautious when interpreting the results. Nevertheless, we believe that a multiverse analysis can increase the reliability and the credibility of empirical investigations.
This study expands our knowledge of educational tracking by empirically examining two potential effects arising from tracking: (a) the promotion of homogeneous learning environments, and (b) the potential to exacerbate social segregation among schools. Furthermore, we investigated the moderating role of school selectivity to gain a better understanding of the nuanced effects of tracking. Our findings support the notion that tracking does, indeed, increase the homogeneity of learning environments. It can contribute to the social segregation of schools, apparently more at the lower end of the social strata. However, our analysis did not find evidence suggesting that higher selectivity in tracking systems moderates these associations.
In addition to these empirical contributions, we underscored the importance of systematically testing the robustness of findings across various model specifications. By using large-scale assessment data, we demonstrated how multiverse analyses can enhance transparency in data-analytical decision making and shed light on the potential implications for empirical results.
We used survey data from the PISA, PIRLS, and TIMSS studies. The PISA data between 2000 and 2018 are available at https://
The macro data of the tracking grade, the GDP, and the population density are available from the article's supplementary materials. GDP and population density were derived from the world bank: https://databank.worldbank.org/source/world-development-indicators.
The R scripts, and the STATA scripts for the data preparations and analyses are available at the following link: https://osf.io/cuq75/?view_only=dd09143528e043b6ba2147dcdcb71ba9.
We would like to express our gratitude to the three anonymous reviewers whose insights and suggestions substantially contributed to improving this manuscript. Our thanks also extend to our student assistants, Lisanne Strasser, Nakia El-Sayed, Marc Pelzer and Luisa Zecher, for their support. This study was conducted as part of a project financed by the Deutsche Forschungsgemeinschaft (DFG) under grant number 430266278.
By Maximilian Brinkmann; Nora Huth-Stöckle; Reinhard Schunck and Janna Teltemann
Reported by Author; Author; Author; Author
Maximilian Brinkmann, geb. 1990 in Remscheid. Studium der Volkswirtschaftslehre und Soziologie in Düsseldorf, Wuppertal und Groningen. Seit 2020 wissenschaftlicher Mitarbeiter an der Universität Hildesheim im DFG Forschungsprojekt „BiMiBi – Bildungssysteme und migrationsspezifische Bildungsungleichheit". Forschungsinteressen: Bildungssoziologie, Bildungssysteme, quantitative Methoden und Kausalanalyse.
Nora Huth-Stöckle, geb. 1989 in Aachen. Studium der Sozialwissenschaft mit den Fächern Soziologie, Politikwissenschaft und Volkswirtschaft an der Universität Köln. Studium der Soziologie an der Universität Duisburg-Essen. Von 2017–2020 wissenschaftliche Mitarbeiterin am GESIS Institut der Sozialwissenschaften in Köln im BMBF Forschungsprojekt „Solikris – Veränderung durch Krisen? Solidarität und Entsolidarisierung in Deutschland und Europa". Seit 2020 wissenschaftliche Mitarbeiterin an der Universität Wuppertal im DFG Forschungsprojekt „BiMiBi – Bildungssysteme und migrationsspezifische Bildungsungleichheit". Forschungsinteressen: Intergruppenbeziehungen, Vorurteile, Bildungsungleichheit Wichtigste Publikationen: Explaining immigrants' social distance towards natives: A multilevel mediation approach across immigrant groups in Germany. Social Science Research, 114, 2023: 102907 (mit E. Schlüter), Economic conditions and perceptions of immigrants as an economic threat in Europe: Temporal dynamics and mediating processes. International Journal of Comparative Sociology, 62(
Reinhard Schunck, geb. 1979 in Bonn. Studium der Sozialwissenschaften in Mannheim, Utrecht (Niederlande) und Bloomington, Indiana (USA). Promotion 2011 an der Bremen International Graduate School of Social Sciences, Universität Bremen. Von 2010 bis 2016 an der Universität Bielefeld. Von 2016 bis 2019 am GESIS-Leibniz-Institut für Sozialwissenschaften. Seit 2019 Professor für Soziologie an der Bergischen Universität Wuppertal. Forschungsschwerpunkte: soziale Ungleichheit, Migration, Familie und quantitative Methoden. Wichtigste Publikationen: Within- and between-cluster effects in generalized linear mixed models: A discussion of approaches and the xthybrid command. The Stata Journal, 17(
Janna Teltemann, geb. 1980 in Uelzen, Studium der Soziologie an der Universität Bremen, von 2007–2016 wissenschaftliche Mitarbeiterin am Institut für empirische und angewandte Soziologie und am Sonderforschungsbereich 597 „Staatlichkeit im Wandel" an der Universität Bremen. 2012 Promotion an der Universität Bremen. Von 2016–2019 Juniorprofessorin und von 2019–2023 W2-Professorin an der Universität Hildesheim. Seit 2023 W3-Professorin für Bildungssoziologie an der Universität Hildesheim. Forschungsschwerpunkte: Bildungsungleichheit, Bildungspolitik, migrationsbedingte Bildungsungleichheit, International vergleichende Sozialforschung. Wichtigste Publikationen: Standardized Testing, Use of Assessment Data and Low Reading Performance of Immigrant and Non-Immigrant Students in OECD Countries. Frontiers 5, 2020 (mit R. Schunck); Education systems, school segregation, and second-generation immigrants' educational success: Evidence from a country-fixed effects approach using three waves of PISA. International Journal of Comparative Sociology 57, 2016: 401–424 (mit R. Schunck); Räumliche Segregation von Familien mit Migrationshintergrund in deutschen Großstädten: Wie stark wirkt die sozioökonomische Restriktion? Kölner Zeitschrift für Soziologie und Sozialpsychologie, 1, 2015: 83–103 (mit S. Dabrowski & M. Windzio).