PhD - Applications of Multilevel Modelling
Exploring the assumption of no correlation of explanatory variables with random effects.
Project aims and activities
The overarching aim of my PhD project is to improve understanding about the implications of using multilevel models with social data where the random effects are correlated with explanatory variables.
In more detail
Random effects models are used in the social sciences to handle data where cases are clustered in some way, either because of the way the data have been sampled, or because conceptually we believe that cases are grouped in some way, so that important differences between these groups might shape the processes affecting them .
For example, if we are interested in the relationship between parental income and school attainment, and our data come from pupils from a range of schools, we might suppose that average outcomes will vary between schools for various reasons we can't measure (or just haven't). Perhaps one school has a sporty ethos, and another is very focused on diversity and tolerance, while a third has suffered a lot of disruption because of a building which is poorly maintained.
These differences could mean that the children within any one of these schools have outcomes which are similarly different to the overall average. Some sort of multilevel model is needed to cope with that structure in the data.
Imagine a model which tries to describe how exam scores vary along with a parental income. One common approach would be to use a random intercepts model, which would allow a different 'baseline' exam score for each school, and estimate the effect of wealth on exam scores relative to that.
This improves our analysis in two ways:
Firstly, we get a better estimate of the individual effect of parental income on exam performance;
Secondly, we can explore the balance of difference forces within the model, e.g. by measuring how much difference in exam score is explained by individual differences in wealth, and how much is related to membership of a different school.
We can extend this specification to explore differences between schools in the relationship between wealth and attainment using a random slopes model. The graphs above show simulated (fake!) data containing relationships we could plausibly find in such a sample. In the second graph, colour coding the same data points by cluster reveals that the relationship between wealth and attainment is weaker than it first seemed.
However, these models rely on an oft-violated underlying assumption of no correlation of explanatory variables with random effects (hencefore the 'NCRX assumption'). If this assumption is not met, the resulting estimates can be inaccurate.
To continue the example above, we can easily imagine that parental income might tend to be higher in schools where exam scores are above average, perhaps because wealthier parents have the means and motivation to move to an area where their child can attend an apparently high-achieving school.
If this confounding factor confuses the model, we could draw the wrong conclusions about the way in which family wealth is involved with school attainment. In the graphs above, (fake) individuals with high income scores are clustered together in schools with high average attainment. The grey dashed line reveals a high correlation between wealth and exam scores, but the flatter, coloured lines show that school-specific relationships are weaker. What would we conclude about the effectiveness of individual schools? What about the school system as a whole?
Such a violation does not always change the results, and there are corrections we can apply to address it, but we need to understand more (on an applied level) about when and how these models perform well or otherwise, and support researchers in applying them appropriately. My PhD project will address that need through three strands of activity:
A survey of researcher practice across the social sciences. Researchers respond to potential violation of the NCRX assumption in different ways - from ignoring it completely to ruling out the method automatically. Are these variations in practice down to disciplinary habit, workflow preferences, varying understanding, or something else? I will collect data by surveying researchers and research papers. Analysis of the resulting data will contribute evidence towards understanding what drives the difference in response.
A simulation study to explore how results are affected by correlation of random effects with explanatory variables under different conditions. I will generate fake data which contains known relationships and hierarchical/nested/clustered structures, and test how estimates from different model specifications change depending on the size and shape of the data and the correlations within it.
Two case studies, exemplifying use cases in occupational social mobility and geographical health inequalities where differences of model specification and response to NCRX violation might be consequential. Using secondary datasets from large social surveys, I will recreate existing two existing analyses and explore how the conclusions drawn might vary or persist depending on how we model the data.
Contact
Student:
Kate O'Hara, University of Stirling, k.a.ohara@stir.ac.uk @Kate_OHara_
Supervisors:
Paul Lambert, University of Stirling, paul.lambert@stir.ac.uk
Kevin Ralston, University of Edinburgh, kev.ralston@stir.ac.uk
Resources
'Three-minute Thesis' slide, as presented at University of Stirling Festival of Research, May 2023. Image credits and references for this slide are listed here.