This episode is brought to you by PAR.
Psychologists need assessment tools for a more diverse population these days. PAR is helping by making many of their Spanish print forms available online through PARiConnect. Learn more at parinc.com\spanish-language-products.
Hey everyone, welcome back. I am thrilled to have a return guest on the podcast today. Dr. Ben Lovett is here to talk with me all about psychometrics.
The title of the episode is very deliberately chosen. We are talking about all the aspects of psychometrics that you maybe [00:01:00] want to know but you’re afraid to ask. So I am asking some super naïve questions.
We get back to basics. We talk about things like, what does it really mean for a test to be biased? We talk about which psychometric properties we should care about the most. We talk about where to find good information on psychometric properties of a test. We talk about how much to share in the report regarding psychometrics among many other things.
This is an episode chock-full of great information. If you’ve heard Dr. Lovett on the podcast before or read some of his work, you know that he is super knowledgeable and goes really deep with his research into any of the areas that he talks about. So there’s a lot to take away from this episode.
I will tell you a little bit about Ben and then we will transition to the conversation.
Dr. Ben Lovett is a professor in the school [00:02:00] psychology program at Teachers College, Columbia University. He teaches courses on psychoeducational assessment and legal and ethical issues for school psychologists. He has over 100 publications on these topics, including books on testing accommodations, and his new book, Practical Psychometrics: A Guide for Test Users that will be linked in the show notes. He consults widely with schools and testing agencies on assessment and disability issues.
As I noted, all of Ben’s books are great, and very research-informed. I highly recommend you check out this new book if you are interested in, like it says, a practical guide on psychometrics.
If you’re a practice owner and you would like some support in launching or growing your practice, I would love to help you out. We have coaching groups at every stage of practice development and I do have occasional openings for individual consulting as well. You can get [00:03:00] information and schedule a pre-consulting call at thetestingpsychologist.com/consulting.
All right, welcome back for my 3rd podcast conversation with Dr. Ben Lovett.
Hey, Ben, welcome back.
Dr. Lovett: Very glad to be back, Jeremy. How are you doing?
Dr. Sharp: I am good today aside from this rain that we cannot seem to shake here in Colorado this summer, but otherwise I’m doing well and having a great summer.
Dr. Lovett: No, it’s good to hear. It’s very hot in New York City at the moment but it is sunny at least.
Dr. Sharp: Well, there you go. It’s just another reason to come visit. I haven’t been there in probably 30 years.
Dr. Lovett: Absolutely. Well, definitely, then.
Dr. Sharp: I’ll put that on my list. [00:04:00] I’m glad to have you back. It’s been, gosh, speaking of long amounts of time, I feel like it’s been forever since you were here the last time.
Dr. Lovett: Probably 6 or 7 years at least, I’m guessing. The pandemic happened and everything, but you’ve been great about keeping up with so many different podcasts on different topics.
Dr. Sharp: I’m doing my best. It’s super fun and largely thanks to folks like you who are willing to come and chat with me. I’m going to keep doing it.
Dr. Lovett: Thanks.
Dr. Sharp: I’m glad to have you. We’re back talking about an entirely different topic than we’ve talked about in the past, talking about psychometrics. The title is everything you want to know but are afraid to ask. So I’m really going to take it upon myself to ask a bunch of dumb questions and try to…
Dr. Lovett: No, please. I may or may not be able to answer all of them but a very important topic to talk about. I thought it was a great title because every psychologist has coursework in psychometrics but it’s not something that you’re [00:05:00] necessarily thinking about on a daily basis as a practitioner. Even as a researcher, it’s not something that you’re always thinking about when you should be. I’m a bit experienced.
Dr. Sharp: That’s a good way to put it. I totally agree. I think we talked about that in our pre-podcast chat that psychometrics, it’s one of those core foundational things that we get taught and I think a lot of us lose track over the years or maybe lose touch with some of those core principles.
Dr. Lovett: Definitely, you might have a course in your 1st or 2nd year in graduate school. And then when you’re taking courses that are more applied, a lot of it is nuts and bolts of administration and the interpretation is separate from remembering those psychometric basics that can turn out to be really important and cases can turn on them.
Dr. Sharp: Sure. I think this will play well too. I just did an episode two months ago on measurement error and pediatric intelligence testing. There’s some overlap in there.[00:06:00]Dr. Lovett: Definitely.
Dr. Sharp: Well, before we get too far into it, I’d love to ask the question that I tend to lead with these days. Interesting for you. You have a lot of areas of interest, and so the question is why, maybe it’s why is this important now to you given you spent your career on many other topics. So why this? Why now?
Dr. Lovett: Sure. It’s a great question. Pretty much all of my work relates to assessment in some way. In that sense, psychometrics is connected. As you know, I do a lot of work on the diagnosis of learning disabilities and ADHD. I also do a lot of work on testing accommodations for students with disabilities.
So psychometrics is in the background with regard to a lot of that. What really pushed me into studying it and writing about it more explicitly is training graduate students in testing. I found that folks really needed an accessible understanding of psychometric concepts. People could memorize formulas but wouldn’t necessarily know how to apply them in individual cases.
I also saw [00:07:00] for years that on assessment listservs and psychological listservs, I would see people posing the same questions again and again that were at times really important psychometric issues. I was glad that people were asking the questions but was thinking, we have to find a way to make that knowledge more accessible. So that’ll definitely be another reason.
And then finally, maybe more on point with your question about why now, I see tests at times criticized fairly and then other times I see testing criticized unfairly. At times as psychologists, we’re not always in a great position to defend our work if we don’t understand psychometrics deeply.
Dr. Sharp: That’s a good way to put it. I think a lot of us do get wrapped up in those discussions/arguments and quickly get out of our depth on either side. Having that solid foundation is pretty key, not just in arguments with other professionals but [00:08:00] just to have a solid foundation for the work we do.
Dr. Lovett: Absolutely. But as you say, in an adversarial situation, it’s even more important. There are times when I testify as an expert witness in court as a psychologist, and there are times when really important legal situations may turn on an understanding of test scores. Probably the most famous one, I wasn’t involved with personally, but in a Supreme Court case, they actually had a footnote citing a psychologist who had done work on measurement error, actually, and reliability in testing and involved death penalty cases.
So definitely important things can turn on psychometrics and when I’m faced with another psychologist or at times even a judge who’s asking questions about test scores, important to be able to explain those things.
Dr. Sharp: Yeah, absolutely. I don’t suppose is the case you’re talking about. Was that Joel Schneider when he was…?
Dr. Lovett: Yeah.
Dr. Sharp: He’s been a guest on the podcast. We talked about that time of his life a little bit.
Dr. Lovett: That’s something [00:09:00] most of us don’t expect to ever be cited by the Supreme Court, but very impressive.
Dr. Sharp: Absolutely. Well, I have so many questions, some purposefully naïve, some not. Nobody’s going to know because…
Dr. Lovett: But there are no bad questions and I might or might not have good answers.
Dr. Sharp: Well, let’s dive into it. Maybe we start with a general question, but I would love to get your take on this. So you said a lot of this originated from your work teaching graduate students. I’m curious what you find to be the most difficult or confusing psychometric topics to teach.
Dr. Lovett: It’s a great question. One thing that I find is that students can understand reliability concepts when I’m giving a test on them but then will interpret scores from tests as though they’re perfectly reliable and as though there’s no error around them. So that’s just one example of something.
There are times when I’ll be at a meeting with a special education team, and people are also [00:10:00] interpreting things that way. So that would be one thing. The general perspective that I think understanding psychometrics gives us is we should be impressed by tests but we should also be humble about them. So it’s amazing that we can give these instruments that are relatively brief and predict performance in other settings at times years later. That’s amazing that psychologists came up with ways of doing that 100 years ago.
On the other hand, if we overinterpret them, if we’re not sufficiently humble about our evidence, then at times we’re going to be making inferences and claims that aren’t really supported. Reliability is just one way that that can happen. No test is perfectly reliable and so there are times when people act as though we really have a precision that we don’t when we’re making judgments about a test score.
So one of the clinical implications of that is, of course, to use multiple measures and multiple sources of evidence but it’s something that I see at times folks forgetting, students in particular, who are just learning this stuff. So, reliability would be one thing.[00:11:00] Another topic that I think is at times difficult to teach is validity because it is actually much more abstract and much more multifaceted than reliability. Validity, I always say, is about specific inferences and uses of tests. There’s no test that’s valid for everything, although there might be some tests that are invalid for everything. We don’t really find any measure that’s valid for all purposes. And so if we try to apply a test to a purpose that it wasn’t designed for, it may or may not be valid.
If we’re looking in the manual or we’re looking at research to support a particular use, we want to see a particular type of evidence. So a test might be good for, for instance, classifying folks as meeting criteria for a disorder but that doesn’t mean that it can be used to select treatments. Just to take a simple example, there are times when decisions or recommendations wouldn’t really flow from a test score.
An example that I see as a school psychologist is, for [00:12:00] instance, recommending that a student be retained in a grade and not promoted to the next grade. It’s a controversial practice and there’s a lot of work suggesting that that’s usually not a good idea regardless of what the test scores are. And so pointing to a test score and saying we should do that, it’s an interesting thing. The test wasn’t necessarily designed for that. Most tests weren’t. So validity would be another one.
Something that I see students concerned with very much today, especially and they should be, is the potential the test can be biased against different groups. And so really important to address that concern. I think we have a lot of research on that topic but it’s something where there’s often misunderstandings of what test bias means and how we would measure that.
And that’s another case where if folks aren’t at all concerned about it, then that’s being too impressed and not humble enough about measures. On the other hand, if you’re overly critical, if you’re not really looking, does the research support these criticisms of these claims, you might not be appropriately respectful of what test scores can do for [00:13:00] us.
Dr. Sharp: I’m glad that you are bringing this up. I think bias in particular is a huge topic that is coming up a lot, rightfully so, over the last several years. I know we’re going to dig into that pretty deeply here during our conversation which I’m excited for.
Dr. Lovett: Happy to.
Dr. Sharp: Any other concepts that we just need to be aware of that folks seem to struggle with in this world of psychometrics?
Dr. Lovett: At times, there are issues of norms and norm reference scores. I would say if I had to pick one thing, the most common query that I see on a listserv or a Facebook group or some other point of assessments related question that’s psychometrically based, is why a composite score from a bunch of subtests is more extreme than any of the individual subtest scores and people often think they’ve made an error.
So let’s say you have a bunch of subtest scores that are each about 1 standard deviation below [00:14:00] the mean and the composite based on all of them is even farther, and it’s routine that people think, I made some sort of mistake. There’s no way that can be. At times people are really worried about a decision.
Let’s say you’re classifying a child as having an intellectual disability and none of the index scores on the WISC are below 70, but the composite is. People will feel like not only is this wrong but it’s dangerous when I’m doing. That’s a basic psychometric thing that people often are not even taught in graduate school in a careful way. You may then have to explain this to parents or teachers. You’re then having to be the psychometric expert because people will otherwise be very concerned that what you’re doing is wrong.
Dr. Sharp: Right. Yes. So we have a great list of things to jump back into and elaborate on a bit. I did want to highlight, though, before we go any further, [00:15:00] this point that you made about the practice that we will often take this leap from test scores to recommendations; that’s a topic that I don’t know that we talk about a whole lot but probably deserves a little bit of a spotlight because I think folks listening, I’m guessing are saying, well, then what do we do? Then why are we testing if we can’t use scores to guide treatment recommendations? I’m curious what you might say to that.
Dr. Lovett: Yes. There are times when we actually have research supporting the use of a particular score for a treatment recommendation. Often we’re both validating a treatment and a test at the same time in a research study.
So let’s say, for instance, that we’re looking at scores on a depression scale in response to the use of an antidepressant or psychotherapy or something like that, when we’re seeing a change or responsiveness on the scale to an effective treatment, we’re both showing the treatment is [00:16:00] effective. We’re reinforcing that. We’re also showing the scale is good at detecting it. And so that would be a piece of evidence that would make me feel more comfortable. This is something where I know this treatment will reduce scores on this measure of symptoms.
There are other times when we don’t really have research supporting a particular use, or more often we have to combine test scores with other sorts of evidence.
The way I like to introduce testing on the first day when I teach graduate students is that it’s a way of learning about someone. And so I start by asking them how they would learn about someone else so they have to make important decisions about, like a potential supervisor at a job or choosing between sections of a course if they were looking into different professors or even a romantic partner. How would you make decisions about that person or that choice?
And so I’ll ask students and students will often come up with how they would actually do things that map on well to a lot of different assessment tools. So a lot of what they’ll describe will be like an interview or an observation, a behavioral observation or even record [00:17:00] review. When you Google someone, all of those map on pretty well. The one thing that we don’t usually do in everyday life is administer standardized measures. That’s just another piece of information that we have as psychologists that we don’t get to use in everyday life too much.
Dr. Sharp: Yeah. That’s fair.
Dr. Lovett: It’s another way of learning about someone. So if you’re making an important decision, the test scores can be helpful but you’re also going to be relying on interview data, record review data, and probably some observation.
Dr. Sharp: Certainly. I love the idea of assessing a romantic partner. How are you going to […]
Dr. Lovett: I don’t recommend psychological tests necessarily.
Dr. Sharp: Just to be clear.
Dr. Lovett: I’m not sure we have any validated ones but it’s a great example.
Dr. Sharp: Sure. So that makes sense to me with behavioral measures or emotional measures, rating scales, and things like that. I know that a lot of us use cognitive measures for that purpose as well. [00:18:00] Of course, the bridge is theoretically, cognitive data leads to a diagnosis which then guides treatment. Sometimes I think we may make recommendations based on cognitive data that is a little looser, maybe not quite as explicitly tied to a diagnosis.
Dr. Lovett: I think part of the issue is that we feel pressed to make recommendations whether or not we really have strong evidence for them. They’re a way of justifying what we’re doing in a more clear and concrete way for certain audiences. I think at times we’re prone to use boilerplate recommendations or one-to-one correspondence recommendations that may not be supported the individual cases. So we should always be careful.
When we’re interpreting cognitive or academic measures for the purposes of recommendations or most other things, I like to treat them as though they are behavioral measures. What I mean by that is, I like to think of [00:19:00] it as you can respond in a certain way to these stimuli that are on the test.
And so I would say the most conservative, careful, humble way of interpreting test score data from cognitive or academic measures is to interpret them behaviorally. That when the individual is faced with particular stimuli, like let’s say you’re giving the Gray Oral Reading Test which is a common achievement measure.
If I get a certain score when I’m administering that to someone, I know that when they’re faced with texts that they have to read aloud and then the text is removed so they can no longer see it, I have a certain level of skill at answering questions about what they read aloud. And so I can make that inference really carefully without lots of other assumptions because I’m interpreting the measure behaviorally. I know how you can respond to particular stimuli.
The farther away that we get from that behavioral interpretation, the less certain we can be. And so there are times when you have to make decisions and we don’t have any [00:20:00] evidence that will lead us to anything like certainty, but we still have to be, I think, more and more humble the farther away we go from that.
One controversial assessment area, of course, is projective personality measures. One of the reasons that they’re controversial is that we’re often making inferences that don’t map one-to-one to the behaviors that you exhibit on the test. So the fact that you report seeing particular things in ink blots on the Rorschach or that you tend to use the whole blot or parts of blots or things like that, that doesn’t really map on behaviorally in a one-to-one clear way to particular behaviors in everyday life.
We might have statistical evidence for that but it’s not as though there’s any obvious logical relationship there whereas if we’re interpreting a reading test, I think you can more easily make that one-to-one behavioral interpretation. You can respond a certain way to certain stimuli. I can’t really tell from a projective what that’s going to give me unless I have [00:21:00] research evidence showing that we know that people who tend to respond this way on the Rorschach or the TAT or something like that also respond this way in everyday life. And so then I’m really appealing to that research evidence because there’s nothing obvious that’s overlapping.
There are times when cognitive measures have that as well. Even though the WAIS does not look very much like what people do in most of their jobs, the WAIS is a good predictor of occupational performance for a lot of different jobs. So there are times when we have the statistical or research evidence and there are times when it’s in the manual or in other sources that are on those traits like intelligence. So it’s important, certainly for folks who are trained to know that literature but there are other times when there’s this obvious behavioral correspondence. And to me, that’s when it’s easiest to make recommendations.
Dr. Sharp: Yeah, that makes sense. I think something that’s coming to my mind is that a lot of the measures that we use [00:22:00] carry with them the assumption that they are reliable, valid, and not biased. We like to have faith in our measures but as you’ve said, that’s not always the case. I’m curious how you either personally or how you teach others to evaluate these things for the measures that we use. Is it reading the manuals? Is it reading the literature? Is it something else?
Dr. Lovett: It’s a great question. I think the most important thing to start with is the manual. So as you say, we often just assume that because something is published, it must have good psychometric qualities.
For years, I did work on a measure that is used especially by audiologists and speech pathologists but it’s sometimes used by psychologists. It’s called the SCAN. It’s a measure of auditory processing. It has some psychometric problems I’ll say, I’ve published on that. When I would present this at conferences, people would be shocked and they would say, well, how did it get published?[00:23:00] Gary Canova at Eastern Illinois University likes to talk about measures that have cash validity. They’re good for making money for the test publisher, less construct validity but they have cash validity. I don’t want to be too critical of test publishers in the sense that I respect how much work goes into making a test. They have a lot of psychometricians, psychologists, clinicians on staff, and consultants who are involved but the tests do vary in terms of their psychometric qualities.
So it is true that on average, the most popular measures do tend to have better psychometrics. There are exceptions to that rule. Even for the measures with better psychometrics, a particular score that they generate, like from a particular subtest or a particular type of interpretation may not have good psychometric characteristics. That’s why the manual, I would say, is the most important place to start.
When I’m training students or giving workshops to practitioners, the most important thing I hope people leave with is [00:24:00] feeling empowered to read and understand the manual. To know where to look for things and to not skip over something, even material that might be a little bit more complicated, like a factor analysis study and the validity chapter of the manual. Understanding what you’d be looking for when you look at those data in the manual.
So definitely the manual is a place to start. There are some things that are less likely to be in the manual. Let’s say you’re interested in the long-term stability of a test score. So we have some research showing really impressive long-term stability of WISC Full-Scale IQ scores over the course of years. That’s not going to be in a manual usually because they can’t wait those years to do the study before publishing it.
So there are times when I recommend that folks just do a Google Scholar search or some other database search to look to see if they’re going to make certain assumptions or if they want to use a score for particular reasons. In that case, it would be to not have to give another measure for a while and be assured that things would be stable [00:25:00] over time.
A bias would be another example where you can do certain types of analyses of bias as a measure is being developed. You can see whether or not individual items exhibit what psychometricians call differential item functioning to see whether or not folks who have the same overall score on a measure, if they’re from different groups, have a different likelihood of getting an answer on an item right.
So there might be a particular test item where folks who have otherwise really similar overall scores on the measure, let’s say that folks who are white clients and black clients actually have very different scores on that item, and that would be differential item functioning. That would suggest that there’s potential bias going on, that there’s something unfair about that item that you’d want to look at.
So something like that, the test developers can look at as they’re making a measure, but other sorts of analysis like trying to predict some future score and seeing whether the prediction is the same for clients from different groups is something that’s going to take time again. [00:26:00] That’s a really important type of bias research but it’s not the sort of thing that will be in a test manual typically.
There are times when you can look up the prior edition of a test that just came out and see whether or not that’s showing bias in the literature. We do have studies regularly published in the literature on bias. Even just looking up an abstract and being able to see whether or not the conclusion, it’s at least something.
I understand folks may not have the time or ability to read lots of studies on a particular score or particular measure but looking into things, especially if there’s a concern in an individual case or someone else who is involved, like a family member, someone else on the treatment team is concerned that the score is not accurate or that the measure is not good, then that’s enough reason definitely to look into things.
Dr. Sharp: Right. Before we transition into this bias discussion because that keeps coming up, quick thoughts on the Buros texts are [00:27:00] relatively well known.
Dr. Lovett: Yeah, The Buros Mental Measurements Yearbook and now Tests in Print, and many people have access to those databases. I think they’re great to look at. I think it’s better to also look at the manual afterwards but I definitely have respect for those. Typically, Buros gets two different people to write independent reviews of a new measure.
They’re basing their reviews off of what’s in the manual. It’s relatively rare that you’ll see citations to other research. And so you’re getting, I’d like to think of it as a summary of the most important things in the manual to the reviewer. Those may not be the most important things to you in an individual case, but I think they’re very helpful as an overview.
I’m so glad you brought that up. That’s definitely a great resource to look at. There are times when you’ll see two different reviewers have different opinions of a measure, though. To me, that’s another sign of why it’s important to look at the manual yourself and make judgments about it, but those can definitely be helpful.[00:28:00] Often I think practitioners will say, what are most people using? And rule of thumb, as I say, on average, that’s going to lead to better psychometrics but will it lead to a good measure for your particular client? Maybe not or for a particular score that you’re going to use. Again, there are exceptions to that where there are some measures that are used frequently that may not have great psychometrics. Those can definitely be helpful reviewers to look at.
Dr. Sharp: Great. Well, really quickly just to close the loop on this, you mentioned the manuals several times and knowing where to look and what to look for. What’s the hot list of what we’re looking for in these manuals?
Dr. Lovett: The psychometric lists that we normally think of are norms, reliability, validity, and bias. I would say those are the things that we’re typically looking for. So if you’re choosing between measures, if you’re deciding which measure to purchase or which measure to look at when you’re inspecting scores that someone else has generated, you have other [00:29:00] data sources, or you’ve administered more than one measure and you’re trying to see what should you wait more, the first thing that I look at is norms.
Recent is better than older. On average, larger norm samples, especially for the norm block that you’re comparing that client to, even if the test was normed on 2,000 or 3,000 people, there might only be 50 or 100 in the block of people that age or whatever that you’re comparing someone to. Those sorts of things are representative. Recent large norm samples are better than the opposite.
For reliability, we’re looking for higher coefficients, of course, but it’s important to know what types of reliability to pay attention to. I find that if you ask many psychologists what reliability is or they’re explaining it to a family or to another professional, usually the definition that they give is about test-retest reliability. So they’ll normally say is a score is stable over time, is a score something that’s going to be dependable that you can rely on to still be the [00:30:00] case if you gave the measure again.
And so that’s certainly not a wrong definition but it applies to one type of reliability, and that’s actually not the type of reliability that you’re likely to see cited the most in a manual. So if you’re giving this definition for test-retest reliability but the most common measure that we’ll see is one for internal consistency, it could be Cronbach’s alpha or split-half reliability or something like that. And so, internal consistency doesn’t tell you anything about whether a score is stable over time.
If you’re just giving that definition conceptually, you have to be careful then in applying it to a particular measure. Why is internal consistency important? Being able to explain that. We’re usually not interpreting test data at the item level. We’re interpreting it at the level of a score that sums across many items.
Internal consistency is based on how strongly those items correlate with each other. So it gives you confidence that the overall score is representative of the person’s functioning [00:31:00] on those different items. If the items aren’t correlating much with each other, the meaningfulness of that overall score is going to be less. So it’s helpful to have an understanding of the most common type of reliability. And for certain measures, other types are important. Even something like the vocabulary subtest on the WAIS or the WISC is really important to have inter-rater reliability. We have those data to show that two people who are scoring the test codes answer the same way.
So looking for a variety of different types of reliability that are going to be important for different types of measures, we don’t usually report inter-rater reliability for simple quantitatively scored items so you won’t see those data, but looking for all of those. So norms, reliability, validity, and again, knowing the different types of validity evidence and what sort of things you’re looking for.
So in a measure for behavioral or emotional functioning, it might be content validity, showing that there’s a mapping of [00:32:00] the different items on the test to the symptoms of a disorder that are in the official criteria. That might be really important. If you’re looking at a particular math measure, you might want to note whether or not it has content validity in terms of measuring the things that this student has been taught or has been part of the curriculum or does it even include the concepts that the student is having trouble with? So that type of content validity might be important.
If you’re looking at a measure that predicts something like whether or not a cognitive measure predicts some sort of achievement, then that may be the relevant data to look at. Often, people are forgetting that the most important type of validity data in a lot of diagnostic assessment is classification validity. Is the measure good at classifying people? Is it sensitive and specific to different particular classifications like a disorder?
And so, people often are not paying attention, I find, to that evidence in the manual. Usually, the way that they do those studies for the manual is they look at a bunch of people who already have an independent [00:33:00] diagnosis of a disorder and a bunch of folks who don’t. They compare how they perform on that measure. They see whether, for folks who have above a certain cut score, that tells you there’s a 90% chance or 90% of those people who have a disorder independently diagnosed will also fall above that cutoff on the scale.
And that will tell you that, in that case, the sensitivity of the scale. So being aware of that data sensitivity and specificity data. And if you can get it, there’s some other validity classification statistics, positive and negative predictive values. Those are even more helpful in some cases. I’m happy to talk more about that.
And then finally, bias-related data. The data you can get from the manual about making sure that they were screening for differential item functioning or that they conducted sensitivity reviews to make sure the material was not causing offense. That’s the sort of stuff you may be able to find in a manual. Depending on the type of test, you’re looking for different things that might be relevant.
That’s also something where you may have to go to the [00:34:00] research if there are concerns about bias. Because if you’re looking at whether they’re differential prediction of some future outcome, that may be important. You typically won’t find that in the manual.
Dr. Sharp: Yeah, well, I think it’s time that we really dig into the bias discussion because it has come up so many times. So let’s do that. I’m going to start with a very general question and that question is, what does that actually mean when we say the test is biased?
Dr. Lovett: It’s a great question. There are different operational definitions. There are different ways of measuring bias. It is true that different people in the field will give different conceptual or working definitions. So I’ll speak for myself. The way that I think about bias and the way that I teach about it, is that when you get a particular score on a test, it has a different meaning depending on what group you’re from.
So you can’t interpret the score the same way because it actually has a different [00:35:00] meaning for people from different groups. We often describe tests as being biased against a particular group. And so what that would mean is that you might have higher ability or higher skill than your test score suggests. That would be in a type of, basically, we can’t interpret the score the same way.
If you and I are from different demographic groups, and so we both get the same score, our scores have different meaning. Even though we both have the same score, mine might suggest a lack of competence and yours suggests a level of competence that we care about. The general way I define bias is when you have a test that has different meaning for folks from different groups. You can’t interpret the score the same way. There would be bias there.
So it’s not the same thing as the fact that there are group differences on a measure. For instance, if you’re looking at combined norms for ADHD symptoms, you’ll find that boys have higher levels of symptoms than girls do. They’ll get a higher score if you’re looking at the combined [00:36:00] norms for the BASC or something like that. I like ADHD examples. That would be a very common one. That doesn’t mean that the measure is biased against boys. It may be correctly detecting that boys do have higher levels of those symptoms on average. So it’s not the same thing.
Differences in group scores doesn’t mean that a test is or a score is biased. It can be a reason for concern to look into that, but it wouldn’t be evidence of bias per se. We have evidence of bias when we know that a score has different meaning for people from different groups.
Usually, the way we figure that out is we compare the score data to some other independent type of data. So we look, for instance, at not just self-reports of depression symptoms, but observable features by trained clinicians. Or we look not just at someone’s score on a measure of reading that we have, a diagnostic measure of reading but we look at how they’re doing in school in their language, [00:37:00] arts grades.
We see whether or not there’s a differential relationship between our test score and some sort of external criterion for performance or some other trait. If that relationship, again is different for people from different groups, then that raises the potential for bias. We want to see what’s going on.
There are times when the external criterion is actually what’s biased. So there are times when teacher ratings, for instance, of students may be more biased than the test is. And so we have to be careful about interpreting that. Differential relationships are really common and important way to assess whether a test could potentially be biased or whether particular scores can be biased.
I think it is really important that you have that general definition, that a score has different meaning for people from different groups. That’s a really big problem if it exists because we can’t interpret the score the same way, we have to ask, well, first, what group is the student or the client or the examinee from before we make any judgments about that score [00:38:00] and what it suggests.
It’s right to be concerned about bias. We do have a lot of research on bias and testing. Thankfully, most research on cognitive and academic measures fails to show bias of that sort. So we have a lot of research looking at those differential relationships. We also have less research but still some on, for instance, teacher ratings of ADHD symptoms. I’ll use ADHD again and comparing those to having independent observers actually go in to classrooms and quantitatively score the number of times that kids are showing different behaviors.
So we have a lot of data on bias, and generally, it’s reassuring of tests. So that’s typically what we find. I wouldn’t make that claim about every single measure all the time but for diagnostic testing, we have a good deal of evidence that’s generally reassuring but it is always something to look into. It’s always a valid concern that a measure might be biased.[00:39:00] We’re most concerned about that if we have someone who scores very low on an ability measure and we’re seeing other types of evidence that suggests that’s not accurate. To me it gets back to that general point that we want to be respectful and at times impressed with what tests can do but we also want to be humble and note the tests are imperfect, like all of our assessment tools are.
Dr. Sharp: That comment about bias not being the same as group differences, I think is important. I wonder if we could dig in just a little bit more to that. Is there anything more you could say about the difference between those two ideas?
Let’s take a break to hear from our featured partner.
As psychologists, we know that the more people we can reach, the more we can help. To do this, we need assessment tools that allow us to assist a more diverse population. PAR is helping by making some of their most popular Spanish-language tests available online through PARiConnect, giving you more [00:40:00] flexibility in serving your clients. Learn more at parinc.com\spanish-language-products.
All right. Let’s get back to the podcast.
Dr. Lovett: Sure. Yeah, absolutely. Group differences can definitely be a reason to look into bias or a reason to think that a measure could potentially be biased, but it wouldn’t be evidence of bias per se, because if the measure is valid for both groups in the same way, then there wouldn’t be any bias in the measure. The score could be interpreted in the same way.
Again, to go back to that ADHD example with boys and girls, if you were to find that boys and girls who have that particular level of ADHD symptoms on the measure, they have scores above a particular cutoff are equally impaired in school. That would be an example of the evidence that would show or suggest that this score is valid across different genders, two genders in [00:41:00] this case. There wouldn’t be any concern about bias from that particular study, at least.
So group differences can suggest a concern but not necessarily biased. To take an example from cognitive or academic measures, at times, tests are not biased but they’re detecting inequality and lack of opportunity that someone’s had in their life that may be related to demographics or racism or other problems. And so the test score may be in part indexing those social problems.
You’re testing a child who has had a real lack of opportunity. Let’s say have been in an environment where they haven’t had a lot of cognitive stimulation. In part, that’s because of policies that may be due to racism and other social problems. And so we have this relationship definitely between social ills and a test score, but it doesn’t make the test score biased or inaccurate or invalid necessarily.
The test score may be picking up a lack of academic [00:42:00] skills accurately and may be the key to that student receiving remediation or help that they need. If the test score were biased, that would suggest it’s not accurate.
Dr. Sharp: I see, that feels like a much finer hair to split when we get into, how do we separate bias from lack of opportunity or social problems like you mentioned with marginalized groups.
Dr. Lovett: The bias-related studies really help because generally, if it’s not due to bias, if it’s in fact due to those other problems, you’ll see a consistent relationship still between the test scores and other external measures. That child not only is doing poorly on the test but they’re also doing poorly in academics in school. They’re not doing well on their state accountability test. They’re not doing well in terms of their grades, all of those sorts of things.
So that additional evidence is always helpful. As you know, I’m very big on looking at things beyond test scores for any sort of important [00:43:00] decision, but they also can help reassure us that the measure in this case is not inaccurate.
Dr. Sharp: Right. That does make sense. I appreciate that distinction. You’ve mentioned that you’ve had to fill some questions from families like many of us have, I think, around the accuracy of tests and are they biased and things like that. I am curious how you respond to those questions because, and really it’s the balancing of all this information that you’re sharing with me and packaging that in a way that lay people can understand.
Dr. Lovett: Sure. Absolutely. I think as with any concern that a client or a family member or other audience brings, I’ve never heard a concern that is not valid. I’ve never heard a concern that’s ill-placed. We should always be critical of the potential for there to be error in our work.
And so to me, I think [00:44:00] validating that concern is an important first part of the response. And making clear that as psychologists, we take that very seriously. At times, I have to educate folks a little bit about how we assess the potential for bias, both when we’re developing a measure and then after it’s published and then how I’m doing it in this individual case.
If the test score is the only piece of evidence that suggests something and I have a number of other sources of evidence from observation and interview and ratings and other sorts of measures, then whether or not we’d call it bias, the test score for may well be an error. I always try to point out when there’s consistency or converging evidence from different sources. And so that often is, I find, a helpful way of addressing concern about any type of error.
Bias is just one type of error. So I respond the way that I do when there’s any potential concern about error. Often, I’m giving [00:45:00] feedback or interacting with families where there’s not a concern about demographic group bias or marginalized group bias, but someone thinks that a particular reading score and an IQ score is not accurate.
That in fact, someone has reading problems that are not being detected by the measure. And so someone thinks that the reading score is too high because they’re sure that their child’s reading scores are not in the average range and don’t believe when, in fact, they are or vice versa, that someone’s intelligence is much higher than their IQ score will suggest or something like that.
As with any type of concern about error, I always validate it because it deserves validation. It’s always worth looking more closely at that. When our intuitions diverge from test scores, we have to figure out why. The philosopher Alfred North Whitehead said that even though science can contradict common sense, in the end, it has to explain common sense.
You have to explain why people have the intuitions that they do. Part of [00:46:00] that is taking a critical inspection of our scores and trying to understand our tests. So I always start with that. I’ll point to other convergent sources of evidence, and I’ll admit, I’ll say, if the test score were the only thing that we’re pointing in that direction, I would tend to be skeptical of that score.
In this case, we have converging evidence from this source, this source, and this source that are all pointing in the same direction. And then, as I say, at other times, I’ll actually go into a little bit about how we ensure that the test was not biased based on prior research or development work. That the test scores in this test have been found to predict some sort of external measure, like teacher ratings of whatever in students from different backgrounds. That may be important to note.
And there are times when pointing to genuine lack of opportunities or even problems that are more clearly social problems, I say, like racism or sexism at times and other sorts of things can be helpful to [00:47:00] point out that those can actually lead to changes in traits.
Someone may experience psychiatric distress or psychological distress or even psychological disorder that’s due to minority-related stigma issues or minority stress, as it’s sometimes called, related to racism or other problems. Acknowledging that social problems can actually change someone’s scores on a test because they’re changing the underlying trait that the test is measuring.
In that case, the test is not biased. The test is not inaccurate. I like to cite the National Council on Measurement in Education. Their view is, they say it’s like blaming a thermometer for global warming. When you’re upset about the results, it doesn’t mean the measurement tool is bad. I think it’s a useful analogy. There are times when tests are pointing to genuine problems in society but they’re not suggesting that the test isn’t inaccurate. The test is measuring that.
Dr. Sharp: [00:48:00] I’m glad that we’re talking about this in such detail. This is something that a lot of us wrestle with, myself included, like how to make these distinctions and when to actually know that it’s the test fault versus…
Dr. Lovett: I think comparing the test to other sorts of evidence is one of the most important ways of doing that. If the test score is the only piece of evidence that’s pointing in a particular direction, and it doesn’t make sense given the referral, and it doesn’t make sense given everything else, then I would be concerned the test score is inaccurate, whether or not it’s due to bias. It could be due to lack of report. It could be due to all sorts of things. It could be an unreliable score.
Even in a reliable score, you have flukes that happen. So there are all kinds of reasons for that, bias would just be one of them. I think it’s important to emphasize to clients, other audiences as well, that you take the test scores as one piece of information. [00:49:00] I think anytime someone only is looking at the test scores, it’s not a good assessment.
Dr. Sharp: Well, yeah, that’s a nice message to take away no matter what. I think it is easy to get wrapped up in the test scores and overinterpret or assign more meaning than we should.
Dr. Lovett: They’re quantitative so it’s easy to focus on them but that doesn’t necessarily mean that they’re always accurate.
Dr. Sharp: Right. Well, I’m glad that we talked about that a bit because we do get a fair number of cases in our practice where scores are higher than expected, lower than expected and you have to sort through.
Dr. Lovett: Yeah. I don’t know if it’s something that you’ve ever discussed on the podcast, but there are times when I give advice about preparing for that ahead of time at the time of referral, letting folks know that you may be surprised by the outcomes, letting folks know how you look at test scores but you’re also going to look at other evidence [00:50:00] but the conclusions aren’t things that you can predetermine.
When someone shows up and they already have their conclusion about themselves or their child or something like that, it’s something that we all face from time to time. We want to be open at the start that things may not go in the direction that you expect. I like to use analogies to medical tests. You may have a particular symptom but it may turn out that you get good news, that there’s not a disorder present. And so if you’re still in pain or you’re still having the symptom, we have to figure out why, but it might not be the disorder you think it is.
There are other times when you might not think there’s anything wrong but a blood test showed something. You’re used to this with medical tests. There are times when you get outcomes that you don’t expect. And so I can’t tell you ahead of time what those outcomes will be. I’ll do my best as any professional does to interpret those test scores in the context of everything else because they can have different meaning depending on the [00:51:00] context but not sure what the outcome will be. So you may be surprised by some of the data that we get.
Dr. Sharp: That’s a good way to put it. I know this is happening for a lot of practitioners these days, managing expectations, we’ll call it, for evaluation results. So if you have any other thoughts or tips on how to navigate when expectations don’t match the outcome, people would take it.
Dr. Lovett: The only other thing I would say that you can do after the fact as opposed to preparing people ahead of time is really address what the concern and the perception is. If you don’t have any other rationale or explanation for why someone is feeling the way they are or believing the things that they do, then they’re less likely to be satisfied with you just saying that’s not the case. The test score data and everything else goes against that.
So if you don’t have some other explanation of it, let’s say, for instance, that you have dealing with a case relatively recently [00:52:00] involving a student who, it was COVID related, had been in and out of school a lot for two years, not consistent attendance. It’s a big problem in New York City now and in many other places- the amount of chronic absenteeism, and so in this case, there was another reason other than a learning disability that seemed pretty clearly tied to academic skill deficits.
In this case, it was really important to explain to the family that it’s not that you’re misperceiving things in the sense that you’re wrong about where your child’s academic skills are, there are definitely deficits here but when we look at where he was actually in 2019, by that point, there would have been a learning disability showing up. We can actually see there wasn’t any problem, and this is what’s actually happening. I understand there may be good reasons why you weren’t sending him to school for certain periods, you didn’t feel things were safe, but that will take a toll on academics.[00:53:00] And so having some way of explaining the test scores and also explaining the person’s perception, that it’s not as though they are deluded about what’s going on. It’s just an explanation. The cause may be different and the test scores help to understand that in the context of everything else.
Dr. Sharp: I like that approach. That seems to match what we talk about a lot here in our practice, at least where it’s not a matter of invalidating the experience or saying, hey, you’re not feeling this or doing this or experiencing that, it’s just not the reason you think. It’s something different.
Dr. Lovett: Like for ADHD, again, I’ll use that example because I do a lot related to that. There are times when a parent will, on a norm reference rating scale, be reporting average levels of symptoms, and yet the person’s perception is that they’re above average. They’re well within the average range.
Actually doing an item level [00:54:00] review and describing about how that norm reference test is made and how we compared your child to 1000 other kids at the same age. And so, we’re really able to say, is this typical? Is this not typical? I’m trying to make it clear that you’re trusting the parent on all of those individual ratings that often, Johnny’s exhibiting such and such a symptom, et cetera. So what we’re able to do is say, is this unusual because one of the first criteria for diagnosing a disorder is that something is atypical. If it’s typical developmentally, then it really wouldn’t be appropriate.
And so that’s why using this particular measure is so helpful. That’s by the way, one of the reasons why checklists are not nearly as helpful as norm reference measures and why physicians overdiagnose at times. So pointing out that norm-referenced element to the test and explaining norm-referencing can be really helpful. There are times when it can be framed in a clinical way that actually is reassuring.[00:55:00] Dr. Sharp: I think I’ve told this story on the podcast maybe. I can’t remember, but I’ve had multiple times over the years where we’ll be talking through test results in a feedback session, and at some point, the parent will sheepishly raise their hand and say something along the lines of, where did these tests come from? Did you make up this test? Insinuating that I had written out some questions in the spur of the moment. It makes me remember that the layperson doesn’t live in this world and know that these are solid tests most of the time.
Dr. Lovett: Absolutely. It can be helpful to, like any type of professional practice mention, that you use a very common measure or mention that standard for measuring this. You’re administering an IQ test. You can describe it as the gold [00:56:00] standard in terms of what clinicians are using. You can say that it’s in the criteria for a particular condition like intellectual disability, that you may need to administer it and you’re using one that is very commonly used. It’s been around in different editions for decades. The most commonly used measure is something like that to explain that because that’s a great point that folks, why would lay people know where the tests are coming from?
Dr. Sharp: Sure. How is this different than a BuzzFeed personality quiz?
Dr. Lovett: Yeah, absolutely. It’s very true. That makes sense.
Dr. Sharp: Okay. Speaking of that, how much information do you put in the evaluation report regarding psychometrics, if any?
Dr. Lovett: It’s a good question. I don’t put any reliability, validity information, or anything like that. I don’t put any statistics about the test. I’ll put a brief description of the measure, but I wouldn’t call that psychometric information. I mainly rely on very [00:57:00] brief information in the oral feedback session and then if there’s a concern to discuss that.
The exception, I guess, would be covering the norm-referencing and explanations of what those scores mean, having a key that describes the relationships and explaining often what a percentile score is, or having a clear correspondence between the narrative that describes it in a very clear and easy way, because I find percentiles are one of the things that, of course, lay people often misinterpret. Not so much clinicians but lay people will often misinterpret percentiles as the percent correct on a cognitive or achievement measure.
There’s no way that my son only got 40% right. How’s that average? That’s failing. Having to explain that ahead of time, percentile score is definitely putting that norm-reference data in the test or in the appendix with the scores is helpful but that’s about [00:58:00] it.
I think of the oral feedback as a place where you may want to go into more, especially if there’s a concern. I always say, relate your scores to what the concerns were. As much as possible, show that your assessment tools were addressing what the referral concerns were. I like to focus on percentiles and describe them as the top or bottom 5% or 10% if you’re seeing something that’s in a clinical range or a really abnormal range or saying the average that the middle half of the population or something like that.
I know some folks like to use bell curve diagrams. I haven’t done that. It can be helpful. If you have one of those, I suggest that you draw or print little dots or people to show that the height represents the number of people who are there because otherwise, I don’t think a curve makes that clear. I know some folks like to do that. Again, just emphasizing.
If you’re focusing on the clinical data that is abnormal or suggestive [00:59:00] of a condition or some sort of problem, saying that someone’s in the top 10%, for kids in average in terms of ADHD symptoms, that can be meaningful in the top 5% or something like that. Or is that someone’s reading skills are in the bottom 10%, and so that’s going to really cause some struggle?
Dr. Sharp: That’s good to hear. I like percentiles as well. I feel like they’re the easiest metric to try to communicate and give some sense of performance. I know there are pitfalls with percentiles just like any score.
Gosh, let’s see. We have talked about a lot of different things. Let me circle back though, you briefly mentioned earlier this idea of outlier scores, scores that are discrepant from other scores, and what we do with that. This is the whole scatter doesn’t [01:00:00] matter or does it, discussion. Getting into that territory. I’d love to hear your thoughts on that.
Dr. Lovett: That’s a dangerous question. One of the things that I talked about earlier was the composite extremity. That’s kind of separate the fact that if you have composite scores, that’s really assessing the probability that someone is really higher, really low on a bunch of things. And so it’s even rarer to have scores that are extreme on a bunch of different traits. That’s just briefly why those scores tend to be more extreme.
In terms of scatter and things like that, I think there are a bunch of issues there. One that I sometimes see people say is that you should focus on composite scores rather than subtest scores. I would say on average; you should wait scores that have higher reliability coefficients. So on average, I would agree but that doesn’t mean that you can’t interpret subtest scores.
You can always again as I say, interpret things in a conservative behavioral way [01:01:00] that when someone is responding to a particular type of stimuli, then that’s going to be indicative of how they’re going to respond to similar stimuli in the classroom or something like that. So you can definitely interpret things that way. I know there are some experts who would say, well, yes, but the real influence on that score is still overall ability or “g”. I wouldn’t disagree with that. It’s just that you can still interpret the score behaviorally. I hope those folks would agree in that regard.
Interpreting discrepancies is as harder because differences between test scores tend not to be especially reliable. The higher the correlation is between two test scores, the lower the reliability is for the difference between them. That’s confusing. Why would it be that a higher correlation between test scores leads to lower reliability of the discrepancy between them?
The way I like to explain it is, if you have two test scores that are strongly correlated, for instance, [01:02:00] a Full Scale IQ score and a reading score from an achievement battery, and those might be correlated at 0. 5 or something like that, maybe more in children. The higher that there is more strongly that they’re correlated, the less likely it will be that a difference between them is genuine because we know in the population, those scores are tightly tethered to each other.
So if someone’s showing a gap, it’s less likely that it’s due to a genuine difference in the traits. The more closely the two test scores are related, the higher the correlation, the less likely you’re going to get a reliable difference between them. And so the size of a discrepancy is not something that I take as especially reliable.
That doesn’t mean that you can’t interpret the subtests individually or you can’t interpret the scores individually and someone has a high level of this and a low level of that, but it’s the magnitude of the discrepancy that is not especially reliable because that’s what we’re looking at for any [01:03:00] reliability statistic; the magnitude of the score. In this case, it’s the magnitude of the gap, the different scores. So that’s not going to be particularly reliable. And so that is definitely a concern.
I don’t think anyone is endorsing simple discrepancies anymore as a way of diagnosing common disability conditions. I know there are still states that use IQ-achievement discrepancies for learning disabilities, but I don’t know any experts that endorse them. I feel like that’s hopefully not a controversial claim among psychology researchers. I don’t think there’s anyone who suggests using those. I think that’s a good thing.
Dr. Sharp: That’s my understanding at this point as well. Yes. It’s gone by the wayside for the most part.
Dr. Lovett: Yeah, exactly. Again, schools have to follow regulations that may not be based on science but that may be why lawmakers need some basic psychometric training.
Dr. Sharp: There we go. [01:04:00] That’s your job. You’re the perfect person.
Dr. Lovett: Go to the capital next. It is a very controversial issue. Those profiles and different scores. I think a lot of it comes from, historically we were interpreting test scores in a cognitive neuropsychological framework, and they were used for some things that were valid and other things that they were less valid for. So it wasn’t uncommon to see personality and psychopathology problems diagnosed on the basis of scatter on the WAIS.
You would hear these things that were at times no more than clinical lore about how the object assembly score when that’s higher than information, that suggests sociopathic problems. You would hear these sorts of things. I love the history of psychology. I teach a course on it. I actually love that stuff but I don’t teach it in the assessment courses.[01:05:00] Dr. Sharp: That’s reasonable. It’s funny you mentioned that.
Dr. Lovett: But there’s these strong emotions because there were ways that scatter and discrepancies were really misused. The IQ achievement discrepancy for learning disabilities, I would say, is one of them. It came to be misused more and more over time. I actually think the original use of it in the 1960s was not as bad as it is now. It got to be used worse and worse in some ways.
As you know, some of my work is on high-ability individuals and learning disabilities. It came to be used for folks who have a very high IQ and average skills in an academic area. There are a lot of reasons why that can happen other than a disability.
Dr. Sharp: Right. That’s another podcast that we have partially done. There’s so much stuff that we could say with all of this. [01:06:00] I’m going to take this as a sign that we are evolving and not afraid to shed models.
Dr. Lovett: And the DSM has evolved.
Dr. Sharp: It’s true. Well, let’s see, I wonder if I could close. People always appreciate practical advice. Anything that we could distill for folks who are going into practice tomorrow, of all the things we’ve talked about, two or three things that we can do as practitioners to become more psychometrically knowledgeable.
Dr. Lovett: Absolutely. I think taking a look at the manuals, the parts of the manual beyond the administration chapter and the scoring chapter is a really helpful way to start for the measures that you use the most, especially for folks who are closer to graduate school, people who are starting to practice now but they may not have [01:07:00] had to look at those chapters.
And so it’s the type of thing to look at, to read up on to know more about the measures that you’re using and to look more into the research that uses them. So even again, doing a Google Scholar search for that measure will bring up recent articles that used it. Based on the titles, you might find some useful articles on the properties of the scores from that measure.
I would just say, count that as a part of your professional development. In the same way that you do other work, include psychometrics in there. For psychologists where testing is even a part of their practice, really helpful to be more conversant with those things.
For folks who have already been practicing for a bit or even are just coming from an internship, if there are situations that you found coming up frequently when you’re doing evaluations, either things that you weren’t that great at explaining in feedback sessions or things that you were having trouble with in reports, there are times when we say things or we write things and we’re not 100% on them. We’ve been saying them because we’ve seen them said in other reports. [01:08:00] That’s a cue to us to look more into it so that we really have a more solid foundation on those points because the day will come when someone will ask more about that. It may not be that they’re criticizing us. It might be out of curiosity.
Dr. Sharp: Oh, right. Yes. I’ve been in that position.
Dr. Lovett: And so being able to explain that can be really helpful. I just hope that people make it a part of their more general professional development. People are really understandably eager to go to PD sessions on a new test. Focusing on the psychometric features may not be where people are spending their time. I just hope that will be a part of it.
Dr. Sharp: It’s a good point. Yeah, it’s not the glamorous part of testing.
Dr. Lovett: I went to one webinar. I won’t say what test it’s for, but I went to a webinar for a measure, I’ll say. This was two years ago. I don’t know, it was an hour-long webinar. There was one PowerPoint slide on [01:09:00] psychometrics. It had four bullet points. They were reliable, valid, sensitive, and specific. The presenter said that their measure was reliable, valid, sensitive, and specific. In New York City, we say, now we could sell you the Brooklyn Bridge if you believe that. So you need to look at the manual first.
Dr. Sharp: That’s great. Read the manual, folks.
Dr. Lovett: Exactly. That’s what I would say. I hope people think about those things and talk about them. We’re always talking with our colleagues about other assessment-related issues, I hope these will also become things that we’re talking about.
Dr. Sharp: Sure. Well, I know that you have a lot of resources out there. We have somehow gone this whole time without mentioning your most recent book but if folks want to learn more about this stuff, it’s a great resource, right? Can you speak a little bit?
Dr. Lovett: I appreciate it very much. Thanks so much. I [01:10:00] recently published a book called Practical Psychometrics. I originally designed it for students but I think, for practitioners who are more interested in reviewing basic psychometric concepts but really applied to particular tests, I use lots of examples from common tests in psychology, education, and other sorts of things that come directly from the manuals to really see how do we interpret that data?
There’s a whole chapter actually on factor analysis that I work to make very clear and accessible. There are almost no formulas in the book. It’s not a textbook in that sense. It’s a book that’s really about how to understand, in narrative ways, all the psychometric data that we’re seeing. It does have some clinical advice about different topics that come up.
There’s a whole chapter on report writing and oral feedback about test scores focusing on how to give clear, accurate feedback about that information. There’s a whole chapter on bias and fairness issues where I cover things like, for instance, response [01:11:00] validity about individual case scores. I know you’ve had some podcasts that have covered those issues. Like what if someone might not be performing up to their full ability? Or you’re assessing someone who’s fatigued or ill or occasionally maybe exaggerating problems. How do you detect that? Because that’s a validity issue.
So I tried to make things very applied to actual situations that we deal with. There are exercises at the end of the chapters that really force folks to explain things themselves and apply the concepts to cases. I hope it will be helpful for students, certainly, but also for practitioners who are interested.
Dr. Sharp: That’s great.
Dr. Lovett: Appreciate that.
Dr. Sharp: Oh, yeah, 100%. I think that’s what we need. We get the formulas, the math, and the statistics in graduate school. This is what we need when we’ve been out of the game for a little while, at least out of touch, I suppose.
Dr. Lovett: No, thank you. I agree entirely. I find that folks had done well in courses but may not be able to talk about those concepts. [01:12:00] So hopefully that’ll be helpful for that.
Dr. Sharp: Yeah. We’ll link to it in the show notes, of course, so folks can go grab it, and check it out.
Dr. Lovett: I appreciate it.
Dr. Sharp: Well, thanks again for being here. It’s always a pleasure to chat with you.
Dr. Lovett: Likewise. I appreciate the questions and the conversation and I look forward to keeping it up.
Dr. Sharp: Yeah, likewise.
All right, y’all. Thank you so much for tuning into this episode. Always grateful to have you here. I hope that you take away some information that you can implement in your practice and in your life. Any resources that we mentioned during the episode will be listed in the show notes so make sure to check those out.
If you like what you hear on the podcast, I would be so grateful if you left a review on iTunes or Spotify or wherever you listen to your podcasts.
If you’re a practice owner or aspiring practice owner, I’d invite you to check out The Testing Psychologist mastermind groups. I have mastermind groups at every stage of practice development: Beginner, [01:13:00] intermediate, and advanced. We have homework, we have accountability, we have support, we have resources. These groups are amazing. We do a lot of work and a lot of connecting. If that sounds interesting to you, you can check out the details at thetestingpsychologist.com/consulting. You can sign up for a pre-group phone call and we will chat and figure out if a group could be a good fit for you. Thanks so much
The information contained in this podcast and on The Testing Psychologist website are intended for informational and educational purposes only. Nothing in this podcast or on the website is intended to be a substitute for professional, psychological, [01:14:00] psychiatric or medical advice, diagnosis, or treatment. Please note that no doctor-patient relationship is formed here and similarly, no supervisory or consultative relationship is formed between the host or guests of this podcast and listeners of this podcast. If you need the qualified advice of any mental health practitioner or medical provider, please seek one in your area. Similarly, if you need supervision on clinical matters, please find a supervisor with expertise that fits your needs.