Generative AI for standardised assessments: A conversation.
This section is an interview between Alina von Davier (Duolingo and Edastratech, United States) and the OECD Secretariat. The conversation is about new possibilities offered by generative AI (GenAI) to develop and implement standardised and high-stakes assessments. After showing how GenAI can enhance the productivity of item development, the discussion dwells on how some innovations are made possible thanks to GenAI, taking the assessment of foreign languages as a case in point. The processes have to remain tightly controlled by humans and usually involve different types of artificial intelligence.
OECD: Many AI tools work quite effectively
for assessment, sometimes even better than
generative AI. However, we want to explore what
new possibilities generative AI can offer. We are
interested in two main areas: how generative AI can help perform traditional assessment tasks
more effectively – for instance, item generation
for standardised assessments – and how it can
enable different and better assessments . Perhaps
we can start with the first aspect?
Alina von Davier: For assessment, at Duolingo
we utilise AI end-to-end, but not in isolation: it just
contributes to our processes. For example, we
use generative AI to generate items at scale after
human experts have designed item prototypes. The
content experts collaborate with the AI engineers and
psychometricians during the design phase in order to
ensure that the design of the new item type is viable
in operational settings. Once the initial item design
is complete and we are comfortable with it, the AIengineers and scientists define the desired display
of the items for the online delivery and create scripts
that will generate multiple items on a large scale.
After that, the items are reviewed by human experts
for quality, fairness, and appropriateness for a global
administration. This has become a mainstream process.
We created the item factory, a full system where
humans and machines collaborate. Let me emphasise
that there is a substantial amount of work involved at
the outset. It demands a very high level of expertise and
considerable effort when you set up a GenAI system for
a particular item type for the first time. That's where the
main work lies: in setting things up and determining
what works and what doesn't. However, once that
setup is complete, compared to human development
the efficacy increases tenfold. It's incredible in terms
of speed and cost. We remain very conservative and
continue to use humans to review every single item,
but we plan to explore how to make this even more
efficient.
OECD: Beyond enhancing the productivity of
conventional assessment design, how can we use GenAI to innovate the ways we assess people’s
knowledge and skills?
Alina von Davier: In our Duolingo English Test, we
have, for example, two other applications of GenAI: one
for a writing task and one for a speaking task. In April
2024, we launched a new writing task, which works as
follows. We provide a prompt, such as "Please write
about topic X – you have y minutes", and after they
complete the assignment, AI intervenes in real time,
analysing the text that has been written so far and
comparing it to a set of themes we created for that
specific item. The AI then acts as a peer or a professor,
suggesting to continue the writing by covering new
sub-topics, for example asking "Can you also write
about this?”. This type of interactive capability for such
item was not possible before GenAI: it allows to more
closely resemble a real-life task, thereby offering greater
authenticity – a problem that most assessments have.
More recently, in July 2025, we launched an interactive,
adaptive speaking task, during which a test taker
converses with an AI agent. The generative aspect
primarily involves generating the agent's utterances.
While it's not a live agent and is extremely constrained,
it allows us to create interactivity, as was the case for the
writing item. These two examples use generative AI to
assess differently than before. It would not be possible
without the technology.
The writing task involves only one intervention from
the AI. You receive a prompt, you write, and then the AI
comes in and asks you to write more about a specific
topic. In contrast, the speaking task is a conversation,
involving multiple interactions. Managing these multiple
interactions is what makes it challenging. It's difficult
because the AI needs to be embedded to “understand”
what the person says. When test-takers are non-native
speakers of the language that is being tested, meaning
they have all types of accents and abilities, a lot of work
goes into ensuring the AI can understand each person's
speech, evaluate it, and then select the appropriate
response to that person. So, it's actually much more
difficult to implement than the writing task. To my
knowledge, this is for the first time where a high-stakes
assessment with million of test takers includes such an
interactive and adaptive speaking task.
OECD: What is the purpose of these two tasks?
What do you want to assess (or to achieve)?
Alina von Davier: As I mentioned before, these two
tasks are examples of something that could not be
achieved previously. We are trying to accomplish two main objectives with this approach: authenticity and
support to test-takers. Before, you could have just
said, "Write for 10 minutes," rather than evaluating
what they wrote in the middle of the assignment, and
then encourage the test takers to write more. The key
difference here is that in real life, people often have their
writing reviewed and receive suggestions on how to
proceed further.
Take the writing task. First, we believe this makes the task
more akin to real-life situations. For instance, at college/
university, someone reviews your writing, provides
feedback, and asks you to expand on it. We do the same.
This is the interactivity aspect. But second, we also aim
to assist test-takers. When we provide an initial prompt
for the writing task, it often has multiple potential writing
directions. We want to encourage test-takers to cover
other aspects they haven't yet addressed, helping them
become better writers and giving them an opportunity to
demonstrate their ability to write about different topics.
At the end, what we assess is the quality of their writing
in English as a foreign language.
Regarding the speaking task, there is currently no other
high-stakes test that uses purely technology-enabled
interactive speaking. Previously, tests might have offered
a prompt, you listened and responded, then listened
to something else and responded, but it was neither
interactive nor adaptive. Our speaking task is both
adaptive and interactive. For example, if a test-taker's
English proficiency level is not very high, the AI agent
will adjust and engage in a simpler conversation. This
was simply not possible before. The only other English proficiency test that features a real, back-and-forth
interview is the IELTS English test, but they conduct it
with humans. As a test-taker, you have to schedule an
appointment, travel to a test centre, and speak with a
human. We are trying to maintain that conversation
but replace the logistics that comes with traveling to a
centre. It's an extremely expensive and difficult process
for test takers to travel to a centre and take a test
delivered by human interviewers. Our test is continuous
– it can be taken anytime, anywhere, hence a technologybased solution makes more sense for this delivery model.
Furthermore, humans have their own issues, such as the
halo effect, for example: if a test-taker responds well on
one question, the examiner may transfer this positive
impression to evaluating the following questions.
OECD: What is the level of efficacy of these tasks?
I assume you've tested how they perform for
those taking the test: does it work well? How does
it compare to human raters? And finally, how
do you combine traditional AI with generative AI in this highly constrained AI scenario, given
your requirements like adaptivity and potentially
broadening the topics of the conversation? How
does it work?
Alina von Davier: After the setup, the generation of
these tasks at scale is extremely efficient. Moreover, the
quality is outstanding and experts cannot tell them apart
from those that are generated by humans.
To your second question: the work we do is not solely
with generative AI; other types of AI models and
psychometric models are also invoked. GenAI and otherAI programmes work together. We write models and
scripts, and we have many scripts that call upon AI and
GenAI for different applications. People have to realise
that GenAI is just one type, and we use many other types
as well.
Let's consider the specific speaking task I described
above. Parts of it can only be accomplished with
generative AI, such as understanding what the person
says and evaluating it quickly in real time. Then, as part
of the rest of the scripts, having the agent selecting the
correct answer and delivering the right spoken response
to the test-taker is not solely a GenAI process. It includes
psychometrics as well. So, there is a component that
only generative AI can handle, but other parts that
are done by other types of AI and psychometrics,
hence computational psychometrics. For the writing
task, at the scale at which we operate, the real-time
reading of the text can only be done by generative AI.
However, everything I’ve described is embedded within
other scripts and programmes. GenAI is not used as a
standalone process where we simply give the GenAI a
prompt and say, "Do this". We have a script that designs
the prompt and then feeds it to the GenAI. It's quite an
elaborate process. That's why I mentioned that “setting it
up” requires a great deal of expertise and time.
OECD: So, what is the next step after the writing
and speaking tasks? Is it leading to another task
or a score? Do you score your test-takers with AI?
Alina von Davier: Yes, scoring is done with AI and
psychometrics, but using machine learning, not
GenAI. For scoring the speaking task, there is one
component that relies on generative AI: the evaluation
of pronunciation. We employ multiple forms of AI, not
just large language models (LLMs). For instance, we
use an automatic speech recognition (ASR) system for
sound processing, text-to-speech, and speech-to-text.
So, yes, we already use AI for scoring, and also even
for proctoring – that's what I meant by end-to-end.
However, we also use psychometric models to obtain the
final score and evaluate the reliability and validity of the
scores.
Our next significant task involving GenAI will be the
provision of feedback, starting with writing. This is almost
ready, but it's primarily for the practice part of the test.
Another desired improvement would be to relax some of
the constraints on the AI speaking agent. We need to be
confident that we maintain comparability though, which is extremely important in standardised testing. If you
allow an LLM to operate independently, you risk losing
comparability. One time you get one response, and the
next time a different response. While that's acceptable
in some contexts, it's not suitable for high-stakes
tests. That's why we have these constraints. We can, of
course, adjust or relax them to make the tests more
authentic, but we still need to maintain the accuracy and
comparability required for a quality exam.
OECD: Thank you so much for sharing how AI and
GenAI are involved in current assessments.
Alina von Davier: My pleasure. If you go to the
Duolingo English Test website, there's something called a
practice hub. Just take a look to see what these items are
like. It's all free and open, so anyone interested to know
more can just check it out.
OECD: Precisely, for high-stakes tests, there
has been intense discussions about the grading
of exams and papers. It seems that "good oldfashioned AI" performs better than generative AI in terms of accuracy and consistency. People
seem to suggest that older machine learning
tends to be more accurate but less flexible. It
requires a lot of time and money to train, and if
you change the task, you often have to restart
and redo it. Whereas with generative AI, you don't
lose as much time because it adapts much better
to new contexts, but you might not achieve the
same level of accuracy. This means that, if it's not
a high-stakes scenario, it's perfectly acceptable,
but if it is high-stakes, you need to think carefully
about your approach. Do you agree with this
assessment? Do you think this is likely to change?
What is your experience?
Alina von Davier: First, let me say that the most
important is to know how to use GenAI properly. Simply
using a prompt one dreamed up one morning is not
going to lead to good quality assessment. It doesn't work
like that. One has to think carefully and plan. People
hope that if they just say, "Do this for me," it will do it
perfectly, but that's not the case. We need to be very
cautious when people make overly positive or negative
claims: take it with a grain of salt because most people
may not have built sufficient experience with GenAI.
That's my main observation now. Many think it's easy
and conversational, but it isn't that simple for highquality exams. One may obtain a full range of output
quality from generative AI if one prompts it properly
or if one builds it correctly. When I teach the use ofgenerative AI for item generation, I advise people to
"divide and conquer". By that, I mean, don't try to get
everything done with one prompt. Let's say one needs
an assessment passage followed by questions for 8th
graders. Don't put it all into one prompt; it won't be
very good. I suggest for people to divide the task: first
generate the passage, review it to ensure its quality,
and when one is satisfied, develop the questions.
That approach works so much better, but it is not yet
widespread. Many people try to put everything into one
prompt, don't experiment further, and then claim it's not
working well. So, I would say, be careful and ask more
questions with both extreme claims, positive or negative.
OECD: It's an interesting observation because
there is a new research literature on how many
prompts are needed to achieve comparable
quality to older machine learning types of scoring.
Alina von Davier: For scoring, we use our own machine
learning models, sometimes with some generative AI
components. We use our own models, but our biggest
concern – as big as accuracy – is comparability. If
generative AI scores the same essay one way at one time
and differently at another, it affects comparability and
replicability, which are crucial for a reliable assessment .GenAI can sometimes be accurate, but again, it depends
on how one uses it. For instance, if one is exploring using
generative AI for scoring essays, and if one provides it
with a very good rubric (the same one one would give to
humans) and a few more examples than what one would
usually do, I believe it can do quite a good job. However,
this depends on the task's complexity and the purpose
of the scores. It varies. It's true that all these applications
are still task-specific for both generation and scoring.
While some parts can be reused, generally, the main
model is task-specific, and one needs to test it again to
see if it works for other tasks.
We also incorporate numerous checks afterwards, with
multiple filters to ensure nothing is released if it's not
good enough. We have automatic tools for monitoring
the quality of our assessments. The first tool we built
is called AQuAA, which stands for Analytics for Quality
Assurance in Assessment. This tool incorporates some
machine learning models and a lot of psychometrics. It
functions as an alert system, continuously analysing data
as it comes in. If anything unusual occurs, we receive an have another, newer system called AQuAP, which stands
for Analytics for Quality Assurance for the Pools (item
pools). It also operates in the background, monitoring
the items to ensure they don't suddenly become more
difficult or exhibit other unusual behaviours within the
pool. This is also an automatic tool that heavily utilises
machine learning. We are also developing one called
AQuATT, for test-takers, which will focus on the test-taker
level.
OECD: Thank you so much for sharing how AI and
GenAI are involved in current assessments.
Alina von Davier: My pleasure. If you go to the
Duolingo English Test website, there's something called a
practice hub. Just take a look to see what these items are
like. It's all free and open, so anyone interested to know
more can just check it o
Comments
Post a Comment