Generative AI for standardised assessments: A conversation.

 

A conversation

This section is an interview between Alina von Davier (Duolingo and Edastratech, United States) and the OECD Secretariat. The conversation is about new possibilities offered by generative AI (GenAI) to develop and implement standardised and high-stakes assessments. After showing how GenAI can enhance the productivity of item development, the discussion dwells on how some innovations are made possible thanks to GenAI, taking the assessment of foreign languages as a case in point. The processes have to remain tightly controlled by humans and usually involve different types of artificial intelligence.


OECD: Many AI tools work quite effectively for assessment, sometimes even better than generative AI. However, we want to explore what new possibilities generative AI can offer. We are interested in two main areas: how generative AI can help perform traditional assessment tasks more effectively – for instance, item generation for standardised assessments – and how it can enable different and better assessments . Perhaps we can start with the first aspect? 

Alina von Davier: For assessment, at Duolingo we utilise AI end-to-end, but not in isolation: it just contributes to our processes. For example, we use generative AI to generate items at scale after human experts have designed item prototypes. The content experts collaborate with the AI engineers and psychometricians during the design phase in order to ensure that the design of the new item type is viable in operational settings. Once the initial item design is complete and we are comfortable with it, the AIengineers and scientists define the desired display of the items for the online delivery and create scripts that will generate multiple items on a large scale. After that, the items are reviewed by human experts for quality, fairness, and appropriateness for a global administration. This has become a mainstream process. We created the item factory, a full system where humans and machines collaborate. Let me emphasise that there is a substantial amount of work involved at the outset. It demands a very high level of expertise and considerable effort when you set up a GenAI system for a particular item type for the first time. That's where the main work lies: in setting things up and determining what works and what doesn't. However, once that setup is complete, compared to human development the efficacy increases tenfold. It's incredible in terms of speed and cost. We remain very conservative and continue to use humans to review every single item, but we plan to explore how to make this even more efficient.



OECD: Beyond enhancing the productivity of conventional assessment design, how can we use GenAI  to innovate the ways we assess people’s knowledge and skills? 

Alina von Davier: In our Duolingo English Test, we have, for example, two other applications of GenAI: one for a writing task and one for a speaking task. In April 2024, we launched a new writing task, which works as follows. We provide a prompt, such as "Please write about topic X – you have y minutes", and after they complete the assignment, AI intervenes in real time, analysing the text that has been written so far and comparing it to a set of themes we created for that specific item. The AI then acts as a peer or a professor, suggesting to continue the writing by covering new sub-topics, for example asking "Can you also write about this?”. This type of interactive capability for such item was not possible before GenAI: it allows to more closely resemble a real-life task, thereby offering greater authenticity – a problem that most assessments have. More recently, in July 2025, we launched an interactive, adaptive speaking task, during which a test taker converses with an AI agent. The generative aspect primarily involves generating the agent's utterances. While it's not a live agent and is extremely constrained, it allows us to create interactivity, as was the case for the writing item. These two examples use generative AI to assess differently than before. It would not be possible without the technology. The writing task involves only one intervention from the AI. You receive a prompt, you write, and then the AI comes in and asks you to write more about a specific topic. In contrast, the speaking task is a conversation, involving multiple interactions. Managing these multiple interactions is what makes it challenging. It's difficult because the AI needs to be embedded to “understand” what the person says. When test-takers are non-native speakers of the language that is being tested, meaning they have all types of accents and abilities, a lot of work goes into ensuring the AI can understand each person's speech, evaluate it, and then select the appropriate response to that person. So, it's actually much more difficult to implement than the writing task. To my knowledge, this is for the first time where a high-stakes assessment with million of test takers includes such an interactive and adaptive speaking task. 

OECD: What is the purpose of these two tasks? What do you want to assess (or to achieve)? 
Alina von Davier: As I mentioned before, these two tasks are examples of something that could not be achieved previously. We are trying to accomplish two main objectives with this approach: authenticity and support to test-takers. Before, you could have just said, "Write for 10 minutes," rather than evaluating what they wrote in the middle of the assignment, and then encourage the test takers to write more. The key difference here is that in real life, people often have their writing reviewed and receive suggestions on how to proceed further. Take the writing task. First, we believe this makes the task more akin to real-life situations. For instance, at college/ university, someone reviews your writing, provides feedback, and asks you to expand on it. We do the same. This is the interactivity aspect. But second, we also aim to assist test-takers. When we provide an initial prompt for the writing task, it often has multiple potential writing directions. We want to encourage test-takers to cover other aspects they haven't yet addressed, helping them become better writers and giving them an opportunity to demonstrate their ability to write about different topics. At the end, what we assess is the quality of their writing in English as a foreign language. Regarding the speaking task, there is currently no other high-stakes test that uses purely technology-enabled interactive speaking. Previously, tests might have offered a prompt, you listened and responded, then listened to something else and responded, but it was neither interactive nor adaptive. Our speaking task is both adaptive and interactive. For example, if a test-taker's English proficiency level is not very high, the AI agent will adjust and engage in a simpler conversation. This was simply not possible before. The only other English proficiency test that features a real, back-and-forth interview is the IELTS English test, but they conduct it with humans. As a test-taker, you have to schedule an appointment, travel to a test centre, and speak with a human. We are trying to maintain that conversation but replace the logistics that comes with traveling to a centre. It's an extremely expensive and difficult process for test takers to travel to a centre and take a test delivered by human interviewers. Our test is continuous – it can be taken anytime, anywhere, hence a technologybased solution makes more sense for this delivery model. Furthermore, humans have their own issues, such as the halo effect, for example: if a test-taker responds well on one question, the examiner may transfer this positive impression to evaluating the following questions. 

OECD: What is the level of efficacy of these tasks? I assume you've tested how they perform for those taking the test: does it work well? How does it compare to human raters? And finally, how do you combine traditional AI with generative AI in this highly constrained AI scenario, given your requirements like adaptivity and potentially broadening the topics of the conversation? How does it work?
 Alina von Davier: After the setup, the generation of these tasks at scale is extremely efficient. Moreover, the quality is outstanding and experts cannot tell them apart from those that are generated by humans. To your second question: the work we do is not solely with generative AI; other types of AI models and psychometric models are also invoked. GenAI and otherAI programmes work together. We write models and scripts, and we have many scripts that call upon AI and GenAI for different applications. People have to realise that GenAI is just one type, and we use many other types as well. Let's consider the specific speaking task I described above. Parts of it can only be accomplished with generative AI, such as understanding what the person says and evaluating it quickly in real time. Then, as part of the rest of the scripts, having the agent selecting the correct answer and delivering the right spoken response to the test-taker is not solely a GenAI process. It includes psychometrics as well. So, there is a component that only generative AI can handle, but other parts that are done by other types of AI and psychometrics, hence computational psychometrics. For the writing task, at the scale at which we operate, the real-time reading of the text can only be done by generative AI. However, everything I’ve described is embedded within other scripts and programmes. GenAI is not used as a standalone process where we simply give the GenAI a prompt and say, "Do this". We have a script that designs the prompt and then feeds it to the GenAI. It's quite an elaborate process. That's why I mentioned that “setting it up” requires a great deal of expertise and time. 

OECD: So, what is the next step after the writing and speaking tasks? Is it leading to another task or a score? Do you score your test-takers with AI? 
Alina von Davier: Yes, scoring is done with AI and psychometrics, but using machine learning, not GenAI. For scoring the speaking task, there is one component that relies on generative AI: the evaluation of pronunciation. We employ multiple forms of AI, not just large language models (LLMs). For instance, we use an automatic speech recognition (ASR) system for sound processing, text-to-speech, and speech-to-text. So, yes, we already use AI for scoring, and also even for proctoring – that's what I meant by end-to-end. However, we also use psychometric models to obtain the final score and evaluate the reliability and validity of the scores. Our next significant task involving GenAI will be the provision of feedback, starting with writing. This is almost ready, but it's primarily for the practice part of the test. Another desired improvement would be to relax some of the constraints on the AI speaking agent. We need to be confident that we maintain comparability though, which is extremely important in standardised testing. If you allow an LLM to operate independently, you risk losing comparability. One time you get one response, and the next time a different response. While that's acceptable in some contexts, it's not suitable for high-stakes tests. That's why we have these constraints. We can, of course, adjust or relax them to make the tests more authentic, but we still need to maintain the accuracy and comparability required for a quality exam. 

OECD: Thank you so much for sharing how AI and GenAI are involved in current assessments. 
Alina von Davier: My pleasure. If you go to the Duolingo English Test website, there's something called a practice hub. Just take a look to see what these items are like. It's all free and open, so anyone interested to know more can just check it out.




OECD: Precisely, for high-stakes tests, there has been intense discussions about the grading of exams and papers. It seems that "good oldfashioned AI" performs better than generative AI in terms of accuracy and consistency. People seem to suggest that older machine learning tends to be more accurate but less flexible. It requires a lot of time and money to train, and if you change the task, you often have to restart and redo it. Whereas with generative AI, you don't lose as much time because it adapts much better to new contexts, but you might not achieve the same level of accuracy. This means that, if it's not a high-stakes scenario, it's perfectly acceptable, but if it is high-stakes, you need to think carefully about your approach. Do you agree with this assessment? Do you think this is likely to change? What is your experience?
 Alina von Davier: First, let me say that the most important is to know how to use GenAI properly. Simply using a prompt one dreamed up one morning is not going to lead to good quality assessment. It doesn't work like that. One has to think carefully and plan. People hope that if they just say, "Do this for me," it will do it perfectly, but that's not the case. We need to be very cautious when people make overly positive or negative claims: take it with a grain of salt because most people may not have built sufficient experience with GenAI. That's my main observation now. Many think it's easy and conversational, but it isn't that simple for highquality exams. One may obtain a full range of output quality from generative AI if one prompts it properly or if one builds it correctly. When I teach the use ofgenerative AI for item generation, I advise people to "divide and conquer". By that, I mean, don't try to get everything done with one prompt. Let's say one needs an assessment passage followed by questions for 8th graders. Don't put it all into one prompt; it won't be very good. I suggest for people to divide the task: first generate the passage, review it to ensure its quality, and when one is satisfied, develop the questions. That approach works so much better, but it is not yet widespread. Many people try to put everything into one prompt, don't experiment further, and then claim it's not working well. So, I would say, be careful and ask more questions with both extreme claims, positive or negative. 

OECD: It's an interesting observation because there is a new research literature on how many prompts are needed to achieve comparable quality to older machine learning types of scoring. 
Alina von Davier: For scoring, we use our own machine learning models, sometimes with some generative AI components. We use our own models, but our biggest concern – as big as accuracy – is comparability. If generative AI scores the same essay one way at one time and differently at another, it affects comparability and replicability, which are crucial for a reliable assessment .GenAI can sometimes be accurate, but again, it depends on how one uses it. For instance, if one is exploring using generative AI for scoring essays, and if one provides it with a very good rubric (the same one one would give to humans) and a few more examples than what one would usually do, I believe it can do quite a good job. However, this depends on the task's complexity and the purpose of the scores. It varies. It's true that all these applications are still task-specific for both generation and scoring. While some parts can be reused, generally, the main model is task-specific, and one needs to test it again to see if it works for other tasks. We also incorporate numerous checks afterwards, with multiple filters to ensure nothing is released if it's not good enough. We have automatic tools for monitoring the quality of our assessments. The first tool we built is called AQuAA, which stands for Analytics for Quality Assurance in Assessment. This tool incorporates some machine learning models and a lot of psychometrics. It functions as an alert system, continuously analysing data as it comes in. If anything unusual occurs, we receive an have another, newer system called AQuAP, which stands for Analytics for Quality Assurance for the Pools (item pools). It also operates in the background, monitoring the items to ensure they don't suddenly become more difficult or exhibit other unusual behaviours within the pool. This is also an automatic tool that heavily utilises machine learning. We are also developing one called AQuATT, for test-takers, which will focus on the test-taker level. 

OECD: Thank you so much for sharing how AI and GenAI are involved in current assessments. Alina von Davier: My pleasure. If you go to the Duolingo English Test website, there's something called a practice hub. Just take a look to see what these items are like. It's all free and open, so anyone interested to know more can just check it o

Comments

Popular posts from this blog

(Day 2) Beyond the Algorithm: Navigating the Future of Artificial Intelligence - 49th Annual UNIS-UN International Student Conference.

(Day 1 - Part 2) Beyond the Algorithm: Navigating the Future of Artificial Intelligence - 49th Annual UNIS-UN International Student Conference.