Generative AI tools to support teachers.

A conversation with Dorottya Demszky



 This section is an interview between Dorottya (Dora) Demszky, Assistant Professor in Education Data Science at Stanford University (United States) and the OECD Secretariat. The conversation discusses research about the emerging evidence about the potential of generative AI tools to support some teacher tasks: lesson planning, professional development based on their actual teaching, real-time support for tutoring, and the provision of feedback to their pupils and students. It concludes with a reflection on the availability of these tools for the teaching profession across the globe.

OECD: What do you think generative AI offers to teachers to support their teaching and the learning of their students, especially when the tools are teacher-facing? 

Dora Demszky: My lab, the EduNLP lab1, primarily focuses on this question: how AI tools, including GenAI, can support teachers in different ways, and of course there is a broader landscape of tools in this area. There are at least 4 areas where GenAI can support: lesson planning, professional development based on their actual teaching, real-time support for tutoring, and the provision of feedback to their pupils and students. OECD: Great. Let’s take those in turn and start with lesson planning and the development of curriculum materials. Students who are creative in dance or music may not be in science, and vice versa. They must have knowledge and experience in a domain to produce something new and appropriate. 



OECD: What do you think generative AI offers to teachers to support their teaching and the learning of their students, especially when the tools are teacher-facing?
Dora Demszky: My lab, the EduNLP lab1, primarily focuses on this question: how AI tools, including GenAI, can support teachers in different ways, and of course there is a broader landscape of tools in this area. There are at least 4 areas where GenAI can support: lesson planning, professional development based on their actual teaching, real-time support for tutoring, and the provision of feedback to their pupils and students. 
OECD: Great. Let’s take those in turn and start with lesson planning and the development of curriculum materials. Students who are creative in dance or music may not be in science, and vice versa. They must have knowledge and experience in a domain to produce something new and appropriate. 
Dora Demszky: A main challenge for teachers is the time-consuming and difficult process of designing highquality lesson plans for students with various needs. Curriculum varies greatly in the United States and this is also true in some other countries. Even when those do not vary, teachers often need to adapt teaching materials to meet students where they are, whether they are below grade level, multilingual newcomers needing language support, or students with special needs requiring visual or other types of tools. Teachers are not necessarily trained for this task. One major area of work, both in industry and research, is addressing the challenge of curriculum adaptation. There are many possibilities, though some approaches are better than others. It's crucial to consider various factors, such as maintaining rigour and preserving core components of carefully designed expert curricula, rather than just simplifying content. Our project, ScaffGen, researches how GenAI can support teachers with curriculum adaptation, considering high-quality instructional materials and teacher-specific contexts like students being below the expected proficiency at their grade level. This involves helping teachers adapt and create scaffolds for students that remain aligned with their curriculum. Specific areas include creating more practice tasks and generating visual aids, like different ways to represent the same problem. We focus on multimodal generation, which GenAI excels at, and currently use LaTeX for diagram generation. We have evaluated scaffolds generated by Large Language Models (LLMs) for highquality instructional materials against expert-created ones. We found that LLMs are similar and sometimes even preferred by teachers over expert-made ones, showing promise. There are still gaps, especially in visual aid generation. Another upcoming paper is a benchmark with a dataset of thousands of diagrams and LaTeX code from the Illustrative Mathematics curriculum, a leading K-12 math curriculum in the United States. We are releasing this dataset and benchmark studies to understand AI's performance in this area. 
OECD: What do we know about the efficacy of AI-generated lesson plans and of your diagramgeneration tool? 
Dora Demszky: One of my former students built CoTeach.AI, an AI-powered curriculum adaptation tool grounded in the Illustrative Mathematics curriculum. After rolling it out for just a week in a small pilot, the tool has gained significant traction with thousands of regular users. We estimate about 10% of all teachers who use Illustrative Mathematics now use CoTeach. AI, which is substantial. Regarding efficacy, we are currently studying it and planning a pilot focused on our diagram-generation tool. We will test the quality of lesson plans from teachers using it versus those who don't, specifically focusing on the idea of multiple representations. We want to see if the tool's ability to generate diagrams supports students' understanding of connections between different representations (e.g., visualising abstract fractions). The curriculum provides limited representations, and we believe our tool can significantly support teachers in this. More generally speaking, I haven't seen any efficacy studies for broader lesson planning tools like Magic School or School.ai. Much of it is self-reported usage or perception. Evaluating efficacy is challenging because it requires rigorous metrics for lesson plan quality and, ideally, measuring student outcomes. Gathering student outcome data is slow, expensive, and logistically difficult, often falling to researchers due to lack of incentives in the EdTech industry. We are working on it, but it's a slow process. 
OECD: I don't know any studies on the efficacy of lesson plans on student learning either. Some studies evaluate the generated lesson plan quality through human judgment and the time saved, focusing on productivity rather than whether the lesson led to better instruction quality. It seems your ScaffGen is more granular than full lesson plans. 
Dora Demszky: CoTeach can generate full lesson plans, but often it generates activities. My lab, as part of the ScaffGen project, focuses on core R&D that many industry providers lack bandwidth for, such as diagram generation, which requires careful engineering, evaluations, benchmarks, and infrastructure. Many existing tools are essentially LLM wrappers, that is, software layers, or interfaces, built around an LLM): they don’t have the capacity to build these challenging but necessary features. We are focused on fundamental technologies and evaluation, though the latter is complex and requires partnerships. We are working on a rubric for lesson plan quality for efficacy studies. It's also ethically challenging to withhold such tools from teachers for a control group. We are interested in gathering evidence despite these open questions.
 OECD: You mentioned teachers sometimes preferred LLM feedback over experts' feedback: can you elaborate on that? 
Dora Demszky: In a project in 2023, using earlier LLMs than the current models, we evaluated the quality of lesson plans based on predefined dimensions like readiness for classroom use, alignment with lesson objectives, preference, and alignment with student needs. Teachers compared the original curriculum warm-up from Illustrative Mathematics to two different LLM-generated and expert-generated lesson plans. What LLMs and experts produced were much more preferred across all criteria over the original material by a huge margin. On some dimensions, LLMs even outperformed experts. This is promising but needs careful interpretation

OECD: A second application you mentioned belongs to the category of “classroom analytics” applications supporting teacher professional development or real-time classroom orchestration. I have always found this use of AI fascinating and promising. What does GenAI bring to these AI tools? 
Dora Demszky: GenAI can support teachers in using pedagogically sound "talk moves" and discourse practices that probe student thinking, instead of just guiding them to a pre-specified solution or drilling. This involves dialogic practices that encourage student expression. GenAI can help analyse classroom discourse and student interactions. This can be done post-session: after a physical or online lesson, a transcript is analysed, and GenAI (or simpler AI models) can provide explicit suggestions on how to improve instructional practice or what talk moves to try next to support active learning. We have conducted over 4 randomised controlled trials (RCTs) testing how this automated post-session feedback supports instructional improvement. We have a tool called Empowering Teachers. Teachers teach classes, and then they receive a report or feedback focusing on different talk moves, for example, inviting student thinking. The report includes counts and talk time. ChatGPT suggestions are also included in the paper. These talk moves are detected by language model-based classifiers, not GenAI itself. We found that teachers who received this automated feedback from classifiers used the targeted talk move (e.g., focusing questions, building on or eliciting student ideas) by up to 20% more after only two feedback sessions, compared to a control group who did not receive such feedback. A limitation is the lack of rigorous assessment of student learning outcomes, but we do have access to student engagement metrics like talking more, showing up to classes, and completing assignments. We found students whose teachers received this feedback were more likely to submit assignments and show up to class. There is room for improvement, but that’s promising. GenAI is good at summarising conversations but struggles to accurately identify high-leverage teaching practices, as it requires significant context and understanding of classrooms. Even with careful prompting, it sometimes hallucinates or misclassifies classroom interactions. We see a lot of potential in this area though, especially for novices like volunteer tutors or new teachers who receive limited training. This will offer them professional learning. I see less industry activity in talk move suggestions, perhaps due to lower profit. More common are fully automated tutoring systems like Khanmigo, though their effectiveness still needs evidence. Our lab focuses on supporting human tutors and teachers so we develop and research these types of teacher-facing tools. 
OECD: How much do teachers like these tools? Adoption is usually one of the issues with them. 
Dora Demszky: One practical challenge is that some teachers find it hard to act on this feedback. It requires reflection and thus time. While it raises awareness, deep change is often better supported by a human coach. We just published a working paper where instructional coaches helped teachers interpret this feedback, which was very helpful. Coaches were supported in pulling out specific evidence, and teachers felt less judged by the coach because they were looking at a third-party piece of evidence together.


OECD: One of your very interesting studies is about providing support to human tutors in real time. Could you tell us about it? Dora Demszky: Yes, in the real-time suggestions space we have the Tutor Copilot project, a collaboration with SCALE at Stanford. This project partnered with a tutoring provider supporting low-income students in text-based, in-school tutoring. Tutor Copilot allows tutors to activate the tool during the online tutoring sessions when students make math errors and need remediation. It suggests different response strategies and actual editable responses, giving tutors agency while also serving an educative purpose. A randomised controlled trial showed that tutors who had access to Tutor Copilot used better instructional practices, and their students mastered lessons faster. This was particularly helpful for tutors with lower initial quality ratings or less experience (see Figure 10.1).





These two figures show the results of using Tutor CoPilot on student learning (upper panel) and on tutor pedagogies (lower panel). The effect on student learning varies based on tutors’ initial effectiveness, measured by their quality rating. The results indicate substantial benefits for tutors with lower initial effectiveness. Lower-rated tutors experienced a 9-percentage point increase in students passing their exit ticket (56% to 65% student passing rate from control to treatment). Similar effects were observed with less experienced tutors


OECD: What do you think about the appropriateness of choosing generic LLMs versus more educationally-focused ones? How can you be sure that the use of GenAI is educationally appropriate? Dora Demszky: That's a very big question, and it's one of the central questions we ask teachers. How do teachers determine appropriateness? Specific models (GPT, Cloud, Gemini) are ever-changing, and we haven't found massive differences. The best model might change next week... The criteria for educational appropriateness vary a lot by context, teacher, and project. It is important to learn about these criteria. We hosted a Practitioner Voices summit for math educators at Stanford this summer, where one of our main goals was to learn about their criteria for evaluating AI tools. Our short report can be found online and we will be releasing a longer paper soon.
OECD: In the case of Tutor Copilot, the system was trained by providing data based on the observation of and work with expert teachers. Do you think GPT-4 at the time would have given similar measurements without this dedicated educational element? Dora Demszky: No, we explicitly needed the expert teachers' input to improve the model. Without that "expert-informed cognitive task analysis" – where we tell the model how an expert teacher would remediate a student's mistake – it performed significantly worse. We are doing something similar with ScaffGen, giving the model these expert-informed processes. This is related to, but slightly different from, the evaluation criteria for determining if a tool is good, though the two can inform each other. OECD: Could these tools supplement and augment teachers to improve educational quality, especially in countries or contexts with teacher shortages or teachers with a lack of expertise? Dora Demszky: I want to problematise the premise that we don't have human teachers available. It's risky to accept that technology should (or could) replace the human teacher role, as this could worsen inequities in access to human teachers, not just in low-income countries but within the United States and OECD countries too. If situations genuinely lack a human teacher, we must think carefully about what roles these tools can fulfil. The relationship-building part cannot be replaced by technology, though other aspects could potentially be, which remains to be tested. 
OECD: I was not thinking of replacing teachers, but more, like, if you have inexperienced or low-quality teachers or people with little subject and pedagogical knowledge, could these tools help them improve their performance? In many countries, there would be just too many teachers to train, so being able to enrol the next person to tutor or teach could help. If the humans don't really know yet what they're doing, could tools like the Tutor Copilot help? Dora Demszky: Our central focus for these technologies isn't time-saving, but rather the educative element – supporting teachers' professional learning. All teachers have room to grow. Different versions of the tools could be tailored to the user's experience level; for example, a novice teacher might be overwhelmed with too many decisions or information before they gain more training. We have tested some tools with complete novices. In the “Code in Place” global programming course run by Stanford University, we implemented the teacher feedback tool with thousands of volunteer section leaders, most of whom had zero teaching experience. This feedback tool helped them, so this is a significant user base we are targeting. But we would need to do pre-work to ensure these technologies translate to different languages and local needs if we were to use them in the contexts you mentioned. 
OECD: What do you think GenAI can never do as well as a human being, if anything, especially regarding the human dimension in education? 
Dora Demszky: Motivation and relationship building are key elements that GenAI may never do as well as humans. While more research is needed, experts in education agree, and it intuitively makes sense. An AI won't be seen as a role model. Students might share things with AI they wouldn't with a human because they're less afraid of vulnerability. However, a human is better able to support emotional well-being and create accountability. With AI, there's no accountability. A student might not care what they do because the AI won't get hurt. Social-emotional skills for example are learned better with a human teacher and human peers. Learning involves much more than just knowledge or information gathering


OECD: Let us move to the final area that you mentioned initially: feedback to students. We know this is essential for learning, for teachers as we have already mentioned, but also for students. What could GenAI offer on that? Dora Demszky: A significant area of research and development for both industry and academia is teachers giving feedback to students on their work. Teachers often lack time (imagine they have 150 students) but also training to give high-quality feedback. As we aim to go beyond just productivity and time-saving, focusing on improving the quality of teaching and of feedback is essential. Some tools exist, like Brisk, supporting GenAIdriven feedback, especially in writing. We are working on rigorously validated tools that also support professional learning around feedback provision.

  OECD: Are you talking about formative assessment, where feedback is given on students' written assignments? Or is it linked to the applications of real-time feedback and dialogic practice you told us about? How do these two link, and what do we know about the efficacy of tools being developed? 
Dora Demszky: Our work focuses on formative feedback that teachers can give on student assignments, but with a strong emphasis on revision. One goal of feedback is to help students improve and revise their work, and students are less likely to read feedback if they don't have a chance to revise. We focus on areas with room for student improvement and lower stakes. The efficacy of these new feedback tools remains to be seen, as they are very new, but conceptually the design seems sound. Teachers often accept GenAI suggestions without editing. That’s a problem. We explicitly design our tools to support teachers in creating feedback, not to replace their feedback, because research indicates students are less likely to act on feedback perceived as coming from AI rather than their teacher. It's crucial for students to feel the feedback is from their teacher. We are developing a benchmark for feedback quality, a set of measures for assessing feedback from teachers or AI tools, which we hope industry will adopt. We have a working paper that compares expert-written feedback to LLM-generated feedback. While LLMs are not bad, they significantly lag behind experts in key areas. For example, LLMs are much less dialogic, tending to give specific rewrite suggestions ("this was not right, here's how to rephrase") rather than engaging with holistic arguments or probing student thinking to encourage revision. Also, LLM comments can be disjoint, unlike a teacher's coherent feedback where comments build on one another. We are actively working on improving and evaluating the use of GenAI tools. The two strands of projects – teacher-facing feedback on talk (which is maths/STEM-focused) and student-facing feedback on writing – are not directly linked currently. However, we envision integrating them. For example, teachers could receive a post-lesson report summarising student assignments and class discussions, offering feedback suggestions for assignments, and guiding future lesson planning. This could be a complementary system down the line. 


OECD: So, are all the tools you've mentioned useable in real life in-person classroom or instruction settings, except Tutor Copilot, which is for virtual platforms? For example, can you think of uses of Tutor Copilot in an in-person setting? 
Dora Demszky: We need to be careful not to make realtime suggestion tools distracting in virtual or physical face-to-face contexts, or to take away educator agency. One idea we're exploring is surfacing feedback during high-leverage moments in a real-time classroom, like when students are working on problems and there's a pause, rather than giving suggestions constantly. Identifying these non-distracting periods could be very useful. Doing this in a virtual context is straightforward; you can speculate when these moments might occur and surface feedback. In a physical classroom, it's harder due to challenges in accurately capturing student voices, surfacing real-time feedback to teachers (e.g., via iPad), and instrumentation. We need to talk to teachers about this. One question for participants at our Practitioner Voices summit was how these tools could support teachers in a physical classroom, whether by analysing group work or teacher discourse. They might help us envision practical implementation. There might be variation, with some teachers preferring post-teaching feedback and others appreciating real-time tools. Our longer report will report what we learned from teachers.







Comments

Popular posts from this blog

(Day 2) Beyond the Algorithm: Navigating the Future of Artificial Intelligence - 49th Annual UNIS-UN International Student Conference.

(Day 1 - Part 2) Beyond the Algorithm: Navigating the Future of Artificial Intelligence - 49th Annual UNIS-UN International Student Conference.