Generative AI tools to support teachers.
This section is an interview between Dorottya (Dora) Demszky, Assistant Professor in Education Data Science at Stanford University (United States) and the OECD Secretariat. The conversation discusses research about the emerging evidence about the potential of generative AI tools to support some teacher tasks: lesson planning, professional development based on their actual teaching, real-time support for tutoring, and the provision of feedback to their pupils and students. It concludes with a reflection on the availability of these tools for the teaching profession across the globe.
OECD: What do you think generative AI offers
to teachers to support their teaching and the
learning of their students, especially when the
tools are teacher-facing?
Dora Demszky: My lab, the EduNLP lab1, primarily
focuses on this question: how AI tools, including GenAI,
can support teachers in different ways, and of course
there is a broader landscape of tools in this area. There
are at least 4 areas where GenAI can support: lesson
planning, professional development based on their
actual teaching, real-time support for tutoring, and the
provision of feedback to their pupils and students.
OECD: Great. Let’s take those in turn and start
with lesson planning and the development of
curriculum materials.
Students who are creative in dance or music may not be
in science, and vice versa. They must have knowledge
and experience in a domain to produce something new
and appropriate.
OECD: What do you think generative AI offers
to teachers to support their teaching and the
learning of their students, especially when the
tools are teacher-facing?
Dora Demszky: My lab, the EduNLP lab1, primarily
focuses on this question: how AI tools, including GenAI,
can support teachers in different ways, and of course
there is a broader landscape of tools in this area. There
are at least 4 areas where GenAI can support: lesson
planning, professional development based on their
actual teaching, real-time support for tutoring, and the
provision of feedback to their pupils and students.
OECD: Great. Let’s take those in turn and start
with lesson planning and the development of
curriculum materials.
Students who are creative in dance or music may not be
in science, and vice versa. They must have knowledge
and experience in a domain to produce something new
and appropriate.
Dora Demszky: A main challenge for teachers is the
time-consuming and difficult process of designing highquality lesson plans for students with various needs.
Curriculum varies greatly in the United States and this is
also true in some other countries. Even when those do
not vary, teachers often need to adapt teaching materials
to meet students where they are, whether they are below
grade level, multilingual newcomers needing language
support, or students with special needs requiring visual
or other types of tools. Teachers are not necessarily
trained for this task.
One major area of work, both in industry and research,
is addressing the challenge of curriculum adaptation.
There are many possibilities, though some approaches
are better than others. It's crucial to consider various factors, such as maintaining rigour and preserving core
components of carefully designed expert curricula,
rather than just simplifying content. Our project,
ScaffGen, researches how GenAI can support teachers
with curriculum adaptation, considering high-quality
instructional materials and teacher-specific contexts like
students being below the expected proficiency at their
grade level. This involves helping teachers adapt and
create scaffolds for students that remain aligned with
their curriculum.
Specific areas include creating more practice tasks and
generating visual aids, like different ways to represent
the same problem. We focus on multimodal generation,
which GenAI excels at, and currently use LaTeX for
diagram generation. We have evaluated scaffolds
generated by Large Language Models (LLMs) for highquality instructional materials against expert-created
ones. We found that LLMs are similar and sometimes
even preferred by teachers over expert-made ones,
showing promise. There are still gaps, especially in visual
aid generation. Another upcoming paper is a benchmark
with a dataset of thousands of diagrams and LaTeX
code from the Illustrative Mathematics curriculum, a
leading K-12 math curriculum in the United States. We
are releasing this dataset and benchmark studies to
understand AI's performance in this area.
OECD: What do we know about the efficacy of
AI-generated lesson plans and of your diagramgeneration tool?
Dora Demszky: One of my former students built
CoTeach.AI, an AI-powered curriculum adaptation tool
grounded in the Illustrative Mathematics curriculum.
After rolling it out for just a week in a small pilot, the
tool has gained significant traction with thousands of
regular users. We estimate about 10% of all teachers
who use Illustrative Mathematics now use CoTeach.
AI, which is substantial. Regarding efficacy, we are
currently studying it and planning a pilot focused on
our diagram-generation tool. We will test the quality
of lesson plans from teachers using it versus those
who don't, specifically focusing on the idea of multiple
representations. We want to see if the tool's ability to
generate diagrams supports students' understanding
of connections between different representations (e.g.,
visualising abstract fractions). The curriculum provides
limited representations, and we believe our tool can
significantly support teachers in this.
More generally speaking, I haven't seen any efficacy
studies for broader lesson planning tools like Magic
School or School.ai. Much of it is self-reported usage or
perception. Evaluating efficacy is challenging because
it requires rigorous metrics for lesson plan quality and,
ideally, measuring student outcomes. Gathering student
outcome data is slow, expensive, and logistically difficult,
often falling to researchers due to lack of incentives in
the EdTech industry. We are working on it, but it's a slow
process.
OECD: I don't know any studies on the efficacy
of lesson plans on student learning either. Some
studies evaluate the generated lesson plan quality
through human judgment and the time saved,
focusing on productivity rather than whether the
lesson led to better instruction quality. It seems
your ScaffGen is more granular than full lesson
plans.
Dora Demszky: CoTeach can generate full lesson
plans, but often it generates activities. My lab, as part
of the ScaffGen project, focuses on core R&D that
many industry providers lack bandwidth for, such as
diagram generation, which requires careful engineering,
evaluations, benchmarks, and infrastructure. Many
existing tools are essentially LLM wrappers, that is,
software layers, or interfaces, built around an LLM): they
don’t have the capacity to build these challenging but
necessary features. We are focused on fundamental
technologies and evaluation, though the latter is complex
and requires partnerships. We are working on a rubric
for lesson plan quality for efficacy studies. It's also
ethically challenging to withhold such tools from teachers
for a control group. We are interested in gathering
evidence despite these open questions.
OECD: You mentioned teachers sometimes
preferred LLM feedback over experts' feedback:
can you elaborate on that?
Dora Demszky: In a project in 2023, using earlier
LLMs than the current models, we evaluated the quality
of lesson plans based on predefined dimensions like
readiness for classroom use, alignment with lesson
objectives, preference, and alignment with student
needs. Teachers compared the original curriculum
warm-up from Illustrative Mathematics to two different
LLM-generated and expert-generated lesson plans.
What LLMs and experts produced were much more
preferred across all criteria over the original material
by a huge margin. On some dimensions, LLMs even
outperformed experts. This is promising but needs
careful interpretation
OECD: A second application you mentioned
belongs to the category of “classroom
analytics” applications supporting teacher
professional development or real-time classroom
orchestration. I have always found this use of AI
fascinating and promising. What does GenAI bring
to these AI tools?
Dora Demszky: GenAI can support teachers in using
pedagogically sound "talk moves" and discourse
practices that probe student thinking, instead of just
guiding them to a pre-specified solution or drilling.
This involves dialogic practices that encourage student
expression. GenAI can help analyse classroom discourse
and student interactions. This can be done post-session:
after a physical or online lesson, a transcript is analysed,
and GenAI (or simpler AI models) can provide explicit
suggestions on how to improve instructional practice or
what talk moves to try next to support active learning.
We have conducted over 4 randomised controlled trials
(RCTs) testing how this automated post-session feedback
supports instructional improvement. We have a tool called
Empowering Teachers. Teachers teach classes, and then
they receive a report or feedback focusing on different talk
moves, for example, inviting student thinking. The report
includes counts and talk time. ChatGPT suggestions are
also included in the paper. These talk moves are detected
by language model-based classifiers, not GenAI itself.
We found that teachers who received this automated
feedback from classifiers used the targeted talk move
(e.g., focusing questions, building on or eliciting student
ideas) by up to 20% more after only two feedback sessions,
compared to a control group who did not receive such
feedback. A limitation is the lack of rigorous assessment
of student learning outcomes, but we do have access to
student engagement metrics like talking more, showing
up to classes, and completing assignments. We found
students whose teachers received this feedback were
more likely to submit assignments and show up to class.
There is room for improvement, but that’s promising.
GenAI is good at summarising conversations but struggles
to accurately identify high-leverage teaching practices,
as it requires significant context and understanding of
classrooms. Even with careful prompting, it sometimes
hallucinates or misclassifies classroom interactions. We
see a lot of potential in this area though, especially for
novices like volunteer tutors or new teachers who receive
limited training. This will offer them professional learning.
I see less industry activity in talk move suggestions,
perhaps due to lower profit. More common are fully
automated tutoring systems like Khanmigo, though their
effectiveness still needs evidence. Our lab focuses on
supporting human tutors and teachers so we develop and
research these types of teacher-facing tools.
OECD: How much do teachers like these tools?
Adoption is usually one of the issues with them.
Dora Demszky: One practical challenge is that some
teachers find it hard to act on this feedback. It requires
reflection and thus time. While it raises awareness, deep
change is often better supported by a human coach.
We just published a working paper where instructional
coaches helped teachers interpret this feedback, which
was very helpful. Coaches were supported in pulling out
specific evidence, and teachers felt less judged by the
coach because they were looking at a third-party piece of
evidence together.
OECD: One of your very interesting studies is
about providing support to human tutors in real
time. Could you tell us about it?
Dora Demszky: Yes, in the real-time suggestions
space we have the Tutor Copilot project, a collaboration
with SCALE at Stanford. This project partnered with a
tutoring provider supporting low-income students in
text-based, in-school tutoring. Tutor Copilot allows tutors
to activate the tool during the online tutoring sessions
when students make math errors and need remediation.
It suggests different response strategies and actual
editable responses, giving tutors agency while also
serving an educative purpose. A randomised controlled
trial showed that tutors who had access to Tutor Copilot
used better instructional practices, and their students
mastered lessons faster. This was particularly helpful for
tutors with lower initial quality ratings or less experience
(see Figure 10.1).
OECD: What do you think about the
appropriateness of choosing generic LLMs versus
more educationally-focused ones? How can you
be sure that the use of GenAI is educationally
appropriate?
Dora Demszky: That's a very big question, and it's
one of the central questions we ask teachers. How do
teachers determine appropriateness? Specific models
(GPT, Cloud, Gemini) are ever-changing, and we haven't
found massive differences. The best model might change
next week... The criteria for educational appropriateness
vary a lot by context, teacher, and project. It is important
to learn about these criteria. We hosted a Practitioner
Voices summit for math educators at Stanford this
summer, where one of our main goals was to learn about
their criteria for evaluating AI tools. Our short report can
be found online and we will be releasing a longer paper
soon.
OECD: In the case of Tutor Copilot, the system
was trained by providing data based on the
observation of and work with expert teachers.
Do you think GPT-4 at the time would have given
similar measurements without this dedicated
educational element?
Dora Demszky: No, we explicitly needed the expert
teachers' input to improve the model. Without that
"expert-informed cognitive task analysis" – where we tell
the model how an expert teacher would remediate a
student's mistake – it performed significantly worse. We
are doing something similar with ScaffGen, giving the
model these expert-informed processes. This is related
to, but slightly different from, the evaluation criteria for
determining if a tool is good, though the two can inform
each other.
OECD: Could these tools supplement and
augment teachers to improve educational quality,
especially in countries or contexts with teacher
shortages or teachers with a lack of expertise?
Dora Demszky: I want to problematise the premise
that we don't have human teachers available. It's risky
to accept that technology should (or could) replace the
human teacher role, as this could worsen inequities
in access to human teachers, not just in low-income
countries but within the United States and OECD
countries too. If situations genuinely lack a human
teacher, we must think carefully about what roles these
tools can fulfil. The relationship-building part cannot
be replaced by technology, though other aspects could
potentially be, which remains to be tested.
OECD: I was not thinking of replacing teachers,
but more, like, if you have inexperienced or
low-quality teachers or people with little subject
and pedagogical knowledge, could these tools
help them improve their performance? In many
countries, there would be just too many teachers
to train, so being able to enrol the next person
to tutor or teach could help. If the humans don't
really know yet what they're doing, could tools
like the Tutor Copilot help?
Dora Demszky: Our central focus for these
technologies isn't time-saving, but rather the educative
element – supporting teachers' professional learning. All
teachers have room to grow. Different versions of the
tools could be tailored to the user's experience level;
for example, a novice teacher might be overwhelmed
with too many decisions or information before they gain
more training. We have tested some tools with complete
novices. In the “Code in Place” global programming
course run by Stanford University, we implemented
the teacher feedback tool with thousands of volunteer
section leaders, most of whom had zero teaching
experience. This feedback tool helped them, so this is
a significant user base we are targeting. But we would
need to do pre-work to ensure these technologies
translate to different languages and local needs if we
were to use them in the contexts you mentioned.
OECD: What do you think GenAI can never do as
well as a human being, if anything, especially
regarding the human dimension in education?
Dora Demszky: Motivation and relationship building
are key elements that GenAI may never do as well as
humans. While more research is needed, experts in
education agree, and it intuitively makes sense. An AI
won't be seen as a role model. Students might share
things with AI they wouldn't with a human because
they're less afraid of vulnerability. However, a human
is better able to support emotional well-being and
create accountability. With AI, there's no accountability.
A student might not care what they do because the AI
won't get hurt. Social-emotional skills for example are
learned better with a human teacher and human peers.
Learning involves much more than just knowledge or
information gathering
OECD: Let us move to the final area that you
mentioned initially: feedback to students. We
know this is essential for learning, for teachers as
we have already mentioned, but also for students.
What could GenAI offer on that?
Dora Demszky: A significant area of research and
development for both industry and academia is teachers
giving feedback to students on their work. Teachers
often lack time (imagine they have 150 students) but
also training to give high-quality feedback. As we aim to
go beyond just productivity and time-saving, focusing
on improving the quality of teaching and of feedback is
essential. Some tools exist, like Brisk, supporting GenAIdriven feedback, especially in writing. We are working on
rigorously validated tools that also support professional
learning around feedback provision.
OECD: Are you talking about formative
assessment, where feedback is given on students'
written assignments? Or is it linked to the
applications of real-time feedback and dialogic
practice you told us about? How do these two link,
and what do we know about the efficacy of tools
being developed?
Dora Demszky: Our work focuses on formative
feedback that teachers can give on student assignments,
but with a strong emphasis on revision. One goal of
feedback is to help students improve and revise their
work, and students are less likely to read feedback if they
don't have a chance to revise. We focus on areas with
room for student improvement and lower stakes. The
efficacy of these new feedback tools remains to be seen,
as they are very new, but conceptually the design seems
sound.
Teachers often accept GenAI suggestions without
editing. That’s a problem. We explicitly design our tools
to support teachers in creating feedback, not to replace
their feedback, because research indicates students are
less likely to act on feedback perceived as coming from
AI rather than their teacher. It's crucial for students to
feel the feedback is from their teacher. We are developing
a benchmark for feedback quality, a set of measures for
assessing feedback from teachers or AI tools, which we
hope industry will adopt.
We have a working paper that compares expert-written
feedback to LLM-generated feedback. While LLMs
are not bad, they significantly lag behind experts in
key areas. For example, LLMs are much less dialogic,
tending to give specific rewrite suggestions ("this was
not right, here's how to rephrase") rather than engaging
with holistic arguments or probing student thinking to
encourage revision. Also, LLM comments can be disjoint,
unlike a teacher's coherent feedback where comments
build on one another. We are actively working on
improving and evaluating the use of GenAI tools.
The two strands of projects – teacher-facing feedback on
talk (which is maths/STEM-focused) and student-facing
feedback on writing – are not directly linked currently.
However, we envision integrating them. For example,
teachers could receive a post-lesson report summarising
student assignments and class discussions, offering
feedback suggestions for assignments, and guiding
future lesson planning. This could be a complementary
system down the line.
OECD: So, are all the tools you've mentioned useable in real life in-person classroom or
instruction settings, except Tutor Copilot, which is
for virtual platforms? For example, can you think
of uses of Tutor Copilot in an in-person setting?
Dora Demszky: We need to be careful not to make realtime suggestion tools distracting in virtual or physical
face-to-face contexts, or to take away educator agency.
One idea we're exploring is surfacing feedback during
high-leverage moments in a real-time classroom, like
when students are working on problems and there's
a pause, rather than giving suggestions constantly.
Identifying these non-distracting periods could be very
useful.
Doing this in a virtual context is straightforward; you
can speculate when these moments might occur and
surface feedback. In a physical classroom, it's harder
due to challenges in accurately capturing student voices,
surfacing real-time feedback to teachers (e.g., via iPad),
and instrumentation. We need to talk to teachers about
this. One question for participants at our Practitioner
Voices summit was how these tools could support
teachers in a physical classroom, whether by analysing
group work or teacher discourse. They might help us
envision practical implementation. There might be
variation, with some teachers preferring post-teaching
feedback and others appreciating real-time tools. Our
longer report will report what we learned from teachers.
Comments
Post a Comment