OpenAI drops ChatGPT, sending waves (or ripples?) in education

Media Thumbnail
  • 0.5
  • 1
  • 1.25
  • 1.5
  • 1.75
  • 2
This is a podcast episode titled, OpenAI drops ChatGPT, sending waves (or ripples?) in education. The summary for this episode is: <p>ChatGPT, the generative AI chatbot from OpenAI, recently took the internet by storm. It can write college-level five-paragraph essays on seemingly any topic and in any style. It can explain anaerobic respiration in 10 seconds from your biology homework prompt. It can write code and give coherent explanations to math problems, although not always correct ones. It’s no wonder students and educators are interested in how this particular AI tool changes the classroom.&nbsp;</p><p><br></p><p>We, at Merlyn Mind, are leading the push for AI in classrooms by building the first digital voice assistant for teachers. In this episode, we gather our team of AI engineers to share their hot takes on ChatGPT’s impact in our wheelhouse: education and AI. Learn from experts on what makes ChatGPT special, what makes it not special, and what the implications are for teaching and learning.&nbsp;</p>
What is ChatGPT?
01:55 MIN
Large language models have trouble keeping up when society's language and current affairs are constantly changing
01:05 MIN
Size matters: Large language models trained on large amounts of data produce more natural sounding responses
01:13 MIN
LLM weren't the first technologies to complete sentences and respond to human voice, but they have become the best
01:59 MIN
ChatGPT is not magic, it's an incremental step along a long evolution in LLM
01:45 MIN
ChatGPT's public release and chat format make it a sensation
02:46 MIN
ChatGPT is different because it can maintain context and follow the conversation over multiple messages
02:14 MIN
What is the RL mean from Reinforcement Learning from Human Feedback (RLHF)?
02:00 MIN
What is the Human Feedback part of RLHF?
02:14 MIN
How did human feedback impact ChatGPT's model?
01:49 MIN
Ethical concerns and ChatGPT's responsibility
03:05 MIN
Impacts of ChatGPT in the classroom and assessment
03:01 MIN
00:55 MIN
How to make ChatGPT correct, not just coherent
01:10 MIN
How does this tool fit into learning?
01:27 MIN
Improve ChatGPT by feeding it higher quality content
01:10 MIN
ChatGPT generates long essays for students. What educators need are short, structured response
01:09 MIN
AI like ChatGPT still lacks purpose and sophisticated strategy and communication
01:34 MIN
What does ChatGPT mean for edtech companies like Merlyn and for education?
01:44 MIN
AI forces us to rethink "what is learning?"
01:53 MIN
AI can be a tool for teachers, not just students
01:45 MIN
AI will become a tool that frees us to focus on the sophisticated, human parts of education
02:16 MIN

Levi: Welcome to Unsupervised Learning, where we bring members of the Merlyn Mind team together to talk about artificial intelligence, technology, and education. We hope you enjoy these conversations and learn something with us. Let's learn.

Levi: Welcome everyone to this episode of Unsupervised Learning, that we're going to talk to some of our AI experts here on the Merlyn Mind team. Today we're joined by Aditya, Ashish and Paul, three of our leading AI experts who've spent their entire careers building advanced AI solutions, researching them and developing them. We also have Jess here, one of our native and expert teachers who works on our strategy team. I'm excited to have a conversation around a very important trending topic ChatGPT. Everyone's talking about it. What on earth is it?

Aditya: I can try taking a shot. I'll just take a step back and maybe it helps to understand this whole trending thing from maybe five years ago on this whole notion of large language models, LLMs and whatnot, that we've heard of. But to explain it in the simplest terms, because that really helped me in the first... Started reading about it. With the increase in the amount of data that we have, there've been new architectures that have been developed called transformers, which enabled to model language as what's the next best word that comes to sequence of words. Let's just think about this. Yeah.

Levi: So is that like when I'm stumbling on a word and a friend is more articulate than me? They said, "This is the word you're looking for."

Aditya: Exactly, in a way. But how do we develop language is a totally other question. But the way this was trained is that you look at existing sequence of words in this huge database that we have, internet as an example. And over time and with good amount of training data and training power, you can actually start creating probabilistic models. So what's the next best word given what everything you've seen so far? So these are language models in general and large language models have been with these huge bullets of parameters and everything. More data, more bigger model parameters and whatnot. And that's been the biggest trend. And for that, there's been some developments on architectures called transformers that you might have heard of. And most of these solutions that have come up are under these umbrella of large language models.

Levi: And if we just pause real quick and say: Why do large language models matter? What does this type of evolution mean for the world? How do we benefit from this today in technology, even before ChatGPT and whatever else has changed there? Why do large language models matter?

Aditya: I would say it's many things because they do have their own cons. The fact that large language models are if not trained with the appropriate objective, are blindly trying to understand what's the next best word that is supposed to come in your sequence of words. Not necessarily reasoning behind what's happening. And these models, most of them at least in the start, were a snapshot of what's being trained at that instant. Whereas the world knowledge is ever evolving. You're continuously changing who the current president is, who's the current affairs. Keep changing, whereas your data is trained on a snapshot. At that instance, when it was trained, the next best word that was predicted might not be the most relevant after two years. So that has been something that people have identified and started working on what has to be the next big thing on large language models.

Levi: So-

Ashish: What I-

Levi: Yeah, you go, Ashish.

Ashish: What I would add there is I think the single most surprising thing about large language models has been that as you grow these things larger and as you train them on more and more data, they show almost incredibly surprising emergent properties that I don't think we had necessarily anticipated. So for example, the capability of producing text that looks very similar to what a human would produce in many cases, the ability to produce very natural sounding language, the ability to write code based on a natural language prompt. These kinds of properties are surprising and it's seemed to emerge as a function of scale. So a small language model built the same way as a large language model may not necessarily have those traits. And I think that's been the really surprising and interesting thing for me as far as large language models are concerned.

Aditya: Right and this is the whole notion of generative AI, generative models. And that is the biggest difference maker in large language models. You can start generating content.

Levi: So yeah, Paul. I'd love to hear your thoughts on this.

Paul: So I'm not as impressed with scale as I am with the ability, and it's an expected ability of large language models and a variety of approaches to machine learning related to language. So for example, something as simple as word embeddings where we by any number of statistical or deep learning processes produce a vector which has great utility for representing the meaning of a word, what we perceive as the meaning of a word, and that we can use computationally in a variety of ways. So anybody who's interested in word embeddings can go look that up. But the essence of that is that from data, by whatever technique, we learn a representation of at first meaning and then knowledge. Both in the case of word embeddings, it's more meaning. In the case of large of language models, whether or not large, there's some knowledge that I'll talk about. But there's definitely meaning being represented within the language model. So it's not necessarily generative that we're concerned with in language modeling. It's a representation of meaning that's a little bit bigger than a word. So we have sentence embeddings that are very useful for finding similar sentences. And we have paragraph embeddings that are very useful for finding similar paragraphs, all because they have a representation of meaning. And in that meaning, there are relationships between the concepts that we just talked about in word embeddings and facts and multi-step relationships potentially. Certainly, facts are latently represented within these models and we can get embeddings at a variety of levels that represent them. So language modeling and generation of language is not limited to next word prediction. BERT, the first of these didn't predict the next word. It predicted what's the word that fits here? And to do that, it took in some latent way more than just meaning of the individual words, but knowledge some approximation of what the meaning of that segment of text was in order to fill that slot. So in language modeling, we're getting a representation of increasing complexity as we increase in training data and scale. That's what's interesting about scale, is we can get to more and more sophisticated representations of knowledge that go beyond simple facts, what you would think of as a simple link.

Levi: So if I zoom out a bit and as a layman say,"Why does it matter at all?", language models allow me to now interact with technology more naturally. Is that a fair statement? So tools that I use every day. So is it like when I'm using my email and it starts auto completing a sentence for me, that I assume has come from a language model? When I'm interacting with a voice assistant and I give a command and it is able to understand me and respond to me, that is possible because of large language models. Is that why this all matters? Why does it matter that we're making progress with large language models? And what's the eventual impact for humans interacting with technology?

Paul: Well, I'd react to the use of the word "it's possible" because of language models. I would rephrase that to language models are particularly good at that given our current state of technology.

Levi: Okay, great.

Paul: Other technologies, they're particularly good at that. I don't know that they're required to do that.

Aditya: Exactly, because I think they've got things way better. They made things way better because some of the examples that Levi you were talking about, did work even before the large language models. It's just that things were significantly better with language model embeddings, which Paul was referring to, allowed us to interpret voice commands in a much more accurate manner because of the examples. Some of the examples that we were referring to. Sentence embeddings, trying to identify which sentences similar to which other sentence. Things started getting much better with these models.

Paul: And as they got bigger, we got to the point that we could look at and people are currently looking at thousands of words and understanding in some sense, representing that, understanding it to whatever degree that representation effectively represents it, and then using that. Pause.

Levi: So ChatGPT. So now let's jump. So we have large language models and now something happened in the last two weeks in the public sphere that feels different to some of us. We're seeing everybody posting questions they asked and answers that were generated by a computer that feel almost like a human wrote that. And I think you start to say, "Is it different? Did something magical just happen? Was there some huge breakthrough or is this the Turning test that's been broken? What happened?"

Paul: Let's take a vote. Who thinks something significant happened?

Levi: So you get the people who don't know. Me and Jess are like,"What's going on? Everybody's talking about this. Something significant happened."

Paul: I don't think anything significant happened. I think it's just an incremental step along a technical evolution. That's my opinion. How about you guys?

Aditya: I just say a thing. I think this was a first visible significant thing, but we'll talk about what was the biggest change. As I mentioned, large language models have been there. GPT that we know of is one example of a large language model. ChatGPT, it comes from a family called GPT-3.5. That's what they refer to. That evolution, 3 to 3.5, it's not like a step function that it's suddenly changed overnight. It's been an evolving thing as Paul is saying, which is why it's not one significant strong step, but it's become visible now. But it's been an evolving thing with this whole framework called reinforcement learning via human feedback, RLHF. So that notion was a good framework to take into account bringing human feedback. But that didn't happen as a single step as such because-

Paul: It's not unique to OpenAI and ChatGPT.

Ashish: So I would say, of course there are a bunch of words here and whether the answer is yes or no depends upon the meaning of these words. What is significant. I don't know the significances. But of course what Paul and Aditya are saying is spot on. Any of these things, this is a bunch of incremental improvements month over month, year over year. And that's led us to this place. I think the one significant thing to me about ChatGPT is there is something very different about growing something in a lab and testing it in a lab, and opening it out to anyone in the world and letting them play with it.

Paul: Amen.

Ashish: Yep. And ChatGPT, like GPT-3, again, it's not completely unique. Although, I do the same thing with GPT-3. But this idea that ChatGPT is at a level where you can release it into the world and of course people are testing it adversarially, they're trying to break it in-

Paul: Love that, yeah.

Ashish: Which they can. And by the way, it's not broken down in... Yes, there are all sorts of problems. There are all sorts of inaudible cases we're identifying and all sorts of risks. But it seems to be holding up pretty well.

Paul: Can I?

Paul: I'm sorry, Ashish.

Ashish: Go for it, please.

Paul: So thank you. I think you just hit the nail on the head. Open AI deserves a lot of credit for putting this out there and letting people play with it, I think. Certainly, it's a great generator of prose. It's a fantastic generator of prose and that's along this incremental path. But there is a significant advance that's clearly visible. It's not limited to ChatGPT though. And I think I'd like to hopefully we'll touch on InstructGPT. But I think that there's some... So two things: One, putting it out in the open and two, chat. Chat is engaging for humans. So you put a chat out in the open and people play with it. Boy, that reminds me of an essay that I'd like to reference. There's this guy in the Atlantic that talked about playing with it. And that's a great article. It's Ian somebody. I'll get the name, but I'd recommend that article.

Levi: Yeah, send it to me and we'll link it in the show notes.

Paul: Okay, good. So I'd recommend that article and I think the fact that it was chat and that you could play with it, meaning they put it out in the open, those two things, I think that's what behind this million user in a week phenomenon.

Levi: Yeah. Let's talk about how important that is actually. So getting to a million users in one week is almost unheard of in any technology, so that's incredible scale. Why were so many people interested? So what is it that people are seeing when they chat and ask a question that is so thrilling to humans? You all have helped us understand that maybe it's not as breakthrough as it seems, that this is a lot of incremental and progress and you already had a lot of this technology in place. But for myself and Jess and others of the million people that are playing with ChatGPT, it seems pretty remarkable to ask any question and get an answer that seems coherent and then importantly, ask a follow-up question and the system seems to comprehend what we already talked about, and respond in line again. That blows our minds. We haven't seen a computer do that before.

Paul: Over to you on RLHF, Aditya.

Aditya: Yeah, I'll just take one second. I wanted to underline something that you said, Levi, where you said when you ask a question and you get a coherent answer. I think the key over there is the coherent answer. Remember that you didn't say an accurate answer.

Levi: Let's talk about that.

Aditya: Yes, because that's fundamentally what the whole notion of RLHF was as well.

Ashish: Oh wait. Aditya, may I just add something to what you said?

Aditya: Yes.

Ashish: You said coherentness versus correctness. I will say though that I have been impressed by its ability to handle context. The fact that it seems to maintain context across not one, two, four, but 20 turns of dialogue, that has been impressive to me.

Paul: Well that, just to interject, I would attribute that to the size, that's a big part, and the span of material that it can attend to and represent.

Levi: Is it fair to say that that is a human-like trait that previously most people didn't see in computing? Maybe in some applications we have context and history, but when I talk to my wife or my friend or my colleague, we have all of this shared history that every conversation is built on, and we're going back to things and drawing things in and it is just this evolving relationship. So I've never had that with computing. Even the fact that I can ask multiple questions like that didn't really ever feel possible before. So is that different?

Ashish: That's right.

Aditya: That I would say, and again, I'll underscore your sentence, your phrase thing, human-like, maintaining of context. That goes to our RLHF or reinforcement learning via human feedback. Which to answer your question, yes, I think this is the first time at least I personally have seen a conversation, where it can hold for a really long context. Whereas multiple turns still maintains the context very well. When there is a need to disambiguate though... We all have been trying to break it, have adversarial conversations, trying to see, confuse it with multiple contexts. There is a subtle difference where humans tend to disambiguate in such cases where when you're trying to reference where it could be multiple things. You say, "Oh, did you mean this or that?" We haven't completely seen that in every example that we've interacted with chat bot with. It prefers one over the other. But besides that, it's extremely human-like in the way it maintains its context. You are absolutely right. And that, I think is partly attributed to the way it's trained on the whole notion of human feedback.

Levi: Can someone explain that? So first define the RFLH, the acronym? Okay, what is it exactly?

Aditya: So fundamentally, this whole framework, what at least the RLFH framework used by OpenAI, what was done is they've brought in the human feedback into the loop in the following way: When they train and they're trying to fine tune the model, they need a scoring function, which is identifying their problem: 'Which among the two things is better?' But-

Ashish: May I say something very quickly there?

Aditya: Yes, please.

Ashish: I'm sorry to interrupt. Levi, I think the one critical aspect here is there are certain tasks in which goodness, how well did you do that task is obvious. For example, if you're building a machine that plays chess, you can measure well how often does this machine win? And so the goodness of the machine or the algorithm is clear.

Aditya: It's objective.

Paul: Yeah.

Ashish: However, there are several... Sorry Paul, did you want to say something?

Paul: I do. So right there, I wanted to introduce the concept of RL for the audience, which is we can take exactly your analogy of chess and we can talk about AlphaZero or whatever it was, where we had a system that learned to represent the chess board, just like a language model learns to represent the meaning of a sentence. So it learned how to represent the chess board and then we let it play itself. And the result of playing itself allowed it to reinforce what worked. It learned what worked in playing chess by reinforcing the decision function of how it chose a move given a position. That's-

Ashish: That's brilliant.

Aditya: This is the reward function.

Ashish: Let me just complete that thought now.

Aditya: Yes.

Ashish: Sorry, Aditya. I'll turn that back to you.

Aditya: Go ahead.

Ashish: That was a really nice way that Paul put it. So the idea was if you know what a good outcome is and you have a sequence of decisions to make, you can think of reinforcement learning as a technique where you try out sequences of decisions, you see which sequences lead to good outcomes, which sequences lead to bad outcomes, and you reinforce the decisions that lead to good outcomes. You reinforce. So you more-

Levi: Like parenting.

Ashish: Absolutely. Now.

Jess: Or teaching.

Levi: Or teaching, yeah.

Paul: Oh my goodness.

Ashish: The thing is though that there are a bunch of really useful and interesting tasks where outcomes are harder to measure. So for example, think of a chat. So if you want to have a chat, at the end of it, you have some notion of whether that chat was satisfying and good. But if I ask you to come up with a mathematical function that you could use to apply to a chat and tell me,"Yeah, this chat was five good and this chat was negative three good", that's harder to do. And so this whole RLHF function, and I'm going to turn this back to Aditya, sorry. This whole RLHF approach is attacks tasks in which the scoring, how good is a chat, how good is a summary, how good is a response to something that's harder to measure? And open AI came up with this really inventive way I think of tackling those kinds of problems, in which the metric, the measure is hard to compute. Aditya, please.

Aditya: Right. No, thanks a lot Ashish, because, this we are going to refer to as the reward function. Going back to Levi, you were saying parenting and Jess you were saying teaching. There's this notion of reward and punishment in general. So this is the reward function and in general in reinforcement learning, it's more mathematically understood in most of the typical examples. What was changed in RLHF, stands for reinforcement learning via human feedback, is to make this reward function a human understanding based. Basically their feedback. So what they've done is try to model the reward function as human preference, as in you give two outputs and you say,"Hey, human annotators. What would you prefer among these two?" It's a comparison. You can't give it a number. So they started making it a comparative study. "This output versus this output. What would you prefer?" And a lot of data was connected. Thanks to GPT-3 and other things where we all have thrown up so much data, they had all that data. So they took those outputs and they gave it to other humans and actually said, "Which one sounds better or which one looks better?" And their outcomes or their responses were modeled as a reward function.

Paul: They actually did it in two phases though. They actually did do a data gathering where they scored the absolute. But this is an important step in language generation today and it's not open like AI's alone. Language generation from any of these approaches to date has largely been a beam search using something called Viterbi to basically spit out text in a probabilistic manner with no feedback, no scoring function. So it didn't get the benefit of any feedback whatsoever, let alone the benefit of reinforcement learning. So this use of human feedback with reinforcement learning, RLHF, is recent, but it's not unique to ChatGPT. In fact, they did a nice job with it in the predecessor to ChatGPT called InstructGPT, which is more interesting in my opinion. So back to you, Aditya.

Aditya: Yeah, so as I was mentioning, OpenAI have done these series of models they call that refer to as the GPT-3.5 family of models, where all of them actually including InstructGPT and ChatGPT, they follow the RLHF framework, where they were able to model human feedback into the reward function and optimize for that. So now going back to why we are seeing the responses being coherent: It's because humans have actually preferred that as the preferred choice when looking at two different outputs. Having said that, if the choice was also to make sure that they're accurate, maybe things would've been different, obviously, because it learns to also be accurate in the way.

Levi: Okay, so let's-

Paul: Two quick comments. So if you go and you look at OpenAI's website, they make two statements related to that. One of them is that in the actual human feedback and in the development of the training cases, the people they used were biased towards longer utterances. So that's why ChatGPT is verbose. The second point, which is a very interesting technical point that they made: It's more than technical, but that what they said, what GPT-3 knows, is different than what the people know. And both of those, not to mention the limits on what ChatGPT-3 knows, lead to many of the errors that you see in the output.

Levi: Actually... Oh yeah, you go ahead, Aditya.

Aditya: One last thing. But the benefit of human feedback also address the toxicity of large language models, sometimes generating really bad content.

Levi: Yeah, let's talk about that, actually. So let's talk about as exciting as this is, I think you see a lot of questions out there about is it ethical, is it good? Are we going into a dangerous territory? So I don't know. Jess, I know you've been looking a lot about this. What have you seen out there that people are worried about? And then we'll go specifically in education and talk about why are people either worried or excited about what this could mean for education?

Jess: Yeah, well, I think in a general aspect, I actually was thinking back to that step function that you were talking about before, is that this, for the layman does feel like a big jump. And I was a math teacher before in the classroom. And so us, we just see two different points and we don't see that gradual curve getting there. So I think that on an ethics standpoint, I've definitely seen examples here of the chat bot releasing racist, sexist examples when you code it in a certain way or provide a certain context that gets around some of their blockers. So I think that there's certainly a worry from just general people about what content can be produced in there.

Paul: Yeah, I have a couple of thoughts if you guys don't mind.

Jess: Go ahead.

Levi: Yeah.

Aditya: Yes.

Paul: Okay. First, I don't think we have to worry about that too much with some caveats. So I think Ashish, you might be able to touch on this better than I, but I think we're going to have adversarial AIs that basically filter toxicity and bias out. I don't think that that's mature yet, but I think that's coming. I do think that we have a deep problem with regard to trust of these systems because we don't entrust in the decisions or the opinions expressed by these systems. It's up to somebody behind the curtain to cloak it in adversarial filters or whatever, and it's their policies that are going to constrain, if not dictate, the behavior or the resulting agent. So you can only trust the agent to the extent that you trust and allow time for all of this to mature. But then, even then, to the extent that you trust its creators.

Ashish: Paul, that's a fantastic point. So already even with just search engines, there's always controversies about are they prioritizing certain links and making other links harder to define? With something like an AI system that produces text, all of those factors are exacerbated, because now you don't even see pages and links. You just see summarized information that this thing is producing for you. I think that's a fantastic point.

Levi: So let's look at... We have 15 minutes maybe left in our conversation. Let's steer towards education. We at Merlyn Mind have chosen to be, as our CEO Satya Nitta likes to say, translators of the science. We're trying to take what's cutting edge in artificial intelligence and apply it in solutions and products that can help teachers and students to learn, to engage, to better use technology in the classroom. What are some of the exciting kind of paths that these types of developments open in education and what are some of the concerning ones? You guys have been thinking about this for sure. What is good ones, bad? I've heard things like, "There's no essays. SATs, see you later. inaudible essays anymore." Is that true? What's coming now?

Ashish: I would love to hear Jess's viewpoint on that actually. Before we get into any technical stuff, I'd actually love to hear from Jess a lot.

Jess: Yeah, I think that there's definitely... That is a first reaction that has come from the teacher communities. What does this mean for academic honesty and does this mean that it's plagiarism to type your essay prompt into ChatGPT and get a coherent essay? But if you think about it, there have been resources for students to lean on in a way that have existed before this too. And I think that anytime a new technology or resources become available, it forces teachers in education to think about, "Okay, what am I really trying to assess from a student and how do I get around this process that a computer or an essay database could just fulfill?" So I think one way that I've even seen is what you guys were talking about: The responses may be coherent but not accurate. So maybe a different way of assessment looks like plugging in these essay prompts and then having students evaluate how great the essay is that has just been spit out and really thinking about, "Is this accurate? What's a better word that this AI could have used? Is this, has the correct voice that we want?" And really using concepts that we want to teach and using it in an evaluative way. So I think it's going to push us to try to find new forms of assessment if the AI becomes powerful enough that it truly is replacing the essay writing itself. I also think that there's a way for us to get at asking students to explain their process too. So you could have students present on the topic they wrote about instead of just relying solely on the writing portion of it. If you think that maybe the students didn't produce this piece of writing in that creative process. They're still having to express that creative process through a presentation or some other form of assessment. So I think that's particularly on the essay, is there's some thoughts that I've seen.

Aditya: I have to say I love what you said about rethinking assessment and using this available technology to rethink of that. I just share one thing and maybe that's connected to what you just said. If I may just share this example that I was playing around with on ChatGPT... Can you guys see my screen?

Ashish: Not just yet. Yes.

Levi: Yes.

Aditya: Okay. Why is seven plus five equal 10? It justified its answer. So this is an assessment thing. So if you actually ask a question, which is a general question-

Levi: Well, explain this more for people who aren't seeing it. So Aditya asked the question to ChatGPT: "Why is seven plus five equal to 10?" And then-

Aditya: Which is obviously wrong and you can see the answer from ChatGPT-3, saying, "The basic principle behind arithmetic is that it allows us to collect and combine numbers in a variety of ways. The symbol plus is used to denote addition, which is the process of combining numbers together to produce a new total. In the case of seven plus five, the two numbers are combined to produce a new total of 10." and that's where things break.

Levi: So why did it break here?

Aditya: That's the thing. This is one example of the different things that I've seen while I played with ChatGPT-3. If you give why questions with a totally wrong statement, it's still justifies your thing. But if I ask it generally, we could try it. What is seven plus five? I'm sure it'll be the right answer.

Paul: And if you said to it,"But wait, seven plus five is 12", it would come back with something interesting.

Aditya: We could try, but-

Paul: Well, okay. I'd like to get back if I could to that question about applying this in education, number one. I thought, Jess, your comments were fantastic. I don't fully see how to pull that off because if we use generated text to assess, there's more work for the student to assess. There's still the work of actually evaluating how they assess that text and this tool wouldn't... Or maybe it could help with that. So this is a case, this seven plus five equals 10. But more generally the case that the system really doesn't know much and isn't thinking about much. And then there are many examples with ChatGPT where we see fallacious or misleading responses. I'm very interested in a system that demonstrates the value of curated knowledge. So if we don't feed a system like ChatGPT with internet randomness, but we feed it high quality text like they did in Galactica, I think we'll get a lot better results out of it. But still, we need more beyond what these things are doing to critique an answer, literally to look at the generated text and have some confidence at saying that it's accurate. We don't have that. We need that.

Jess: Yeah, I think that could be an incredible skill for teachers to implement and already is implemented. Students are interacting with media and stories that have a lot of falsehoods in them already, and that's already a skill that we are trying to teach our students of how to sort through what's false and what's true, even when it's really convincing and looks like anything could be true nowadays. So I think that that's a really powerful concept and skill that we've been teaching for a while and that this AI tool could just be a new way to integrate that.

Paul: Cool.

Aditya: And again, that is the reason your thought on rethinking assessment was just really spot on for me. It's a tool. As in one of our previous conversations, Ashish, I think you mentioned, the genie is out the bottle. It's not like you can push it back in. It's out there, the kids are going to use it. Now, we have to figure out how best to actually rethink it. Maybe it's going to help us rethink the notion of assessments and education and learning. When you say,"Why does seven plus five equal to 10", whatnot, the expectation is not for you to up come up with the coherent answer. That's not what you're supposed to be learning. You're supposed to be learning how addition works. And that's what we are supposed to test, not your justification of the answer, which is wrong.

Paul: Yeah. Could I take one more step here by referencing Retro? So if we took Galactica with a large body of academically vetted content and we augmented it with a memory where not only does the system generate text, but it guides that generation of text heavily by retrieving information from this curated knowledge base, that's essentially what Retro did, I think we could do a fantastic job of delivering accurate content to students if we stepped in that direction.

Ashish: So to reiterate that point, absolutely. So this notion of grounding the AI in high quality content as opposed to what Paul called internet randomness, I think that is a crucial potential step forward, that would make it incredibly useful to education use cases, to all kinds of other use cases.

Levi: As we think about... So Paul, you want to finish that thought first? Sorry, didn't mean to cut you off.

Paul: Well, there's one really. So there's two more things that I think, if we have time, are very relevant to education.

Levi: Like what?

Paul: One of them is we don't have to generate paragraphs of text. If we can generate small useful sentences like "What's the name of the planet just further from the Sun than Mars?" if we can do that, now all of a sudden we can generate all sorts of tools and utility in education. So we don't have to just chat. We can generate small pieces of text. And some of those small pieces of text might be pithy answers and we might generate questions like multiple choice questions. We might be able to generate multiple choice. We might be able to take the teacher's guidance. "Hey, generate me a multiple choice question about why planets don't fall out of orbit," or something like that. And maybe the teacher finds that useful in teaching whatever lesson. So I just wanted to make the point that we don't have to aim to generate paragraphs. Aim for generating smaller even structured things.

Paul: The second thing is... Yep?

Levi: Okay, go second thing.

Paul: The second thing is I want to get back what's uninteresting about ChatGPT, is it has no purpose. In education, we want to educate. That's our purpose. So how do we bring that purpose into this chat? And I'll touch on a system and then a goal. Recently, there's another system called Cicero, which uses this RLHF framework to play the game of diplomacy, which requires communication and essentially negotiation and coercion or convincing people. It's like playing Risk. And this system does very well. It actually learns by having strategy. It has the goal of winning the game and it learns how to communicate so as to achieve its goal. Even though that communication takes many steps of dialogue, if we can get the goals to be our learning objectives, perhaps we can have an AI that teaches with a purpose using these kinds of capabilities and takes multiple steps in its chat, trying to lead you to achieve learning objectives, mix in some of what we just said about assess generated assessments or preexisting assessments along the way, and perhaps a goal that I hope to see someday, engage in really good Socratic dialogue.

Levi: Actually, we have just a few minutes left and I'd love everybody close out with a similar comment to Paul's: what are your hopes and wishes to see one day? It sounds like there's an opportunity to use advancements like this to drive more engaging Socratic discussions in class, more engaging Socratic discussions one-on-one potentially with an AI. As we think now about the role of an institution to shepherd this incorrectly... I think we heard that before, Paul. I think it was you saying it's really about the goals behind the institution bringing this. What are you trying to achieve with this? Merlyn Mind was founded with the explicit intent to say there's cutting edge technology that could have more impact in education for the good if we use it correctly and build solutions and applications that are truly focused on the right goals, the right outcomes, and building right systems to get there. We've so far focused on helping teachers more effectively navigate and use the technology they have in their classrooms through voice to make it easier to interact and through multimodal control. So I can use a remote control, I can touch an interface, I can use a voice command, and automations. I can take something that took multiple steps and compress it into one and save myself time as a teacher so I can engage with my student. As you think about the role that Merlyn Mind will play going forward and continue to bring in technologies that are on the cutting edge, like these types of large language models and what is happening with ChatGPT, what does it mean for a company like Merlyn Mind to continue to have more of an impact on teaching and learning? Is there opportunity for us to bring these things into what we're doing to better help teachers and students?

Aditya: You touched on this point. You just said, "So far, Merlyn Mind has done supporting teachers on a technical control, just navigating through the classroom." I think things like ChatGPT in general, these developments in AI, allow us to support teachers in doing what they really eventually care about, which is learning instruction and not through navigating of technology and things like that. There are automations even over there. There are things that can be happening even over there, and that's where we are heading towards. These kind of advancements are also giving us an opportunity to rethink what is the crux of education, what does learning even mean? What is instruction? What is learning? There are so many other things happening in our classroom today, which I think are creating the noisy environment to eventually not let the teachers do what they love doing, which is teach, instruct and learn and whatnot. While AI can help declutter that and really focus on this already, it can also support them, do this in a much better manner. I'll rephrase that. Rather than better, I would say more efficient than more support, which I'm sure is what teachers would want. So this is just one example. As we were discussing about ChatGPT, we have seen a few things. But there are a few ways to go even before we can actually have something usable for teachers because of some of the shortcomings we discussed. But it's an exciting time. It's super exciting time. I'm actually looking forward to what all we can do with these kind of developments.

Levi: Great. Ashish?

Ashish: Yeah, I think what I would say is I'll agree with both Paul and Aditya. So we've focused this conversation on what is the implication of ChatGPT for students? Oh, if a student uses ChatGPT, are they going to comprehend? Are they going to produce inaccurate stuff? Are they going to find misinformation? I think there is the other side of it, which is what are the implications of ChatGPT for teachers? How can we make the tool usable for teachers so that teachers can do what they like doing effectively, more efficiently? I think that is a really crucial side as well.

Aditya: And teachers are also less adversarial than students.

Paul: I have a quick comment and then I'd love to hear what Jess has to say. Okay, so with regard to teacher productivity in particular, so I think the actual teaching, helping people learn, I'm much more interested in that. But our mission is also to help the teacher in the classroom and helping her be more productive in having less distraction doing rote work or working with technology, so that she can focus on students and learning, is core to that mission. So if you take a look at something like WebGPT or InstructGPT and you think about being able to tell, to demonstrate or what's coming out of a deck, if you pursue the notion of being able to demonstrate or tell a machine how to do some of your rote work, perhaps we can enable that teacher to be more productive, to not be so embroiled in dealing with technology and systems and more focused on learning.

Levi: Wonderful. Jess, final thoughts?

Jess: I have so many thoughts, but I'll try to keep them condensed. I think for terms of teacher productivity, I've already seen examples of people plugging into this like, "Create a rubric for this question," and ChatGPT will spit out a rubric for you. And I don't see far off saying, "Create a lesson plan around this topic." There are already structures and that's not new. There's so many resources for teachers to access pre-made units, pre-made lessons, things like that. So this fear of this technology will take over my job as a teacher, they're already resources out there. I think what teachers add is being able to take those base blocks and then being like, "Okay, here's a boiler plate lesson, but how do I make this engaging for my students, their interests and their needs, and how do I differentiate it for my different learners?" And maybe this could be the building block that teachers then build off of, what you guys were mentioning. The other thing that it reminds me of is just I think it gets down to what kinds of things do we want humans to continue to do and what do we prefer to have technology figure out instead? And I think about in my own math classroom as a teacher letting kids use calculators. I think there was... I'm sure when calculators are first entering the classroom, it's like, "Oh, what's lost? Now the student doesn't... We're not going to teach students how to do these crazy multiplication and division problems anymore? They can just plug it into a calculator? Are we going to let them get away with that?" And the answer is yes. Why would we waste our students' time asking them to do this crazy calculation by hand when the technology exists for them to do that? And instead, student-teacher time could be focused on understanding the concept and "What do I actually plug into the calculator?" and thinking about "What am I actually looking for?" and the purpose. So I think those are my thoughts, are really figuring out, what do we want to offload on technology and what do we determine is actually really important conceptually for students and teachers to engage in?

Paul: I am going to enjoy reading the transcript for those wonderful deep thoughts. That was fantastic. In the course of it, just one small aspect of everything you had to say there, which was fantastic, I was reminded, Ashish and Aditya, of our conversation where we wanted to get a chat bot to know enough about teaching that it could answer the question: "How would you teach a fifth grader about gravity?"

Levi: To be continued on our next episode. I so appreciate Aditya, Ashish, Paul, Jess. Thank you for coming and spending time talking about this today. We will continue having these conversations. We are thrilled that the experts in AI have chosen to dedicate their careers to education. I am so happy to be part of this company because I believe what we're doing matters. Let's keep doing it. Let's keep building solutions that help teachers and students and let's translate, like Satya says, and bring this stuff to education. To be continued on the next conversation. Thanks all.

Levi: Thank you for joining us for this episode of Unsupervised Learning. Until next time, keep learning.


ChatGPT, the generative AI chatbot from OpenAI, recently took the internet by storm. It can write college-level five-paragraph essays on seemingly any topic and in any style. It can explain anaerobic respiration in 10 seconds from your biology homework prompt. It can write code and give coherent explanations to math problems, although not always correct ones. It’s no wonder students and educators are interested in how this particular AI tool changes the classroom. 

We, at Merlyn Mind, are leading the push for AI in classrooms by building the first digital voice assistant for teachers. In this episode, we gather our team of AI engineers to share their hot takes on ChatGPT’s impact in our wheelhouse: education and AI. Learn from experts on what makes ChatGPT special, what makes it not special, and what the implications are for teaching and learning. 

Today's Host

Today's Guests

Guest Thumbnail

Levi Belnap

|Chief Strategy Officer, Merlyn Mind
Guest Thumbnail

Ashish Jagmohan

|Distinguished Research Scientist, Merlyn Mind
Guest Thumbnail

Aditya Vempaty

|Engineering Manager, AI Systems, Merlyn Mind
Guest Thumbnail

Paul Haley

|Distinguished Engineer, Machine Learning, Merlyn Mind
Guest Thumbnail

Jessica Williams

|Strategy Analyst, Merlyn Mind