Assessing With AI: Insights From the Machine Learning Minds at MetaMetrics
Unlocking the Potential of AI to Score Student Writing
AI has amazed us over the last two years. But, we have all seen it make some comic (sometimes tragic) mistakes. Can it be trusted to evaluate students’ written assignments or essays? If so, how does it work? What are its benefits and limitations? Will it always be accurate?
These are the big questions that we’ll be answering in the first ten posts in this blog series.
Alistair Van Moere, Ph.D.
President
MetaMetrics
What are the two kinds of AI for scoring student essays?
Many people throw around “AI” as a general term, but in fact, there are many kinds of AI, for many different uses. For example, did you know that there are two main kinds of AI for scoring student essays?
- Feature-based scoring
- Deep learning scoring
Let’s look more at these two different approaches.
1) Feature-based scoring
This approach to automated essay scoring has been in use for decades. In the early years, it was pretty rudimentary – for example, an algorithm counted the number of words a student wrote, and counted the errors, and counted the characters in the words they used (longer words tend to be more sophisticated), and counted the proportion of common or rare words they wrote. All these metrics are known as features: hence, feature-based scoring.
It turns out that these features are really good at predicting whether a teacher would give an essay a high score or low score. Basically, you can build an algorithm that uses these features, and it will end up assigning scores to essays that are very similar to scores a teacher would give.
The problem with this kind of essay scoring is that it’s easy for students to “game the system”. They can just write more words, or use big words, or write longer sentences with more commas – and they will get a higher score. They can also write clever-sounding nonsense, or factual inaccuracies, and still get a high score.
So, that was the old way of doing things. Fast-forward to the 2020s, and, thanks to the breakthroughs in natural language processing, feature-based scoring has come a very long way indeed. We no longer count words or errors. In fact, at MetaMetrics we have a collection of hundreds of features, and many are really hard for students to trick. They go beyond the surface features such as grammar and spelling; instead, we can look for patterns in the writing and analyze the meaning of what students write.
For example, here is a taste of three features from the MetaMetrics feature library:
- Semantic support: How semantically similar is a word with all the words leading up to it?
- Semantic predictability: How predictable is a word, given the previous string of words?
- Semantic similarity: How do the words and word-combinations in an essay compare with the words and word-combinations in high-scoring essays?
By analyzing every word in the essay, and its grammatical and semantic appropriateness with every other word in the essay, we can take feature-based scoring to a whole new level. Students cannot “game” these features or write nonsense.
2) Deep-learning scoring
This is a new and very different approach. It uses the same kind of models that underlie the Large Language Models (LLMs) like ChatGPT or Google’s Gemini. LLMs are developed from billions of words, and billions of parameters. (Think of a parameter as a knob that can be adjusted, or looking at the same data from many different angles). In these models, words, sentences, and whole texts become abstract representations. Some models are designed to understand texts, and some are designed to generate texts, and some do both.
So, you can design or fine-tune the model for different purposes. For example, you can fine-tune it to be a chatbot, or interpret medical records to look for diseases. Well, you can also fine-tune the model to score essays. You can do this by providing it with instructions for what to do, and giving it examples of other essays that have been scored. Just like feature-based scoring, you will get an algorithm that scores student essays in a very similar way to how teachers would.
Which one is better?
You might be thinking that deep-learning leverages the really powerful new AI and is the best approach to take. Well, sometimes. It really depends on your goals and the context in which students are writing. Here are some pros and cons, and factors to consider:
- Deep learning scoring is more flexible, may be quicker to implement, and in many cases might be good enough for quick feedback to students.
- Feature-based scoring is explainable and meets GDPR standards (more on that in another post).
- In our studies at MetaMetrics, we obtained more accurate results with really well-designed feature-based models versus deep-learning models (yes, really!).
- It’s easier to analyze feature-based models for possible bias, which is vital for fairness and equity.
At MetaMetrics we have implemented best practices in both approaches. Which scoring approach to adopt, and when, is an important decision that we would be happy to help with.
Looking for other posts in this series?
Access All Assessing With AI Posts
Add MetaMetrics® Writing AccuRater to Your Literacy Programs
MetaMetrics has decades of experience analyzing text and is excited to announce MetaMetrics® Writing AccuRater, the state-of-the-art in AI analysis of student writing.
Are you an edtech or assessment company? We’d love to power your learning program. If you are interested in incorporating writing into your learning activities or assessment, then we are ready to listen to your needs. Contact our team to discuss.