TL;DR
An alpha version of Docent, our experimental AI-powered grading system, is now available at https://get-docent.com/. If you're interested in using the system, please contact us for support.
The Challenge of Grading
One thing that I find challenging when teaching is grading, especially in large classes with numerous assignments. The task is typically delegated to teaching assistants with varying levels of expertise and enthusiasm. One particular challenge is getting TAs to provide detailed, constructive feedback on assignments.
Our Experiment with LLMs
With the introduction of LLMs, we began exploring their potential to enhance the grading process. Our primary goal wasn't to replace human graders but to provide students with detailed, personalized feedback—effectively offering an on-demand tutor and addressing "Bloom's two-sigma problem.":
"The average student tutored one-to-one using mastery learning techniques performed two standard deviations better than students educated in a classroom environment."
To evaluate the effectiveness of LLMs in grading, we used a dataset of 12,546 student submissions from a Business Analytics course spanning six academic semesters. We used human-assigned grades as our benchmark.
Good Quantitative Results
Our findings revealed a remarkably low discrepancy between LLM-assigned and human grades. We tested various LLMs using different approaches:
- With and without fine-tuning
- Zero-shot and few-shot learning
While fine-tuning and few-shot approaches showed slight improvements, we were amazed to find that GPT-4 with zero-shot learning achieved a median error of just 0.6% compared to human grading. In practical terms, if a human grader assigned 80/100 to an assignment, the LLM's grade typically fell within the 79.5-80.5 range—a striking consistency with human grading.
Qualitative Feedback: Where AI Shines
LLMs excel at providing qualitative feedback. For example, in this ChatGPT thread, you can see the detailed feedback the LLM provided for an SQL question in a database course. Much better and more detailed than whatever any human grader was going to ever provide.
Real-World Implementation: Docent
Encouraged by these results, we implemented Docent to assist human graders in our Spring and Summer 2024 classes. We also conducted a user study to assess the perceived helpfulness of LLM-generated comments. However, during deployment, we identified several areas for improvement:
- Excessive Feedback: The LLM often provides too much feedback, striving to find issues even in near-perfect assignments.
- Difficulty with Negation: Despite clear grading guidelines, LLMs struggle to ignore specified minor shortcomings. See below :-)
- Multi-Part Assignment Challenges: For assignments with multiple questions, grading each question separately yields better results than assessing the entire assignment at once.
- Inconsistent Performance: While median performance is excellent, about 5-10% of assignments receive imperfect grades (compared to a human), leading to student appeals.
Current Status and Recommendations
Based on our experiences, here are our current recommendations for using AI in grading:
- Human Supervised Use: Grading using LLMs is best used as a tool for teaching assistants, who should review and adjust the AI-generated grades and feedback before releasing them to students.
- Caution in High-Stakes Scenarios: We advise against using AI for high-stakes grading, such as final exams, until we achieve greater robustness across all submissions.
- Ideal for Low-Stakes Assignments: LLM-based feedback is well-suited for low-stakes assignments and practice questions, where even imperfect feedback improves the current status quo.
Try Docent
To facilitate experimentation with AI-assisted grading, we've deployed an alpha version of Docent at https://get-docent.com/. If you're interested in using the system, please contact us for support and guidance.