Autograding resources

Most coding problems can be autograded. Problems that involve writing fruitful functions can be graded more easily than others. These types of problems can be graded by writing test functions that compare output values. Instructors should make sure that all edge cases are captured when creating test cases. Problems that require writing void functions are harder to autograde and may involve checking stdout, depending on the nature of the problem.

Here, we provide some best-practices and tips for writing autograder tests. If you have additional wisdom to add, please do open a PR (or even just an issue) on the nbgrader repository!

Tips for writing good test cases

Test each function/feature in isolation. If a problem contains many functions or parts, write cases that test each of these functions individually. Testing one function at a time makes it easier for you to track an error.

Organize test cases consistently. It can be helpful to arrange and group your test cases with comments.

Try to cover all edge cases. If you have a function that can take in a certain range of inputs, test the boundaries of this range. Test cases should also check for different lengths, different cases of strings, integers and floats, or different ranges when applicable.


Problem: Write a function isAnagram() that takes 2 strings, and returns True if the two given strings are anagrams of each other. Your function should ignore cases, spaces, and all punctuation. So your function should identify “HeLLo!” and “hOlle” as anagrams.

Test cases:

from import assert_equal

# standard True cases
assert_equal(isAnagram('hi', 'hi'), True)
assert_equal(isAnagram('pat', 'tap'), True)
assert_equal(isAnagram('left', 'felt'), True)

# ignore punctuation, spaces, and different cases (upper/lower)
assert_equal(isAnagram('hi', 'hi!'), True)
assert_equal(isAnagram('HI', 'hi'), True)
assert_equal(isAnagram('hi', 'HI'), True)
assert_equal(isAnagram('He llo', '?hello'), True)

# False cases
assert_equal(isAnagram('hi', 'h'), False)
assert_equal(isAnagram('apple', 'aple'), False)
assert_equal(isAnagram('aaaaaa', 'aaaa'), False)

Partially autograding, partially manually grading

When test cases are not enough to determine the correctness of a student’s solution, you can autograde them to make sure that there are no errors in the execution or the solution. You still need to manually look at the solutions to determine whether they are correct or not. This might be helpful if you want students to write a function for a problem using a specific implementation approach.


Problem: Write a function sortList() that takes a list of numbers and returns a list sorted in descending order without using the built-in methods.

Test cases (but will still require instructors to check whether any built-in method is used):

from import assert_equal
assert_equal(sortList([2, 3, 1]), [3, 2, 1])
assert_equal(sortList([3, 2, 1]), [3, 2, 1])
assert_equal(sortList([1, 2, 1, 2, 3, 1]), [3, 2, 2, 1, 1, 1])
assert_equal(sortList([-1, 0, 1]), [1, 0, -1])

Checking whether a specific function has been used

Sometimes, you may want to ensure that students are implementing their code in the way they have been asked to. For example, if you have your students write a function called mse` (to compute mean-squared-error) in the first part of the problem, and then want them to plot the MSE, you may ask them to be sure that they use the mse function in their code. How can you test for this?

In Python, you can be a bit clever and test for this by removing the mse function from the global namespace, running their code, and checking whether their code throws the appropriate error. If it doesn’t throw an error, that means they aren’t calling the mse function.

A test case that does this might look something like this:

# save a reference to the original function, then delete it from the
# global namespace
old_mse = mse
del mse

# try running the students' code

# if an NameError is thrown, that means their function calls mse
except NameError:

# if no error is thrown, that means their function does not call mse
    raise AssertionError("plot_mse does not call mse")

# restore the original function
    mse = old_mse
    del old_mse

Checking how functions were called or not called with mocking

In unit testing it is desirable to test your own code but not necessarily to include calls your code makes to external libraries. In particular calls that involve file systems and databases are typically replaced with so called mock objects that can record what arguments were passed in the call but which do not otherwise do any work.

We can use this in autograding in the following situations with the patch function from the mock library (since Python 3 a subpackage of unittest). The argument to patch is a string which represents where a certain function is used (rather than where it is defined). In the context of jupyter notebooks this argument will be ‘__main__.function_name’.


Problem: verify that a function call results in the printing of a certain result

def foo()
    print('right result')

The test code for this can be written as

from unittest.mock import patch
with patch('__main__.print') as mock_print:

This test passes silently if the print is correct, but if the print is wrong:

def foo()
    print('wrong result')

an assertion error is raised with output of the form

AssertionError                            Traceback (most recent call last)

AssertionError: Expected call: print('right result')
Actual call: print('wrong result')


Problem: verify that the students have implemented a min function without using the built-in function. That is, you expect the student to write a solution like

def mymin(x, y):
    if x < y:
        return x
        return y

However, if the student has implemented

def mymin(x, y):
    return min(x, y)

the following test will capture this with an error

with patch('__main__.min') as mock_min:
    mymin(1, 2)


AssertionError                            Traceback (most recent call last)
AssertionError: Expected 'min' to not have been called. Called 1 times.

Grading plots

Programmatically grading plots is a painful experience because there are many ways that students can create the requested plot. In general, we recommend just grading plots by hand. However, it is possible to programmatically grade some simple types of plots (such as a scatter plot or bar plot). One such tool that facilitates grading matplotlib plot specifically is plotchecker.


In many ways, writing autograder tests is similar to writing unit tests. However, there are certain types of errors that students may make—especially if they are new to programming—that are not things you would typically test for when writing tests for your own code. Here are a list of some things we’ve come across that are good to be aware of when writing your own autograder tests.

For loops

For loops in Python are sometimes confusing to students who are new to programming, especially if they are sometimes using them in the context of indexing into lists/arrays and sometimes not. For example, I have seen students sometimes write a for loop like this:

for i in x:
    y[i] = f(x[i])

rather than:

for i in range(len(x)):
    y[i] = f(x[i])

In particular, if x in the above example contains integers, the code may not throw an error, and in certain cases, may even pass the tests if you are not looking out for this type of mistake!

Global variables in the notebook

Although the Jupyter notebook is a really excellent format for assignments, it does have some drawbacks. One drawback is the fact that variables defined earlier in the document—for example, in a piece of sample code—can be accessed later. This can pose problems if students accidentally use those variable names rather than the variable names given in the function definition.

As a toy example, let’s say that earlier in the notebook we have defined a variable called arr. Then, students are asked to write a function that multiplies all the variables in the array by two. You give them a function signature of f(x), and they write the following code:

def f(x):
    return arr * 2

Notice that their code uses arr rather than x. This can be a problem especially if you only test on one input—namely, arr—because that test case will pass! Thus, it is important to test students’ code on a variety of inputs in order to catch edge cases such as these, and that you use different variable names:

# both of these tests will pass no because the student hardcoded the use
# of the arr variable!
arr = np.array([1, 2, 3])
np.assert_array_equal(f(arr), np.array([2, 4, 6]))
arr = np.array([5, 7])
np.assert_array_equal(f(arr), np.array([10, 14]))

# this test will fail because it uses a new input AND a new variable name
arr2 = np.array([3, 2])
np.assert_array_equal(f(arr2), np.array([6, 4]))

One solution is to quietly reset the notebook environment using [%reset -f]( magic function. Note that this wipes out all variables, function definitions, and loaded libraries, but should leave downloaded libraries in local environment. This is useful if the test functions in a single notebook are completely independent of one another.

Alternatively, if you suspect a particular variable name be re-used in between tests, consider deleting these variables before running the tests.