We use cookies to improve your experience on our website. By clicking OK or by continuing to browse the website, you consent to their use.
Click here to review our Cookie Use Policy.


by QSR

Supports robust qualitative and mixed methods research for virtually any data source and any research method.

Learn more

by QSR

Intuitive data analysis software designed for public policy experts analyzing surveys.

Learn more

Creating software to help you discover the rich insights from humanised data.

Learn more

Avoiding the pitfalls of Inter-rater reliability testing

26 November 2013 - BY Ben Meehan IN coding, comparison, intercoder, inter-rater, NVivo, queries, reliability

You want to make sure your team has a consistent approach to coding but maybe you're grappling with the practical application of inter-rater reliability testing in NVivo.

This article aims to help you identify the common pitfalls before you run your tests. It assumes that you understand the concept of Inter-rater reliability testing but that you are struggling to understand and report on the results.

Force coders to identify themselves

Some people don't realise that you can have user accounts in NVivo.

Before conducting the test, it is essential that you set up a user account for each coder and change a default setting in NVivo which forces the user to log in every time they run NVivo. This must be done on each computer you intend to use during the test as coders may be remote.

To do this go to File->Options and change the setting below to ‘Prompt for user on launch’:

This will force coders to identify themselves to NVivo before they begin coding. Adding new users is as easy as writing in a new user name and initials. If you don’t set up user names NVivo will take the Windows logged in user name and initials by default. You can find out more in the help topic Manage Users in a Standalone Project .

You are now ready to allow your coders to commence coding the same transcript.

Setup a coding comparison query

In this example, we'll use the standard sample project preloaded with NVivo so all readers automatically have a copy of this query to play with.

Go to Queries and right click on the query called “Coding comparison of Wanda to Effie and Henry for Thomas interview”.  Select query properties and click on the ‘Coding Comparison’ tab in the dialog box. It's easy to setup up the query (see Run a Coding Comparison query):

Keep the query simple to start with so you can understand the results more easily. Perhaps just two coders, one transcript and a reasonably small number of nodes.

Now you're ready to run the test and analyse the results.

Understanding and reporting the results

The query results report displays both ‘Percentage Agreement’ and ‘Kappa Coefficient’. This can be confusing because they are not using the same logic to report. You do have the option in the dialog box to choose one and/or the other.

The query result looks like this:

Kappa scores are between zero and one (depending on levels of agreement/disagreement) while the percentage report shows some more detail.

For me, the best part of this query result is the fact that you can drill down on any line to get a more visual representation of the specifics behind levels of agreement/disagreement.

If you double click on the first line of the query result this is what you will see:

The illustration above shows where the coders agreed/disagreed in relation to the source content coded. However, by clicking on an individual coder’s stripe, I can see exactly what was coded by each coder as NVivo will highlight the exact text coded by that person.

In addition, if I switch on the coding sub-stripes to compare what nodes they coded this content to, I will get a more holistic view of agreement levels. You can do this by right clicking on the coder’s stripe and selecting ‘Show Sub-Stripes->More Sub-Stripes. Using this option, I can filter the coding stripes to see all or some of the nodes each coder coded that segment to.

Export to Excel

Export your query results to Excel and insert an average cell formula to see how agreement levels compare across Thomas’s entire transcript. This exercise gives us an average Kappa score of 0.55 across the entire transcript or 0.69 if the one area of total disagreement is excluded.

Essential “do’s and don’ts”

Learning from your mistakes may not be a great idea for this particular query as your coders may get less cooperative if you have to ask them to repeat the coding should the query fail due to not being set up correctly. So I have put together some common errors and omissions that we have encountered in our training and support work.


  • Change the default setting to force users to log in.
  • Setup your user accounts for each coder.
  • Conduct the testing as early as possible in the life of the project so as to align   thinking. Especially if your coders are going to work remotely.
  • Copy the project file to participating  coders when using stand-alone projects.
  • Keep the test in manageable proportions.
  • Import project files from remote coders (take backups first in case anything goes wrong).
  • Check  for duplicate users after import and merge as may be necessary (this can happen if coders are inconsistent in how they log in. I might log in as ‘BM’   today and ‘B.M.’ tomorrow).
  • Discuss  results and retest if necessary.


  • Ask a coder to participate if their  version of NVivo is later than yours. They will be able to open your file but  as NVivo is not backward compatible you will not be able to import their work   without upgrading the software.
  • Ask a coder to participate if your own coding processes are quite advanced. You will have changed your thinking several times through coding and re-coding the data so you will likely get a poor result. Ask the coder to code against your initial codes.
  • Running a coding comparison query using multiple transcripts, coders and nodes will result in a very big report which might be difficult to make sense of. Breaking the query up to compare sub-sets will make interpreting and reporting of results much easier.
  • Edit any sources during the coding test as this will also cause duplication of sources on import and the test will not work.   Should this happen, you will have to repeat the whole process.

Watch a video to see how it's done:

 Conducting Inter-rater reliability testing using the Coding Comparison query

It would be great to hear about your experiences with inter-rater reliability testing. What works for you?