Data is being collected and created at the fastest rate in human history; by the far the vast majority of this is in digital format. Allied with this, what was previously “offline” information can now be digitised quickly and cheaply e.g. old manuscripts, maps etc. This vast collection of existing and new information creates new opportunities and also difficulties. For a lot of this information to be useful it must be categorised and annotated in some way, so that sense can be made of the data and also so that the correct data can be accessed more easily. It is possible to complete this categorisation by hand with human annotators, but this effort can be expensive in terms of time, money and resources. This is especially true for large data sets or for data sets that require niche expertise to annotate. With this expense in mind, many have turned to machine learning to annotate data; however machine learning approaches still require human intervention to both create training sets for algorithms and judge the output of algorithms. Thus it is inevitable that human intervention is involved at some stage of the categorisation and annotation process. In this project we aim to gain a better understanding of this annotation process so that we can provide guidelines, approaches and processes for providing the most cost effective and accurate annotations for data sets.
We propose to work with the three main types of unstructured data faced in big data: text, image, and video. The first challenge is to better understand the process assessors go through when annotating and judging different types of material. This will be carried out using a mixture of qualitative and quantitative techniques, using smaller scale lab-based studies. By better understanding the process by which individuals annotate and classify material, we hope to provide insights which can be used to make the annotation process more efficient, and identify a set of initial factors which affect annotation performance, such as degree of domain expertise and time. Based on this initial work, the aim is to then investigate which of these factors most affect assessment, using large scale crowdsourcing style methods. The final challenge is related to the classification task: how should annotation be approached, to give the best results when used in machine learning? Based on this, the project aims to create a set of guidelines for the creation of annotation and relevance sets.