Welcome to this lecture. It's about sampling for surveillance. My name is Tine Hald. I'm a professor in epidemiology here at the Technical University of Denmark. This is the outline of my lecture. First, I will discuss a little bit what surveillance actually is, then I will go through different methods of sampling, and then we will discuss the terms, accuracy and representation or representativeness, and then also precision and sample size and how they are associated. And finally, I will go through sensitivity and specificity of sampling systems. Just briefly to explain where in the pathway for a metagenome project or metagenome surveillance we are. We are in the very beginning, where we are actually designing our surveillance and we are deciding how to take samples and then doing the sampling. The rest of the items on this flowchart, you will hear about later in the course. "Surveillance" has been defined by different international organizations. We have "public health surveillance" that has been defined by WHO as "The continuous and systematic collection, and analysis and interpretation of public health-related data that is needed for planning and implementation and evaluation of public health practices." Also the World Animal Health Organization, OIE, has a similar definition. And as you can see, it's really similar and what they kind of emphasize is that it's also about getting timely dissemination of result of surveillance and the information to the relevant persons that can take action on the surveillance results. So, if we're going to compare for instance simple monitoring with surveillance. Than surveillance is basically monitoring plus taking action on the results. Finally, we also have the definition from the European Food Safety Authority, when we are making surveillance of food items. And it's very similar to the one defined by OIE. "It is a systematic ongoing collection, collation and analysis of information related to food safety and then dissemination of this information to the correct person to take action." Ideally, when we're talking about Zoonosis and for instance also antimicrobial resistance, we would like to have an integrated surveillance which means, we want to analyze and interpret the data from all three sectors human, animals, and food in combination. We sometimes also refer to this as integrated surveillance. That give us better opportunities to actually see if the control actions that we take also have an effect. There are of course many objectives of surveillance depending on what you're actually surveying, but most often we want to estimate the magnitude of an occurrence. It could be prevalence or incidence in a target population. And we want to document the distribution of, for instance an infection across geographical areas and also over time. We want to detect outbreaks and epidemics, so we can try to intervene to stop the outbreaks. And we want to test hypothesis about origin of infectious agents. Finally, surveillance can also be used for evaluating effect of control efforts to see that the control that we do, the prevention measures that we implement, are actually working. And finally we can identify research needs and facilitate research, future research. About evaluating the effect of control strategies, I'm just going to show you a graph of Salmonella surveillance in Denmark from 1998 to 2016. In Denmark, we have a surveillance of Salmonella in the major food producing animals, so broilers, pigs and the table eggs and then also we have imported food and we've done that since the end of the 80's. And we also have a method that can estimate how many human cases are originating from the different animal reservoirs. And this is what the bars are showing, so the green bars are showing the number of estimated cases from chicken, and the orange from pork and the red ones from table eggs and so on. The arrows here indicate actually when control programs were put in place. And as you can see every time we put a new control program in place, in the different animal sectors, we also saw a decline in human cases. So therefore surveillance is also a very good means to actually show the effect of effective control. Now I'll move into sampling methods. One of the first sampling methods, I will discuss is probability sampling and probability sampling means that we can actually calculate a probability that each individual sampling unit in the population has to be selected. The most simple one is Simple Random sampling. Here, well simple in the way, it's easy to understand and it's easy to design maybe, but it might not always be very easy to carry out. But it is really where every individual in the population have equal chance of being sampled. Another method of sampling is Stratified Random sampling. Here, we might want to divide our target population into different strata as we call them. It could, for instance, if you want to sample pig herds, In order to be sure to get both very large pig herds of which there might be few and also very small pig herds of which there also might be few, you might want to divide all your pig herds into strata depending on herd size and then you collect randomly herds within each strata. Another way is called cluster random sampling, It is basically also where you divide your target population into clusters. That make sense for the survey and then you do random sampling of the clusters and you then also do Random sampling within the clusters. This can also help you reduce the number of samples your taking. Then there are non-probability sampling methods. The first one is really like Convenience sampling. This way you very much rely on volunteers. You can also say that a TV reporter walking down the street to interview people might kind of pick them out more or less as they come by, easy access. So that would be like convenience sampling. Then there's something called Purposive or Judgement sampling. Here the sampling very much rely by the researcher’s choice. So the researcher has the idea that he/she will know how best it is to sample. That could be for specific diseases, in particular for rare diseases where it would be very hard to actually find people that are relevant for surveillance, if you do not have some prior knowledge about the group. It's of course very important that the resource that you get are only representative for that group. Now, we'll talk a little bit about accuracy and representativeness. Since randomisation is our gold standard, we will strive to ensure randomisation as much as possible; because randomisation also gives us representativeness. And representativeness is when the surveillance results accurately reflect the trends in the target population. So, if we look at these targets here, we can see that this one has a high accuracy and a high precision. This means that we are on the target. It means that our sample actually reflects the true value in the population. But it also means that we are quite certain about it. So this is very good. If we get there, it's very good. Often, that is not possible, but it is what we aim at in the end. Here for instance, we have a low accuracy and a high precision. This is probably like the worst case, because we are off target. Our sample is way off the true value. And we are also very certain that it's up here. So we are really misjudging the sample estimates that we get. Here, it's a bit better. We have a low precision, but our accuracy is high because we are within the target, even though we're not too sure about it because of the low precision, the true value lies within our sample. And finally, over here we are both off target and also not very certain. And this is of course also kind of a worst case scenario. Precision means how certain we are. So the closer the bullets are here, the most certain we are. And precision is very much controlled by sample size, which I will talk a little bit about now. So, I'm often asked by people that design a study what is the appropriate sample size? And they really just want a number like, 47. But of course you cannot just give them a number without considering a lot of things. You need to really consider the surveillance objectives. So, do we want to detect a disease for instance only or do we also want to follow trends, and what is the expected occurrence in the population? For instance, what is the expected prevalence? If it's very low and we want to detect and be sure to detect, obviously we need to take more samples then if the prevalence is high. And it's totally important how it's measured, if we're measuring cases, if we are measuring counts, gene counts and so on, will also influence how many samples we need to take. Variability between and within epidemiological units should also be considered. The more variation we have between individuals or sampling units, the more samples we need to take in order to show that there is actually a difference. And the same goes for the genetic variation between microorganisms that we want to survey. So sometimes if we are on fairly new ground, as we often are with metagenomic projects, we really need pilot studies to investigate the variation before we kind of Settle ourselves on a sample size. This is just showing the very statistics behind sample sizes. Each bar shows or each line shows at different true prevalences, how many samples is needed to get a certain probability over here to actually detect if a disease is present. So for instance, if we have this here, the red one is the true prevalence of five percent. How many samples do we need to detect a disease occurring in that population with 95 percent probability? We will need around 60 samples. And for instance, if we take the one with a prevalence of One, the purple one, and we have 95 and we kind of go down here, we will need around 300 samples, so the green one. So, as you can see it's pretty depending on sample size and if you have a very low prevalence you will need much more than a thousand samples to actually get a high probability of detection. Also, sometimes you want to show if there's a difference in prevalence between two populations. So here, we have two populations; one where the true value is point two and the other is point four, but if you only take ten samples and we then know our margin of error or our uncertainty about the true value, we can really not distinguish these two populations. So, we cannot say that they are different. On the other hand, if we take 100 samples then suddenly we get much more precise, our margin of error gets much smaller, and we can clearly see that there is a difference between the two populations. In microbiological studies and also in animal studies, sometimes it can be an advantage to pool samples. Because it saves costs and time. So by pooling we mean that, if we for instance want to say something about the farm, we want to take samples from the pigs, for instance random samples from different pigs. And then, we want to pool all these samples into the same sample, to a single sample, maybe more samples, but still we pool more samples into fewer samples and then we want to analyze the fewer samples because it's more time efficient and cost efficient. However, the disadvantage of this can be that we reduce the methodological sensitivity. Because, if now we say it’s only this pig that has the infection that we're looking for, and when we are then pooling all the samples we Are actually diluting the samples and our chance of actually detecting that infection might then be lowered. However, if we want to say something at the farm level or at a upper epidemiological unit level, then it makes sense to pool if the loss of sensitivity is not too big. And particularly also if you're only interested in detection and we are not really interested in knowing how many pigs in the farm are infected. But really, if it's present in the farm or not. Very briefly, just to illustrate the effect of pooling. This graph shows the expected prevalence when we have 60 samples and the number of positive pools that we find. And the dark red one is really corresponding to taking six individual samples. So, the pool size is just one. And then we have three, five, twelve and thirty. And basically, what this shows is kind of logic. But as you increase your pool size, the more samples you pool together, the more uncertain you get on the results. So at a pool size of three is obviously more similar to the original with one, than a pool size of 30. It should take 60 samples and you have only analyzed two pools. Then, you are very uncertain. So, this is of course something to consider if you're deciding to do a pool sampling study. Finally, about sensitivity and specificity of surveillance. Sensitive surveillance is characterized by the ability to detect in a timely manner changes over time and space, but also change in the genetic patterns of the microbiological population that you're looking for. And over here, we see that this is the true status of the population, whether its present or Absent, and these other surveillance results. So, whether the surveillance, (let's say again, it's pig herds), whether they are positive or negative. So, if the surveillance accurately identify true positives, and they do that the vast majority of time, it has a high sensitivity because sensitivity is defined as true prevalence divided by true prevalence plus false negative. On the other hand, specificity is defined as the true negative divided by the false positive plus the true negative. So it says something about how good the surveillance system is actually to not detect those that are really also correctly negative. Sensitivity and specificity, particularly sensitivity, depends on occurrence and sample size. Again, if we want to detect something with a low prevalence and we want to have many true positives then we also need to increase the sample size. But lab methods and also the Bioinformatic approaches that you will hear more about is important. Because lab methods may turn into false negative and false positive. In bioinformatics, we rely on mapping up to reference genes and sometimes these can also be a false positive and false negative results depending on how much of the genome do we want to have to match up to the reference gene in order to conclude that for instance the microorganism is present in the sample or not. But you will hear much more about that later in the course. This is just to introduce the terms sensitivity and specificity. This is just references that I've used. Also, references for images and figures and then I just want to say thank you for listening in.