Hello everyone. Welcome to Big Data and Language. So far, we've learned about how to use two different court broad. The one is BNC based on the British National Corpus, and the second one is COCA based on American English. But now, maybe you might want to use your own data, then you might want to know about the tools that you can use based on your data. So this time, I will introduce two different tools that you can use for free. So again, let's go to the web page Google and you need to type "AntConc" click, and you will see a Lawrence the Anthony's AntConc. He is a professor at Waseda University and he developed several programs, one of them is AntConc. So depending on your OS system, feel free to choose one of them. I'll click all windows. The good advantages of using AntConc, the number 1 is you don't need to log in and number two it's for free, so you can use anytime, anywhere, just download it and run. Just run and it looks like this. Are you with me so far? If you still need to download AntConc. Please stop here, and once you've finished out loading, then let's play. If you are ready, then I want you to click "File" and "Open files", and go to the text that you already have. So my case, I just prepared these three text files from the one of the research articles in engineering. Once you click and you see that, you see here the corpus files. Because I only selected the three files, so these are the three filenames. Are you with me so far? So now, what you might want to do is, you might want to check the word list. What will be the most frequently used words? Once you click word list here and let's click "Start", and just within one second, you will see the result. The first one is "the" and the second one is "of, and, to, is". These are the frequencies, the first one "the" for example, this "the" word "the" used 439 times in these three texts, and "of" for example, the preposition "of" used 271 times in these three text files. So this one, you can see the answer of the first task. The first task, the question was, how many types and tokens? Why are the number of types and the number of tokens are different? So in order to answer this question, how many types and tokens? Once you go back to the AntConc, you'll see that here, the word types 1,748 and the tokens, 7,093. So almost five times more, the number of texts types, the number of tokens about five times more than the words types. Let me explain the tokens and the types. The term tokens referred to the total number of words in a text corpus or corpus, regardless of how often they are repeated. The term type refers to the number of distinct word in a text or a corpus according to Chris Turner at Coventry University by this definition. So there are total 7,093 words and 1,748 distinct words in these three files, which means some words are used multiple times. For example, "the", "the" used 439 times so that's why the number of tokens are way bigger than the number of word types. Are you with me so far? Now, let's move to the second question. What are five most frequent words? As I mentioned that, "the, of, and, to, is" and these are the frequencies. It's very clear and now, let's move to what is the frequency of "the"? I already mentioned that here, 439. So change sort by word as sort by word and. Here, you can see that sorts by frequency, but you can change other function to sort by word. Sort by word, click, and start. You've noticed that this one is alphabetical order. So the frequency, they are different, all use 127 times, however, ABD, three times. What about word, sort by word "and", start then you see that based on the last letter of the word you can sort. So there are many functions depending on your research question or depending on your research purposes, you can choose other sorting system. So let me go back to the next question, what will be nouns frequently used to after fuzzy? So what you might need to do is here, go to "Collocate" and you might want to type "fuzzy" here and the important thing is just list five nouns. So let's click, and there are many words here, but you might want to select only the nouns. So the answer is "decision, rows, logic, toolbox, and input." How many files include "fuzzy rules"? Let me type and find the rules here. You see that there are all the rules in concordance, the fuzzy rule only, because my filename is pretty long, so you see that only one file. Within one file, there are the collocations "fuzzy rules". So one file, the filename is number 3. Now, let's move to the, what would collocations of problems? This time, you need to set out one birthday which one is 1L, which means you need to find the collocation right before the word "problems". So you need to go to collocates and type "problems" here, and the first was sort by frequency. So sort by frequency here, you need to change, sort by frequency and starts, and you see the result here, the first one is "optimization, these, of, the, such." These are the five words with the collocations of problems. Good. Let's change, sort by stats. Sort by stat and click again start, and you'll see that "nonlinear, dimensional, network, expensive, and squares." So what does this function show and what's the meaning of stat minus 0.79325 of collocate "the"? So if you'll look at the collocate "the" and you see that number 19, "the", word "the" and the stat is minus 0.79325. Stats, this one is a highest one and this one is the lowest one. Let me explain very briefly, but I will explain more about the function later, which means the stat is value of statistical measure between the search term and the collocate. The measure is how related the search term and the collocate are, so which means the word "problems" and also the word "the", how they co-relate? The AntConc supports two statistical measures. The one is MI, mutual information, and the other one is T-score. The default statistical measure is MI. So you might be curious about MI equations, so I will upload it and you will see the equation here. So the MI of two discrete random variables x and y can be defined as where pxy as joint probability function of x and y, and px and py are the marginal probability distribution functions of x and y. Since the MI is designed to measure dependencies of two random variables, it makes no sense to measure dependencies of two words with MI. Therefore, AntConc uses variant of MI named I-value to calculate relation between the node n and the collocate c. Our I equation, you can see that, where N is total number of tokens in f and c, fn and fc are the numbers of co-occurrences and occurrences of n and c. If I evaluates to zero, n and c are independent, that is their occurrences do not influence each other. If I value evaluates to the positive value, then n and c have positive correlation. Greater then the value of I is more likely and n and c comes together, and vice versa. So the I-value of problems and the word "the', two words is, like here, because of this equation, you see that the value of minus 0.79325. So this means their co-occurrences which one is "the, problems" is negatively correlated in this corpus. So you notice that based on this value, whether two words are co-related or occurred together or not. So the last question was a little bit challenging, but it might be good chance for you to understand what would be the statistics, I will sort by stats at AntConc. So thanks for your attention. If you still not familiar with this tool AntConc then feel free to play more you probably get familiar with this great useful tool. Thank you for your attention.