Apart from science and engineering problems, Python may also solve problems of humanities and social sciences. Let's take NLTK for example. It's a Python toolkit related to natural language processing. Let's see how to utilize functions in Python and NLTK to gather and analyze data in corpus. Learning them, you may further utilize NLTK or other similar toolkits to analyze and predicate other language materials, especially big data containing enormous important information. Let's take Project Gutenberg and the US presidents' inaugural speech corpus for example. Let's look at Project Gutenberg first. Here, as we talked about before. Use function "fileids()" to calculate the books of Project Gutenberg, that have been included into NLTK like these. Next, we may do some simple calculations. The "words()" function here, say, puts the numbers of all words in the book Hamlet including punctuation marks, into the variable "allwords", and then we can count the total quantity of all the words, including the punctuation marks. Well, how many of these words and punctuation marks are not repeated. Can we first process them with a set and then use the "len()" function to get such a result? Next, let's do some other statistics. For example, how many times does the string of "Hamlet" appears in this story? Can we use the "count()" function? The result says, 99 times. Besides we can do some slightly complex statistics, say, all the words whose length. All words whose length is over 12. Let's realize it through list comprehension and then sort these words. The output result is like this. Look at another example. What's the effect of this example? Let's see. We first use the "FreeDist()" function. It is to create a frequency distribution of a given data. What are the data here? They first include all the words and they are all converted into the lower case. After acquiring such an object, we use the dedicated methods "B()" and "N()" in NLTK to count the numbers of different words and the numbers of all words, respectively. Then, we use the "tabulate()" method to display the first 20 datasets in the form of table. The effect is like this. As we see. The highest-frequency word appears for 993 times. It's the word of "the". So far, you might have already known the effect of this program. It is to gather statistics of the book Hamlet count the top 20 high-frequency words, regardless of upper or lower cases and count their times of appearance. Well, we might as well use the "plot()" method to plot this result. By the way, when using the "plot()" method, we often use the argument "cumulative" i.e. to add data cumulatively. Sure, in this case, this argument is meaningless. Having explored some secretes in stories, let's look at the inaugural speeches of US presidents. What kind of secrets can be explored from corpus? We know that, US presidents like to mention freedom and democracy in their inaugural speeches. Let's check the frequency of the word "freedom" in the inaugural speech texts. Gather statistics of all the inaugural speech texts, and then calculate the frequency with the "freq()" method As we see, the result is like this: it's approximately one in a thousand. Then, let's look at the word length habits of US presidents in their inaugural speeches during certain time intervals. Here, we use such a function. It's a function for conditional frequency distribution. When corpus texts are divided into several types, it can separately calculate the frequency distribution of each type. What can it be used to research into the systematic difference among types? We gather statistics for such a time interval based on what, based on the word length. Let's check its result. As we see, in the inaugural speech made by Barack Obama in 2009, there are 599 words whose length is 3, the most favored type in his speech. So far, have you found any problem? As we guess, of these words, does this word appear for so many times as well? However, as we know, this word is of no significant meaning in statistics. Would it be better to remove such a word in advance? For this kind of statistics, it really is, That's the concept of "stop word". We must first remove such stop words from the raw text, i.e. some words of no significant meaning. In NLTK, there's a section dedicated to downloading of stop word corpus. You may have a look. It's available in multiple languages. After removing those stop words, statistics will be more meaningful. That's stop word list provided in NLTK. This is the English version. Here, as we see, it contains many function words, such as me, are, my like these quite common words without any practical meaning. It's worth noticing that the stop word list is not universal, it should be determined according to the research topic and field. For researches into personal pronouns, for example, words like "we" and "your" can not be regarded as stop words. Those are what we have explored from the NLTK corpus. You may explore more of what you're interested in NLTK is a common natural language processing toolkit. It's easy to start with abundant available resources and functions. I strongly recommend you use it. Sure, learners majoring in humanities may look for some other corpuses related to your majors, or scrape data by yourselves for analyses. Besides, in our previous introduction, have you found out that, many operations are not so difficult? Hope that you will be more skilled and free in using them after more practice.