Many of you might have tried "subtotal" in Excel This method also exists in Python which is called "grouping" In grouping, first follow a certain rule to divide data into several groups and then process each group separately At last, summarize the results Let's waste no time looking at, in "pandas" the basic and important methods of data analysis and statistics There are powerful data structures in "pandas": Series, and DataFrame and a lot of functions and methods are available so it's a powerful tool for data analysis in Python Let's have a look how many methods are available for a DataFrame object Quite a few, right For example, this group of methods starting with to_ are to write a DataFrame object into a type of files The quantity of such methods alone exceed 20 In this part, let's talk about data statistics and analysis based on "pandas" Mostly, we'll base on the information on Dow Jones Industrial Average stocks (DataFrame djidf) and the information of stock of American Express over the past year (DataFrame quotesdf) we built before to create several small questions to discuss the basic methods for data filtering and statistics in "pandas" as well as concepts and methods of advanced grouping and merger Data in other fields are similarly processed It's worth mentioning here that To facilitate statistics I've slightly modified "quotesdf" Look at the program This column "date" in the raw data has been modified from time stamp into regular time With the functions of map() and strftime() "date" is also set as the "index" of quotesdf You may explore these statements for addition and modification by yourself This program generates "quotesdf" This is the result of execution This program generates "djidf" Let's now briefly recall the meaning of the attributes in the two DataFrames The left one is "djidf" and its attributes include "code" for meaning company code, "name" for company name, and "price" for the latest trading price "index", by default, is 0 to 29 The right one is "quotesdf" and its attributes include the daily closing price (close), highest price (high), lowest price (low) and opening price (open) of stock as well as the trading volume (volume). "index" indicates the date Look at simple filtering and statistics first Filtering is needed before statistics Filtering methods have been roughly learned and practiced in previous chapters Let's look at them in some questions The first question is to calculate the mean value of the latest trading prices of all the stocks This question should be easy First, select the latest trading prices in "djidf" Just use djidf.price or djidf['price'] Just select this column, and then call the mean() method The second question is to acquire the names of companies whose latest trading prices of stocks are greater than or equal to 300 Obviously, it's necessary to first find records of such companies through data filtering and then acquire the result through attribute names Let's recall the method of filtering "object name[]", and [] contains the condition of filtering the latest trading price being greater than or equal to 300 Write down the condition and that's OK Write like this, for example and then list the company names through the ".name" attribute Look at two questions below The first one is actually upgraded version of the previous one Apart from finding information on companies whose latest trading prices are greater than or equal to 300 we need information of companies satisfying another requirement Then, shall we only need to list the two conditions The relation between the conditions is "or", just expressed with the symbol "|" So, it's written like this The result is like this The second question is to find the day numbers of stock opening of American Express Company in September A key point here is September since we're to select data from September We only need to express the starting and ending dates of September greater than or equal to the starting value and less than or equal to the ending value The relation between two conditions is "and" with the symbol "&" As "date" is an index so, it's written like this Next, calculate the day numbers with the len() function Let's look at two other questions The first one is to count the respective day numbers of price rise and drop in the stock of American Express over the recent year We know that, stock price rises means the closing price is higher than the opening price and then we only need to calculate such day numbers We may use such an expression of condition filtering and the len() function to calculate the day numbers This is similar to our method of analyzing stock opening day numbers in September we calculated just now The day numbers of drop is acquired through subtracting the rise day numbers from the total day numbers The other one is to acquire the statistics of rise or drop of closing prices of two consecutive days of the company over the recent year For example, compare the prices of the previous day and the following day and we know whether the closing price rises or drops and calculate the day numbers of rise and drop There's an existing function in the "numpy" module to compare two adjacent objects for increase or decrease This function is diff() We may express it like this The argument is the closing price to calculate and then get its symbol with the np.sign() function The result is like this Inside it, 1 indicates a rise -1 means a drop and then we use the Boolean indexing in the "numpy" array to calculate the length to get the respective day numbers of rise and drop Here, you might have such a question There are so many functions available how can I know whether there is the function I need or not We may consider from two perspectives Firstly, read more official tutorials on relevant modules and demonstration programs in documents and learn the capacities and usages of some functions and methods Use the dir() and help() functions more as well This is also a habit we should develop, which I often emphasize The other is to have confidence in Python in other words, confidence in the standard library and the third party library you use Some little tricks we can think of have almost been realized with corresponding functions and methods, as we finally discover So, based on the above two points as long as we find suitable functions the efficiency of solution to problems will be higher Sorting is something we often handle during data processing Let's look at this question Sort the stocks according to the latest trading prices and list the top three company names according to the sorting result We may use the sort_values() method, which means sorting based on value, in DataFrame If sorting is based on indexes, we would use the sort_index() method What is based on for sorting in this question It must be based on the latest trading price sort_values() has an essential argument: by and we just need to set the value of "by" as 'price' Moreover, as we're to find the top three companies should be it sorting in a reverse order Use the "ascending" argument and set it as "False" Then, apply slicing to extract the information on top three companies and, with the "name" attribute, we can acquire our desired result Another question here is to count the day numbers of stock opening of American Express of all months over the recent year It is an upgraded version of the question of day numbers of stock opening in September we discussed before Think about it In "quotesdf", only "date" contains a month which is currently "index" If we can acquire the month from the "date" column and then calculate the stock opening day numbers based on the month, can the question be solved The answer is yes There's indeed a grouping function "groupby()" in pandas and the groupby() function has an argument which is the criteria for grouping Here, we count the numbers in each month so the criteria for grouping is month Well, first, let's see how we can acquire the month Let's do actual operations Observe "quotesdf" first As we see, the month is the 5th and 6th characters in the date attribute and we may select all the months through slicing Then, use the groupby() function to group based on the month The grouping result is a DataFrameGroupBy object We may view the grouping result through the groups attribute of object As we see, the result is divided into 12 groups, and each group contains one month It will be easy to calculate the stock opening day numbers after grouping As pandas supports vectorized operation we may imagine the result as only one month To calculate the stock opening day numbers of this month, can we first select the "open" column and then count with the count() method This is the result of operation It's worth mentioning that the groups attribute just now only shows groups instead of data forms after grouping It is possible, though, to view the data forms after grouping Do it like this This is the serial number and this is the data Got it? That's why we can use the attribute name "open" later to extract all the open columns There are mainly 3 steps from grouping to final result aggregation namely, data splitting, applying, and combining Splitting is to split the data into several parts according to the grouping criteria In our previous example, we split the data into 12 parts by the month Applying means the apply of certain operational methods like counting the opening prices here like counting the opening prices here Combinaing is to aggregate the results of all groups For understanding grouping problems one point is to determine which attribute to be the criteria for grouping and the other point is to understand the data status after grouping Here, we may resort to the attribute of "groups" etc to view it Let's move on It's still the previous question Apart from the previous mode of first selecting the "open" attribute column and then counting, there's another more frequently used method "apply()" in pandas to solve this problem It's quite free to use the apply() method Its most basic form is "DataFrame.apply(f, axis = 0)" where f is a function which may be a self-defined function and "axis" is 0 by default indicating that the apply() method will, by default, apply the function onto one Series that is, automatically traverse each data column in DataFrame and process the data columns according to the capacity of corresponding function After processing, all the results are aggregated and returned If "axis" is set as 1, each data row of DataFrame will be traversed and processed For example, in this question, we've used the groupby() method to group by the month As vectorized operation is supported we only need to consider how to calculate the length of a Series object You might already know It's also OK to use the len() function and we only need to use the function name of "len" as an argument of the apply() method It's OK to write like this The apply() method is often used in conjunction with groupby() Sure, it may be separately used Let's look at some more examples to better understand this method Demonstrate it directly If we're to calculate the maximum value of each attribute of quotesdf how can we process it Just directly call the max() method of this DataFrame, right as the default "axis" is 0 acting on each column If we're to calculate the maximum value of each record just set "axis" as 1 although this result is meaningless for this DataFrame What if we wanna change the "close" and "open" attributes in quotesdf from the "float" type into the "int" type from the "float" type into the "int" type As we mentioned, just use astype(int) The previous three cases can all be solved with the apply() method To calculate the maximum value of each attribute we may write it like this As the argument of the max() function can be a Series to calculate the maximum value of each row, just directly add axis = 1 What if we wanna change the "close" and "open" attributes in quotesdf from the "float" type into the "int" type? How to write Can we write like this As we see, no Why not Because the int() function does not support Series objects It's wrong to write like this What should we do We often use type conversion functions that support vectorized operation in the numpy module such as np.int32 Let's try writing it It works Sure, in pandas, there's another method similar to apply() which, however, directly acts on DataFrame objects namely, applymap() Since it acts on each element we can easily know that the previous int() function would be correct if used in the applymap() method As expected, it is In the processing just now, we didn't directly change the types of the two data columns "close" and "open" in DataFrame which is still of the "float" type It's OK if you wanna directly change it Just process it through assignment Still change it to the apply() method Have a look The two columns have indeed been converted into the "int" type Finally, let's talk about merger in pandas Merge is a very common data process mode in Python There are three ways append concat and join Of them, join is of the SQL type which, to put it simply, is joining at the level of database operation Sounds amazing, right First, look at the instance about "append" Append the stock information of American Express over a period of time to the end of stock trading information of another period of time The method is simple Just select the two parts of data records and then append the second part, with the append() method to the first part to achieve merger Next, look at concat As we see, its way is different from that of the append() method The first step requires two objects i.e. two fragments Then, put them into a list and then use the concat() function to join them The final effect of merger is like this With the two examples above, we've found that append() and concat () are different One is to append Sure, the "append" here is not in the strict sense since no new data are appended into the original data while "concat" is to join Then, I have another question two objects of different logical structures for instance, one object has 10 attributes and the other object has 2 more attributes Can they be concated Let's think since Python is so powerful it should be possible It really is The concat() method has an argument: ignore_index We only need to set its value as True No details will be discussed due to our time limit Next, look at the last way to merge: join The concept of "join" here is identical to that of "join" in database If we want to join two sheets together how can do achieve it It must share a common field (attribute) A common field doesn't mean the names are the same and the key point is data meanings should be the same For example, the fields (attributes) in this sheet are "code" and "name" In another sheet there are "volume", "code", and "month" Suppose the two "codes" here share the same meaning We can then join them to get such a sheet based on the shared field "code" In fact, to put it simply, it can be understood as in the second sheet the "name" attribute corresponding to "code" has been added Look at an example If we merge the mean of monthly trading volumes of American Express and Coca Cola over the past year with the information on Dow Jones Industrial Average stocks If we've already generated through a program such a DataFrame object "AKdf" satisfying requirements of both companies and its attributes include "volume", "code", and "month" Observe the two plots If we remove the "price" attribute in djidf would the merged result have an additional "name" attribute The effect is like this As there's a shared "code" attribute we only need to use the merge() function to achieve it The value of argument "on" is the key for "join" Here, as the attribute names in both DataFrame objects contain "code", with the same meaning it may be omitted If different we may use arguments such as "left_on" and "right_on" for setting The merge() function has some other arguments to realize SQL joining of different uses As time is limited, no more examples will be provided here If interested, you may explore more That's all for the data statistics and analysis methods based on pandas we introduced in this part Sure, we only introduced some basic and important ones You still need to continuously explore and, in case of problems, continuously practice and try I believe pandas will not let you down