class: center, middle # Pandas for Data Analysis
by Titipat Achakulvisut, reference from Python for Data Analysis by Wes McKinney --- ## Essential Python Libraries Recap from last week, here are few libraries that are essentials in Python. - `numpy` multidimensional array, element-wise computation, linear algebra operations - `sciPy` collection of packages addressing a number of different standard problem domains in scientific computing (including sparse matrix computation) - `pandas` data structures, functions designed to make working with structured data fast, easy, and expressive - `matplotlib` producing plots and other 2D data visualizations Today, we're going to focus on `pandas` library --- ## Getting pandas `pandas` comes with Anaconda distribution already. You can also update using `pip` ```bash pip install pandas --upgrade ``` Main pandas data structure is the _DataFrame_ (it's like spread sheet, table). To create dataframe, you can do something like the following: ```python import pandas as pd # convention df = pd.DataFrame([{'a': 2, 'b': 3}, {'a': 1, 'b': 2}]) # from list of dictionary ``` you can also create dataframe from lists of list. ```python df = pd.DataFrame([[2, 3], [1, 2]], columns=['a', 'b']) ``` --- ## Read and write files See [IO Tools](http://pandas.pydata.org/pandas-docs/stable/io.html) for full functionalities. To read data to dataframe, you can use: - `pd.read_csv` - `pd.read_excel` - `pd.read_json` And writing are something like following - `df.to_csv(..., index=False)` - `df.to_excel(...)` - `df.to_json(..., orient='records')`, you can also select format (`split`, `records`, `index`, `columns`, `values`) Or even write multiple sheets ```python writer = pd.ExcelWriter('example_sheets.xlsx') df1.to_excel(writer, 'sheet1') df2.to_excel(writer, 'sheet2') writer.save() ``` --- ## Basic functinalities Describe something quick ```python df.describe() ``` Cast to array/list ```python df.as_matrix() list(df.as_matrix()) ``` Concatenate dataframe ```python df1 = pd.DataFrame([{'a': 2, 'b': 3}, {'a': 1, 'b': 2}]) df2 = pd.DataFrame([{'a': 4, 'b': 6}, {'a': 3.5, 'b': 10}]) df_concat = pd.concat((df1, df2), axis=0) ``` ```python df1 = pd.DataFrame([{'a': 2, 'b': 3}, {'a': 1, 'b': 2}]) df2 = pd.DataFrame([{'c': 4, 'd': 6}, {'c': 3.5, 'd': 10}]) df_concat = pd.concat((df1, df2), axis=1) ``` --- ## Basic functinalities Fill NaN with empty string ```python df.fillna('', inplace=True) ``` Drop duplicated rows ```python df.drop_duplicates(subset='col_name', inplace=True) ``` --- ## Chicago Food Inspection Let's explore Chicago food inspection data from the City of Chicago. Download data [here](https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5/data). Reading data using `csv`: ```python import csv food = list(csv.DictReader(open('Food_Inspections.csv'))) ``` Reading data using `pandas`: ```python import pandas as pd food_df = pd.read_csv('Food_Inspections.csv').fillna('') print(food_df.columns) ``` We'll explore together in class. --- ## Q/A