Organize pandas notebook with cool hacks

36 minute read

Published:

messy-notebook

Does it ring a bell looking at this messy notebook? I am sure you must have created or encountered a similar kind of notebook while performing data analysis tasks in pandas.

Pandas is widely used by data scientists and ML Engineers all around the world to perform all kinds of data related tasks like data cleaning and preprocessing, data analysis, data manipulation, data conversion, etc. However, most of us are not using it right, as seen in the above example, which has decreased our productivity a lot.

You might wonder then what is the correct way to use pandas. Is there any particular way that we can make the notebook clean and modular so that we can increase our productivity?

Luckily, there is a type of quick hack or technique, whatever you may call it, which can be used to greatly improve the workflow and make notebooks not only clean and well organized but highly productive and efficient. The good thing is that you don’t need to install any extra packages or libraries. In the end, your notebook will look something like this.

Clean notebook

Note: Dark mode is available on this website. You can switch between the modes by clicking the leftmost button at the bottom of the left sidebar.

dark_mode

Untitled12.ipynb

The way to achieve clean and well-organized pandas notebooks was explored in the presentation Untitled12.ipynb by Vincent D. Warmerdam at PyData Eindhoven 2019

The presentation Untitled12.ipynb: Prevent Miles of Scrolling, Reduce the Spaghetti Code from the Copy Pasta has been uploaded in youtube as well. You can watch the video below if you want:


In this article, I will briefly summarize the presentation by Vincent D. Warmerdam and then move on to the code implementation (solution) and a few code examples based on the methods used in his presentation.

The Untitled phenomena

Untitled12.ipynb

He began his talk by introducing a term called Untitled phenomena. The term simply refers to the bad practice of not naming the notebook files which eventually creates an unorganized bunch of Untitled notebooks. As a result, he also named the presentation Untitled12.ipynb.

Moreover, not only the bad practice of naming that we follow but also the bad organization of code inside the notebook needs to be improved. Copying and pasting code multiple times creates spaghetti code. This is especially true for a lot of data science based Jupyter notebooks. The goal of his talk was to uncover a great pattern for pandas that would prevent loads of scrolling such that the code behaves like lego. He also gave some useful tricks and tips on how to prevent miles of scrolling and reduce the spaghetti code when creating Jupyter notebooks.

> Skip to coding solution

I have initially written a summary of the talk Untitled12.ipynb and explored some common problems in the usual coding style before moving to the solution. If you want to directly jump to the coding solution to create a clean pandas notebook using a pipeline, then click the link above. However, I recommend you to read the common problems I have mentioned before going to the solution.

Contents

I will be talking about the following topics which will more or less revolve around his talk.

Importance of Workflow

At the beginning of the presentation, he began by discussing the following points that highlight the importance of workflows and the need of jupyter-notebook and pandas over excel:

  • We want to separate the data from the analysis: Tha analysis portion should not modify the raw data. The raw data should be safe from these modifications so that it can be reused later as well. However, this is not possible in excel.

  • We want to be able to automate our analysis. The main aim of programming and workflow is automation. Our tasks become a lot easier if we can automate the analysis using a pandas script rather than performing the analysis every time using Excel.

  • We want our analysis to be reproducible i.e. we must be able to reproduce the same analysis results on the data at a later time in the future.

  • We should not pay a third part obscene amounts of money for something as basic as arithmetic. This budget is better allocated towards innovation and education of staff.

However, the current style of coding in pandas and jupyter notebook has solved only the last point.

The usual coding style

Let’s explore the common practice of writing pandas code and try to point out the major problems in such approaches.

Initially, I will show the general workflow that most of us follow while using pandas. I will be performing some analysis on the real COVID 19 dataset of the U.S. states obtained from The COVID Tracking Project which is available under the Creative Commons CC BY-NC-4.0 license. The dataset is updated each day between 4 pm and 5 pm EDT.

After showing the common approach, I will point out the major pitfalls and then move on to the solution.

First, I will download the U.S. COVID-19 dataset using the API provided by The COVID Tracking Project

!mkdir data
!wget -O data/covid19_us_states_daily.csv https://covidtracking.com/api/v1/states/daily.csv
!wget  -O data/state_info.csv https://covidtracking.com/api/v1/states/info.csv
--2020-06-05 16:34:10--  https://covidtracking.com/api/v1/states/daily.csv
Resolving covidtracking.com (covidtracking.com)... 104.248.63.231, 2604:a880:400:d1::888:7001
Connecting to covidtracking.com (covidtracking.com)|104.248.63.231|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘data/covid19_us_states_daily.csv’

data/covid19_us_sta     [  <=>               ] 987.40K  3.11MB/s    in 0.3s    

2020-06-05 16:34:11 (3.11 MB/s) - ‘data/covid19_us_states_daily.csv’ saved [1011093]

--2020-06-05 16:34:12--  https://covidtracking.com/api/v1/states/info.csv
Resolving covidtracking.com (covidtracking.com)... 104.248.50.87, 2604:a880:400:d1::888:7001
Connecting to covidtracking.com (covidtracking.com)|104.248.50.87|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘data/state_info.csv’

data/state_info.csv     [ <=>                ]  27.67K  --.-KB/s    in 0.02s   

2020-06-05 16:34:13 (1.43 MB/s) - ‘data/state_info.csv’ saved [28329]
import pandas as pd 
# Importing plotly library for plotting interactive graphs
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import chart_studio
import chart_studio.plotly as py

The first step is generally to read or import the data

df = pd.read_csv('data/covid19_us_states_daily.csv', index_col='date')
df.head()
statepositivenegativependinghospitalizedCurrentlyhospitalizedCumulativeinIcuCurrentlyinIcuCumulativeonVentilatorCurrentlyonVentilatorCumulativerecovereddataQualityGradelastUpdateEtdateModifiedcheckTimeEtdeathhospitalizeddateCheckedfipspositiveIncreasenegativeIncreasetotaltotalTestResultstotalTestResultsIncreaseposNegdeathIncreasehospitalizedIncreasehashcommercialScorenegativeRegularScorenegativeScorepositiveScorescoregrade
date
20200604AK513.059584.0NaN13.0NaNNaNNaN1.0NaN376.0A6/4/2020 00:002020-06-04T00:00:00Z06/03 20:0010.0NaN2020-06-04T00:00:00Z281907600976009719156009700c1046011af7271cbe2e6698526714c6cb5b9274800000NaN
20200604AL19072.0216227.0NaNNaN1929.0NaN601.0NaN357.011395.0B6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00653.01929.02020-06-04T00:00:00Z122134842352992352993705235299029bcbefdb36212ba2b97b5a354f4e45bf16648ee2300000NaN
20200604AR8067.0134413.0NaN138.0757.0NaNNaN30.0127.05717.0A6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00142.0757.02020-06-04T00:00:00Z5001424801424800142480026acd3a4fbbc3dbb32138725f91e3261d683e7052a00000NaN
20200604AS0.0174.0NaNNaNNaNNaNNaNNaNNaNNaNC6/1/2020 00:002020-06-01T00:00:00Z05/31 20:000.0NaN2020-06-01T00:00:00Z60001741740174008bbc72fa42781e0549e2e4f9f4c3e7cbef14ab3200000NaN
20200604AZ22753.0227002.0NaN1079.03195.0375.0NaN223.0NaN5172.0A+6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00996.03195.02020-06-04T00:00:00Z45204710249755249755523024975515661fa237b8204cd23701577aef6338d339daa4452e00000NaN

After taking a glance at the data, I realize that the date is not formatted well, so I format it.

df.index = pd.to_datetime(df.index, format="%Y%m%d")
df.head()
statepositivenegativependinghospitalizedCurrentlyhospitalizedCumulativeinIcuCurrentlyinIcuCumulativeonVentilatorCurrentlyonVentilatorCumulativerecovereddataQualityGradelastUpdateEtdateModifiedcheckTimeEtdeathhospitalizeddateCheckedfipspositiveIncreasenegativeIncreasetotaltotalTestResultstotalTestResultsIncreaseposNegdeathIncreasehospitalizedIncreasehashcommercialScorenegativeRegularScorenegativeScorepositiveScorescoregrade
date
2020-06-04AK513.059584.0NaN13.0NaNNaNNaN1.0NaN376.0A6/4/2020 00:002020-06-04T00:00:00Z06/03 20:0010.0NaN2020-06-04T00:00:00Z281907600976009719156009700c1046011af7271cbe2e6698526714c6cb5b9274800000NaN
2020-06-04AL19072.0216227.0NaNNaN1929.0NaN601.0NaN357.011395.0B6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00653.01929.02020-06-04T00:00:00Z122134842352992352993705235299029bcbefdb36212ba2b97b5a354f4e45bf16648ee2300000NaN
2020-06-04AR8067.0134413.0NaN138.0757.0NaNNaN30.0127.05717.0A6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00142.0757.02020-06-04T00:00:00Z5001424801424800142480026acd3a4fbbc3dbb32138725f91e3261d683e7052a00000NaN
2020-06-04AS0.0174.0NaNNaNNaNNaNNaNNaNNaNNaNC6/1/2020 00:002020-06-01T00:00:00Z05/31 20:000.0NaN2020-06-01T00:00:00Z60001741740174008bbc72fa42781e0549e2e4f9f4c3e7cbef14ab3200000NaN
2020-06-04AZ22753.0227002.0NaN1079.03195.0375.0NaN223.0NaN5172.0A+6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00996.03195.02020-06-04T00:00:00Z45204710249755249755523024975515661fa237b8204cd23701577aef6338d339daa4452e00000NaN

Then, I try to view some additional information about the dataset

df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5113 entries, 2020-06-04 to 2020-01-22
Data columns (total 34 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   state                     5113 non-null   object 
 1   positive                  5098 non-null   float64
 2   negative                  4902 non-null   float64
 3   pending                   842 non-null    float64
 4   hospitalizedCurrently     2591 non-null   float64
 5   hospitalizedCumulative    2318 non-null   float64
 6   inIcuCurrently            1362 non-null   float64
 7   inIcuCumulative           576 non-null    float64
 8   onVentilatorCurrently     1157 non-null   float64
 9   onVentilatorCumulative    198 non-null    float64
 10  recovered                 2409 non-null   float64
 11  dataQualityGrade          4012 non-null   object 
 12  lastUpdateEt              4758 non-null   object 
 13  dateModified              4758 non-null   object 
 14  checkTimeEt               4758 non-null   object 
 15  death                     4388 non-null   float64
 16  hospitalized              2318 non-null   float64
 17  dateChecked               4758 non-null   object 
 18  fips                      5113 non-null   int64  
 19  positiveIncrease          5113 non-null   int64  
 20  negativeIncrease          5113 non-null   int64  
 21  total                     5113 non-null   int64  
 22  totalTestResults          5113 non-null   int64  
 23  totalTestResultsIncrease  5113 non-null   int64  
 24  posNeg                    5113 non-null   int64  
 25  deathIncrease             5113 non-null   int64  
 26  hospitalizedIncrease      5113 non-null   int64  
 27  hash                      5113 non-null   object 
 28  commercialScore           5113 non-null   int64  
 29  negativeRegularScore      5113 non-null   int64  
 30  negativeScore             5113 non-null   int64  
 31  positiveScore             5113 non-null   int64  
 32  score                     5113 non-null   int64  
 33  grade                     0 non-null      float64
dtypes: float64(13), int64(14), object(7)
memory usage: 1.4+ MB

You can see that various columns are not of use. So, I decide to remove such columns.

df.drop([*df.columns[4:10], *df.columns[11:15], 'posNeg', 'fips'], 
        axis=1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5113 entries, 2020-06-04 to 2020-01-22
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   state                     5113 non-null   object 
 1   positive                  5098 non-null   float64
 2   negative                  4902 non-null   float64
 3   pending                   842 non-null    float64
 4   recovered                 2409 non-null   float64
 5   death                     4388 non-null   float64
 6   hospitalized              2318 non-null   float64
 7   dateChecked               4758 non-null   object 
 8   positiveIncrease          5113 non-null   int64  
 9   negativeIncrease          5113 non-null   int64  
 10  total                     5113 non-null   int64  
 11  totalTestResults          5113 non-null   int64  
 12  totalTestResultsIncrease  5113 non-null   int64  
 13  deathIncrease             5113 non-null   int64  
 14  hospitalizedIncrease      5113 non-null   int64  
 15  hash                      5113 non-null   object 
 16  commercialScore           5113 non-null   int64  
 17  negativeRegularScore      5113 non-null   int64  
 18  negativeScore             5113 non-null   int64  
 19  positiveScore             5113 non-null   int64  
 20  score                     5113 non-null   int64  
 21  grade                     0 non-null      float64
dtypes: float64(7), int64(12), object(3)
memory usage: 918.7+ KB

I also realize that there are a lot of missing (nan or null) values. So, I replace the missing values by 0.

df.fillna(value=0, inplace=True)
df.head()
statepositivenegativependingrecovereddeathhospitalizeddateCheckedpositiveIncreasenegativeIncreasetotaltotalTestResultstotalTestResultsIncreasedeathIncreasehospitalizedIncreasehashcommercialScorenegativeRegularScorenegativeScorepositiveScorescoregrade
date
2020-06-04AK513.059584.00.0376.010.00.02020-06-04T00:00:00Z819076009760097191500c1046011af7271cbe2e6698526714c6cb5b92748000000.0
2020-06-04AL19072.0216227.00.011395.0653.01929.02020-06-04T00:00:00Z22134842352992352993705029bcbefdb36212ba2b97b5a354f4e45bf16648ee23000000.0
2020-06-04AR8067.0134413.00.05717.0142.0757.02020-06-04T00:00:00Z001424801424800026acd3a4fbbc3dbb32138725f91e3261d683e7052a000000.0
2020-06-04AS0.0174.00.00.00.00.02020-06-01T00:00:00Z001741740008bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32000000.0
2020-06-04AZ22753.0227002.00.05172.0996.03195.02020-06-04T00:00:00Z5204710249755249755523015661fa237b8204cd23701577aef6338d339daa4452e000000.0

I also want to add a column corresponding to the state name instead of the abbreviation. So, I merge state_info with the current dataframe.

df2 = pd.read_csv('data/state_info.csv', usecols=['state', 'name'])
df3 = (df
      .reset_index()
      .merge(df2, on='state', how='left', left_index=True))
df3.head()
datestatepositivenegativependingrecovereddeathhospitalizeddateCheckedpositiveIncreasenegativeIncreasetotaltotalTestResultstotalTestResultsIncreasedeathIncreasehospitalizedIncreasehashcommercialScorenegativeRegularScorenegativeScorepositiveScorescoregradename
02020-06-04AK513.059584.00.0376.010.00.02020-06-04T00:00:00Z819076009760097191500c1046011af7271cbe2e6698526714c6cb5b92748000000.0Alaska
12020-06-04AL19072.0216227.00.011395.0653.01929.02020-06-04T00:00:00Z22134842352992352993705029bcbefdb36212ba2b97b5a354f4e45bf16648ee23000000.0Alabama
22020-06-04AR8067.0134413.00.05717.0142.0757.02020-06-04T00:00:00Z001424801424800026acd3a4fbbc3dbb32138725f91e3261d683e7052a000000.0Arkansas
32020-06-04AS0.0174.00.00.00.00.02020-06-01T00:00:00Z001741740008bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32000000.0American Samoa
42020-06-04AZ22753.0227002.00.05172.0996.03195.02020-06-04T00:00:00Z5204710249755249755523015661fa237b8204cd23701577aef6338d339daa4452e000000.0Arizona

I realize that the date index is lost. So, I reset the date index. Also, it is better to rename the column name as state_name.

df3.set_index('date', inplace=True)
df3.rename(columns={'name': 'state_name'}, inplace=True)
df3.head()
statepositivenegativependingrecovereddeathhospitalizeddateCheckedpositiveIncreasenegativeIncreasetotaltotalTestResultstotalTestResultsIncreasedeathIncreasehospitalizedIncreasehashcommercialScorenegativeRegularScorenegativeScorepositiveScorescoregradestate_name
date
2020-06-04AK513.059584.00.0376.010.00.02020-06-04T00:00:00Z819076009760097191500c1046011af7271cbe2e6698526714c6cb5b92748000000.0Alaska
2020-06-04AL19072.0216227.00.011395.0653.01929.02020-06-04T00:00:00Z22134842352992352993705029bcbefdb36212ba2b97b5a354f4e45bf16648ee23000000.0Alabama
2020-06-04AR8067.0134413.00.05717.0142.0757.02020-06-04T00:00:00Z001424801424800026acd3a4fbbc3dbb32138725f91e3261d683e7052a000000.0Arkansas
2020-06-04AS0.0174.00.00.00.00.02020-06-01T00:00:00Z001741740008bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32000000.0American Samoa
2020-06-04AZ22753.0227002.00.05172.0996.03195.02020-06-04T00:00:00Z5204710249755249755523015661fa237b8204cd23701577aef6338d339daa4452e000000.0Arizona

Now that the data is ready for some analysis, I decide to plot deaths count in each state indexed by date using the interactive plotly library.

fig1 = px.line(df3, x=df3.index, y='death', color='state')
fig1.update_layout(xaxis_title='date', title='Total deaths in each state (Cumulative)')
py.plot(fig1, filename = 'daily_deaths', auto_open=True)
'https://plotly.com/~ayush.kumar.shah/1/'

Note: These plots are interactive, so you can zoom in or out, pinch, hover over the graph, download it, and so on.

Now, I decide to calculate the total deaths in the US across all states and plot it.

df4 = df3.resample('D').sum()
df4.head()
positivenegativependingrecovereddeathhospitalizedpositiveIncreasenegativeIncreasetotaltotalTestResultstotalTestResultsIncreasedeathIncreasehospitalizedIncreasecommercialScorenegativeRegularScorenegativeScorepositiveScorescoregrade
date
2020-01-221.00.00.00.00.00.00011000000000.0
2020-01-231.00.00.00.00.00.00011000000000.0
2020-01-241.00.00.00.00.00.00011000000000.0
2020-01-251.00.00.00.00.00.00011000000000.0
2020-01-261.00.00.00.00.00.00011000000000.0
fig2 = px.line(df4, x=df4.index, y='death')
fig2.update_layout(xaxis_title='date', title='Total deaths in the U.S. (Cumulative)')
py.plot(fig2, filename = 'total_daily_deaths', auto_open=True)
'https://plotly.com/~ayush.kumar.shah/4/'

I also want to calculate the number of Active cases i.e.

Active = positive - deaths - recovered

df4['active'] = df4['positive'] - df4['death'] - df4['recovered']

Now, after calculating the active column, I want to plot active cases instead of death. So, I go to the previous cell and replace death by active and generate the plot.

In [25]: df4[’death’].plot()

In [25]: df4[‘active’].plot()

fig3 = px.line(df4, x=df4.index, y='active')
fig3.update_layout(xaxis_title='date', title='Total active cases in the U.S. (Cumulative)')
py.plot(fig3, filename = 'total_daily_active', auto_open=True)
'https://plotly.com/~ayush.kumar.shah/6/'

Then I decide to calculate the statistics of a single month of May only. Since the data is cumulative, I need to subtract the data of May from data of April to find the increase in various statistics in May after which I plot the results.

df5 = (df3.loc['2020-05']
          .groupby('state_name')
          .agg({'positive': 'first',
                'negative': 'first',
                'pending': 'first',
                'recovered': 'first',
                'death': 'first',
                'hospitalized': 'first', 
                'total': 'first', 
                'totalTestResults': 'first',
                'deathIncrease': 'sum',
                'hospitalizedIncrease': 'sum', 
                'negativeIncrease': 'sum', 
                'positiveIncrease': 'sum',
                'totalTestResultsIncrease': 'sum'}))

df6 = (df3.loc['2020-04']
          .groupby('state_name')
          .agg({'positive': 'first',
                'negative': 'first',
                'pending': 'first',
                'recovered': 'first',
                'death': 'first',
                'hospitalized': 'first', 
                'total': 'first', 
                'totalTestResults': 'first',
                'deathIncrease': 'sum',
                'hospitalizedIncrease': 'sum', 
                'negativeIncrease': 'sum', 
                'positiveIncrease': 'sum',
                'totalTestResultsIncrease': 'sum'}))

df7 = df5.sub(df6)
df7.head()
positivenegativependingrecovereddeathhospitalizedtotaltotalTestResultsdeathIncreasehospitalizedIncreasenegativeIncreasepositiveIncreasetotalTestResultsIncrease
state_name
Alabama10884.0119473.00.09355.0362.0866.0130357130357106-11245594484650440
Alaska79.032497.00.0116.01.00.03257632576-5717327-15717170
American Samoa0.0171.0-17.00.00.00.0154171001710171
Arizona12288.0141132.00.03262.0586.01829.0153420153420290660950765929101005
Arkansas3998.077138.00.03970.072.0309.0811368113619-9337973126639239
fig4 = px.bar(df7, x=df7.index, y='death')
fig4.update_layout(xaxis_title='state_name', title='Total Deaths in th US in May only')
py.plot(fig4, filename = 'total_deaths_May', auto_open=True)
'https://plotly.com/~ayush.kumar.shah/12/'

Problems in the usual coding style

Now that I have demonstrated the usual approach followed in pandas notebook, let’s discuss the problems in this approach.

1. Flow is disrupted:

The flow of the notebook is very difficult to understand and also creates problems. For example, we may create a variable name under the plot that needs it. In the above code as well, we created df3['active'] below the cell in which it is needed. So, it may cause errors when run by others. Also, you may have to scroll the notebook for miles and miles.

2. No reproducibility:

When the notebook is shared with others, the other person faces a lot of problems to execute or understand the notebook. For instance, the name of the dataframes doesn’t signify any information about the type of dataframe. It runs from df1 to df7 and creates a lot of confusion. But you want to create a notebook which is very easy to iterate on and the one you can share with your colleagues.

3. Difficult to move the code to production:

With this approach, your code is not ready to move into production. You end up having to rewrite the whole notebook before moving it to production which is not effective.

4. Unable to automate:

The notebook in the current condition cannot be automated for analysis since there may occur a lot of problems like an error in code execution, unavailability of filenames used.

Although the code may give an interesting conclusion or desired output, we are not quite sure that conclusion is at least correct.

Despite having so many problems associated with this approach, it is common for everyone to still use this type of flow while making a notebook since while coding, people enjoy when the code works when they check the outputs and hence keep on similarly continuing the coding.

Coding Solution

1. Naming convention

Follow a naming convention for the notebook according to the task as suggested by Cookiecutter Data Science that shows the owner and the order the analysis was done in. You can use the format

<step>-<ghuser>-<description>.ipynb

(e.g., 0.1-ayush-visualize-corona-us.ipynb).

2. Plan your steps beforehand

Load the data and then think in advance about all the steps of analysis or tasks you could be doing in the notebook. You don’t need to think the logic right away but just keep in mind the steps.

df = pd.read_csv('data/covid19_us_states_daily.csv', index_col='date')

3. Create functions

You know that initially, you want to clean the data and make sure the columns and indexes are in a proper usable format. So, why not create a function for that and name it according to the subtasks on the dataframe.

For example, initially you want to make the index a proper datetime object. Then you may want to do remove the duplicates, then add state name. Just add these functions without even thinking the logic and then later you can add the logic. This way, you will be on track and not lost.

The functions are created after creating the decorator.

4. Create proper decorators

Before adding functions, let’s also think about some additional utility that would be helpful. During the pandas analysis, you often check the shape, columns, and other information associated to the dataframe after performing an operation. However, a decorator can help automate this process.

Decorator is simply a function that expects a function and returns a function. It’s really functional right, haha. Don’t get confused by the definition. It is not so difficult as it sounds. We will see how it works in the code below.

Also, if you are not familiar with decorators or want to learn more about it, you can visit the article by Geir Arne Hjelle.

import datetime as dt
def df_info(f):
    def wrapper(df, *args, **kwargs):
        tic = dt.datetime.now()
        result = f(df, *args, **kwargs)
        toc = dt.datetime.now()
        print("\n\n{} took {} time\n".format(f.__name__, toc - tic))
        print("After applying {}\n".format(f.__name__))
        print("Shape of df = {}\n".format(result.shape))
        print("Columns of df are {}\n".format(result.columns))
        print("Index of df is {}\n".format(result.index))
        for i in range(100): print("-", end='')
        return result
    return wrapper

We have created a decorator called df_info which displays information like time taken by the function, shape, and columns after applying any function f.

The advantage of using a deorator is that we get logging. You can modify the decorator according to the information that you want to log or display after performing an operation on the dataframe.

Now, we create functions as our plan and use these decorators on them by using @df_info. This will be equivalent to calling df_info(f(df, *args, **kwargs))

@df_info
def create_dateindex(df):
    df.index = pd.to_datetime(df.index, format="%Y%m%d")
    return df

@df_info
def remove_columns(df):
    df.drop([*df.columns[4:10], *df.columns[11:15], 'posNeg', 'fips'], 
        axis=1, inplace=True)
    return df

@df_info
def fill_missing(df):
    df.fillna(value=0, inplace=True)
    return df

@df_info
def add_state_name(df):
    _df = pd.read_csv('data/state_info.csv', usecols=['state', 'name'])
    df = (df
      .reset_index()
      .merge(_df, on='state', how='left', left_index=True))
    df.set_index('date', inplace=True)
    df.rename(columns={'name': 'state_name'}, inplace=True)
    return df

@df_info
def drop_state(df):
    df.drop(columns=['state'], inplace=True)
    return df

@df_info
def sample_daily(df):
    df = df.resample('D').sum()
    return df

@df_info
def add_active_cases(df):
    df['active'] = df['positive'] - df['death'] - df['recovered']
    return df

def aggregate_monthly(df, month):
    df = (df.loc[month]
        .groupby('state_name')
        .agg({'positive': 'first',
            'negative': 'first',
            'pending': 'first',
            'recovered': 'first',
            'death': 'first',
            'hospitalized': 'first', 
            'total': 'first', 
            'totalTestResults': 'first',
            'deathIncrease': 'sum',
            'hospitalizedIncrease': 'sum', 
            'negativeIncrease': 'sum', 
            'positiveIncrease': 'sum',
            'totalTestResultsIncrease': 'sum'}))
    return df

@df_info
def create_month_only(df, month):
    df_current = aggregate_monthly(df, month)
    if int(month[-2:]) == 0:
        prev_month = str(int(month[:4]) - 1) + '-12'
    else:
        prev_month = month[:5] + '{:02d}'.format(int(month[-2:])-1)

    df_previous = aggregate_monthly(df, prev_month)
    df = df_current.sub(df_previous)
    return df

    

5. Remove side effect

However, these functions make changes that are inplace (side effects) i.e. modifies the originally loaded dataframe. So, to solve this, we add a function called start pipeline, which returns a copy of dataframe.

def start_pipeline(df):
    return df.copy()

6. Constructing pandas pipelines (Main step)

Now, let’s use these functions to achieve the previous tasks using pipe

df_daily = (df.pipe(start_pipeline)
            .pipe(create_dateindex)
            .pipe(remove_columns)
            .pipe(fill_missing)
            .pipe(add_state_name)
            .pipe(sample_daily)
            .pipe(add_active_cases))

create_dateindex took 0:00:00.003388 time

After applying create_dateindex

Shape of df = (5113, 34)

Columns of df are Index(['state', 'positive', 'negative', 'pending', 'hospitalizedCurrently',
       'hospitalizedCumulative', 'inIcuCurrently', 'inIcuCumulative',
       'onVentilatorCurrently', 'onVentilatorCumulative', 'recovered',
       'dataQualityGrade', 'lastUpdateEt', 'dateModified', 'checkTimeEt',
       'death', 'hospitalized', 'dateChecked', 'fips', 'positiveIncrease',
       'negativeIncrease', 'total', 'totalTestResults',
       'totalTestResultsIncrease', 'posNeg', 'deathIncrease',
       'hospitalizedIncrease', 'hash', 'commercialScore',
       'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
       'grade'],
      dtype='object')

Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04',
               ...
               '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
               '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
               '2020-01-23', '2020-01-22'],
              dtype='datetime64[ns]', name='date', length=5113, freq=None)

----------------------------------------------------------------------------------------------------

remove_columns took 0:00:00.002087 time

After applying remove_columns

Shape of df = (5113, 22)

Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
       'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
       'total', 'totalTestResults', 'totalTestResultsIncrease',
       'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
       'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
       'grade'],
      dtype='object')

Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04',
               ...
               '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
               '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
               '2020-01-23', '2020-01-22'],
              dtype='datetime64[ns]', name='date', length=5113, freq=None)

----------------------------------------------------------------------------------------------------

fill_missing took 0:00:00.006381 time

After applying fill_missing

Shape of df = (5113, 22)

Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
       'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
       'total', 'totalTestResults', 'totalTestResultsIncrease',
       'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
       'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
       'grade'],
      dtype='object')

Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04',
               ...
               '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
               '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
               '2020-01-23', '2020-01-22'],
              dtype='datetime64[ns]', name='date', length=5113, freq=None)

----------------------------------------------------------------------------------------------------

add_state_name took 0:00:00.015122 time

After applying add_state_name

Shape of df = (5113, 23)

Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
       'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
       'total', 'totalTestResults', 'totalTestResultsIncrease',
       'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
       'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
       'grade', 'state_name'],
      dtype='object')

Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04',
               ...
               '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
               '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
               '2020-01-23', '2020-01-22'],
              dtype='datetime64[ns]', name='date', length=5113, freq=None)

----------------------------------------------------------------------------------------------------

sample_daily took 0:00:00.017170 time

After applying sample_daily

Shape of df = (135, 19)

Columns of df are Index(['positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized',
       'positiveIncrease', 'negativeIncrease', 'total', 'totalTestResults',
       'totalTestResultsIncrease', 'deathIncrease', 'hospitalizedIncrease',
       'commercialScore', 'negativeRegularScore', 'negativeScore',
       'positiveScore', 'score', 'grade'],
      dtype='object')

Index of df is DatetimeIndex(['2020-01-22', '2020-01-23', '2020-01-24', '2020-01-25',
               '2020-01-26', '2020-01-27', '2020-01-28', '2020-01-29',
               '2020-01-30', '2020-01-31',
               ...
               '2020-05-26', '2020-05-27', '2020-05-28', '2020-05-29',
               '2020-05-30', '2020-05-31', '2020-06-01', '2020-06-02',
               '2020-06-03', '2020-06-04'],
              dtype='datetime64[ns]', name='date', length=135, freq='D')

----------------------------------------------------------------------------------------------------

add_active_cases took 0:00:00.002020 time

After applying add_active_cases

Shape of df = (135, 20)

Columns of df are Index(['positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized',
       'positiveIncrease', 'negativeIncrease', 'total', 'totalTestResults',
       'totalTestResultsIncrease', 'deathIncrease', 'hospitalizedIncrease',
       'commercialScore', 'negativeRegularScore', 'negativeScore',
       'positiveScore', 'score', 'grade', 'active'],
      dtype='object')

Index of df is DatetimeIndex(['2020-01-22', '2020-01-23', '2020-01-24', '2020-01-25',
               '2020-01-26', '2020-01-27', '2020-01-28', '2020-01-29',
               '2020-01-30', '2020-01-31',
               ...
               '2020-05-26', '2020-05-27', '2020-05-28', '2020-05-29',
               '2020-05-30', '2020-05-31', '2020-06-01', '2020-06-02',
               '2020-06-03', '2020-06-04'],
              dtype='datetime64[ns]', name='date', length=135, freq='D')

Check out all the logs displayed above. We are able to view in detail how each operation changed the data without having to print the dataframe after each operation.

fig2 = px.line(df_daily, x=df_daily.index, y='death')
fig2.update_layout(xaxis_title='date', title='Total deaths in the U.S. (Cumulative)')
py.plot(fig2, filename = 'total_daily_deaths', auto_open=True)
'https://plotly.com/~ayush.kumar.shah/4/'
fig3 = px.line(df_daily, x=df_daily.index, y='active')
fig3.update_layout(xaxis_title='date', title='Total active cases in the U.S. (Cumulative)')
py.plot(fig3, filename = 'total_daily_active', auto_open=True)
'https://plotly.com/~ayush.kumar.shah/6/'
df_may = create_month_only(
                df=(df.pipe(start_pipeline)
                    .pipe(create_dateindex)
                    .pipe(remove_columns)
                    .pipe(fill_missing)
                    .pipe(add_state_name)), 
                month='2020-05')
create_dateindex took 0:00:00.002492 time

After applying create_dateindex

Shape of df = (5113, 34)

Columns of df are Index(['state', 'positive', 'negative', 'pending', 'hospitalizedCurrently',
       'hospitalizedCumulative', 'inIcuCurrently', 'inIcuCumulative',
       'onVentilatorCurrently', 'onVentilatorCumulative', 'recovered',
       'dataQualityGrade', 'lastUpdateEt', 'dateModified', 'checkTimeEt',
       'death', 'hospitalized', 'dateChecked', 'fips', 'positiveIncrease',
       'negativeIncrease', 'total', 'totalTestResults',
       'totalTestResultsIncrease', 'posNeg', 'deathIncrease',
       'hospitalizedIncrease', 'hash', 'commercialScore',
       'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
       'grade'],
      dtype='object')

Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04',
               ...
               '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
               '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
               '2020-01-23', '2020-01-22'],
              dtype='datetime64[ns]', name='date', length=5113, freq=None)

----------------------------------------------------------------------------------------------------

remove_columns took 0:00:00.002219 time

After applying remove_columns

Shape of df = (5113, 22)

Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
       'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
       'total', 'totalTestResults', 'totalTestResultsIncrease',
       'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
       'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
       'grade'],
      dtype='object')

Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04',
               ...
               '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
               '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
               '2020-01-23', '2020-01-22'],
              dtype='datetime64[ns]', name='date', length=5113, freq=None)

----------------------------------------------------------------------------------------------------

fill_missing took 0:00:00.001883 time

After applying fill_missing

Shape of df = (5113, 22)

Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
       'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
       'total', 'totalTestResults', 'totalTestResultsIncrease',
       'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
       'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
       'grade'],
      dtype='object')

Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04',
               ...
               '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
               '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
               '2020-01-23', '2020-01-22'],
              dtype='datetime64[ns]', name='date', length=5113, freq=None)

----------------------------------------------------------------------------------------------------

add_state_name took 0:00:00.014981 time

After applying add_state_name

Shape of df = (5113, 23)

Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
       'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
       'total', 'totalTestResults', 'totalTestResultsIncrease',
       'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
       'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
       'grade', 'state_name'],
      dtype='object')

Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04',
               ...
               '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
               '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
               '2020-01-23', '2020-01-22'],
              dtype='datetime64[ns]', name='date', length=5113, freq=None)

----------------------------------------------------------------------------------------------------

create_month_only took 0:00:00.031071 time

After applying create_month_only

Shape of df = (56, 13)

Columns of df are Index(['positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized',
       'total', 'totalTestResults', 'deathIncrease', 'hospitalizedIncrease',
       'negativeIncrease', 'positiveIncrease', 'totalTestResultsIncrease'],
      dtype='object')

Index of df is Index(['Alabama', 'Alaska', 'American Samoa', 'Arizona', 'Arkansas',
       'California', 'Colorado', 'Connecticut', 'Delaware',
       'District Of Columbia', 'Florida', 'Georgia', 'Guam', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Northern Mariana Islands', 'Ohio',
       'Oklahoma', 'Oregon', 'Pennsylvania', 'Puerto Rico', 'Rhode Island',
       'South Carolina', 'South Dakota', 'Tennessee', 'Texas',
       'US Virgin Islands', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'],
      dtype='object', name='state_name')
fig4 = px.bar(df_may, x=df_may.index, y='death')
fig4.update_layout(xaxis_title='state_name', title='Total Deaths in th US in May only')
py.plot(fig4, filename = 'total_deaths_May', auto_open=True)
'https://plotly.com/~ayush.kumar.shah/12/'

You can observe how easily pipe functionality has achieved the required task in a clean and organized way. Also, the original dataframe is intact and not affected by the above operations.

df.head()
statepositivenegativependinghospitalizedCurrentlyhospitalizedCumulativeinIcuCurrentlyinIcuCumulativeonVentilatorCurrentlyonVentilatorCumulativerecovereddataQualityGradelastUpdateEtdateModifiedcheckTimeEtdeathhospitalizeddateCheckedfipspositiveIncreasenegativeIncreasetotaltotalTestResultstotalTestResultsIncreaseposNegdeathIncreasehospitalizedIncreasehashcommercialScorenegativeRegularScorenegativeScorepositiveScorescoregrade
date
20200604AK513.059584.0NaN13.0NaNNaNNaN1.0NaN376.0A6/4/2020 00:002020-06-04T00:00:00Z06/03 20:0010.0NaN2020-06-04T00:00:00Z281907600976009719156009700c1046011af7271cbe2e6698526714c6cb5b9274800000NaN
20200604AL19072.0216227.0NaNNaN1929.0NaN601.0NaN357.011395.0B6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00653.01929.02020-06-04T00:00:00Z122134842352992352993705235299029bcbefdb36212ba2b97b5a354f4e45bf16648ee2300000NaN
20200604AR8067.0134413.0NaN138.0757.0NaNNaN30.0127.05717.0A6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00142.0757.02020-06-04T00:00:00Z5001424801424800142480026acd3a4fbbc3dbb32138725f91e3261d683e7052a00000NaN
20200604AS0.0174.0NaNNaNNaNNaNNaNNaNNaNNaNC6/1/2020 00:002020-06-01T00:00:00Z05/31 20:000.0NaN2020-06-01T00:00:00Z60001741740174008bbc72fa42781e0549e2e4f9f4c3e7cbef14ab3200000NaN
20200604AZ22753.0227002.0NaN1079.03195.0375.0NaN223.0NaN5172.0A+6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00996.03195.02020-06-04T00:00:00Z45204710249755249755523024975515661fa237b8204cd23701577aef6338d339daa4452e00000NaN

7. Create a module

Finally, you can create a module (eg processing.py) and keep all the above functions in the module. You can simply import them here and use them directly. It will clean the notebook further.

processing.py

While loading the modules, load the “autoreload” extension so that you can change code in the modules and the changes get updated automatically. For more info, see autoreload documentation

%load_ext autoreload
%autoreload 2
from processing import *
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Advantages

1. Effective for the long run (Maintainability)

Although, the approach may look like an inefficient method of coding but it is very effective in the long run since you will not have to spend hours maintaining the notebook. Given the functions are well written and well defined, they are ready for production.

The code is easily sharable as well as anyone can understand the code unlike in the previous approach. Also, for complex analysis tasks, this approach can be easily used for maintaining the notebook.

2. Proper flow and planning

You do not need to think about the logic of the analysis at the beginning. You can just plan your tasks and write down the required functions which already gives you kind of a framework of mind which helps to be on track. The calm that will follow is likely going to have a greater impact on innovation.

Then, you can finally define the logic at the end to make it work.

3. Easier to modify

You might have noticed that the pipe functionality gives you the ability to modify the tasks or flow easily. You can do so by commenting or adding the functions in the pipeline.

For example, you don’t want to remove the columns and sample the data daily. Then you can achieve this simply by commenting those lines as shown below:

df_daily = (df.pipe(start_pipeline)
            .pipe(create_dateindex)
            # .pipe(remove_columns)
            .pipe(fill_missing)
            .pipe(add_state_name)
            .pipe(drop_state)
            # .pipe(sample_daily)
            .pipe(add_active_cases))

4. Easier to debug

In this approach, you know what is happening in each step which makes it a lot easier to debug. Furthermore, since all the operations are functions, you can easily debug the code by performing unit tests or using other methods on the functions.

5. Readability

This approach helps you prevent miles of scrolling and also is easily readable than the previous approach. By looking at the code, you can easily understand what operations are being performed on the data and also can see the effect of those operations on the data in each step using decorator.

Example:

Let us consider cooking chicken. When we do so, we don’t describe the steps like this:

temperature = 210 celsius
food1 = Chicken
food2 = Season(food1, with Spices)
food3 = Season(food2, with Gravy)
Serve(PutInOven(food3, temperature), on a plate)

But instead, we describe it the following way:

temperature = 210 celsius
Chicken.Season(with Spices)
        .Season(with Gravy)
        .PutInOven(temperature)
        .Serve()

The pipe functionality helps us to write code in the latter way, which is also much more readable.

6. Reusability

During production, we turn the project into a Python package. You can import your code and use it in notebooks with a cell. You do not need to write code to do the same task in multiple notebooks.

7. Separation into analysis and data manipulation

Once your functions have been moved to a separate module, two levels of abstraction are obtained: analysis and data manipulation.

You can fiddle around on a high level and keep the details on a low level. The notebook then becomes the summary and a user interface where you can very quickly make nice little charts instead of manipulating data or performing analytical steps to get a result.

Final notes

Hence, following these practices while coding in pandas or performing other similar tasks like building scikit-learn pipelines or other ML pipelines, can be extremely beneficial for developers. Also, all the 4 problems mentioned in the beginning have been solved in this approach. Thus, giving utmost priority to clarity and interoperability, we should remember that it’s a lot easier to solve a problem if we understand the problem well.

Moreover, if you find writing these codes difficult, an open source package called Scikit-lego maintained by Vincent and MatthijsB, with contributions from all around the world, is available. This package does all the hard work for you to create such pipelines along with additional features like custom logging. Do check it out.

Also, if you have any confusion or suggestions, feel free to comment. I am all ears. Thank you.

Leave a Comment