Ayush Kumar Shah

Common commands

2020-08-12T13:00:00-04:00

General Shell Commands
SSH Commands
tmux commands
Vim Commands
Git Commands
i3wm commands
Brew bundle

General shell commands

Commands	Description
`echo $SHELL`	Display name of active shell (bash or zsh or others)
`man command-name` Eg: man bash \| grep -C2 ‘$@’	Get information about the command Here, return 2 leading and trailing lines around the matching text ‘$@’ in the information
`command-name --help`	Information about the command usage
`pwd`	get current path
pwd \| pbcopy	copy current path to clipboard (Use xcopy or xsel for linux)
`cd -`	go back to previous location
`take new_dir`	create new_dir and cd into it i.e. `mkdir new_dir; cd new_dir`
`ls -al`	List. a - all l - long listing format d means directory - means file
`ls -ls`	list files with detailed info (permission, date, symoblic links)
ls -1 \| wc -l	count number of files in a directory
`cat filename`	show the contents of the file filename
tee Eg: df -h \| tee usage.txt	display stdout of a command and write it in a file
`free -h`	Show RAM - space used and free
`df -h`	Show disk information - space used and free
`du -sh .`	Show total size occupied by current directory
`du -sh *`	Show size of each file or folder in current directory
du -sh * \| tail -1	Show total size occupied by the last file in the current directory
ps ax[c] [\| less]	List currently running programs. c - easier to read less - easier to navigate
`pidof process-name`	Get the process id of a running process.
`kill process-id`	Kill the process
`uname [-[s][a]]`	Display name of OS Distribution. a - Detailed info.
`stat filename`	Display file status
`alias alias-name`	Shows the alias actual command
`date +format` E.g. `date +%d/%m/%Y`	Date command
`cal [-3] [[month] year]` E.g. `cal -3 june 1996` or `cal 1997` or `cal`	Calendar command. -3 means show previous and the next month as well.
`less file.txt`	Show file contents (similar to cat but allows to move up and down)
`more file.txt`	Show file contents (similar to cat but allows to move up and down)
`rm -ir`	Remove. i - prompt to ask permission for each file. r - recursively delete
`grep [-i] text_to_search /path/to/file`	Search for contents in a file , i - case insensitive
`grep -v text_to_search /path/to/file`	Search for contents not matching the pattern in a file
`command > file.txt`	adds output of command to file.txt. Creates a new file if does not exist. If exists, overwrites the contents of the file.
`command >> file.txt`	adds output of command to file.txt. Creates a new file if does not exist. If exists, appends the outputs to the contents of the file.
`find / -name "file_name" [2>/dev/null]` Eg: `find \ -name "backup" 2/dev/null`	Find file from the root directory 2/dev/null: 2 takes error output and redirects to dev/null where it is deleted
`find . -not -name "file_name"`	Find files not matching the filename
find . -name “file_name” \| xargs -I % rm %	Find and delete files matching the filename
`find . -name "file_name" -exec rm -i {} \;`	Find and delete files matching the filename
`find . -name "file_name" -exec grep "Hello" -i {} \;`	Find and search “Hello” in files matching the filename
`find -E . -regex ".*/file_name[0-9].sh"`	Find files matching the regular expression (this syntax works only in osx)
`find -E . -not -regex ".*/file_name[0-9].sh"`	Find files not matching the regular expression (this syntax works only in osx)
command \| grep text_to_search Eg: find / -name “backup” 2>/dev/null \| grep $USER	Using pipe to combine grep with other commands
`awk`	very powerful command for pattern scanning and processing
	fzf: fuzzy finding files or directories You need to install fzf
	fzf: fuzzy finding commands in history
	fzf: fuzzy finding files or directories from current path
`top` or `htop` or `ytop` or `gotop`	Process info and CPU Usage (You need to install htop or ytop)
`tree [-aldf][-L level][-P pattern][-I pattern][-o filename]`	display directory’s contents in a tree a - all files l - symbolic links d - directories only L - limit number of levels of directory I - files not matching pattern P - files matching pattern o - output to filename You need to install tree
`lsof /dev/nvidia* \| awk '{print $2}'`	Display ids of processes utilizing CUDA/GPU

SSH Commands

SSH Tunnelling

To access servers hosted on the remote machine from the local machine

$ ssh -NL port1_server:localhost:port1_local [-NL port2_server:localhost:port2_local]{multiple ports possible} username@remote-ip-address

Example:

$ ssh -NL 8888:localhost:8888 ayush@192.168.100.7

Copy multiple files from remote to local:

$ scp username@remote-ip:/some/remote/directory/\{file1,file2,file3\} /localpath
$ scp username@remote-ip:'/path1/file1 /path2/file2 /path3/file3' /localPath

Other ssh commands

Generate ssh key:

Using ed25519 (more secure: Recommended)

$ ssh-keygen -t ed25519

Using RSA

$ ssh-keygen -t rsa -b 3072

Save ssh host info Modify this file: ~/.ssh/config

Host *
    AddKeysToAgent yes
    UseKeychain yes
    IdentityFile ~/.ssh/id_rsa (path/to/key)

Host targaryen
    HostName 192.168.1.10
    User daenerys
    Port 7654

Host tyrell
    HostName 192.168.10.20

Host martell
    HostName 192.168.10.50

Host *ell
    User oberyn

Host * !martell
    LogLevel INFO

Host *
    User root
    Compression yes

Save ssh password so that no need to re-enter every time

Run this in client (not server)

ssh-copy-id -i path/to/key.pub username@server-ip-address

Example:

ssh-copy-id -i ~/.ssh/id_rsa.pub ayush@192.168.1.107

Open server in nautilus / file explorer in linux

File explorer: Other locations > Connect to server > sftp://username@ip/

tmux commands


tmux	Create a tmux session with default window name 0
tmux new -As name	Create a tmux session with a name or attach to an existing session (if it exists)
tmux ls	List the active tmux sessions
tmux a -t name	Attach to an existing tmux session
tmux kill-session- t name	Kill an existing tmux session
= (default), can be changed to
[%”]	(Splitting panes)
[c-D]	(exit)
D	(get out )
c	Create a new window (appears in status bar)
0	Switch to window 0
1	Switch to window 1
x	Kill current window
d	Detach tmux (exit back to normal terminal)
z	the active pane is toggled between zoomed and unzoomed
space	switch between split orientations
!	Break current pane to a new window
\| Swap pane within a window
()	Switch between tmux sessions
	Swap pane within a window
:move-window -t 2	rename current window to 2 if 2 does not exist
:resize-pane -D n	Resizes the current pane down by n cells
:resize-pane -U n	Resizes the current pane upward by n cells
:resize-pane -L n	Resizes the current pane left by n cells
:resize-pane -R n	Resizes the current pane right by n cells
:join-pane [-dhv] [-l size `\|` -p percentage] [-s src-pane] [-t dst-pane] Eg: :join-pane -v -s 4 -t :1	Join one pane to another
	save current state You need to install tmux-resurrect
	reload saved state

Vim commands

I. Pure Vim

Syntax:

Verbs (operations) + Noun (text on which operation is performed)

[count] [operation] [text object / motion]

Run bash commands in vim

:[.]!command

. (dot) - outputs the command into the current buffer

1. VIM Verbs (operations)


c	change
d	delete
C	change everything from where your cursor is to the end of the line
D	delete everything from where your cursor is to the end of the line
dd	delete a line
x	delete a sigle character
y	yank text into the copy buffer.
yy or Y	yank line into the copy buffer.
v	highlight one character at a time.
V	highlight one line at a time.
	highlight by columns.
p	paste text after the current line.
P	paste text on the current line.
>	Shift Right
<	Shift Left
=	Indent
gU	make uppercase
gu	make lowercase
~	swap case

2. VIM Nouns (text)

i. Text Objects

Must be combined with verbs


iw	inner word (non whitespace) (works from anywhere in a word)
aw	word with surrounding white space (works from anywhere in a word) aw ~ W. Difference in position. E.g. For dw, cursor must be at beginning, whereas daw works from any position.
ib	inner bracket (the contents of an HTML tag)
ab	a bracket
it	inner tag (the contents of an HTML tag)
at	a tag block
i”	inner quotes
a”	a quote
ip	inner paragraph
ap	a paragraph
is	inner sentence
as	a sentence

Combination examples:


gUiw	capitalize a word
ci(	change inner bracket
6dW	delete 6 words
yis	copy inner sentence
di”	delete inner quotes

ii. Motions

Can be combined with verbs or used independently


[count] w/W	go a (word / word with whitespace) to right
[count] b/B	go a (word / word with whitespace) to left
[count] e/E	go to the end of (word / word with whitespace)
[count] ]m	go to the beginning of next method
[count] h / j / k / l	left / down / up / right
[count] f/F [char] [;,]+	go to the next occurence of character
[count] t/T [char] [;,]+	go to just before the next occurence of character
%	move to matching parenthesis pair
[count] +	down to first non blank char of the line.
[count]$	moves the cursor to the end of the line.
0	moves the cursor to the beginning of the line.
G	move to the end of the file.
gg	move to the beginning of the file.
]m or [m	Mode between methods.

Combination examples:


3ce	Change 3 words to end
d]m	delete start of next method
ctL	change upto before the next occurence of L
d]m	delete start of next method

3. Other important vim commands


i	Insert to left of cursor
a	Insert to right of cursor
A	insert at end of line
I	insert at beginning of line
o	insert at beginning of next line
O	insert at beginning of previous line
u	undo
	will redo the last undo.
/text	search for text
:%s/text/replacement text/g	search through the entire document for text and replace it with replacement text.
:%s/text/replacement text/gc	search through the entire document and confirm before replacing text.
*	search forward for word under cursor
#	search backward for word under cursor
:vsplit	vertical split windows
m[a-zA-Z]	sets a custom mark whose location can be accessed using `[mark] and line accessed using ‘[mark]
g;	goto last cursor position
’.	move to the last edit
:marks	show all current marks that are being used
:w	write
:w file_name	write the changes to a new file
:q	quit
:q! or ZQ	force quit
:wq or ZZ	write and quit
:w !sudo tee %	Write with sudo permissions if permission not available
:bd	remove buffer
[:vert] :sf filename	find file and open in split mode
select multiple lines then I or A and type the required text	insert text at multiple lines at the beginning or the end
q `> commands q [count] @>`	record command macros apply recorded commands
:ab ipho International Physics Olympiad	Set abbreviation for long terms for easy typing Use to prevent expansion
norm command. Eg `vip` then `:norm Ithis comes to the left`	Applies sequence of button presses / commands to each line selected Select a paragraph and add the text to the left of each line in the paragraph
Global commands. Eg `:g/^@/m$`	Apply commands to lines matching particular pattern Move all lines starting with @ to the end of the document
Timetravel Eg `:earlier 10m` `:earlier 5h` `:later 2h`	Move to the file state in the past or future as specified

4. Useful key remappings


jk (Custom- `inoremap jk` )
kj (Custom)`inoremap kj`
nnoremap `\|`
nnoremap	:w
nnoremap	:wq!
Better window navigation
nnoremap	`h`
nnoremap	`j`
nnoremap	`k`
nnoremap	`l`

5. Using Args

Args are list of files initially opened. So, it’s a subset of buffers.


:args	display args files
:args */.yaml	add files to args
:sall	open all args files in split mode
:vert sall	open all args files in vertical split mode
:windo difft	show differences in all args files
c-x, c-l	autocomplete
:vim TODO/ ##	search in all args files
:cdo s/TODO/DONE/g	replace in all args files

6. Scrolling and motions


`> , >`	Up, down scroll
{ }	Up, down scroll between spaces
`> , >`	Up, down full screen scroll
`> , >`	Up, down scroll by lines
H / M / L	Navigations to top / middle / bottom
zt	Put current cursor position to top
zz	Put current cursor position to middle
zb	Put current cursor position to bottom

II. Vim Plugins commands

Install any vim plugin manager like vim-plug.

To apply latest settings:

:source $MYVIMRC

Ranger

First, instll ranger

Mac

brew install ranger

Linux

sudo apt install ranger

Install ranger plugin for vim

" Ranger in vim
Plug 'francoiscabrol/ranger.vim'
" Dependency for ranger in neovim
Plug 'rbgrouleff/bclose.vim'

When ranger is open in vim or externally


`cw`	Rename file/dir : change word
`A`	Rename file: add at the end of extension
`a`	Rename file: add just before the extension
`I`	Rename file/dir: add at the front of the filename/directory
`:bulkrename`	Rename a list of files/directories
`:mkdir newdir`	Create new directory
`Space`	Highlight/Select files / directories
`V`	Highlight/Select files / directories similar to visual mode
`uv`	Undo highlight/select
`yy`	Copy/yank file/dir
`dd`	Cut file/dir
`pp`	Paste file/dir. If file exists, new file created with _ at the end of the name
`po`	Paste but overwrite file/dir
`uy`	Undo Copy/yank
`dD`	Delete
`Z` (Custom mapping)	Compress using an external script mapped to ranger
`x` (Custom mapping)	Extract using an external script mapped to ranger

1. vim-surround commands

Install Plug tpope/vim-surround


ds[‘“bB{}t]	delete surrounding quotes
cs[‘“bB{}t] [‘“bB{}t]	change surrounding quotes
ysiw[‘“bB{}t]	add surrounding quotes “
v-select, S[‘“bB{}t]	add surrounding

Examples:


Hello	cst	Hello
if *x>3{	ysW(	if ( x>3 ) {
*“hello”	ysWf print	print(“hello”)

2. Git plugins commands

Install these plugins first

" Show differences with style
Plug 'mhinz/vim-signify'
" Main GIT PLugin :Git
Plug 'tpope/vim-fugitive'
" Git Hub plugin, enables :Gbrowse
Plug 'tpope/vim-rhubarb'
" Git commit browser
Plug 'junegunn/gv.vim'
" Git commit history in each line


	Toggle between buffers
:Git diff	Show git differences
:Gdiffsplit	Show differences in split mode
:GBrowse	Open the repository in github
:GV	Show git commit history

3. Coc commands

Install COC plugin first

" Intellisense
Plug 'neoclide/coc.nvim', {'branch': 'release'}


gd	Goto Definitions of variable under cursor
gr	Goto References of variable under cursor
:CocInstall tool_name E.g. :CocInstall coc-python	Installing coc tools
:CocUninstall tool_name	Uninstalling coc tools
:CocList extensions (Tab for autocompletion)	Show extensions
:CocCommand	execute a COC command
o	expand/collapse in Coc explorer (First run :CocInstall coc-explorer)

4. coc-python

Install coc-python first

:CocInstall coc-python


Shift K	doc hint
:Format	autopep8 formatting
w	Switch cursors between sidebar and code
c I A	multiple cursors: change Insert at first Insert at end

5. FZF

Install fzf in system and fzf plugin

OSx

brew install fzf

# To install useful key bindings and fuzzy completion:
$(brew --prefix)/opt/fzf/install

brew install ripgrep

Linux

sudo apt install fzf
sudo apt install ripgrep

FZF Plugin

Plug 'junegunn/fzf', { 'do': { -> fzf#install() } }
Plug 'junegunn/fzf.vim'
Plug 'airblade/vim-rooter'


:Rg	Find word inside file
:BLines	Find all occurrences of word in a giant file
:Lines	Same as above but search in all buffers
:History:	History of commands ran in vim
:Ag	similar to Rg but
:Buffers	Search through buffers
>	Tab
gf	Goto file: open file directly from path written in vim

6. Startify

Install Startify Plugin for Project management

" Start Screen
Plug 'mhinz/vim-startify'


:SSave	Save session
:SLoad	Load session

7. vim-commentary

" Better Comments
Plug 'tpope/vim-commentary'


gc`[]` (Examples below)	Comment out the target of a motion
gcap	Comment out a paragraph
gcw	Comment out the current line
gc2j	Comment out the current line and 2 lines below it

Easy remapping |nnoremap /| :Commentary |vnoremap /| :Commentary

Git commands


git commit –amend	add to previous commits
git commit -m $’Heading commit\n\nCommit description\nLine 2 of description’	Commit and commit description in 1 line
git push origin -f branchname	forced push
git rebase master	merge changes of master onto the current branch (first pull from master before rebase)
git log
git diff
git remote -v	show repo information
git reset –hard eg git reset --hard HEAD@1
git show
git config –global user.name
git config –global user.email
git reset	remove file from the current index (the “about to be committed” list) without changing anything else.
git checkout filename	Undo local changes to latest commit
git stash	Stash local changes temporarily
git stash list	Show stashed branches
git stash show	Show the latest stashed file changes
git stash show -p N	Show the Nth (see number in git stash list) stashed file changes
git stash drop stash@{index}	Remove the given stash
git stash clear	Remove all stashes
`git stash list \| awk -F: '{ print "\n\n\n\n"; print $0; print "\n\n"; system("git stash show -p " $1); }'`	Show the changes in the stash in detail

Ignore files that have already been committed to the repo

$ git rm -r --cached .
$ git add .
$ git commit -m "Clean up ignored files"

Hard delete unpublished commits

git reset --hard commit_id (reset to the particular commit. It will destroy any local modifications.)

Alternatively, if there’s work to keep

git stash
git reset --hard commit_id
git stash pop

This saves the modifications, then reapplies that patch after resetting. You could get merge conflicts, if you’ve modified things which were changed since the commit you reset to.

Undo published commits with new commits

This will create three separate revert commits:

git revert a867b4af 25eee4ca 0766c053

It also takes ranges. This will revert the last two commits:

git revert HEAD~2..HEAD

Similarly, you can revert a range of commits using commit hashes:

git revert a867b4af..0766c053

Reverting a merge commit

git revert -m 1

To get just one, you could use rebase -i to squash them afterwards Or, you could do it manually (be sure to do this at top level of the repo) get your index and work tree into the desired state, without changing HEAD:

git checkout 0d1d7fc32 .

Then commit. Be sure and write a good message describing what you just did

git commit

Git reset

git reset does know five “modes”: soft, mixed, hard, merge and keep. I will start with the first three, since these are the modes you’ll usually encounter. After that you’ll find a nice little a bonus, so stay tuned.

soft

When using
```
  git reset --soft HEAD~1
```
you will remove the last commit from the current branch, but the file changes will stay in your working tree. Also the changes will stay on your index, so following with a git commit will create a commit with the exact same changes as the commit you “removed” before.
mixed

This is the default mode and quite similar to soft. When “removing” a commit with
```
  git reset HEAD~1
```
you will still keep the changes in your working tree but not on the index; so if you want to “redo” the commit, you will have to add the changes (git add) before commiting.
hard

When using
```
  git reset --hard HEAD~1
```
you will lose all uncommited changes in addition to the changes introduced in the last commit. The changes won’t stay in your working tree so doing a git status command will tell you that you don’t have any changes in your repository.

Tread carefully with this one. If you accidentally remove uncommited changes which were never tracked by git (speak: committed or at least added to the index), you have no way of getting them back using git.
Bonus (keep)
```
  git reset --keep HEAD~1
```
is an interesting and useful one. It only resets the files which are different between the current HEAD and the given commit. It aborts the reset if anyone of these files has uncommited changes. It’s basically acts as a safer version of hard.

This mode is particularly useful when you have a bunch of changes and want to switch to a different branch without losing these changes - for example when you started to work on the wrong branch.

Remove sensitive file from github repo history

git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch path_to_file" HEAD
git push -f origin master

Other git commands

git rm | rm plus git add combined
git rm --cached | file removed from the index (staging it for deletion on the next commit), but keep your  copy in the local file system.

i3wm commands

pavucontrol
alsamixer


mod + r - resize mode , then arrow keys or vim keys
mod + Shift + e - exit
mod + d - dmenu
mod+Shift+c reload
mod+Shift+r restart

Alt+Shift- Change keyboard language

Brew bundle for OSx

Usage

Create a Brewfile in the root of your project with:

touch Brewfile

Add your dependencies in your Brewfile:

tap "homebrew/cask"
tap "user/tap-repo", "https://user@bitbucket.org/user/homebrew-tap-repo.git"
cask_args appdir: "/Applications"

brew "imagemagick"
brew "denji/nginx/nginx-full", args: ["with-rmtp-module"]
brew "mysql@5.6", restart_service: true, link: true, conflicts_with: ["mysql"]

cask "firefox", args: { appdir: "~/my-apps/Applications" }
cask "google-chrome"
cask "java" unless system "/usr/libexec/java_home --failfast"

mas "1Password", id: 443987910

whalebrew "whalebrew/wget"

cask and mas entries are automatically skipped on Linux. Other entries can be run only on (or not on) Linux with if OS.mac? or if OS.linux?.

Install

You can then easily install all dependencies with:

brew bundle

Any previously-installed dependencies which have upgrades available will be upgraded.

brew bundle will look for a Brewfile in the current directory. Use --file to specify a path to a different Brewfile, or set the HOMEBREW_BUNDLE_FILE environment variable; --file takes precedence if both are provided.

My .Brewfile is stored in the home directory and the HOMEBREW_BUNDLE_FILE environment variable is set to ~/.Brewfile

You can skip the installation of dependencies by adding space-separated values to one or more of the following environment variables:

HOMEBREW_BUNDLE_BREW_SKIP
HOMEBREW_BUNDLE_CASK_SKIP
HOMEBREW_BUNDLE_MAS_SKIP
HOMEBREW_BUNDLE_WHALEBREW_SKIP
HOMEBREW_BUNDLE_TAP_SKIP

brew bundle will output a Brewfile.lock.json in the same directory as the Brewfile if all dependencies are installed successfully. This contains dependency and system status information which can be useful in debugging brew bundle failures and replicating a “last known good build” state.

You can opt-out of this behaviour by setting the HOMEBREW_BUNDLE_NO_LOCK environment variable or passing the --no-lock option.

You may wish to check this file into the same version control system as your Brewfile (or ensure your version control system ignores it if you’d prefer to rely on debugging information from a local machine).

Dump

You can create a Brewfile from all the existing Homebrew packages you have installed with:

brew bundle dump

The --force option will allow an existing Brewfile to be overwritten as well. The --describe option will output a description comment above each line. The --no-restart option will prevent restart_service from being added to brew lines with running services.

Cleanup

You can also use a Brewfile to list the only packages that should be installed, removing any package not present or dependent. This workflow is useful for maintainers or testers who regularly install lots of formulae. To uninstall all Homebrew formulae not listed in the Brewfile:

brew bundle cleanup

Unless the --force option is passed, formulae that would be uninstalled will be listed rather than actually be uninstalled.

Check

You can check there’s anything to install/upgrade in the Brewfile by running:

brew bundle check

This provides a successful exit code if everything is up-to-date, making it useful for scripting.

For a list of dependencies that are missing, pass --verbose. This will also check all dependencies by not exiting on the first missing dependency category.

List

Outputs a list of all of the entries in the Brewfile.

brew bundle list

Pass one of --casks, --taps, --mas, --whalebrew or --brews to limit output to that type. Defaults to --brews. Pass --all to see everything.

Note that the type of the package is not included in this output.

Exec

Runs an external command within Homebrew’s superenv build environment.

brew bundle exec -- bundle install

This sanitized build environment ignores unrequested dependencies, which makes sure that things you didn’t specify in your Brewfile won’t get picked up by commands like bundle install, npm install, etc. It will also add compiler flags which will help find keg-only dependencies like openssl, icu4c, etc.

Restarting services

You can choose whether brew bundle restarts a service every time it’s run, or only when the formula is installed or upgraded in your Brewfile:

# Always restart myservice
brew 'myservice', restart_service: true

# Only restart when installing or upgrading myservice
brew 'myservice', restart_service: :changed

References

Organize pandas notebook with cool hacks

2020-06-06T11:00:00-04:00

Does it ring a bell looking at this messy notebook? I am sure you must have created or encountered a similar kind of notebook while performing data analysis tasks in pandas.

Pandas is widely used by data scientists and ML Engineers all around the world to perform all kinds of data related tasks like data cleaning and preprocessing, data analysis, data manipulation, data conversion, etc. However, most of us are not using it right, as seen in the above example, which has decreased our productivity a lot.

You might wonder then what is the correct way to use pandas. Is there any particular way that we can make the notebook clean and modular so that we can increase our productivity?

Luckily, there is a type of quick hack or technique, whatever you may call it, which can be used to greatly improve the workflow and make notebooks not only clean and well organized but highly productive and efficient. The good thing is that you don’t need to install any extra packages or libraries. In the end, your notebook will look something like this.

Note: Dark mode is available on this website. You can switch between the modes by clicking the leftmost button at the bottom of the left sidebar.

Untitled12.ipynb

The way to achieve clean and well-organized pandas notebooks was explored in the presentation Untitled12.ipynb by Vincent D. Warmerdam at PyData Eindhoven 2019

The presentation Untitled12.ipynb: Prevent Miles of Scrolling, Reduce the Spaghetti Code from the Copy Pasta has been uploaded in youtube as well. You can watch the video below if you want:

In this article, I will briefly summarize the presentation by Vincent D. Warmerdam and then move on to the code implementation (solution) and a few code examples based on the methods used in his presentation.

The Untitled phenomena

He began his talk by introducing a term called Untitled phenomena. The term simply refers to the bad practice of not naming the notebook files which eventually creates an unorganized bunch of Untitled notebooks. As a result, he also named the presentation Untitled12.ipynb.

Moreover, not only the bad practice of naming that we follow but also the bad organization of code inside the notebook needs to be improved. Copying and pasting code multiple times creates spaghetti code. This is especially true for a lot of data science based Jupyter notebooks. The goal of his talk was to uncover a great pattern for pandas that would prevent loads of scrolling such that the code behaves like lego. He also gave some useful tricks and tips on how to prevent miles of scrolling and reduce the spaghetti code when creating Jupyter notebooks.

I have initially written a summary of the talk Untitled12.ipynb and explored some common problems in the usual coding style before moving to the solution. If you want to directly jump to the coding solution to create a clean pandas notebook using a pipeline, then click the link above. However, I recommend you to read the common problems I have mentioned before going to the solution.

I will be talking about the following topics which will more or less revolve around his talk.

Importance of Workflow
The Usual coding style
Problems in the usual coding style
Coding Solution
Advantages

Importance of Workflow

At the beginning of the presentation, he began by discussing the following points that highlight the importance of workflows and the need of jupyter-notebook and pandas over excel:

We want to separate the data from the analysis: Tha analysis portion should not modify the raw data. The raw data should be safe from these modifications so that it can be reused later as well. However, this is not possible in excel.
We want to be able to automate our analysis. The main aim of programming and workflow is automation. Our tasks become a lot easier if we can automate the analysis using a pandas script rather than performing the analysis every time using Excel.
We want our analysis to be reproducible i.e. we must be able to reproduce the same analysis results on the data at a later time in the future.
We should not pay a third part obscene amounts of money for something as basic as arithmetic. This budget is better allocated towards innovation and education of staff.

However, the current style of coding in pandas and jupyter notebook has solved only the last point.

The usual coding style

Let’s explore the common practice of writing pandas code and try to point out the major problems in such approaches.

Initially, I will show the general workflow that most of us follow while using pandas. I will be performing some analysis on the real COVID 19 dataset of the U.S. states obtained from The COVID Tracking Project which is available under the Creative Commons CC BY-NC-4.0 license. The dataset is updated each day between 4 pm and 5 pm EDT.

After showing the common approach, I will point out the major pitfalls and then move on to the solution.

First, I will download the U.S. COVID-19 dataset using the API provided by The COVID Tracking Project

!mkdir data
!wget -O data/covid19_us_states_daily.csv https://covidtracking.com/api/v1/states/daily.csv
!wget  -O data/state_info.csv https://covidtracking.com/api/v1/states/info.csv

--2020-06-05 16:34:10--  https://covidtracking.com/api/v1/states/daily.csv
Resolving covidtracking.com (covidtracking.com)... 104.248.63.231, 2604:a880:400:d1::888:7001
Connecting to covidtracking.com (covidtracking.com)|104.248.63.231|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘data/covid19_us_states_daily.csv’

data/covid19_us_sta     [  <=>               ] 987.40K  3.11MB/s    in 0.3s    

2020-06-05 16:34:11 (3.11 MB/s) - ‘data/covid19_us_states_daily.csv’ saved [1011093]

--2020-06-05 16:34:12--  https://covidtracking.com/api/v1/states/info.csv
Resolving covidtracking.com (covidtracking.com)... 104.248.50.87, 2604:a880:400:d1::888:7001
Connecting to covidtracking.com (covidtracking.com)|104.248.50.87|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘data/state_info.csv’

data/state_info.csv     [ <=>                ]  27.67K  --.-KB/s    in 0.02s   

2020-06-05 16:34:13 (1.43 MB/s) - ‘data/state_info.csv’ saved [28329]

import pandas as pd 
# Importing plotly library for plotting interactive graphs
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import chart_studio
import chart_studio.plotly as py

The first step is generally to read or import the data

df = pd.read_csv('data/covid19_us_states_daily.csv', index_col='date')
df.head()

	state	positive	negative	pending	hospitalizedCurrently	hospitalizedCumulative	inIcuCurrently	inIcuCumulative	onVentilatorCurrently	onVentilatorCumulative	recovered	dataQualityGrade	lastUpdateEt	dateModified	checkTimeEt	death	hospitalized	dateChecked	fips	positiveIncrease	negativeIncrease	total	totalTestResults	totalTestResultsIncrease	posNeg	deathIncrease	hospitalizedIncrease	hash	commercialScore	negativeRegularScore	negativeScore	positiveScore	score	grade
date
20200604	AK	513.0	59584.0	NaN	13.0	NaN	NaN	NaN	1.0	NaN	376.0	A	6/4/2020 00:00	2020-06-04T00:00:00Z	06/03 20:00	10.0	NaN	2020-06-04T00:00:00Z	2	8	1907	60097	60097	1915	60097	0	0	c1046011af7271cbe2e6698526714c6cb5b92748	0	0	0	0	0	NaN
20200604	AL	19072.0	216227.0	NaN	NaN	1929.0	NaN	601.0	NaN	357.0	11395.0	B	6/4/2020 00:00	2020-06-04T00:00:00Z	06/03 20:00	653.0	1929.0	2020-06-04T00:00:00Z	1	221	3484	235299	235299	3705	235299	0	29	bcbefdb36212ba2b97b5a354f4e45bf16648ee23	0	0	0	0	0	NaN
20200604	AR	8067.0	134413.0	NaN	138.0	757.0	NaN	NaN	30.0	127.0	5717.0	A	6/4/2020 00:00	2020-06-04T00:00:00Z	06/03 20:00	142.0	757.0	2020-06-04T00:00:00Z	5	0	0	142480	142480	0	142480	0	26	acd3a4fbbc3dbb32138725f91e3261d683e7052a	0	0	0	0	0	NaN
20200604	AS	0.0	174.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	C	6/1/2020 00:00	2020-06-01T00:00:00Z	05/31 20:00	0.0	NaN	2020-06-01T00:00:00Z	60	0	0	174	174	0	174	0	0	8bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32	0	0	0	0	0	NaN
20200604	AZ	22753.0	227002.0	NaN	1079.0	3195.0	375.0	NaN	223.0	NaN	5172.0	A+	6/4/2020 00:00	2020-06-04T00:00:00Z	06/03 20:00	996.0	3195.0	2020-06-04T00:00:00Z	4	520	4710	249755	249755	5230	249755	15	66	1fa237b8204cd23701577aef6338d339daa4452e	0	0	0	0	0	NaN

After taking a glance at the data, I realize that the date is not formatted well, so I format it.

df.index = pd.to_datetime(df.index, format="%Y%m%d")
df.head()

	state	positive	negative	pending	hospitalizedCurrently	hospitalizedCumulative	inIcuCurrently	inIcuCumulative	onVentilatorCurrently	onVentilatorCumulative	recovered	dataQualityGrade	lastUpdateEt	dateModified	checkTimeEt	death	hospitalized	dateChecked	fips	positiveIncrease	negativeIncrease	total	totalTestResults	totalTestResultsIncrease	posNeg	deathIncrease	hospitalizedIncrease	hash	commercialScore	negativeRegularScore	negativeScore	positiveScore	score	grade
date
2020-06-04	AK	513.0	59584.0	NaN	13.0	NaN	NaN	NaN	1.0	NaN	376.0	A	6/4/2020 00:00	2020-06-04T00:00:00Z	06/03 20:00	10.0	NaN	2020-06-04T00:00:00Z	2	8	1907	60097	60097	1915	60097	0	0	c1046011af7271cbe2e6698526714c6cb5b92748	0	0	0	0	0	NaN
2020-06-04	AL	19072.0	216227.0	NaN	NaN	1929.0	NaN	601.0	NaN	357.0	11395.0	B	6/4/2020 00:00	2020-06-04T00:00:00Z	06/03 20:00	653.0	1929.0	2020-06-04T00:00:00Z	1	221	3484	235299	235299	3705	235299	0	29	bcbefdb36212ba2b97b5a354f4e45bf16648ee23	0	0	0	0	0	NaN
2020-06-04	AR	8067.0	134413.0	NaN	138.0	757.0	NaN	NaN	30.0	127.0	5717.0	A	6/4/2020 00:00	2020-06-04T00:00:00Z	06/03 20:00	142.0	757.0	2020-06-04T00:00:00Z	5	0	0	142480	142480	0	142480	0	26	acd3a4fbbc3dbb32138725f91e3261d683e7052a	0	0	0	0	0	NaN
2020-06-04	AS	0.0	174.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	C	6/1/2020 00:00	2020-06-01T00:00:00Z	05/31 20:00	0.0	NaN	2020-06-01T00:00:00Z	60	0	0	174	174	0	174	0	0	8bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32	0	0	0	0	0	NaN
2020-06-04	AZ	22753.0	227002.0	NaN	1079.0	3195.0	375.0	NaN	223.0	NaN	5172.0	A+	6/4/2020 00:00	2020-06-04T00:00:00Z	06/03 20:00	996.0	3195.0	2020-06-04T00:00:00Z	4	520	4710	249755	249755	5230	249755	15	66	1fa237b8204cd23701577aef6338d339daa4452e	0	0	0	0	0	NaN

Then, I try to view some additional information about the dataset

df.info()

DatetimeIndex: 5113 entries, 2020-06-04 to 2020-01-22
Data columns (total 34 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 state                     5113 non-null   object 
 positive                  5098 non-null   float64
 negative                  4902 non-null   float64
 pending                   842 non-null    float64
 hospitalizedCurrently     2591 non-null   float64
 hospitalizedCumulative    2318 non-null   float64
 inIcuCurrently            1362 non-null   float64
 inIcuCumulative           576 non-null    float64
 onVentilatorCurrently     1157 non-null   float64
 onVentilatorCumulative    198 non-null    float64
recovered                 2409 non-null   float64
dataQualityGrade          4012 non-null   object 
lastUpdateEt              4758 non-null   object 
dateModified              4758 non-null   object 
checkTimeEt               4758 non-null   object 
death                     4388 non-null   float64
hospitalized              2318 non-null   float64
dateChecked               4758 non-null   object 
fips                      5113 non-null   int64  
positiveIncrease          5113 non-null   int64  
negativeIncrease          5113 non-null   int64  
total                     5113 non-null   int64  
totalTestResults          5113 non-null   int64  
totalTestResultsIncrease  5113 non-null   int64  
posNeg                    5113 non-null   int64  
deathIncrease             5113 non-null   int64  
hospitalizedIncrease      5113 non-null   int64  
hash                      5113 non-null   object 
commercialScore           5113 non-null   int64  
negativeRegularScore      5113 non-null   int64  
negativeScore             5113 non-null   int64  
positiveScore             5113 non-null   int64  
score                     5113 non-null   int64  
grade                     0 non-null      float64
dtypes: float64(13), int64(14), object(7)
memory usage: 1.4+ MB

You can see that various columns are not of use. So, I decide to remove such columns.

df.drop([*df.columns[4:10], *df.columns[11:15], 'posNeg', 'fips'], 
        axis=1, inplace=True)
df.info()

DatetimeIndex: 5113 entries, 2020-06-04 to 2020-01-22
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 state                     5113 non-null   object 
 positive                  5098 non-null   float64
 negative                  4902 non-null   float64
 pending                   842 non-null    float64
 recovered                 2409 non-null   float64
 death                     4388 non-null   float64
 hospitalized              2318 non-null   float64
 dateChecked               4758 non-null   object 
 positiveIncrease          5113 non-null   int64  
 negativeIncrease          5113 non-null   int64  
total                     5113 non-null   int64  
totalTestResults          5113 non-null   int64  
totalTestResultsIncrease  5113 non-null   int64  
deathIncrease             5113 non-null   int64  
hospitalizedIncrease      5113 non-null   int64  
hash                      5113 non-null   object 
commercialScore           5113 non-null   int64  
negativeRegularScore      5113 non-null   int64  
negativeScore             5113 non-null   int64  
positiveScore             5113 non-null   int64  
score                     5113 non-null   int64  
grade                     0 non-null      float64
dtypes: float64(7), int64(12), object(3)
memory usage: 918.7+ KB

I also realize that there are a lot of missing (nan or null) values. So, I replace the missing values by 0.

df.fillna(value=0, inplace=True)
df.head()

	state	positive	negative	pending	recovered	death	hospitalized	dateChecked	positiveIncrease	negativeIncrease	total	totalTestResults	totalTestResultsIncrease	deathIncrease	hospitalizedIncrease	hash	commercialScore	negativeRegularScore	negativeScore	positiveScore	score	grade
date
2020-06-04	AK	513.0	59584.0	0.0	376.0	10.0	0.0	2020-06-04T00:00:00Z	8	1907	60097	60097	1915	0	0	c1046011af7271cbe2e6698526714c6cb5b92748	0	0	0	0	0	0.0
2020-06-04	AL	19072.0	216227.0	0.0	11395.0	653.0	1929.0	2020-06-04T00:00:00Z	221	3484	235299	235299	3705	0	29	bcbefdb36212ba2b97b5a354f4e45bf16648ee23	0	0	0	0	0	0.0
2020-06-04	AR	8067.0	134413.0	0.0	5717.0	142.0	757.0	2020-06-04T00:00:00Z	0	0	142480	142480	0	0	26	acd3a4fbbc3dbb32138725f91e3261d683e7052a	0	0	0	0	0	0.0
2020-06-04	AS	0.0	174.0	0.0	0.0	0.0	0.0	2020-06-01T00:00:00Z	0	0	174	174	0	0	0	8bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32	0	0	0	0	0	0.0
2020-06-04	AZ	22753.0	227002.0	0.0	5172.0	996.0	3195.0	2020-06-04T00:00:00Z	520	4710	249755	249755	5230	15	66	1fa237b8204cd23701577aef6338d339daa4452e	0	0	0	0	0	0.0

I also want to add a column corresponding to the state name instead of the abbreviation. So, I merge state_info with the current dataframe.

df2 = pd.read_csv('data/state_info.csv', usecols=['state', 'name'])
df3 = (df
      .reset_index()
      .merge(df2, on='state', how='left', left_index=True))
df3.head()

	date	state	positive	negative	recovered	death	hospitalized	dateChecked	positiveIncrease	negativeIncrease	total	totalTestResults	totalTestResultsIncrease	deathIncrease	hospitalizedIncrease	hash	name
0	2020-06-04	AK	513.0	59584.0	376.0	10.0	0.0	2020-06-04T00:00:00Z	8	1907	60097	60097	1915	0	0	c1046011af7271cbe2e6698526714c6cb5b92748	Alaska
1	2020-06-04	AL	19072.0	216227.0	11395.0	653.0	1929.0	2020-06-04T00:00:00Z	221	3484	235299	235299	3705	0	29	bcbefdb36212ba2b97b5a354f4e45bf16648ee23	Alabama
2	2020-06-04	AR	8067.0	134413.0	5717.0	142.0	757.0	2020-06-04T00:00:00Z	0	0	142480	142480	0	0	26	acd3a4fbbc3dbb32138725f91e3261d683e7052a	Arkansas
3	2020-06-04	AS	0.0	174.0	0.0	0.0	0.0	2020-06-01T00:00:00Z	0	0	174	174	0	0	0	8bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32	American Samoa
4	2020-06-04	AZ	22753.0	227002.0	5172.0	996.0	3195.0	2020-06-04T00:00:00Z	520	4710	249755	249755	5230	15	66	1fa237b8204cd23701577aef6338d339daa4452e	Arizona

I realize that the date index is lost. So, I reset the date index. Also, it is better to rename the column name as state_name.

df3.set_index('date', inplace=True)
df3.rename(columns={'name': 'state_name'}, inplace=True)
df3.head()

	state	positive	negative	pending	recovered	death	hospitalized	dateChecked	positiveIncrease	negativeIncrease	total	totalTestResults	totalTestResultsIncrease	deathIncrease	hospitalizedIncrease	hash	commercialScore	negativeRegularScore	negativeScore	positiveScore	score	grade	state_name
date
2020-06-04	AK	513.0	59584.0	0.0	376.0	10.0	0.0	2020-06-04T00:00:00Z	8	1907	60097	60097	1915	0	0	c1046011af7271cbe2e6698526714c6cb5b92748	0	0	0	0	0	0.0	Alaska
2020-06-04	AL	19072.0	216227.0	0.0	11395.0	653.0	1929.0	2020-06-04T00:00:00Z	221	3484	235299	235299	3705	0	29	bcbefdb36212ba2b97b5a354f4e45bf16648ee23	0	0	0	0	0	0.0	Alabama
2020-06-04	AR	8067.0	134413.0	0.0	5717.0	142.0	757.0	2020-06-04T00:00:00Z	0	0	142480	142480	0	0	26	acd3a4fbbc3dbb32138725f91e3261d683e7052a	0	0	0	0	0	0.0	Arkansas
2020-06-04	AS	0.0	174.0	0.0	0.0	0.0	0.0	2020-06-01T00:00:00Z	0	0	174	174	0	0	0	8bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32	0	0	0	0	0	0.0	American Samoa
2020-06-04	AZ	22753.0	227002.0	0.0	5172.0	996.0	3195.0	2020-06-04T00:00:00Z	520	4710	249755	249755	5230	15	66	1fa237b8204cd23701577aef6338d339daa4452e	0	0	0	0	0	0.0	Arizona

Now that the data is ready for some analysis, I decide to plot deaths count in each state indexed by date using the interactive plotly library.

fig1 = px.line(df3, x=df3.index, y='death', color='state')
fig1.update_layout(xaxis_title='date', title='Total deaths in each state (Cumulative)')
py.plot(fig1, filename = 'daily_deaths', auto_open=True)

'https://plotly.com/~ayush.kumar.shah/1/'

Note: These plots are interactive, so you can zoom in or out, pinch, hover over the graph, download it, and so on.

Now, I decide to calculate the total deaths in the US across all states and plot it.

df4 = df3.resample('D').sum()
df4.head()

	positive	negative	pending	recovered	death	hospitalized	positiveIncrease	negativeIncrease	total	totalTestResults	totalTestResultsIncrease	deathIncrease	hospitalizedIncrease	commercialScore	negativeRegularScore	negativeScore	positiveScore	score	grade
date
2020-01-22	1.0	0.0	0.0	0.0	0.0	0.0	0	0	1	1	0	0	0	0	0	0	0	0	0.0
2020-01-23	1.0	0.0	0.0	0.0	0.0	0.0	0	0	1	1	0	0	0	0	0	0	0	0	0.0
2020-01-24	1.0	0.0	0.0	0.0	0.0	0.0	0	0	1	1	0	0	0	0	0	0	0	0	0.0
2020-01-25	1.0	0.0	0.0	0.0	0.0	0.0	0	0	1	1	0	0	0	0	0	0	0	0	0.0
2020-01-26	1.0	0.0	0.0	0.0	0.0	0.0	0	0	1	1	0	0	0	0	0	0	0	0	0.0

fig2 = px.line(df4, x=df4.index, y='death')
fig2.update_layout(xaxis_title='date', title='Total deaths in the U.S. (Cumulative)')
py.plot(fig2, filename = 'total_daily_deaths', auto_open=True)

'https://plotly.com/~ayush.kumar.shah/4/'

I also want to calculate the number of Active cases i.e.

Active = positive - deaths - recovered

df4['active'] = df4['positive'] - df4['death'] - df4['recovered']

Now, after calculating the active column, I want to plot active cases instead of death. So, I go to the previous cell and replace death by active and generate the plot.

In [25]: df4[’~~death~~’].plot()

In [25]: df4[‘active’].plot()

fig3 = px.line(df4, x=df4.index, y='active')
fig3.update_layout(xaxis_title='date', title='Total active cases in the U.S. (Cumulative)')
py.plot(fig3, filename = 'total_daily_active', auto_open=True)

'https://plotly.com/~ayush.kumar.shah/6/'

Then I decide to calculate the statistics of a single month of May only. Since the data is cumulative, I need to subtract the data of May from data of April to find the increase in various statistics in May after which I plot the results.

df5 = (df3.loc['2020-05']
          .groupby('state_name')
          .agg({'positive': 'first',
                'negative': 'first',
                'pending': 'first',
                'recovered': 'first',
                'death': 'first',
                'hospitalized': 'first', 
                'total': 'first', 
                'totalTestResults': 'first',
                'deathIncrease': 'sum',
                'hospitalizedIncrease': 'sum', 
                'negativeIncrease': 'sum', 
                'positiveIncrease': 'sum',
                'totalTestResultsIncrease': 'sum'}))

df6 = (df3.loc['2020-04']
          .groupby('state_name')
          .agg({'positive': 'first',
                'negative': 'first',
                'pending': 'first',
                'recovered': 'first',
                'death': 'first',
                'hospitalized': 'first', 
                'total': 'first', 
                'totalTestResults': 'first',
                'deathIncrease': 'sum',
                'hospitalizedIncrease': 'sum', 
                'negativeIncrease': 'sum', 
                'positiveIncrease': 'sum',
                'totalTestResultsIncrease': 'sum'}))

df7 = df5.sub(df6)
df7.head()

	positive	negative	pending	recovered	death	hospitalized	total	totalTestResults	deathIncrease	hospitalizedIncrease	negativeIncrease	positiveIncrease	totalTestResultsIncrease
state_name
Alabama	10884.0	119473.0	0.0	9355.0	362.0	866.0	130357	130357	106	-112	45594	4846	50440
Alaska	79.0	32497.0	0.0	116.0	1.0	0.0	32576	32576	-5	7	17327	-157	17170
American Samoa	0.0	171.0	-17.0	0.0	0.0	0.0	154	171	0	0	171	0	171
Arizona	12288.0	141132.0	0.0	3262.0	586.0	1829.0	153420	153420	290	660	95076	5929	101005
Arkansas	3998.0	77138.0	0.0	3970.0	72.0	309.0	81136	81136	19	-93	37973	1266	39239

fig4 = px.bar(df7, x=df7.index, y='death')
fig4.update_layout(xaxis_title='state_name', title='Total Deaths in th US in May only')
py.plot(fig4, filename = 'total_deaths_May', auto_open=True)

'https://plotly.com/~ayush.kumar.shah/12/'

Problems in the usual coding style

Now that I have demonstrated the usual approach followed in pandas notebook, let’s discuss the problems in this approach.

1. Flow is disrupted:

The flow of the notebook is very difficult to understand and also creates problems. For example, we may create a variable name under the plot that needs it. In the above code as well, we created df3['active'] below the cell in which it is needed. So, it may cause errors when run by others. Also, you may have to scroll the notebook for miles and miles.

2. No reproducibility:

When the notebook is shared with others, the other person faces a lot of problems to execute or understand the notebook. For instance, the name of the dataframes doesn’t signify any information about the type of dataframe. It runs from df1 to df7 and creates a lot of confusion. But you want to create a notebook which is very easy to iterate on and the one you can share with your colleagues.

3. Difficult to move the code to production:

With this approach, your code is not ready to move into production. You end up having to rewrite the whole notebook before moving it to production which is not effective.

4. Unable to automate:

The notebook in the current condition cannot be automated for analysis since there may occur a lot of problems like an error in code execution, unavailability of filenames used.

Although the code may give an interesting conclusion or desired output, we are not quite sure that conclusion is at least correct.

Despite having so many problems associated with this approach, it is common for everyone to still use this type of flow while making a notebook since while coding, people enjoy when the code works when they check the outputs and hence keep on similarly continuing the coding.

Coding Solution

1. Naming convention

Follow a naming convention for the notebook according to the task as suggested by Cookiecutter Data Science that shows the owner and the order the analysis was done in. You can use the format

--.ipynb

(e.g., 0.1-ayush-visualize-corona-us.ipynb).

2. Plan your steps beforehand

Load the data and then think in advance about all the steps of analysis or tasks you could be doing in the notebook. You don’t need to think the logic right away but just keep in mind the steps.

df = pd.read_csv('data/covid19_us_states_daily.csv', index_col='date')

3. Create functions

You know that initially, you want to clean the data and make sure the columns and indexes are in a proper usable format. So, why not create a function for that and name it according to the subtasks on the dataframe.

For example, initially you want to make the index a proper datetime object. Then you may want to do remove the duplicates, then add state name. Just add these functions without even thinking the logic and then later you can add the logic. This way, you will be on track and not lost.

The functions are created after creating the decorator.

4. Create proper decorators

Before adding functions, let’s also think about some additional utility that would be helpful. During the pandas analysis, you often check the shape, columns, and other information associated to the dataframe after performing an operation. However, a decorator can help automate this process.

Decorator is simply a function that expects a function and returns a function. It’s really functional right, haha. Don’t get confused by the definition. It is not so difficult as it sounds. We will see how it works in the code below.

Also, if you are not familiar with decorators or want to learn more about it, you can visit the article by Geir Arne Hjelle.

import datetime as dt
def df_info(f):
    def wrapper(df, *args, **kwargs):
        tic = dt.datetime.now()
        result = f(df, *args, **kwargs)
        toc = dt.datetime.now()
        print("\n\n{} took {} time\n".format(f.__name__, toc - tic))
        print("After applying {}\n".format(f.__name__))
        print("Shape of df = {}\n".format(result.shape))
        print("Columns of df are {}\n".format(result.columns))
        print("Index of df is {}\n".format(result.index))
        for i in range(100): print("-", end='')
        return result
    return wrapper

We have created a decorator called df_info which displays information like time taken by the function, shape, and columns after applying any function f.

The advantage of using a deorator is that we get logging. You can modify the decorator according to the information that you want to log or display after performing an operation on the dataframe.

Now, we create functions as our plan and use these decorators on them by using @df_info. This will be equivalent to calling df_info(f(df, *args, **kwargs))

@df_info
def create_dateindex(df):
    df.index = pd.to_datetime(df.index, format="%Y%m%d")
    return df

@df_info
def remove_columns(df):
    df.drop([*df.columns[4:10], *df.columns[11:15], 'posNeg', 'fips'], 
        axis=1, inplace=True)
    return df

@df_info
def fill_missing(df):
    df.fillna(value=0, inplace=True)
    return df

@df_info
def add_state_name(df):
    _df = pd.read_csv('data/state_info.csv', usecols=['state', 'name'])
    df = (df
      .reset_index()
      .merge(_df, on='state', how='left', left_index=True))
    df.set_index('date', inplace=True)
    df.rename(columns={'name': 'state_name'}, inplace=True)
    return df

@df_info
def drop_state(df):
    df.drop(columns=['state'], inplace=True)
    return df

@df_info
def sample_daily(df):
    df = df.resample('D').sum()
    return df

@df_info
def add_active_cases(df):
    df['active'] = df['positive'] - df['death'] - df['recovered']
    return df

def aggregate_monthly(df, month):
    df = (df.loc[month]
        .groupby('state_name')
        .agg({'positive': 'first',
            'negative': 'first',
            'pending': 'first',
            'recovered': 'first',
            'death': 'first',
            'hospitalized': 'first', 
            'total': 'first', 
            'totalTestResults': 'first',
            'deathIncrease': 'sum',
            'hospitalizedIncrease': 'sum', 
            'negativeIncrease': 'sum', 
            'positiveIncrease': 'sum',
            'totalTestResultsIncrease': 'sum'}))
    return df

@df_info
def create_month_only(df, month):
    df_current = aggregate_monthly(df, month)
    if int(month[-2:]) == 0:
        prev_month = str(int(month[:4]) - 1) + '-12'
    else:
        prev_month = month[:5] + '{:02d}'.format(int(month[-2:])-1)

    df_previous = aggregate_monthly(df, prev_month)
    df = df_current.sub(df_previous)
    return df

    

5. Remove side effect

However, these functions make changes that are inplace (side effects) i.e. modifies the originally loaded dataframe. So, to solve this, we add a function called start pipeline, which returns a copy of dataframe.

def start_pipeline(df):
    return df.copy()

6. Constructing pandas pipelines (Main step)

Now, let’s use these functions to achieve the previous tasks using pipe

df_daily = (df.pipe(start_pipeline)
            .pipe(create_dateindex)
            .pipe(remove_columns)
            .pipe(fill_missing)
            .pipe(add_state_name)
            .pipe(sample_daily)
            .pipe(add_active_cases))

create_dateindex took 0:00:00.003388 time

After applying create_dateindex

Shape of df = (5113, 34)

Columns of df are Index(['state', 'positive', 'negative', 'pending', 'hospitalizedCurrently',
       'hospitalizedCumulative', 'inIcuCurrently', 'inIcuCumulative',
       'onVentilatorCurrently', 'onVentilatorCumulative', 'recovered',
       'dataQualityGrade', 'lastUpdateEt', 'dateModified', 'checkTimeEt',
       'death', 'hospitalized', 'dateChecked', 'fips', 'positiveIncrease',
       'negativeIncrease', 'total', 'totalTestResults',
       'totalTestResultsIncrease', 'posNeg', 'deathIncrease',
       'hospitalizedIncrease', 'hash', 'commercialScore',
       'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
       'grade'],
      dtype='object')

Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04',
               ...
               '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
               '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
               '2020-01-23', '2020-01-22'],
              dtype='datetime64[ns]', name='date', length=5113, freq=None)

----------------------------------------------------------------------------------------------------

remove_columns took 0:00:00.002087 time

After applying remove_columns

Shape of df = (5113, 22)

Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
       'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
       'total', 'totalTestResults', 'totalTestResultsIncrease',
       'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
       'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
       'grade'],
      dtype='object')

Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04',
               ...
               '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
               '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
               '2020-01-23', '2020-01-22'],
              dtype='datetime64[ns]', name='date', length=5113, freq=None)

----------------------------------------------------------------------------------------------------

fill_missing took 0:00:00.006381 time

After applying fill_missing

Shape of df = (5113, 22)

Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
       'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
       'total', 'totalTestResults', 'totalTestResultsIncrease',
       'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
       'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
       'grade'],
      dtype='object')

Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04',
               ...
               '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
               '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
               '2020-01-23', '2020-01-22'],
              dtype='datetime64[ns]', name='date', length=5113, freq=None)

----------------------------------------------------------------------------------------------------

add_state_name took 0:00:00.015122 time

After applying add_state_name

Shape of df = (5113, 23)

Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
       'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
       'total', 'totalTestResults', 'totalTestResultsIncrease',
       'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
       'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
       'grade', 'state_name'],
      dtype='object')

Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04',
               ...
               '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
               '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
               '2020-01-23', '2020-01-22'],
              dtype='datetime64[ns]', name='date', length=5113, freq=None)

----------------------------------------------------------------------------------------------------

sample_daily took 0:00:00.017170 time

After applying sample_daily

Shape of df = (135, 19)

Columns of df are Index(['positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized',
       'positiveIncrease', 'negativeIncrease', 'total', 'totalTestResults',
       'totalTestResultsIncrease', 'deathIncrease', 'hospitalizedIncrease',
       'commercialScore', 'negativeRegularScore', 'negativeScore',
       'positiveScore', 'score', 'grade'],
      dtype='object')

Index of df is DatetimeIndex(['2020-01-22', '2020-01-23', '2020-01-24', '2020-01-25',
               '2020-01-26', '2020-01-27', '2020-01-28', '2020-01-29',
               '2020-01-30', '2020-01-31',
               ...
               '2020-05-26', '2020-05-27', '2020-05-28', '2020-05-29',
               '2020-05-30', '2020-05-31', '2020-06-01', '2020-06-02',
               '2020-06-03', '2020-06-04'],
              dtype='datetime64[ns]', name='date', length=135, freq='D')

----------------------------------------------------------------------------------------------------

add_active_cases took 0:00:00.002020 time

After applying add_active_cases

Shape of df = (135, 20)

Columns of df are Index(['positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized',
       'positiveIncrease', 'negativeIncrease', 'total', 'totalTestResults',
       'totalTestResultsIncrease', 'deathIncrease', 'hospitalizedIncrease',
       'commercialScore', 'negativeRegularScore', 'negativeScore',
       'positiveScore', 'score', 'grade', 'active'],
      dtype='object')

Index of df is DatetimeIndex(['2020-01-22', '2020-01-23', '2020-01-24', '2020-01-25',
               '2020-01-26', '2020-01-27', '2020-01-28', '2020-01-29',
               '2020-01-30', '2020-01-31',
               ...
               '2020-05-26', '2020-05-27', '2020-05-28', '2020-05-29',
               '2020-05-30', '2020-05-31', '2020-06-01', '2020-06-02',
               '2020-06-03', '2020-06-04'],
              dtype='datetime64[ns]', name='date', length=135, freq='D')

Check out all the logs displayed above. We are able to view in detail how each operation changed the data without having to print the dataframe after each operation.

fig2 = px.line(df_daily, x=df_daily.index, y='death')
fig2.update_layout(xaxis_title='date', title='Total deaths in the U.S. (Cumulative)')
py.plot(fig2, filename = 'total_daily_deaths', auto_open=True)

'https://plotly.com/~ayush.kumar.shah/4/'

fig3 = px.line(df_daily, x=df_daily.index, y='active')
fig3.update_layout(xaxis_title='date', title='Total active cases in the U.S. (Cumulative)')
py.plot(fig3, filename = 'total_daily_active', auto_open=True)

'https://plotly.com/~ayush.kumar.shah/6/'

df_may = create_month_only(
                df=(df.pipe(start_pipeline)
                    .pipe(create_dateindex)
                    .pipe(remove_columns)
                    .pipe(fill_missing)
                    .pipe(add_state_name)), 
                month='2020-05')

create_dateindex took 0:00:00.002492 time

After applying create_dateindex

Shape of df = (5113, 34)

Columns of df are Index(['state', 'positive', 'negative', 'pending', 'hospitalizedCurrently',
       'hospitalizedCumulative', 'inIcuCurrently', 'inIcuCumulative',
       'onVentilatorCurrently', 'onVentilatorCumulative', 'recovered',
       'dataQualityGrade', 'lastUpdateEt', 'dateModified', 'checkTimeEt',
       'death', 'hospitalized', 'dateChecked', 'fips', 'positiveIncrease',
       'negativeIncrease', 'total', 'totalTestResults',
       'totalTestResultsIncrease', 'posNeg', 'deathIncrease',
       'hospitalizedIncrease', 'hash', 'commercialScore',
       'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
       'grade'],
      dtype='object')

Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04',
               ...
               '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
               '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
               '2020-01-23', '2020-01-22'],
              dtype='datetime64[ns]', name='date', length=5113, freq=None)

----------------------------------------------------------------------------------------------------

remove_columns took 0:00:00.002219 time

After applying remove_columns

Shape of df = (5113, 22)

Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
       'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
       'total', 'totalTestResults', 'totalTestResultsIncrease',
       'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
       'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
       'grade'],
      dtype='object')

Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04',
               ...
               '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
               '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
               '2020-01-23', '2020-01-22'],
              dtype='datetime64[ns]', name='date', length=5113, freq=None)

----------------------------------------------------------------------------------------------------

fill_missing took 0:00:00.001883 time

After applying fill_missing

Shape of df = (5113, 22)

Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
       'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
       'total', 'totalTestResults', 'totalTestResultsIncrease',
       'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
       'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
       'grade'],
      dtype='object')

Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04',
               ...
               '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
               '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
               '2020-01-23', '2020-01-22'],
              dtype='datetime64[ns]', name='date', length=5113, freq=None)

----------------------------------------------------------------------------------------------------

add_state_name took 0:00:00.014981 time

After applying add_state_name

Shape of df = (5113, 23)

Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
       'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
       'total', 'totalTestResults', 'totalTestResultsIncrease',
       'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
       'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
       'grade', 'state_name'],
      dtype='object')

Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
               '2020-06-04', '2020-06-04',
               ...
               '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
               '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
               '2020-01-23', '2020-01-22'],
              dtype='datetime64[ns]', name='date', length=5113, freq=None)

----------------------------------------------------------------------------------------------------

create_month_only took 0:00:00.031071 time

After applying create_month_only

Shape of df = (56, 13)

Columns of df are Index(['positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized',
       'total', 'totalTestResults', 'deathIncrease', 'hospitalizedIncrease',
       'negativeIncrease', 'positiveIncrease', 'totalTestResultsIncrease'],
      dtype='object')

Index of df is Index(['Alabama', 'Alaska', 'American Samoa', 'Arizona', 'Arkansas',
       'California', 'Colorado', 'Connecticut', 'Delaware',
       'District Of Columbia', 'Florida', 'Georgia', 'Guam', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Northern Mariana Islands', 'Ohio',
       'Oklahoma', 'Oregon', 'Pennsylvania', 'Puerto Rico', 'Rhode Island',
       'South Carolina', 'South Dakota', 'Tennessee', 'Texas',
       'US Virgin Islands', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'],
      dtype='object', name='state_name')

fig4 = px.bar(df_may, x=df_may.index, y='death')
fig4.update_layout(xaxis_title='state_name', title='Total Deaths in th US in May only')
py.plot(fig4, filename = 'total_deaths_May', auto_open=True)

'https://plotly.com/~ayush.kumar.shah/12/'

You can observe how easily pipe functionality has achieved the required task in a clean and organized way. Also, the original dataframe is intact and not affected by the above operations.

df.head()

	state	positive	negative	pending	hospitalizedCurrently	hospitalizedCumulative	inIcuCurrently	inIcuCumulative	onVentilatorCurrently	onVentilatorCumulative	recovered	dataQualityGrade	lastUpdateEt	dateModified	checkTimeEt	death	hospitalized	dateChecked	fips	positiveIncrease	negativeIncrease	total	totalTestResults	totalTestResultsIncrease	posNeg	deathIncrease	hospitalizedIncrease	hash	commercialScore	negativeRegularScore	negativeScore	positiveScore	score	grade
date
20200604	AK	513.0	59584.0	NaN	13.0	NaN	NaN	NaN	1.0	NaN	376.0	A	6/4/2020 00:00	2020-06-04T00:00:00Z	06/03 20:00	10.0	NaN	2020-06-04T00:00:00Z	2	8	1907	60097	60097	1915	60097	0	0	c1046011af7271cbe2e6698526714c6cb5b92748	0	0	0	0	0	NaN
20200604	AL	19072.0	216227.0	NaN	NaN	1929.0	NaN	601.0	NaN	357.0	11395.0	B	6/4/2020 00:00	2020-06-04T00:00:00Z	06/03 20:00	653.0	1929.0	2020-06-04T00:00:00Z	1	221	3484	235299	235299	3705	235299	0	29	bcbefdb36212ba2b97b5a354f4e45bf16648ee23	0	0	0	0	0	NaN
20200604	AR	8067.0	134413.0	NaN	138.0	757.0	NaN	NaN	30.0	127.0	5717.0	A	6/4/2020 00:00	2020-06-04T00:00:00Z	06/03 20:00	142.0	757.0	2020-06-04T00:00:00Z	5	0	0	142480	142480	0	142480	0	26	acd3a4fbbc3dbb32138725f91e3261d683e7052a	0	0	0	0	0	NaN
20200604	AS	0.0	174.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	C	6/1/2020 00:00	2020-06-01T00:00:00Z	05/31 20:00	0.0	NaN	2020-06-01T00:00:00Z	60	0	0	174	174	0	174	0	0	8bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32	0	0	0	0	0	NaN
20200604	AZ	22753.0	227002.0	NaN	1079.0	3195.0	375.0	NaN	223.0	NaN	5172.0	A+	6/4/2020 00:00	2020-06-04T00:00:00Z	06/03 20:00	996.0	3195.0	2020-06-04T00:00:00Z	4	520	4710	249755	249755	5230	249755	15	66	1fa237b8204cd23701577aef6338d339daa4452e	0	0	0	0	0	NaN

7. Create a module

Finally, you can create a module (eg processing.py) and keep all the above functions in the module. You can simply import them here and use them directly. It will clean the notebook further.

processing.py

While loading the modules, load the “autoreload” extension so that you can change code in the modules and the changes get updated automatically. For more info, see autoreload documentation

%load_ext autoreload
%autoreload 2
from processing import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Advantages

1. Effective for the long run (Maintainability)

Although, the approach may look like an inefficient method of coding but it is very effective in the long run since you will not have to spend hours maintaining the notebook. Given the functions are well written and well defined, they are ready for production.

The code is easily sharable as well as anyone can understand the code unlike in the previous approach. Also, for complex analysis tasks, this approach can be easily used for maintaining the notebook.

2. Proper flow and planning

You do not need to think about the logic of the analysis at the beginning. You can just plan your tasks and write down the required functions which already gives you kind of a framework of mind which helps to be on track. The calm that will follow is likely going to have a greater impact on innovation.

Then, you can finally define the logic at the end to make it work.

3. Easier to modify

You might have noticed that the pipe functionality gives you the ability to modify the tasks or flow easily. You can do so by commenting or adding the functions in the pipeline.

For example, you don’t want to remove the columns and sample the data daily. Then you can achieve this simply by commenting those lines as shown below:

df_daily = (df.pipe(start_pipeline)
            .pipe(create_dateindex)
            # .pipe(remove_columns)
            .pipe(fill_missing)
            .pipe(add_state_name)
            .pipe(drop_state)
            # .pipe(sample_daily)
            .pipe(add_active_cases))

4. Easier to debug

In this approach, you know what is happening in each step which makes it a lot easier to debug. Furthermore, since all the operations are functions, you can easily debug the code by performing unit tests or using other methods on the functions.

5. Readability

This approach helps you prevent miles of scrolling and also is easily readable than the previous approach. By looking at the code, you can easily understand what operations are being performed on the data and also can see the effect of those operations on the data in each step using decorator.

Example:

Let us consider cooking chicken. When we do so, we don’t describe the steps like this:

temperature = 210 celsius
food1 = Chicken
food2 = Season(food1, with Spices)
food3 = Season(food2, with Gravy)
Serve(PutInOven(food3, temperature), on a plate)

But instead, we describe it the following way:

temperature = 210 celsius
Chicken.Season(with Spices)
        .Season(with Gravy)
        .PutInOven(temperature)
        .Serve()

The pipe functionality helps us to write code in the latter way, which is also much more readable.

6. Reusability

During production, we turn the project into a Python package. You can import your code and use it in notebooks with a cell. You do not need to write code to do the same task in multiple notebooks.

7. Separation into analysis and data manipulation

Once your functions have been moved to a separate module, two levels of abstraction are obtained: analysis and data manipulation.

You can fiddle around on a high level and keep the details on a low level. The notebook then becomes the summary and a user interface where you can very quickly make nice little charts instead of manipulating data or performing analytical steps to get a result.

Final notes

Hence, following these practices while coding in pandas or performing other similar tasks like building scikit-learn pipelines or other ML pipelines, can be extremely beneficial for developers. Also, all the 4 problems mentioned in the beginning have been solved in this approach. Thus, giving utmost priority to clarity and interoperability, we should remember that it’s a lot easier to solve a problem if we understand the problem well.

Moreover, if you find writing these codes difficult, an open source package called Scikit-lego maintained by Vincent and MatthijsB, with contributions from all around the world, is available. This package does all the hard work for you to create such pipelines along with additional features like custom logging. Do check it out.

Also, if you have any confusion or suggestions, feel free to comment. I am all ears. Thank you.

Part 5 - Integrate Disqus comments and Google Analytics with Pelican

2020-05-11T00:45:00-04:00

This article is a part of a series of articles for web development using pelican. So, if you haven’t read the previous articles, please check it out by clicking the links below.

Creating and deploying static websites using Markdown and the Python library Pelican

Up to this point, you have created and hosted your static website on GitHub pages/custom domain and also learned to automate deployment.

Now, let’s integrate Disqus comment service system and google analytics into our site to analyze the in-depth detail about the visitors on your website.

I. Integrate Disqus Comments

Initially, go to the Disqus website and create an account. After creating your account, you will see 2 options as shown below:

Select the 2nd option i.e. I want to install Disqus on my site. Then fill up the fields like Website Name and Category as shown below.

In the website name field, you may enter any name for your website.

In the next step, you will have to select a subscription plan. Select the basic plan as shown.

Then select I don't see my platform listed option as shown.

Skip the installation step, go to the bottom of the page, and click Configure.

Add the website name (the GitHub page link or the custom domain if you have it linked) as shown below.

Go to Edit Settings and click General. There, you can see your Disqus website shortname in the Shortname field. Copy that name.

Add the following line with the value copied from above to both the files publishconf.py and pelicanconf.py

DISQUS_SITENAME = 'ayushblog-2'

That’s it. You can check by using the command

(.venv) fab reserve

Then visit localhost:8000. At the bottom, you can see the Disqus comment section. Sometimes, it doesn’t appear in localhost. But don’t worry, it will still appear in the website.

You can push the updated source code to view the changes on your website.

You can configure the appearance and other preferences of the comment system by logging in to this link: Disqus admin panel. You can also choose to moderate the comments before making it visible to the public. If you do so, you can moderate the comments by going to the moderate section of disqus. You can approve or delete the comment.

Now, just push the source code and you are ready to go.

You can approve the comments by logging in to Disqus

II. Integrate Google Analytics

Now, let’s learn to integrate Google Analytics in our website.

Create an account for google analytics by visiting this link: Analytics - Create Account. Write an account name.

Select Web and click Next.

Fill in the information as shown below and click Create.

Accept all the terms and conditions.

Then, you will get a Tracking ID. Copy the Tracking ID and paste it in the file publishconf.py as shown below.

GOOGLE_ANALYTICS = "UA-166070073-1"

That’s all. Now just push the updated source code to the source branch and the analytics of your website will be tracked by google.

To view your detailed analytics, just log in to the Google Analytics website.

You can view detailed stats of your website visitors like the number of total visitors, active visitors, bounce rate, location of visitors. You can also view the real-time data of your visitors. How cool is that?

Congratulations!! You have completed the entire series of articles on Creating and deploying static websites using Markdown and the Python library Pelican.

If you have any confusion in any article, feel free to comment on your queries. I will be more than happy to help. I am also open to suggestions and feedbacks.

Also, you can use my GitHub repository for my blog post: ayushkumarshah.github.io as a reference in any point of the article. I have followed the same steps mentioned in this series to create my blog website that you are seeing right now.

If you want to visit any specific parts of the article, you can do so from the links below.

Or, go to the home-page of the article.

Part 4 - Setting up Travis-CI for automating deployment

2020-05-10T21:47:00-04:00

This article is a part of a series of articles for web development using pelican. So, if you haven’t read the previous articles, please check it out by clicking the links below.

Creating and deploying static websites using Markdown and the Python library Pelican

Up to this point, you have created and hosted your static website on GitHub pages and custom domain as well.

Now, let’s learn to automate the process of pushing to source and deploying to the master branch by using continuous integration tools like Travis-CI so that you don’t need to manually push to two branches every time you update your site.

First, visit Travis-CI and log in using your GitHub account.
Then, add your repository yourusername.github.io in the Repositories section as shown below.

Now, we need to generate Personal access tokens in GitHub. Go to Generate new token for Github
Check the public_repo checkbox and click Generate Token as shown below.

Copy the generated token by clicking the copy button as shown below. Note that you cannot view this token again if you don’t copy.

Go back to Travis-CI Repository and open settings. Add the following environment variables as shown in the gif:
- GH_TOKEN Paste the value of access token you copied
- TRAVIS_REPO_SLUG username/username.github.io

Now, open fabfile.py and delete the publish function along with the wrapper @hosts(production) and replace it by the following lines:

# @hosts(production) > Removed
def publish(commit_message):
    """Automatic deploy  to GitHub Pages"""
    env.msg = commit_message
    env.GH_TOKEN = os.getenv('GH_TOKEN')
    env.TRAVIS_REPO_SLUG = os.getenv('TRAVIS_REPO_SLUG')
    clean()
    local('pelican -s publishconf.py')
    with hide('running', 'stdout', 'stderr'):
        local("ghp-import -m '{msg}' -b {github_pages_branch} {deploy_path}".format(**env))
        local("git push -fq https://{GH_TOKEN}@github.com/{TRAVIS_REPO_SLUG}.git {github_pages_branch}".format(**env))

Now, create a .travis.yml configuration file in the root directory for automatic deployment.

(.venv) $ touch .travis.yml

Add the following lines in it.

language: python
cache: pip
branches:
  only:
    - source
python:
  - 3.5
install:
  - gem install sass
  - pip install -r requirements.txt
  - git config --global user.email "your-github-email"
  - git config --global user.name "your-github-name"
  - git clone https://github.com/alexandrevicenzi/Flex.git themes/Flex
  - git clone https://github.com/getpelican/pelican-plugins

script:
  - fab publish:"Build site"

The above file is responsible for testing every pushed source code and also for automatic deployment of the output folder contents (HTML) to the master branch. Change the theme repository in the above file if you are using a different theme.

The final step is to add the following line to the top of your README.md file.

# Personal Blog [![Build Status](https://travis-ci.org/username/username.github.io.svg?branch=source)](https://travis-ci.org/username/username.github.io)

Note that you must replace username by your username in the above line. The above line adds the build status (passed or failed) in your repository as shown below.

You can click the build button to view the build status in Travis-CI in detail. You can view why the build failed in detail if the build failed and hence make the necessary corrections in the source code.

If the build fails, the new contents are not pushed to the master branch and hence your website won’t be updated by failed content caused by an error in the source code. This enables your website to run without errors at all times.

Hence, after a successful configuration, every time you update your source code and push to the source branch, automatic testing occurs and the website’s HTML files are pushed to the master branch.

Learn to integrate Disqus comments and Google Analytics in your website in the part 5 of the article.

If you have any confusion in any article, feel free to comment on your queries. I will be more than happy to help. I am also open to suggestions and feedbacks.

Also, you can use my GitHub repository for my blog post: ayushkumarshah.github.io as a reference in any point of the article. I have followed the same steps mentioned in this series to create my blog website that you are seeing right now.

If you want to visit any specific parts of the article, you can do so from the links below.

Or, go to the home-page of the article.

Part 3 - Hosting your website to GitHub Pages and custom domain

2020-03-28T22:30:00-04:00

This article is a part of a series of articles for web development using pelican. So, if you haven’t read the previous articles, please check it out by clicking the links below.

Creating and deploying static websites using Markdown and the Python library Pelican

Up to this point, you have created your static website locally. You surely want to share it with the public so that they can view your articles. There are several ways of deploying your websites but the best option is by using GitHub pages.

Why GitHub Pages?

It is completely free of cost. You don’t need to buy any hosting services. Github hosts your website for free.
It is secure and reliable as the website is hosted in a secure GitHub server.
It becomes easy to organize and keep track of your source code.

Let’s get started

1. Create and associate a GitHub repo

If you don’t already have a GitHub account, go to GitHub and create one.
Login to github and create a repository with the name username.github.io (Replace username by your GitHub’s username. For eg, mine is ayushkumarshah.github.io) and copy the clone address as shown in the gif below.
Now, go to your project directory i.e. 'web_development' perform the following commands to add the remote repository that you just created to your project.

Use the URL that you just copied from the repository you created instead of the url used in the command below.

(.venv) $ git init
(.venv) $ git remote add origin 'https://github.com/username/username.github.io.git'

Also, add your GitHub email address and username to git. You can find your username by logging into github and finding the name as shown below.

(.venv) $ git config --global user.email "your-github-email"
(.venv) $ git config --global user.name "your-github-name"

We will be using 2 branches in our repository - source and master.
- source: store the source code of our project (i.e. all folders and files except the output folder)
- master: store the contents of the output folder. i.e. all the HTML files generated after building the site. The master branch will be used to host the website to GitHub-pages.
So, let’s switch to the source branch.

# Create and switch to a new branch source
(.venv) $ git checkout -b source

Create a .gitignore file to mark the files which should not be added into the repository.

(.venv) $ touch .gitignore

Copy all the lines from this link: .gitignore and paste it in the newly created .gitignore file.
You may also create a Readme.md file for your repository. Create it in the main directory web_development

(.venv) $ touch Readme.md

You can add information about your project in the Readme.md file similar to mine. You can copy it from this link: Readme.md and modify it accordingly.

2. Build and publish your website

Let’s modify the configuration for publishing the website by opening publishconfig.py and modifying/adding the following settings.

SITEURL = 'https://username.github.io'
DOMAIN = SITEURL
FEED_DOMAIN = SITEURL
HTTPS = True

Also, let’s modify the commands in the file fabfile.py. Open the file and add the following settings if not present already.

# Local path configuration (can be absolute or relative to fabfile)
env.deploy_path = 'output'
DEPLOY_PATH = env.deploy_path
env.msg = 'Update blog'   # Commit message

# Github Pages configuration
env.github_pages_branch = "master"

# Port for `serve`
SERVER = '127.0.0.1'
PORT = 8000

Also, we need to add a deploy() function in fabfile.py.

def deploy():
    """Push to GitHub pages"""
    env.msg = "Build site"
    clean()
    preview()
    local("ghp-import -m '{msg}' -b {github_pages_branch} {deploy_path}".format(**env))
    local("git push origin {github_pages_branch}".format(**env))

So, your source code is ready. Let’s add it to the repository using the following commands:

(.venv) $ git add -A
(.venv) $ git commit -m "Add source code for the first post"
(.venv) $ git push origin source

Now, perform the following command to build and publish the site to the master branch

(.venv) $ fab deploy

Note: Always work in the source branch during development. The deploy() function will push the contents of the output folder into the master branch. So, you don’t need to worry about it. So, every time you add an article, just follow the steps above by first pushing the source code to the source repository and then running the deploy function.

Congratulations! your site has been hosted to GitHub pages publicly. To check your website, open your browser on any device and visit https://your-username.github.io.

That’s it. You have now learned to create and host your static website in GitHub pages.

3. Linking your site to a custom domain (Optional)

You might want to host your site to a custom domain of your choice rather than GitHub pages. This can be done completely free of cost if you have a custom domain registered already.

If you don’t have a custom domain, you can buy them at several websites like Namesilo, GoDaddy, etc.

You can make your domain secure and manageable using Cloudflare Service

The first step is to create a file called CNAME inside the content/extra directory.

(.venv) $ touch content/extra/CNAME

Then, add (copy and paste) the name of your site i.e. www.your-site-name.com in the file CNAME.
Change the value of SITEURL in the publishconf.py file.

SITEURL = 'https://you-site-name.com'

Now, you need to redirect your site to point to your content hosted in GitHub-pages. For that, you need to use your domain management site which you used to buy the domain or some 3rd party management site like Cloudflare.
- Go to the DNS section and add A records one by one to redirect your site to following 4 IP addresses (GitHub-pages): You can see the image below for reference. I used Cloudflare for DNS management.
  - 185.199.108.153
  - 185.199.109.153
  - 185.199.111.153
  - 185.199.110.153
If you want to redirect the GitHub-pages site to your custom domain, then go to the repository settings and add your site name in the Custom domain field of the Github Pages section as shown below.

Congratulations!! Your blogs have been redirected to tour own custom domain. You can browse your site and check if it is working.

4. Add forked repo of theme (Optional)

This is an optional step. Perform these steps only if want to modify or tweak with the theme (Flex in this case) to give your website a slightly different look. You may modify colors, styles or even perform changes in the design (if you some knowledge on web development - HTML and CSS).

Since you have cloned the repository of the theme directly, modifying it directly is not a good idea since you will have issues updating the theme to a newer version.

Hence, you will create your own version of the theme repository instead i.e. forking the repository. I will demonstrate using the Flex theme but you may follow the same steps for other themes as well. Follow these steps (also shown in the gif below):

First, let’s delete the previously cloned repository of the Flex theme.

(.venv) $ rm -rf themes/Flex

Now, open and fork the Flex repository or the repository of the theme you chose.
Then, copy the https (not ssh) link of the forked repository.

Now, clone the forked repo in your project.

Paste the link you copied from the forked repo instead of https://github.com/ayushkumarshah/Flex.git and themes/name_of_theme as the 2nd argument in the command below.

(.venv) $ git clone 'https://github.com/ayushkumarshah/Flex.git' 'themes/Flex'

Now, you may modify the theme by tweaking with the HTML and CSS files inside the themes/Flex/ directory and then commit the changes to the forked repository separately.

In the next part, learn to automate the process of pushing to source and deploying to the master branch by using Continuous Integration tools like Travis-CI in the part 4 of the article.

If you have any confusion in any article, feel free to comment on your queries. I will be more than happy to help. I am also open to suggestions and feedbacks.

Also, you can use my GitHub repository for my blog post: ayushkumarshah.github.io as a reference in any point of the article. I have followed the same steps mentioned in this series to create my blog website that you are seeing right now.

If you want to visit any specific parts of the article, you can do so from the links below.

Or, go to the home-page of the article.

Part 2 - Writing content using Markdown

2020-03-24T08:00:00-04:00

This article is a part of a series of articles for web development using pelican. So, if you haven’t read the previous articles, please check it out by clicking the links below.

Creating and deploying static websites using Markdown and the Python library Pelican

Part 1: Setting up Pelican - Installation and Theme

Now that you have set up your website, the next step is to start writing some content – articles, blogs, about page, contact page, etc. We will use Markdown for writing any content you create. If you have not heard about Markdown, don’t worry as I will guide you with examples.

Create directories and files

First, let us create the required directories for articles and pages.

(.venv) $ mkdir content/articles
(.venv) $ mkdir content/pages

Now, let’s create a file for your first article inside the articles directory. Note that the touch command is being used only to create a file. You can create a file without using any command too. It’s up to you.

(.venv) $ touch content/articles/first_article.md

Also, create files for about, contact, and 404 error page.

(.venv) $ touch content/pages/about.md content/pages/contact.md content/pages/404.md

At this point, your project structure should look like:

web_development
        ├── content
            ├── articles
                └── first_article.md
            ├── pages
                ├── 404.md
                ├── about.md
                └── contact.md
        ├── fabfile.py
        ├── output
            ├── ... (many html files)
        ├── themes
            ├── Flex/
        ├── pelican-plugins
            ├── ... (various plugin directories)
        ├── pelicanconf.py
        ├──publishconf.py
        └──requirements.txt

Start Writing Articles

Define metadata for article

Before writing the actual content, we need to define the metadata for the article. Metadata carries important information about your article. Open the file first_article.md and add the following metadata lines:

Title: My First Article
Date: 2020-03-17 00:00
Modified: 2020-03-17 00:00
Category: Blog
Slug: first-article
Summary: In this article, I have written my first article using Markdown.
Tags: pelican, markdown
Authors: Ayush Kumar Shah
Status: published

These keywords are pretty much self-explanatory. I will just explain the new ones.

Slug defines the name of the HTML file to be generated.
Status: Choose one option among draft, published, or hidden.
- draft: In this mode, the article is not shown on the main page but can be viewed by visiting localhost:8000/drafts/first-article after serving the site (i.e. running this fab reserve). It is used to show to your friends while writing before publishing during the developing stage.
- published: In this mode, the article is shown on the main page after serving the site. localhost:8000/2020/03/first-article.
- hidden: In this mode, the article is just not shown on the website.

Write article content

Useful tip: Use VSCode as a text editor to manage your project and write content as you can preview .md files (content files written in markdown) in real-time directly using the Preview functionality. Hence, it becomes easy for you to view how your content will look like in real-time.

Now add the following lines in the file first_article.md just below the metadata defined above.

This is an example from [https://markdown-it.github.io/](https://markdown-it.github.io/)

---

# h1 Heading
## h2 Heading
### h3 Heading
#### h4 Heading
##### h5 Heading
###### h6 Heading


## Horizontal Rules

___

---

***


## Emphasis

**This is bold text**

__This is bold text__

*This is italic text*

_This is italic text_

~~Strikethrough~~

## Blockquotes

> Blockquotes can also be nested...
>> ...by using additional greater-than signs right next to each other...
> > > ...or with spaces between arrows.

## Lists

Unordered

+ Create a list by starting a line with `+`, `-`, or `*`
+ Sub-lists are made by indenting 2 spaces:
- Marker character change forces new list start:
    * Ac tristique libero volutpat at
    + Facilisis in pretium nisl aliquet
    - Nulla volutpat aliquam velit
+ Very easy!

Ordered

1. Lorem ipsum dolor sit amet
2. Consectetur adipiscing elit
3. Integer molestie lorem at massa

## Code

Inline `code`

Indented code

    // Some comments
    line 1 of code
    line 2 of code
    line 3 of code


Block code "fences"

```
Sample text here...
```

Syntax highlighting

``` python
numbers = [9, 8, 4, 1, 5]
for i, number in enumerate(numbers):
    print (i, number)
message = "Hello World"
hello(message)

def hello(message):
    print (message)
```

## Tables

| Option | Description |
| ------ | ----------- |
| data   | path to data files to supply the data that will be passed into templates. |
| engine | engine to be used for processing templates. Handlebars is the default. |
| ext    | extension to be used for dest files. |

Right aligned columns

| Option | Description |
| ------:| -----------:|
| data   | path to data files to supply the data that will be passed into templates. |
| engine | engine to be used for processing templates. Handlebars is the default. |
| ext    | extension to be used for dest files. |

## Links

[link text](http://dev.nodeca.com)

[link with title](http://nodeca.github.io/pica/demo/ "title text!")

## Images

![Minion](https://octodex.github.com/assets/img/sample/minion.png)
![Stormtroopocat](https://octodex.github.com/assets/img/sample/stormtroopocat.jpg "The Stormtroopocat")

You can view the complete Markdown cheatsheet for reference.

Now, let’s view how your article looks on the website.

Close the previous process i.e. fab reserve if it is still running by pressing Ctrl+C or Cmd+C. Then,

(.venv) $ fab reserve

Open your browser and visit localhost:8000

Congratulations, your first article has been published on your website. It was as simple as that. Compare the article output in the website as shown in the image above and the markdown code to understand how the code works.

Start Writing Pages

Now, let’s create our pages. Pages are more permanent and don’t require detailed metadata like the articles. Example: an about me page. We have added the links to the pages in the navigation bar.

About page

Open about.md and add the following metadata lines as you did before. As you can see the difference as it is not as detailed as before.

Title: About
Date: 2020-03-18 08:00
Modified: 2020-03-18 08:00

Write the content for your about page using Markdown in the way you want to design the page. I have provided a simple example for my about page below.

Hello! I’m Ayush Kumar Shah. To talk about myself, I love football (Cristiano Ronaldo is my idol), traveling, and photography. I have a great interest in Artificial Intelligence and am pursuing my career in the same. 

I am a Machine Learning Engineer at [Fusemachines](https://www.fusemachines.com) working with global client teams to build state-of-the-art products. I have worked in the domains of Recommendation System, Nepali Handwritten character recognition, and waste classification during my time at Fusemachines.

My inquisitive nature, craving for knowledge, and longing for novelty and innovation strengthen my passion to work and learn daily to increase my knowledge horizon.

I am mostly into tech and so, my blog will be a reflection of whatever new thing I learn about tech.

Thank you for visiting my blog.

Contact page

You can configure your contact.md file similarly. Have a look at a simple example below and create a similar one.

Title: Contact
Date: 2020-03-18 03:27
Modified: 2020-03-18 03:27
Slug: contact

If you have any questions or want to discuss something, please feel free to contact me at
[ayush.kumar.shah@gmail.com](mailto:aysh.kumar.shah@gmail.com)
[Twitter](https://twitter.com/ayushkumarshah7)
[Linkedin](https://np.linkedin.com/in/ayush7).

Likewise, if you want to inform about any type of error in my blogs, you can open an issue [here](https://github.com/ayushkumarshah/ayushkumarshah.github.io/issues/new).

Finally, let’s define a page for error as well. Open 404.md and add the following lines

Title: Not Found
Status: hidden
Save_as: 404.html

Sorry, that page doesn't seem to exist. Please double-check the address or
head to the [home page][1].

[1]: {index}

Finally, your site is ready. You may now add more articles by creating more .md files into the content/articles/ directory and follow similar steps.

Although your site has been built, it is not publicly available. Learn how to host your site in GitHub pages or a custom domain in part 3 of the article.

If you have any confusion in any article, feel free to comment on your queries. I will be more than happy to help. I am also open to suggestions and feedbacks.

Also, you can use my GitHub repository for my blog post: ayushkumarshah.github.io as a reference in any point of the article. I have followed the same steps mentioned in this series to create my blog website that you are seeing right now.

If you want to visit any specific parts of the article, you can do so from the links below.

Or, go to the home-page of the article.

Part 1 - Setting up Pelican - Installation and Theme

2020-03-20T08:00:00-04:00

This article is a part of a series of articles for web development using pelican. So, if you haven’t read the previous article, please check it out by clicking the link below.

Creating and deploying static websites using Markdown and the Python library Pelican

Pelican is a static site generator, written in Python.

Features of Pelican

Articles (e.g., blog posts) and pages (e.g., “About”, “Projects”, “Contact”)
Comments, via an external service (Disqus). If you prefer to have more control over your comment data, self-hosted comments are another option. Check out the Pelican Plugins repository for more details.
Theming support (themes are created using Jinja2 templates)
Publication of articles in multiple languages
Atom/RSS feeds
Code syntax highlighting
Import from WordPress, Dotclear, or RSS feeds
Integration with external tools: Twitter, Google Analytics, etc. (optional)
Fast rebuild times thanks to content caching and selective output writing

Setting up Pelican

Project Structure: Create any folder for your project. For example web_development

$ mkdir web_development
$ cd web_development

1. Installation and Configuration

First, install virtualenv via pip and then create a virtual environment for your project.

$ pip install virtualenv
$ virtualenv .venv

Activate the virtual environment

$ source .venv/bin/activate

Now, to install pelican and all packages and dependencies that we will be using later, we need to create a requirements.txt file

(.venv) $ touch requirements.txt

and paste the lines from this link: requirements.txt into the file.

Then just run the following command inside the virtual environment to install all these packages

(.venv) $ pip install -r requirements.txt

Let’s now run a quickstart configuration script for pelican.

(.venv) $ pelican-quickstart

Pelican asks a series of questions to help you get started by building required configuration files.

Welcome to pelican-quickstart v3.7.1.

This script will help you create a new Pelican-based website. Please answer the following questions so this script can generate the files needed by Pelican.

> Where do you want to create your new web site? [.] .
> What will be the title of this web site? Ayush Kumar Shah
> Who will be the author of this web site? Ayush Kumar Shah
> What will be the default language of this web site? [en] en
> Do you want to specify a URL prefix? e.g., http://example.com   (Y/n) n
> Do you want to enable article pagination? (Y/n) Y
> How many articles per page do you want? [10] 5
> What is your time zone? [Europe/Paris] Asia/Kathmandu
> Do you want to generate a Fabfile/Makefile to automate generation and publishing? (Y/n) Y
> Do you want an auto-reload & simpleHTTP script to assist with theme and site development? (Y/n) n
> Do you want to upload your website using FTP? (y/N) N
> Do you want to upload your website using SSH? (y/N) N
> Do you want to upload your website using Dropbox? (y/N) N
> Do you want to upload your website using S3? (y/N) N
> Do you want to upload your website using Rackspace Cloud Files? (y/N) N
> Do you want to upload your website using GitHub Pages? (y/N) y
> Is this your personal page (username.github.io)? (y/N) y

Done. Your new project is available at /Users/ayushkumarshah/Desktop/Blog_writing/web

While answering the questions, please keep these things in mind:

Title and Author: Replace Ayush Kumar Shah with the title and author’s name that you want.
Default language: You can set any language using the standard ISO 639.1 2 letter code.
Article Pagination: If you do not want to limit the number of articles on a page, enter n.
Time zone: Choose your timezone from the Wikipedia.
Fabfile will help in further processes. So enter Y.

You may delete the Makefile as we will not be using it.

(.venv) $ rm Makefile

After successfully running the command, your directory should look like this:

web_development
    ├── content/
    ├── fabfile.py
    ├── output/
    ├── pelicanconf.py
    ├── publishconf.py
    └── requirements.txt

Let me tell you with the purpose of each of these files :

content/ - directory that stored al the website content
fabfile.py - script that helps us automate website generation and publishing
output/ - directory which stores the HTML files of the generated static website
pelicanconf.py - file containing all the configurations of the website publishconf.py - file containing additional website configurations used while publishing the website requirements.txt - contains all the packages and dependencies required

2. Generate and view your website

Till now, we have installed and configured Pelican successfully.

Let’s generate our first website and preview what it looks like. Make sure you are inside .venv environment.

Open fabfile.py and replace all instances of SocketServer by socketserver. (SocketServer is for python2).

# import SocketServer
import socketserver
...
# class AddressReuseTCPServer(SocketServer.TCPServer):
class AddressReuseTCPServer(socketserver.TCPServer):

Now, we are ready to generate and view our site.

(.venv) $ fab build
(.venv) $ fab serve

You may also run a single command equivalent to the 2 commands above:

(.venv) $ fab reserve

In case an error occurs, open fabfile.py again and change the import line to

import SocketServer as socketserver

After running the fab command, you will notice HTML files generated inside the output folder. These files are the HTML files of your website.

Your website should be already running in port 8000 of your localhost. To view your website, open your browser and visit localhost:8000

Congratulations, you have generated your first website.

3. Installing Pelican Themes

Now that we have built our website, let’s make the design more beautiful and responsive. There are numerous Pelican themes to choose from. Both the live version of the themes and the repositories are available. You can check them out and select the one that suits your website. My favorite themes are Flex (live version), Pneumatic (live version) and Bulrush (live version). I am currently using the Bulrush theme with some custom modifications for my website.

I will demonstrate using the Flex theme.

First, open and clone the Flex repository or the repository of the theme you chose. Make sure you are inside the web_development directory.

(.venv) $ git clone https://github.com/alexandrevicenzi/Flex.git themes/Flex

Here, the 2nd argument is the destination directory of the theme in your project. You can replace themes/Flex by themes/name_of_theme.

Now, specify the path of your theme in the configuration file pelicanconf.py by adding the following line:

THEME = 'themes/Flex'

Although Flex theme requires no additional plugin, most of the themes require various Pelican plugins. So, let’s download the pelican-plugins into your project. Note that you may skip this step if you want to use Flex theme.

(.venv) $ git clone https://github.com/getpelican/pelican-plugins.git

Now, add the path of the plugins in pelicanconf.py in a similar way as before by adding the following lines:

PLUGIN_PATHS = ['./pelican-plugins']

Also, add a line specifying a list of plugins required in your theme. You can view the name of plugins required in the documentation of the GitHub repository of the corresponding theme. Three most common plugins required by most of the themes are listed below. You can add the following line in the same file pelicanconf.py.

PLUGINS = ['sitemap', 'post_stats', 'feed_summary']

Some themes may require additional plugins, for which you have to search the documentation.

Another way to find the plugin name required is to just skip it for a while and after everything is done, while trying to serve your website, you will get an error message stating the name of missing plugins. Then you can add these plugins in the pelicanconf.py file.

At this state, your directory structure should look like this:

web_development
    ├── content/
    ├── fabfile.py
    ├── output
        ├── ... (many HTML files)
    ├── themes
        ├── Flex/
    ├── pelican-plugins
        ├── ... (various plugin directories)
    ├── pelicanconf.py
    ├──publishconf.py
    └──requirements.txt

If it doesn’t, then you probably did something wrong.

So, by now we have successfully installed the Flex theme on our website. You

Flex Configurations

We can check our new theme by generating and serving our new website again.

Close the previous process i.e. fab reserve if it is still running by pressing Ctrl+C or Cmd+C. Now,

(.venv) $ fab reserve

Open your browser and visit localhost:8000

You should see your website in a new theme.

However, it is not customized to include your profile. So, let’s customize the site by adding some attributes of the theme.

First, let’s create some folders inside the content directory.

(.venv) $ mkdir content/images
(.venv) $ mkdir content/extra

Let’s replace the default profile photo and favicon by your own.

Copy the profile image profile.png and the collection of favicon files like favicon.ico, favicon-16x16.png, etc into the images directory you just created.

Note: A favicon is a small pixel icon that appears at the top of the browser before the site name. It serves as branding for your website. You can create one using various tools online like realfavicongenerator or the favicon generator from websiteplanet (Thanks to Estefany for mentioning this site which allows image size upto 5 MB)

Different themes have different attributes or configurations.

Check the documentation or the README.md file of the respective theme. For Flex theme, a sample pelicanconfig.py can be found inside the docs folder. Check it for reference and also compare it with the live version of the theme. You can find more examples of the configurations in the Flex Wiki for reference.

I will demonstrate using a sample configuration for this theme. For that, add the following lines in your pelicanconfig.py file.

### Flex configurations

PLUGINS = ['sitemap', 'post_stats', 'feed_summary']
SITEURL = 'http://localhost:8000'
SITETITLE = 'Ayush Kumar Shah'  # Replace with your name
SITESUBTITLE = 'Ideas and Thoughts'
SITELOGO = '/assets/img/sample/profile.png'
FAVICON = '/assets/img/sample/favicon.ico'

# Sitemap Settings
SITEMAP = {
    'format': 'xml',
    'priorities': {
        'articles': 0.6,
        'indexes': 0.6,
        'pages': 0.5,
    },
    'changefreqs': {
        'articles': 'monthly',
        'indexes': 'daily',
        'pages': 'monthly',
    }
}

# Add a link to your social media accounts
SOCIAL = (
    ('github', 'https://github.com/ayushkumarshah'),
    ('envelope', 'mailto:ayushkumarshah@gmail.com'),
    ('linkedin','https://np.linkedin.com/in/ayush7'),
    ('twitter','https://twitter.com/ayushkumarshah7'),
    ('facebook','https://www.facebook.com/ayushkumarshah'),
    ('reddit','https://www.reddit.com/user/ayushkumarshah')
)

STATIC_PATHS = ['images', 'extra']

# Main Menu Items
MAIN_MENU = True
MENUITEMS = (('Archives', '/archives'),('Categories', '/categories'),('Tags', '/tags'))

# Code highlighting the theme
PYGMENTS_STYLE = 'friendly'

ARTICLE_URL = '{date:%Y}/{date:%m}/{slug}/'
ARTICLE_SAVE_AS = ARTICLE_URL + 'index.html'

PAGE_URL = '{slug}/'
PAGE_SAVE_AS = PAGE_URL + 'index.html'

ARCHIVES_SAVE_AS = 'archives.html'
YEAR_ARCHIVE_SAVE_AS = '{date:%Y}/index.html'
MONTH_ARCHIVE_SAVE_AS = '{date:%Y}/{date:%m}/index.html'

# Feed generation is usually not desired when developing
FEED_DOMAIN = SITEURL
FEED_ALL_ATOM = 'feeds/all.atom.xml'
CATEGORY_FEED_ATOM = 'feeds/%s.atom.xml'
TRANSLATION_FEED_ATOM = None
AUTHOR_FEED_ATOM = None
AUTHOR_FEED_RSS = None

# HOME_HIDE_TAGS = True
FEED_USE_SUMMARY = True

You may remove the LINKS variable from the configuration file pelicanconfig.py as you don’t need those links. We can check our new configuration by generating and serving our website again.

Close the previous process i.e. fab reserve if it is still running by pressing Ctrl+C or Cmd+C. Now,

(.venv) $ fab reserve

Open your browser and visit localhost:8000

You should see your website with your new configuration. Feel free to modify it as per your liking.

Congratulations, you have completed the basic setup for Pelican.

However, your site has no content. Start writing content in the part 2 of the article.

If you have any confusion in any article, feel free to comment on your queries. I will be more than happy to help. I am also open to suggestions and feedbacks.

Also, you can use my GitHub repository for my blog post: ayushkumarshah.github.io as a reference in any point of the article. I have followed the same steps mentioned in this series to create my blog website that you are seeing right now.

If you want to visit any specific parts of the article, you can do so from the links below.

Or, go to the home-page of the article.

Creating and deploying static websites using Markdown and the Python library Pelican

2020-03-19T21:00:00-04:00

You may have the desire to build your personal or blog site and host it in your domain name but several obstacles like incomplete knowledge of HTML and CSS, databases; the financial burden to host your site; the complexities of deployment and continuous-integration pipelines, etc might have prevented you from doing so.

In this article, I will explain the complete steps to build your static website like the one I have built (shahayush) using a static site generator called Pelican, which is written in Python, deploy it on GitHub Pages along with continuous integration (CI) using Travis-CI and linking it to your custom domain name, all without requiring the knowledge of HTML and CSS, databases or deployment pipelines. Furthermore, I will also explain the way to integrate a comment system called Disqus in your site and also help you to link Google Analytics to your site so that you can analyze in-depth detail about the visitors on your website.

The most striking advantage of this technique is that you can perform the complete process for free except the fee to register your domain name. You can also avoid this fee by hosting the site only on GitHub pages where you can host a website like your_username.github.io. The only prerequisite for completing this process is the basic knowledge of Python and Markdown for writing the articles. You might have used Markdown in jupyter notebook or the Readme.md file of your GitHub repository. Don’t worry if you are completely unaware of them. You can still manage to learn them through this article as they are extremely simple to catch up.

By part 2 of the article series, you will have your website ready which will look something like this:

My current website is also built using the same methods discussed in this article series.

Some samples of websites built using pelican

Theme: bulrush

Theme: medius

Demo website: medius by Onur Aslan

Theme: hyde

Demo website: hyde by vanz

Theme: pneumatic

Demo website: pneumatic by Kevin Yap

Details on how to use these themes will be discussed in the Part 1 of this article series. I just wanted to give some overview on how the website will look like in the end.

Advantages of Pelican over WordPress

You may wonder that the same thing can be achieved using WordPress and has a wider community compared to Pelican. So, why use Pelican? I have listed a few advantages of Pelican over WordPress written by Vincent Cheng in his article Migrating from Wordpress to Pelican.

speed: a static blog is going to be faster than a dynamically generated site, no matter how much you try to optimize your Wordpress site/cache/database. This site now serves up nothing more than HTML, CSS, and JS files.
simplicity: as mentioned above, there’s no need to set up, configure, and optimize your Wordpress installation. Simplicity in this sense also refers to the fact that this site is now powered by a smaller, simple to understand stack, rather than a giant and much more complex PHP stack that regularly attracts attackers…
improved workflow: you can use your preferred editor and your preferred VCS to create and keep track of your blog posts. Markdown is a nice bonus as well (it’s the sweet spot between a WYSIWYG editor and raw HTML).
mobility/deployment: static site = easier to move around (just copy the files; there’s no database to worry about) and deploy (and often cheaper to deploy; you can do so for free with Github Pages, for example).
less cost: Switching to Pelican means that you get to move off of Wordpress.com infrastructure, hence no more ads (and no need to pay $30/yr to get rid of them), no restrictions on the amount and type of content you upload, and being able to use your own domain name (without having to pay extra for it), and of course not having to rely on a third-party to host your blog.

Let’s get started

Now that you have got an overall insight of what this article series is about along with the benefits of using Pelican, get started by building your own website. For ease, I have divided the article into 6 parts as:

Click on the respective links to get started.

Ayush Kumar Shah

Common commands

General shell commands

SSH Commands

SSH Tunnelling

Copy multiple files from remote to local:

Other ssh commands

tmux commands

Vim commands

I. Pure Vim

Syntax:

Run bash commands in vim

1. VIM Verbs (operations)

2. VIM Nouns (text)

i. Text Objects

ii. Motions

3. Other important vim commands

4. Useful key remappings

5. Using Args

6. Scrolling and motions

II. Vim Plugins commands

Ranger

1. vim-surround commands

Hello

2. Git plugins commands

3. Coc commands

4. coc-python

5. FZF

6. Startify

7. vim-commentary

Git commands

Ignore files that have already been committed to the repo

Hard delete unpublished commits

Alternatively, if there’s work to keep

Undo published commits with new commits

Git reset

Remove sensitive file from github repo history

Other git commands

i3wm commands

Brew bundle for OSx

Usage

Install

Dump

Cleanup

Check

List

Exec

Restarting services

References

Organize pandas notebook with cool hacks

Untitled12.ipynb

> Skip to coding solution

Contents

Importance of Workflow

The usual coding style

Problems in the usual coding style

1. Flow is disrupted:

2. No reproducibility:

3. Difficult to move the code to production:

4. Unable to automate:

Coding Solution

1. Naming convention

2. Plan your steps beforehand

3. Create functions

4. Create proper decorators

5. Remove side effect

6. Constructing pandas pipelines (Main step)

7. Create a module

Advantages

1. Effective for the long run (Maintainability)

2. Proper flow and planning

3. Easier to modify

4. Easier to debug

5. Readability

6. Reusability

7. Separation into analysis and data manipulation

Final notes

Part 5 - Integrate Disqus comments and Google Analytics with Pelican

I. Integrate Disqus Comments

II. Integrate Google Analytics