Commands | Description |
---|---|
echo $SHELL |
Display name of active shell (bash or zsh or others) |
man command-name Eg: man bash | grep -C2 ‘$@’ |
Get information about the command Here, return 2 leading and trailing lines around the matching text ‘$@’ in the information |
command-name --help |
Information about the command usage |
pwd |
get current path |
pwd | pbcopy | copy current path to clipboard (Use xcopy or xsel for linux) |
cd - |
go back to previous location |
take new_dir |
create new_dir and cd into it i.e. mkdir new_dir; cd new_dir |
ls -al |
List. a - all l - long listing format d means directory - means file |
ls -ls |
list files with detailed info (permission, date, symoblic links) |
ls -1 | wc -l | count number of files in a directory |
cat filename |
show the contents of the file filename |
tee Eg: df -h | tee usage.txt |
display stdout of a command and write it in a file |
free -h |
Show RAM - space used and free |
df -h |
Show disk information - space used and free |
du -sh . |
Show total size occupied by current directory |
du -sh * |
Show size of each file or folder in current directory |
du -sh * | tail -1 | Show total size occupied by the last file in the current directory |
ps ax[c] [| less] | List currently running programs. c - easier to read less - easier to navigate |
pidof process-name |
Get the process id of a running process. |
kill process-id |
Kill the process |
uname [-[s][a]] |
Display name of OS Distribution. a - Detailed info. |
stat filename |
Display file status |
alias alias-name |
Shows the alias actual command |
date +format E.g. date +%d/%m/%Y |
Date command |
cal [-3] [[month] year] E.g. cal -3 june 1996 or cal 1997 or cal |
Calendar command. -3 means show previous and the next month as well. |
less file.txt |
Show file contents (similar to cat but allows to move up and down) |
more file.txt |
Show file contents (similar to cat but allows to move up and down) |
rm -ir |
Remove. i - prompt to ask permission for each file. r - recursively delete |
grep [-i] text_to_search /path/to/file |
Search for contents in a file , i - case insensitive |
grep -v text_to_search /path/to/file |
Search for contents not matching the pattern in a file |
command > file.txt |
adds output of command to file.txt. Creates a new file if does not exist. If exists, overwrites the contents of the file. |
command >> file.txt |
adds output of command to file.txt. Creates a new file if does not exist. If exists, appends the outputs to the contents of the file. |
find / -name "file_name" [2>/dev/null] Eg: find \ -name "*backup*" 2/dev/null |
Find file from the root directory 2/dev/null: 2 takes error output and redirects to dev/null where it is deleted |
find . -not -name "file_name" |
Find files not matching the filename |
find . -name “file_name” | xargs -I % rm % | Find and delete files matching the filename |
find . -name "file_name" -exec rm -i {} \; |
Find and delete files matching the filename |
find . -name "file_name" -exec grep "Hello" -i {} \; |
Find and search “Hello” in files matching the filename |
find -E . -regex ".*/file_name[0-9].sh" |
Find files matching the regular expression (this syntax works only in osx) |
find -E . -not -regex ".*/file_name[0-9].sh" |
Find files not matching the regular expression (this syntax works only in osx) |
command | grep text_to_search Eg: find / -name “backup” 2>/dev/null | grep $USER |
Using pipe to combine grep with other commands |
awk |
very powerful command for pattern scanning and processing |
<C-T> |
fzf: fuzzy finding files or directories You need to install fzf |
<C-R> |
fzf: fuzzy finding commands in history |
<Esc-C> |
fzf: fuzzy finding files or directories from current path |
top or htop or ytop or gotop |
Process info and CPU Usage (You need to install htop or ytop) |
tree [-aldf][-L level][-P pattern][-I pattern][-o filename] |
display directory’s contents in a tree a - all files l - symbolic links d - directories only L - limit number of levels of directory I - files not matching pattern P - files matching pattern o - output to filename You need to install tree |
lsof /dev/nvidia* | awk '{print $2}' |
Display ids of processes utilizing CUDA/GPU |
To access servers hosted on the remote machine from the local machine
$ ssh -NL port1_server:localhost:port1_local [-NL port2_server:localhost:port2_local]{multiple ports possible} username@remote-ip-address
Example:
$ ssh -NL 8888:localhost:8888 ayush@192.168.100.7
$ scp username@remote-ip:/some/remote/directory/\{file1,file2,file3\} /localpath
$ scp username@remote-ip:'/path1/file1 /path2/file2 /path3/file3' /localPath
Generate ssh key:
Using ed25519 (more secure: Recommended)
$ ssh-keygen -t ed25519
Using RSA
$ ssh-keygen -t rsa -b 3072
Save ssh host info
Modify this file: ~/.ssh/config
Host *
AddKeysToAgent yes
UseKeychain yes
IdentityFile ~/.ssh/id_rsa (path/to/key)
Host targaryen
HostName 192.168.1.10
User daenerys
Port 7654
Host tyrell
HostName 192.168.10.20
Host martell
HostName 192.168.10.50
Host *ell
User oberyn
Host * !martell
LogLevel INFO
Host *
User root
Compression yes
Save ssh password so that no need to re-enter every time
Run this in client (not server)
ssh-copy-id -i path/to/key.pub username@server-ip-address
Example:
ssh-copy-id -i ~/.ssh/id_rsa.pub ayush@192.168.1.107
Open server in nautilus / file explorer in linux
File explorer: Other locations > Connect to server > sftp://username@ip/
tmux | Create a tmux session with default window name 0 |
tmux new -As name | Create a tmux session with a name or attach to an existing session (if it exists) |
tmux ls | List the active tmux sessions |
tmux a -t name | Attach to an existing tmux session |
tmux kill-session- t name | Kill an existing tmux session |
<prefix> = <c-B> (default), can be changed to <c-A> |
|
<prefix> [%”] |
(Splitting panes) |
[c-D] | (exit) |
<prefix> D |
(get out ) |
<prefix> c |
Create a new window (appears in status bar) |
<prefix> 0 |
Switch to window 0 |
<prefix> 1 |
Switch to window 1 |
<prefix> x |
Kill current window |
<prefix> d |
Detach tmux (exit back to normal terminal) |
<prefix> z |
the active pane is toggled between zoomed and unzoomed |
<prefix> space |
switch between split orientations |
<prefix> ! |
Break current pane to a new window |
<prefix> | Swap pane within a window |
|
<prefix> () |
Switch between tmux sessions |
<prefix> <C-o> |
Swap pane within a window |
<prefix> :move-window -t 2 |
rename current window to 2 if 2 does not exist |
<prefix> :resize-pane -D n |
Resizes the current pane down by n cells |
<prefix> :resize-pane -U n |
Resizes the current pane upward by n cells |
<prefix> :resize-pane -L n |
Resizes the current pane left by n cells |
<prefix> :resize-pane -R n |
Resizes the current pane right by n cells |
<prefix> :join-pane [-dhv] [-l size | -p percentage] [-s src-pane] [-t dst-pane] Eg: <prefix> :join-pane -v -s 4 -t :1 |
Join one pane to another |
<prefix> <c-S> |
save current state You need to install tmux-resurrect |
<prefix> <c-R> |
reload saved state |
Verbs (operations) + Noun (text on which operation is performed)
[count] [operation] [text object / motion]
:[.]!command
. (dot)
- outputs the command into the current buffer
c | change |
d | delete |
C | change everything from where your cursor is to the end of the line |
D | delete everything from where your cursor is to the end of the line |
dd | delete a line |
x | delete a sigle character |
y | yank text into the copy buffer. |
yy or Y | yank line into the copy buffer. |
v | highlight one character at a time. |
V | highlight one line at a time. |
<c-v> |
highlight by columns. |
p | paste text after the current line. |
P | paste text on the current line. |
> | Shift Right |
< | Shift Left |
= | Indent |
gU | make uppercase |
gu | make lowercase |
~ | swap case |
Must be combined with verbs
iw | inner word (non whitespace) (works from anywhere in a word) |
aw | word with surrounding white space (works from anywhere in a word) aw ~ W. Difference in position. E.g. For dw, cursor must be at beginning, whereas daw works from any position. |
ib | inner bracket (the contents of an HTML tag) |
ab | a bracket |
it | inner tag (the contents of an HTML tag) |
at | a tag block |
i” | inner quotes |
a” | a quote |
ip | inner paragraph |
ap | a paragraph |
is | inner sentence |
as | a sentence |
Combination examples:
gUiw | capitalize a word |
ci( | change inner bracket |
6dW | delete 6 words |
yis | copy inner sentence |
di” | delete inner quotes |
Can be combined with verbs or used independently
[count] w/W | go a (word / word with whitespace) to right |
[count] b/B | go a (word / word with whitespace) to left |
[count] e/E | go to the end of (word / word with whitespace) |
[count] ]m | go to the beginning of next method |
[count] h / j / k / l | left / down / up / right |
[count] f/F [char] [;,]+ | go to the next occurence of character |
[count] t/T [char] [;,]+ | go to just before the next occurence of character |
% | move to matching parenthesis pair |
[count] + | down to first non blank char of the line. |
[count]$ | moves the cursor to the end of the line. |
0 | moves the cursor to the beginning of the line. |
G | move to the end of the file. |
gg | move to the beginning of the file. |
]m or [m | Mode between methods. |
Combination examples:
3ce | Change 3 words to end |
d]m | delete start of next method |
ctL | change upto before the next occurence of L |
d]m | delete start of next method |
i | Insert to left of cursor |
a | Insert to right of cursor |
A | insert at end of line |
I | insert at beginning of line |
o | insert at beginning of next line |
O | insert at beginning of previous line |
u | undo |
<c-r> |
will redo the last undo. |
/text | search for text |
:%s/text/replacement text/g | search through the entire document for text and replace it with replacement text. |
:%s/text/replacement text/gc | search through the entire document and confirm before replacing text. |
* | search forward for word under cursor |
# | search backward for word under cursor |
:vsplit | vertical split windows |
m[a-zA-Z] | sets a custom mark whose location can be accessed using `[mark] and line accessed using ‘[mark] |
g; | goto last cursor position |
’. | move to the last edit |
:marks | show all current marks that are being used |
:w | write |
:w file_name | write the changes to a new file |
:q | quit |
:q! or ZQ | force quit |
:wq or ZZ | write and quit |
:w !sudo tee % | Write with sudo permissions if permission not available |
:bd | remove buffer |
[:vert] :sf filename | find file and open in split mode |
<c-v> select multiple lines then I or A and type the required text |
insert text at multiple lines at the beginning or the end |
q <char > commands q [count] @ <char > |
record command macros apply recorded commands |
:ab ipho International Physics Olympiad | Set abbreviation for long terms for easy typing Use <C-v> to prevent expansion |
norm command. Eg vip then :norm Ithis comes to the left |
Applies sequence of button presses / commands to each line selected Select a paragraph and add the text to the left of each line in the paragraph |
Global commands. Eg :g/^@/m$ |
Apply commands to lines matching particular pattern Move all lines starting with @ to the end of the document |
Timetravel Eg :earlier 10m :earlier 5h :later 2h |
Move to the file state in the past or future as specified |
jk (Custom- inoremap jk <Esc> ) |
<Esc> |
kj (Custom)inoremap kj <Esc> |
<Esc> |
nnoremap <C-c>| <Esc> |
|
nnoremap <C-s> |
:w<CR> |
nnoremap <C-Q> |
:wq!<CR> |
Better window navigation | |
nnoremap <C-h> |
<C-w>h |
nnoremap <C-j> |
<C-w>j |
nnoremap <C-k> |
<C-w>k |
nnoremap <C-l> |
<C-w>l |
Args are list of files initially opened. So, it’s a subset of buffers.
:args | display args files |
:args */.yaml | add files to args |
:sall | open all args files in split mode |
:vert sall | open all args files in vertical split mode |
:windo difft | show differences in all args files |
c-x, c-l | autocomplete |
:vim TODO/ ## | search in all args files |
:cdo s/TODO/DONE/g | replace in all args files |
<c-u > , <c-d > |
Up, down scroll |
{ } | Up, down scroll between spaces |
<c-b > , <c-f > |
Up, down full screen scroll |
<c-y > , <c-e > |
Up, down scroll by lines |
H / M / L | Navigations to top / middle / bottom |
zt | Put current cursor position to top |
zz | Put current cursor position to middle |
zb | Put current cursor position to bottom |
Install any vim plugin manager like vim-plug.
To apply latest settings:
:source $MYVIMRC
First, instll ranger
Mac
brew install ranger
Linux
sudo apt install ranger
Install ranger plugin for vim
" Ranger in vim
Plug 'francoiscabrol/ranger.vim'
" Dependency for ranger in neovim
Plug 'rbgrouleff/bclose.vim'
When ranger is open in vim or externally
cw |
Rename file/dir : change word |
A |
Rename file: add at the end of extension |
a |
Rename file: add just before the extension |
I |
Rename file/dir: add at the front of the filename/directory |
:bulkrename |
Rename a list of files/directories |
:mkdir newdir |
Create new directory |
Space |
Highlight/Select files / directories |
V |
Highlight/Select files / directories similar to visual mode |
uv |
Undo highlight/select |
yy |
Copy/yank file/dir |
dd |
Cut file/dir |
pp |
Paste file/dir. If file exists, new file created with _ at the end of the name |
po |
Paste but overwrite file/dir |
uy |
Undo Copy/yank |
dD |
Delete |
Z (Custom mapping) |
Compress using an external script mapped to ranger |
x (Custom mapping) |
Extract using an external script mapped to ranger |
Install Plug tpope/vim-surround
ds[‘“bB{}t] | delete surrounding quotes |
cs[‘“bB{}t] [‘“bB{}t] | change surrounding quotes |
ysiw[‘“bB{}t] | add surrounding quotes “ |
v-select, S[‘“bB{}t] | add surrounding |
Examples:
<p> Hello </p> |
cst<h2> |
<h2> Hello </h2> |
if *x>3{ | ysW( | if ( x>3 ) { |
*“hello” | ysWf print<cr(Enter)> | print(“hello”) |
Install these plugins first
" Show differences with style
Plug 'mhinz/vim-signify'
" Main GIT PLugin :Git
Plug 'tpope/vim-fugitive'
" Git Hub plugin, enables :Gbrowse
Plug 'tpope/vim-rhubarb'
" Git commit browser
Plug 'junegunn/gv.vim'
" Git commit history in each line
<c-o> <c-i> |
Toggle between buffers |
:Git diff | Show git differences |
:Gdiffsplit | Show differences in split mode |
:GBrowse | Open the repository in github |
:GV | Show git commit history |
Install COC plugin first
" Intellisense
Plug 'neoclide/coc.nvim', {'branch': 'release'}
gd | Goto Definitions of variable under cursor |
gr | Goto References of variable under cursor |
:CocInstall tool_name E.g. :CocInstall coc-python | Installing coc tools |
:CocUninstall tool_name | Uninstalling coc tools |
:CocList extensions (Tab for autocompletion) | Show extensions |
:CocCommand | execute a COC command |
o | expand/collapse in Coc explorer (First run :CocInstall coc-explorer) |
Install coc-python first
:CocInstall coc-python
Shift K | doc hint |
:Format | autopep8 formatting |
<C-w> w |
Switch cursors between sidebar and code |
<C-n> <C-n> <C-n> c I A |
multiple cursors: change Insert at first Insert at end |
Install fzf in system and fzf plugin
OSx
brew install fzf
# To install useful key bindings and fuzzy completion:
$(brew --prefix)/opt/fzf/install
brew install ripgrep
Linux
sudo apt install fzf
sudo apt install ripgrep
FZF Plugin
Plug 'junegunn/fzf', { 'do': { -> fzf#install() } }
Plug 'junegunn/fzf.vim'
Plug 'airblade/vim-rooter'
:Rg | Find word inside file |
:BLines | Find all occurrences of word in a giant file |
:Lines | Same as above but search in all buffers |
:History: | History of commands ran in vim |
:Ag | similar to Rg but |
:Buffers | Search through buffers |
> | Tab |
gf | Goto file: open file directly from path written in vim |
Install Startify Plugin for Project management
" Start Screen
Plug 'mhinz/vim-startify'
:SSave | Save session |
:SLoad | Load session |
" Better Comments
Plug 'tpope/vim-commentary'
gc[<count>] <Text object> (Examples below) |
Comment out the target of a motion |
gcap | Comment out a paragraph |
gcw | Comment out the current line |
gc2j | Comment out the current line and 2 lines below it |
Easy remapping
|nnoremap <leader>
/| :Commentary<CR>
|vnoremap <leader>
/| :Commentary<CR>
git commit –amend | add to previous commits |
git commit -m $’Heading commit\n\nCommit description\nLine 2 of description’ | Commit and commit description in 1 line |
git push origin -f branchname | forced push |
git rebase master | merge changes of master onto the current branch (first pull from master before rebase) |
git log | |
git diff | |
git remote -v | show repo information |
git reset –hard |
|
git show | |
git config –global user.name | |
git config –global user.email | |
git reset |
remove file from the current index (the “about to be committed” list) without changing anything else. |
git checkout filename | Undo local changes to latest commit |
git stash | Stash local changes temporarily |
git stash list | Show stashed branches |
git stash show | Show the latest stashed file changes |
git stash show -p N | Show the Nth (see number in git stash list) stashed file changes |
git stash drop stash@{index} | Remove the given stash |
git stash clear | Remove all stashes |
git stash list | awk -F: '{ print "\n\n\n\n"; print $0; print "\n\n"; system("git stash show -p " $1); }' |
Show the changes in the stash in detail |
$ git rm -r --cached .
$ git add .
$ git commit -m "Clean up ignored files"
git reset --hard commit_id (reset to the particular commit. It will destroy any local modifications.)
git stash
git reset --hard commit_id
git stash pop
This saves the modifications, then reapplies that patch after resetting. You could get merge conflicts, if you’ve modified things which were changed since the commit you reset to.
This will create three separate revert commits:
git revert a867b4af 25eee4ca 0766c053
It also takes ranges. This will revert the last two commits:
git revert HEAD~2..HEAD
Similarly, you can revert a range of commits using commit hashes:
git revert a867b4af..0766c053
Reverting a merge commit
git revert -m 1 <merge_commit_sha>
To get just one, you could use
rebase -i
to squash them afterwards Or, you could do it manually (be sure to do this at top level of the repo) get your index and work tree into the desired state, without changing HEAD:
git checkout 0d1d7fc32 .
Then commit. Be sure and write a good message describing what you just did
git commit
git reset does know five “modes”: soft, mixed, hard, merge and keep. I will start with the first three, since these are the modes you’ll usually encounter. After that you’ll find a nice little a bonus, so stay tuned.
soft
When using
git reset --soft HEAD~1
you will remove the last commit from the current branch, but the file changes will stay in your working tree. Also the changes will stay on your index, so following with a git commit will create a commit with the exact same changes as the commit you “removed” before.
mixed
This is the default mode and quite similar to soft. When “removing” a commit with
git reset HEAD~1
you will still keep the changes in your working tree but not on the index; so if you want to “redo” the commit, you will have to add the changes (git add) before commiting.
hard
When using
git reset --hard HEAD~1
you will lose all uncommited changes in addition to the changes introduced in the last commit. The changes won’t stay in your working tree so doing a git status command will tell you that you don’t have any changes in your repository.
Tread carefully with this one. If you accidentally remove uncommited changes which were never tracked by git (speak: committed or at least added to the index), you have no way of getting them back using git.
Bonus (keep)
git reset --keep HEAD~1
is an interesting and useful one. It only resets the files which are different between the current HEAD and the given commit. It aborts the reset if anyone of these files has uncommited changes. It’s basically acts as a safer version of hard.
This mode is particularly useful when you have a bunch of changes and want to switch to a different branch without losing these changes - for example when you started to work on the wrong branch.
git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch path_to_file" HEAD
git push -f origin master
git rm | rm plus git add combined
git rm --cached | file removed from the index (staging it for deletion on the next commit), but keep your copy in the local file system.
mod + r - resize mode , then arrow keys or vim keys | |
mod + Shift + e - exit | |
mod + d - dmenu | |
mod+Shift+c reload | |
mod+Shift+r restart |
Alt+Shift- Change keyboard language
Create a Brewfile
in the root of your project with:
touch Brewfile
Add your dependencies in your Brewfile
:
tap "homebrew/cask"
tap "user/tap-repo", "https://user@bitbucket.org/user/homebrew-tap-repo.git"
cask_args appdir: "/Applications"
brew "imagemagick"
brew "denji/nginx/nginx-full", args: ["with-rmtp-module"]
brew "mysql@5.6", restart_service: true, link: true, conflicts_with: ["mysql"]
cask "firefox", args: { appdir: "~/my-apps/Applications" }
cask "google-chrome"
cask "java" unless system "/usr/libexec/java_home --failfast"
mas "1Password", id: 443987910
whalebrew "whalebrew/wget"
cask
and mas
entries are automatically skipped on Linux.
Other entries can be run only on (or not on) Linux with if OS.mac?
or if OS.linux?
.
You can then easily install all dependencies with:
brew bundle
Any previously-installed dependencies which have upgrades available will be upgraded.
brew bundle
will look for a Brewfile
in the current directory. Use --file
to specify a path to a different Brewfile
, or set the HOMEBREW_BUNDLE_FILE
environment variable; --file
takes precedence if both are provided.
My .Brewfile
is stored in the home directory and the HOMEBREW_BUNDLE_FILE
environment variable is set to ~/.Brewfile
You can skip the installation of dependencies by adding space-separated values to one or more of the following environment variables:
HOMEBREW_BUNDLE_BREW_SKIP
HOMEBREW_BUNDLE_CASK_SKIP
HOMEBREW_BUNDLE_MAS_SKIP
HOMEBREW_BUNDLE_WHALEBREW_SKIP
HOMEBREW_BUNDLE_TAP_SKIP
brew bundle
will output a Brewfile.lock.json
in the same directory as the Brewfile
if all dependencies are installed successfully. This contains dependency and system status information which can be useful in debugging brew bundle
failures and replicating a “last known good build” state.
You can opt-out of this behaviour by setting the HOMEBREW_BUNDLE_NO_LOCK
environment variable or passing the --no-lock
option.
You may wish to check this file into the same version control system as your Brewfile
(or ensure your version control system ignores it if you’d prefer to rely on debugging information from a local machine).
You can create a Brewfile
from all the existing Homebrew packages you have installed with:
brew bundle dump
The --force
option will allow an existing Brewfile
to be overwritten as well.
The --describe
option will output a description comment above each line.
The --no-restart
option will prevent restart_service
from being added to brew
lines with running services.
You can also use a Brewfile
to list the only packages that should be installed, removing any package not present or dependent. This workflow is useful for maintainers or testers who regularly install lots of formulae. To uninstall all Homebrew formulae not listed in the Brewfile
:
brew bundle cleanup
Unless the --force
option is passed, formulae that would be uninstalled will be listed rather than actually be uninstalled.
You can check there’s anything to install/upgrade in the Brewfile
by running:
brew bundle check
This provides a successful exit code if everything is up-to-date, making it useful for scripting.
For a list of dependencies that are missing, pass --verbose
. This will also check all dependencies by not exiting on the first missing dependency category.
Outputs a list of all of the entries in the Brewfile.
brew bundle list
Pass one of --casks
, --taps
, --mas
, --whalebrew
or --brews
to limit output to that type. Defaults to --brews
. Pass --all
to see everything.
Note that the type of the package is not included in this output.
Runs an external command within Homebrew’s superenv build environment.
brew bundle exec -- bundle install
This sanitized build environment ignores unrequested dependencies, which makes sure that things you didn’t specify in your Brewfile
won’t get picked up by commands like bundle install
, npm install
, etc. It will also add compiler flags which will help find keg-only dependencies like openssl
, icu4c
, etc.
You can choose whether brew bundle
restarts a service every time it’s run, or
only when the formula is installed or upgraded in your Brewfile
:
# Always restart myservice
brew 'myservice', restart_service: true
# Only restart when installing or upgrading myservice
brew 'myservice', restart_service: :changed
Does it ring a bell looking at this messy notebook? I am sure you must have created or encountered a similar kind of notebook while performing data analysis tasks in pandas.
Pandas is widely used by data scientists and ML Engineers all around the world to perform all kinds of data related tasks like data cleaning and preprocessing, data analysis, data manipulation, data conversion, etc. However, most of us are not using it right, as seen in the above example, which has decreased our productivity a lot.
You might wonder then what is the correct way to use pandas. Is there any particular way that we can make the notebook clean and modular so that we can increase our productivity?
Luckily, there is a type of quick hack or technique, whatever you may call it, which can be used to greatly improve the workflow and make notebooks not only clean and well organized but highly productive and efficient. The good thing is that you don’t need to install any extra packages or libraries. In the end, your notebook will look something like this.
Note: Dark mode is available on this website. You can switch between the modes by clicking the leftmost button at the bottom of the left sidebar.
The way to achieve clean and well-organized pandas notebooks was explored in the presentation Untitled12.ipynb by Vincent D. Warmerdam at PyData Eindhoven 2019
The presentation Untitled12.ipynb: Prevent Miles of Scrolling, Reduce the Spaghetti Code from the Copy Pasta has been uploaded in youtube as well. You can watch the video below if you want:
In this article, I will briefly summarize the presentation by Vincent D. Warmerdam and then move on to the code implementation (solution) and a few code examples based on the methods used in his presentation.
The Untitled phenomena
He began his talk by introducing a term called Untitled phenomena
. The term simply refers to the bad practice of not naming the notebook files which eventually creates an unorganized bunch of Untitled notebooks. As a result, he also named the presentation Untitled12.ipynb
.
Moreover, not only the bad practice of naming that we follow but also the bad organization of code inside the notebook needs to be improved. Copying and pasting code multiple times creates spaghetti code. This is especially true for a lot of data science based Jupyter notebooks. The goal of his talk was to uncover a great pattern for pandas that would prevent loads of scrolling such that the code behaves like lego. He also gave some useful tricks and tips on how to prevent miles of scrolling and reduce the spaghetti code when creating Jupyter notebooks.
I have initially written a summary of the talk Untitled12.ipynb and explored some common problems in the usual coding style before moving to the solution. If you want to directly jump to the coding solution to create a clean pandas notebook using a pipeline, then click the link above. However, I recommend you to read the common problems I have mentioned before going to the solution.
I will be talking about the following topics which will more or less revolve around his talk.
At the beginning of the presentation, he began by discussing the following points that highlight the importance of workflows and the need of jupyter-notebook and pandas over excel:
We want to separate the data from the analysis: Tha analysis portion should not modify the raw data. The raw data should be safe from these modifications so that it can be reused later as well. However, this is not possible in excel.
We want to be able to automate our analysis. The main aim of programming and workflow is automation. Our tasks become a lot easier if we can automate the analysis using a pandas script rather than performing the analysis every time using Excel.
We want our analysis to be reproducible i.e. we must be able to reproduce the same analysis results on the data at a later time in the future.
We should not pay a third part obscene amounts of money for something as basic as arithmetic. This budget is better allocated towards innovation and education of staff.
However, the current style of coding in pandas and jupyter notebook has solved only the last point.
Let’s explore the common practice of writing pandas code and try to point out the major problems in such approaches.
Initially, I will show the general workflow that most of us follow while using pandas. I will be performing some analysis on the real COVID 19 dataset of the U.S. states obtained from The COVID Tracking Project which is available under the Creative Commons CC BY-NC-4.0 license. The dataset is updated each day between 4 pm and 5 pm EDT.
After showing the common approach, I will point out the major pitfalls and then move on to the solution.
First, I will download the U.S. COVID-19 dataset using the API provided by The COVID Tracking Project
!mkdir data
!wget -O data/covid19_us_states_daily.csv https://covidtracking.com/api/v1/states/daily.csv
!wget -O data/state_info.csv https://covidtracking.com/api/v1/states/info.csv
--2020-06-05 16:34:10-- https://covidtracking.com/api/v1/states/daily.csv
Resolving covidtracking.com (covidtracking.com)... 104.248.63.231, 2604:a880:400:d1::888:7001
Connecting to covidtracking.com (covidtracking.com)|104.248.63.231|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘data/covid19_us_states_daily.csv’
data/covid19_us_sta [ <=> ] 987.40K 3.11MB/s in 0.3s
2020-06-05 16:34:11 (3.11 MB/s) - ‘data/covid19_us_states_daily.csv’ saved [1011093]
--2020-06-05 16:34:12-- https://covidtracking.com/api/v1/states/info.csv
Resolving covidtracking.com (covidtracking.com)... 104.248.50.87, 2604:a880:400:d1::888:7001
Connecting to covidtracking.com (covidtracking.com)|104.248.50.87|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘data/state_info.csv’
data/state_info.csv [ <=> ] 27.67K --.-KB/s in 0.02s
2020-06-05 16:34:13 (1.43 MB/s) - ‘data/state_info.csv’ saved [28329]
import pandas as pd
# Importing plotly library for plotting interactive graphs
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import chart_studio
import chart_studio.plotly as py
The first step is generally to read or import the data
df = pd.read_csv('data/covid19_us_states_daily.csv', index_col='date')
df.head()
state | positive | negative | pending | hospitalizedCurrently | hospitalizedCumulative | inIcuCurrently | inIcuCumulative | onVentilatorCurrently | onVentilatorCumulative | recovered | dataQualityGrade | lastUpdateEt | dateModified | checkTimeEt | death | hospitalized | dateChecked | fips | positiveIncrease | negativeIncrease | total | totalTestResults | totalTestResultsIncrease | posNeg | deathIncrease | hospitalizedIncrease | hash | commercialScore | negativeRegularScore | negativeScore | positiveScore | score | grade | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||||||||||||||||||||||||||
20200604 | AK | 513.0 | 59584.0 | NaN | 13.0 | NaN | NaN | NaN | 1.0 | NaN | 376.0 | A | 6/4/2020 00:00 | 2020-06-04T00:00:00Z | 06/03 20:00 | 10.0 | NaN | 2020-06-04T00:00:00Z | 2 | 8 | 1907 | 60097 | 60097 | 1915 | 60097 | 0 | 0 | c1046011af7271cbe2e6698526714c6cb5b92748 | 0 | 0 | 0 | 0 | 0 | NaN |
20200604 | AL | 19072.0 | 216227.0 | NaN | NaN | 1929.0 | NaN | 601.0 | NaN | 357.0 | 11395.0 | B | 6/4/2020 00:00 | 2020-06-04T00:00:00Z | 06/03 20:00 | 653.0 | 1929.0 | 2020-06-04T00:00:00Z | 1 | 221 | 3484 | 235299 | 235299 | 3705 | 235299 | 0 | 29 | bcbefdb36212ba2b97b5a354f4e45bf16648ee23 | 0 | 0 | 0 | 0 | 0 | NaN |
20200604 | AR | 8067.0 | 134413.0 | NaN | 138.0 | 757.0 | NaN | NaN | 30.0 | 127.0 | 5717.0 | A | 6/4/2020 00:00 | 2020-06-04T00:00:00Z | 06/03 20:00 | 142.0 | 757.0 | 2020-06-04T00:00:00Z | 5 | 0 | 0 | 142480 | 142480 | 0 | 142480 | 0 | 26 | acd3a4fbbc3dbb32138725f91e3261d683e7052a | 0 | 0 | 0 | 0 | 0 | NaN |
20200604 | AS | 0.0 | 174.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | C | 6/1/2020 00:00 | 2020-06-01T00:00:00Z | 05/31 20:00 | 0.0 | NaN | 2020-06-01T00:00:00Z | 60 | 0 | 0 | 174 | 174 | 0 | 174 | 0 | 0 | 8bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32 | 0 | 0 | 0 | 0 | 0 | NaN |
20200604 | AZ | 22753.0 | 227002.0 | NaN | 1079.0 | 3195.0 | 375.0 | NaN | 223.0 | NaN | 5172.0 | A+ | 6/4/2020 00:00 | 2020-06-04T00:00:00Z | 06/03 20:00 | 996.0 | 3195.0 | 2020-06-04T00:00:00Z | 4 | 520 | 4710 | 249755 | 249755 | 5230 | 249755 | 15 | 66 | 1fa237b8204cd23701577aef6338d339daa4452e | 0 | 0 | 0 | 0 | 0 | NaN |
After taking a glance at the data, I realize that the date is not formatted well, so I format it.
df.index = pd.to_datetime(df.index, format="%Y%m%d")
df.head()
state | positive | negative | pending | hospitalizedCurrently | hospitalizedCumulative | inIcuCurrently | inIcuCumulative | onVentilatorCurrently | onVentilatorCumulative | recovered | dataQualityGrade | lastUpdateEt | dateModified | checkTimeEt | death | hospitalized | dateChecked | fips | positiveIncrease | negativeIncrease | total | totalTestResults | totalTestResultsIncrease | posNeg | deathIncrease | hospitalizedIncrease | hash | commercialScore | negativeRegularScore | negativeScore | positiveScore | score | grade | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||||||||||||||||||||||||||
2020-06-04 | AK | 513.0 | 59584.0 | NaN | 13.0 | NaN | NaN | NaN | 1.0 | NaN | 376.0 | A | 6/4/2020 00:00 | 2020-06-04T00:00:00Z | 06/03 20:00 | 10.0 | NaN | 2020-06-04T00:00:00Z | 2 | 8 | 1907 | 60097 | 60097 | 1915 | 60097 | 0 | 0 | c1046011af7271cbe2e6698526714c6cb5b92748 | 0 | 0 | 0 | 0 | 0 | NaN |
2020-06-04 | AL | 19072.0 | 216227.0 | NaN | NaN | 1929.0 | NaN | 601.0 | NaN | 357.0 | 11395.0 | B | 6/4/2020 00:00 | 2020-06-04T00:00:00Z | 06/03 20:00 | 653.0 | 1929.0 | 2020-06-04T00:00:00Z | 1 | 221 | 3484 | 235299 | 235299 | 3705 | 235299 | 0 | 29 | bcbefdb36212ba2b97b5a354f4e45bf16648ee23 | 0 | 0 | 0 | 0 | 0 | NaN |
2020-06-04 | AR | 8067.0 | 134413.0 | NaN | 138.0 | 757.0 | NaN | NaN | 30.0 | 127.0 | 5717.0 | A | 6/4/2020 00:00 | 2020-06-04T00:00:00Z | 06/03 20:00 | 142.0 | 757.0 | 2020-06-04T00:00:00Z | 5 | 0 | 0 | 142480 | 142480 | 0 | 142480 | 0 | 26 | acd3a4fbbc3dbb32138725f91e3261d683e7052a | 0 | 0 | 0 | 0 | 0 | NaN |
2020-06-04 | AS | 0.0 | 174.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | C | 6/1/2020 00:00 | 2020-06-01T00:00:00Z | 05/31 20:00 | 0.0 | NaN | 2020-06-01T00:00:00Z | 60 | 0 | 0 | 174 | 174 | 0 | 174 | 0 | 0 | 8bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32 | 0 | 0 | 0 | 0 | 0 | NaN |
2020-06-04 | AZ | 22753.0 | 227002.0 | NaN | 1079.0 | 3195.0 | 375.0 | NaN | 223.0 | NaN | 5172.0 | A+ | 6/4/2020 00:00 | 2020-06-04T00:00:00Z | 06/03 20:00 | 996.0 | 3195.0 | 2020-06-04T00:00:00Z | 4 | 520 | 4710 | 249755 | 249755 | 5230 | 249755 | 15 | 66 | 1fa237b8204cd23701577aef6338d339daa4452e | 0 | 0 | 0 | 0 | 0 | NaN |
Then, I try to view some additional information about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5113 entries, 2020-06-04 to 2020-01-22
Data columns (total 34 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 state 5113 non-null object
1 positive 5098 non-null float64
2 negative 4902 non-null float64
3 pending 842 non-null float64
4 hospitalizedCurrently 2591 non-null float64
5 hospitalizedCumulative 2318 non-null float64
6 inIcuCurrently 1362 non-null float64
7 inIcuCumulative 576 non-null float64
8 onVentilatorCurrently 1157 non-null float64
9 onVentilatorCumulative 198 non-null float64
10 recovered 2409 non-null float64
11 dataQualityGrade 4012 non-null object
12 lastUpdateEt 4758 non-null object
13 dateModified 4758 non-null object
14 checkTimeEt 4758 non-null object
15 death 4388 non-null float64
16 hospitalized 2318 non-null float64
17 dateChecked 4758 non-null object
18 fips 5113 non-null int64
19 positiveIncrease 5113 non-null int64
20 negativeIncrease 5113 non-null int64
21 total 5113 non-null int64
22 totalTestResults 5113 non-null int64
23 totalTestResultsIncrease 5113 non-null int64
24 posNeg 5113 non-null int64
25 deathIncrease 5113 non-null int64
26 hospitalizedIncrease 5113 non-null int64
27 hash 5113 non-null object
28 commercialScore 5113 non-null int64
29 negativeRegularScore 5113 non-null int64
30 negativeScore 5113 non-null int64
31 positiveScore 5113 non-null int64
32 score 5113 non-null int64
33 grade 0 non-null float64
dtypes: float64(13), int64(14), object(7)
memory usage: 1.4+ MB
You can see that various columns are not of use. So, I decide to remove such columns.
df.drop([*df.columns[4:10], *df.columns[11:15], 'posNeg', 'fips'],
axis=1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5113 entries, 2020-06-04 to 2020-01-22
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 state 5113 non-null object
1 positive 5098 non-null float64
2 negative 4902 non-null float64
3 pending 842 non-null float64
4 recovered 2409 non-null float64
5 death 4388 non-null float64
6 hospitalized 2318 non-null float64
7 dateChecked 4758 non-null object
8 positiveIncrease 5113 non-null int64
9 negativeIncrease 5113 non-null int64
10 total 5113 non-null int64
11 totalTestResults 5113 non-null int64
12 totalTestResultsIncrease 5113 non-null int64
13 deathIncrease 5113 non-null int64
14 hospitalizedIncrease 5113 non-null int64
15 hash 5113 non-null object
16 commercialScore 5113 non-null int64
17 negativeRegularScore 5113 non-null int64
18 negativeScore 5113 non-null int64
19 positiveScore 5113 non-null int64
20 score 5113 non-null int64
21 grade 0 non-null float64
dtypes: float64(7), int64(12), object(3)
memory usage: 918.7+ KB
I also realize that there are a lot of missing (nan or null) values. So, I replace the missing values by 0.
df.fillna(value=0, inplace=True)
df.head()
state | positive | negative | pending | recovered | death | hospitalized | dateChecked | positiveIncrease | negativeIncrease | total | totalTestResults | totalTestResultsIncrease | deathIncrease | hospitalizedIncrease | hash | commercialScore | negativeRegularScore | negativeScore | positiveScore | score | grade | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||||||||||||||
2020-06-04 | AK | 513.0 | 59584.0 | 0.0 | 376.0 | 10.0 | 0.0 | 2020-06-04T00:00:00Z | 8 | 1907 | 60097 | 60097 | 1915 | 0 | 0 | c1046011af7271cbe2e6698526714c6cb5b92748 | 0 | 0 | 0 | 0 | 0 | 0.0 |
2020-06-04 | AL | 19072.0 | 216227.0 | 0.0 | 11395.0 | 653.0 | 1929.0 | 2020-06-04T00:00:00Z | 221 | 3484 | 235299 | 235299 | 3705 | 0 | 29 | bcbefdb36212ba2b97b5a354f4e45bf16648ee23 | 0 | 0 | 0 | 0 | 0 | 0.0 |
2020-06-04 | AR | 8067.0 | 134413.0 | 0.0 | 5717.0 | 142.0 | 757.0 | 2020-06-04T00:00:00Z | 0 | 0 | 142480 | 142480 | 0 | 0 | 26 | acd3a4fbbc3dbb32138725f91e3261d683e7052a | 0 | 0 | 0 | 0 | 0 | 0.0 |
2020-06-04 | AS | 0.0 | 174.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2020-06-01T00:00:00Z | 0 | 0 | 174 | 174 | 0 | 0 | 0 | 8bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32 | 0 | 0 | 0 | 0 | 0 | 0.0 |
2020-06-04 | AZ | 22753.0 | 227002.0 | 0.0 | 5172.0 | 996.0 | 3195.0 | 2020-06-04T00:00:00Z | 520 | 4710 | 249755 | 249755 | 5230 | 15 | 66 | 1fa237b8204cd23701577aef6338d339daa4452e | 0 | 0 | 0 | 0 | 0 | 0.0 |
I also want to add a column corresponding to the state name instead of the abbreviation. So, I merge state_info with the current dataframe.
df2 = pd.read_csv('data/state_info.csv', usecols=['state', 'name'])
df3 = (df
.reset_index()
.merge(df2, on='state', how='left', left_index=True))
df3.head()
date | state | positive | negative | pending | recovered | death | hospitalized | dateChecked | positiveIncrease | negativeIncrease | total | totalTestResults | totalTestResultsIncrease | deathIncrease | hospitalizedIncrease | hash | commercialScore | negativeRegularScore | negativeScore | positiveScore | score | grade | name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-06-04 | AK | 513.0 | 59584.0 | 0.0 | 376.0 | 10.0 | 0.0 | 2020-06-04T00:00:00Z | 8 | 1907 | 60097 | 60097 | 1915 | 0 | 0 | c1046011af7271cbe2e6698526714c6cb5b92748 | 0 | 0 | 0 | 0 | 0 | 0.0 | Alaska |
1 | 2020-06-04 | AL | 19072.0 | 216227.0 | 0.0 | 11395.0 | 653.0 | 1929.0 | 2020-06-04T00:00:00Z | 221 | 3484 | 235299 | 235299 | 3705 | 0 | 29 | bcbefdb36212ba2b97b5a354f4e45bf16648ee23 | 0 | 0 | 0 | 0 | 0 | 0.0 | Alabama |
2 | 2020-06-04 | AR | 8067.0 | 134413.0 | 0.0 | 5717.0 | 142.0 | 757.0 | 2020-06-04T00:00:00Z | 0 | 0 | 142480 | 142480 | 0 | 0 | 26 | acd3a4fbbc3dbb32138725f91e3261d683e7052a | 0 | 0 | 0 | 0 | 0 | 0.0 | Arkansas |
3 | 2020-06-04 | AS | 0.0 | 174.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2020-06-01T00:00:00Z | 0 | 0 | 174 | 174 | 0 | 0 | 0 | 8bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32 | 0 | 0 | 0 | 0 | 0 | 0.0 | American Samoa |
4 | 2020-06-04 | AZ | 22753.0 | 227002.0 | 0.0 | 5172.0 | 996.0 | 3195.0 | 2020-06-04T00:00:00Z | 520 | 4710 | 249755 | 249755 | 5230 | 15 | 66 | 1fa237b8204cd23701577aef6338d339daa4452e | 0 | 0 | 0 | 0 | 0 | 0.0 | Arizona |
I realize that the date index is lost. So, I reset the date index. Also, it is better to rename the column name as state_name.
df3.set_index('date', inplace=True)
df3.rename(columns={'name': 'state_name'}, inplace=True)
df3.head()
state | positive | negative | pending | recovered | death | hospitalized | dateChecked | positiveIncrease | negativeIncrease | total | totalTestResults | totalTestResultsIncrease | deathIncrease | hospitalizedIncrease | hash | commercialScore | negativeRegularScore | negativeScore | positiveScore | score | grade | state_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||||||||||||
2020-06-04 | AK | 513.0 | 59584.0 | 0.0 | 376.0 | 10.0 | 0.0 | 2020-06-04T00:00:00Z | 8 | 1907 | 60097 | 60097 | 1915 | 0 | 0 | c1046011af7271cbe2e6698526714c6cb5b92748 | 0 | 0 | 0 | 0 | 0 | 0.0 | Alaska |
2020-06-04 | AL | 19072.0 | 216227.0 | 0.0 | 11395.0 | 653.0 | 1929.0 | 2020-06-04T00:00:00Z | 221 | 3484 | 235299 | 235299 | 3705 | 0 | 29 | bcbefdb36212ba2b97b5a354f4e45bf16648ee23 | 0 | 0 | 0 | 0 | 0 | 0.0 | Alabama |
2020-06-04 | AR | 8067.0 | 134413.0 | 0.0 | 5717.0 | 142.0 | 757.0 | 2020-06-04T00:00:00Z | 0 | 0 | 142480 | 142480 | 0 | 0 | 26 | acd3a4fbbc3dbb32138725f91e3261d683e7052a | 0 | 0 | 0 | 0 | 0 | 0.0 | Arkansas |
2020-06-04 | AS | 0.0 | 174.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2020-06-01T00:00:00Z | 0 | 0 | 174 | 174 | 0 | 0 | 0 | 8bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32 | 0 | 0 | 0 | 0 | 0 | 0.0 | American Samoa |
2020-06-04 | AZ | 22753.0 | 227002.0 | 0.0 | 5172.0 | 996.0 | 3195.0 | 2020-06-04T00:00:00Z | 520 | 4710 | 249755 | 249755 | 5230 | 15 | 66 | 1fa237b8204cd23701577aef6338d339daa4452e | 0 | 0 | 0 | 0 | 0 | 0.0 | Arizona |
Now that the data is ready for some analysis, I decide to plot deaths count in each state indexed by date using the interactive plotly library.
fig1 = px.line(df3, x=df3.index, y='death', color='state')
fig1.update_layout(xaxis_title='date', title='Total deaths in each state (Cumulative)')
py.plot(fig1, filename = 'daily_deaths', auto_open=True)
'https://plotly.com/~ayush.kumar.shah/1/'
Note: These plots are interactive, so you can zoom in or out, pinch, hover over the graph, download it, and so on.
Now, I decide to calculate the total deaths in the US across all states and plot it.
df4 = df3.resample('D').sum()
df4.head()
positive | negative | pending | recovered | death | hospitalized | positiveIncrease | negativeIncrease | total | totalTestResults | totalTestResultsIncrease | deathIncrease | hospitalizedIncrease | commercialScore | negativeRegularScore | negativeScore | positiveScore | score | grade | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||||||||
2020-01-22 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 |
2020-01-23 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 |
2020-01-24 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 |
2020-01-25 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 |
2020-01-26 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 |
fig2 = px.line(df4, x=df4.index, y='death')
fig2.update_layout(xaxis_title='date', title='Total deaths in the U.S. (Cumulative)')
py.plot(fig2, filename = 'total_daily_deaths', auto_open=True)
'https://plotly.com/~ayush.kumar.shah/4/'
I also want to calculate the number of Active cases i.e.
Active = positive - deaths - recovered
df4['active'] = df4['positive'] - df4['death'] - df4['recovered']
Now, after calculating the active column, I want to plot active cases instead of death. So, I go to the previous cell and replace death
by active
and generate the plot.
In [25]: df4[’death’].plot()
In [25]: df4[‘active’].plot()
fig3 = px.line(df4, x=df4.index, y='active')
fig3.update_layout(xaxis_title='date', title='Total active cases in the U.S. (Cumulative)')
py.plot(fig3, filename = 'total_daily_active', auto_open=True)
'https://plotly.com/~ayush.kumar.shah/6/'
Then I decide to calculate the statistics of a single month of May only. Since the data is cumulative, I need to subtract the data of May from data of April to find the increase in various statistics in May after which I plot the results.
df5 = (df3.loc['2020-05']
.groupby('state_name')
.agg({'positive': 'first',
'negative': 'first',
'pending': 'first',
'recovered': 'first',
'death': 'first',
'hospitalized': 'first',
'total': 'first',
'totalTestResults': 'first',
'deathIncrease': 'sum',
'hospitalizedIncrease': 'sum',
'negativeIncrease': 'sum',
'positiveIncrease': 'sum',
'totalTestResultsIncrease': 'sum'}))
df6 = (df3.loc['2020-04']
.groupby('state_name')
.agg({'positive': 'first',
'negative': 'first',
'pending': 'first',
'recovered': 'first',
'death': 'first',
'hospitalized': 'first',
'total': 'first',
'totalTestResults': 'first',
'deathIncrease': 'sum',
'hospitalizedIncrease': 'sum',
'negativeIncrease': 'sum',
'positiveIncrease': 'sum',
'totalTestResultsIncrease': 'sum'}))
df7 = df5.sub(df6)
df7.head()
positive | negative | pending | recovered | death | hospitalized | total | totalTestResults | deathIncrease | hospitalizedIncrease | negativeIncrease | positiveIncrease | totalTestResultsIncrease | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
state_name | |||||||||||||
Alabama | 10884.0 | 119473.0 | 0.0 | 9355.0 | 362.0 | 866.0 | 130357 | 130357 | 106 | -112 | 45594 | 4846 | 50440 |
Alaska | 79.0 | 32497.0 | 0.0 | 116.0 | 1.0 | 0.0 | 32576 | 32576 | -5 | 7 | 17327 | -157 | 17170 |
American Samoa | 0.0 | 171.0 | -17.0 | 0.0 | 0.0 | 0.0 | 154 | 171 | 0 | 0 | 171 | 0 | 171 |
Arizona | 12288.0 | 141132.0 | 0.0 | 3262.0 | 586.0 | 1829.0 | 153420 | 153420 | 290 | 660 | 95076 | 5929 | 101005 |
Arkansas | 3998.0 | 77138.0 | 0.0 | 3970.0 | 72.0 | 309.0 | 81136 | 81136 | 19 | -93 | 37973 | 1266 | 39239 |
fig4 = px.bar(df7, x=df7.index, y='death')
fig4.update_layout(xaxis_title='state_name', title='Total Deaths in th US in May only')
py.plot(fig4, filename = 'total_deaths_May', auto_open=True)
'https://plotly.com/~ayush.kumar.shah/12/'
Now that I have demonstrated the usual approach followed in pandas notebook, let’s discuss the problems in this approach.
The flow of the notebook is very difficult to understand and also creates problems. For example, we may create a variable name under the plot that needs it. In the above code as well, we created df3['active']
below the cell in which it is needed. So, it may cause errors when run by others. Also, you may have to scroll the notebook for miles and miles.
When the notebook is shared with others, the other person faces a lot of problems to execute or understand the notebook. For instance, the name of the dataframes doesn’t signify any information about the type of dataframe. It runs from df1
to df7
and creates a lot of confusion. But you want to create a notebook which is very easy to iterate on and the one you can share with your colleagues.
With this approach, your code is not ready to move into production. You end up having to rewrite the whole notebook before moving it to production which is not effective.
The notebook in the current condition cannot be automated for analysis since there may occur a lot of problems like an error in code execution, unavailability of filenames used.
Although the code may give an interesting conclusion or desired output, we are not quite sure that conclusion is at least correct.
Despite having so many problems associated with this approach, it is common for everyone to still use this type of flow while making a notebook since while coding, people enjoy when the code works when they check the outputs and hence keep on similarly continuing the coding.
Follow a naming convention for the notebook according to the task as suggested by Cookiecutter Data Science that shows the owner and the order the analysis was done in. You can use the format
<step>-<ghuser>-<description>.ipynb
(e.g., 0.1-ayush-visualize-corona-us.ipynb
).
Load the data and then think in advance about all the steps of analysis or tasks you could be doing in the notebook. You don’t need to think the logic right away but just keep in mind the steps.
df = pd.read_csv('data/covid19_us_states_daily.csv', index_col='date')
You know that initially, you want to clean the data and make sure the columns and indexes are in a proper usable format. So, why not create a function for that and name it according to the subtasks on the dataframe.
For example, initially you want to make the index a proper datetime object. Then you may want to do remove the duplicates, then add state name. Just add these functions without even thinking the logic and then later you can add the logic. This way, you will be on track and not lost.
The functions are created after creating the decorator.
Before adding functions, let’s also think about some additional utility that would be helpful. During the pandas analysis, you often check the shape, columns, and other information associated to the dataframe after performing an operation. However, a decorator can help automate this process.
Decorator
is simply a function that expects a function and returns a function. It’s really functional right, haha. Don’t get confused by the definition. It is not so difficult as it sounds. We will see how it works in the code below.
Also, if you are not familiar with decorators or want to learn more about it, you can visit the article by Geir Arne Hjelle.
import datetime as dt
def df_info(f):
def wrapper(df, *args, **kwargs):
tic = dt.datetime.now()
result = f(df, *args, **kwargs)
toc = dt.datetime.now()
print("\n\n{} took {} time\n".format(f.__name__, toc - tic))
print("After applying {}\n".format(f.__name__))
print("Shape of df = {}\n".format(result.shape))
print("Columns of df are {}\n".format(result.columns))
print("Index of df is {}\n".format(result.index))
for i in range(100): print("-", end='')
return result
return wrapper
We have created a decorator called df_info
which displays information like time taken by the function, shape, and columns after applying any function f
.
The advantage of using a deorator is that we get logging. You can modify the decorator according to the information that you want to log or display after performing an operation on the dataframe.
Now, we create functions as our plan and use these decorators on them by using @df_info
. This will be equivalent to calling df_info(f(df, *args, **kwargs))
@df_info
def create_dateindex(df):
df.index = pd.to_datetime(df.index, format="%Y%m%d")
return df
@df_info
def remove_columns(df):
df.drop([*df.columns[4:10], *df.columns[11:15], 'posNeg', 'fips'],
axis=1, inplace=True)
return df
@df_info
def fill_missing(df):
df.fillna(value=0, inplace=True)
return df
@df_info
def add_state_name(df):
_df = pd.read_csv('data/state_info.csv', usecols=['state', 'name'])
df = (df
.reset_index()
.merge(_df, on='state', how='left', left_index=True))
df.set_index('date', inplace=True)
df.rename(columns={'name': 'state_name'}, inplace=True)
return df
@df_info
def drop_state(df):
df.drop(columns=['state'], inplace=True)
return df
@df_info
def sample_daily(df):
df = df.resample('D').sum()
return df
@df_info
def add_active_cases(df):
df['active'] = df['positive'] - df['death'] - df['recovered']
return df
def aggregate_monthly(df, month):
df = (df.loc[month]
.groupby('state_name')
.agg({'positive': 'first',
'negative': 'first',
'pending': 'first',
'recovered': 'first',
'death': 'first',
'hospitalized': 'first',
'total': 'first',
'totalTestResults': 'first',
'deathIncrease': 'sum',
'hospitalizedIncrease': 'sum',
'negativeIncrease': 'sum',
'positiveIncrease': 'sum',
'totalTestResultsIncrease': 'sum'}))
return df
@df_info
def create_month_only(df, month):
df_current = aggregate_monthly(df, month)
if int(month[-2:]) == 0:
prev_month = str(int(month[:4]) - 1) + '-12'
else:
prev_month = month[:5] + '{:02d}'.format(int(month[-2:])-1)
df_previous = aggregate_monthly(df, prev_month)
df = df_current.sub(df_previous)
return df
However, these functions make changes that are inplace (side effects) i.e. modifies the originally loaded dataframe. So, to solve this, we add a function called start pipeline, which returns a copy of dataframe.
def start_pipeline(df):
return df.copy()
Now, let’s use these functions to achieve the previous tasks using pipe
df_daily = (df.pipe(start_pipeline)
.pipe(create_dateindex)
.pipe(remove_columns)
.pipe(fill_missing)
.pipe(add_state_name)
.pipe(sample_daily)
.pipe(add_active_cases))
create_dateindex took 0:00:00.003388 time
After applying create_dateindex
Shape of df = (5113, 34)
Columns of df are Index(['state', 'positive', 'negative', 'pending', 'hospitalizedCurrently',
'hospitalizedCumulative', 'inIcuCurrently', 'inIcuCumulative',
'onVentilatorCurrently', 'onVentilatorCumulative', 'recovered',
'dataQualityGrade', 'lastUpdateEt', 'dateModified', 'checkTimeEt',
'death', 'hospitalized', 'dateChecked', 'fips', 'positiveIncrease',
'negativeIncrease', 'total', 'totalTestResults',
'totalTestResultsIncrease', 'posNeg', 'deathIncrease',
'hospitalizedIncrease', 'hash', 'commercialScore',
'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
'grade'],
dtype='object')
Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
'2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
'2020-06-04', '2020-06-04',
...
'2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
'2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
'2020-01-23', '2020-01-22'],
dtype='datetime64[ns]', name='date', length=5113, freq=None)
----------------------------------------------------------------------------------------------------
remove_columns took 0:00:00.002087 time
After applying remove_columns
Shape of df = (5113, 22)
Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
'total', 'totalTestResults', 'totalTestResultsIncrease',
'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
'grade'],
dtype='object')
Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
'2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
'2020-06-04', '2020-06-04',
...
'2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
'2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
'2020-01-23', '2020-01-22'],
dtype='datetime64[ns]', name='date', length=5113, freq=None)
----------------------------------------------------------------------------------------------------
fill_missing took 0:00:00.006381 time
After applying fill_missing
Shape of df = (5113, 22)
Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
'total', 'totalTestResults', 'totalTestResultsIncrease',
'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
'grade'],
dtype='object')
Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
'2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
'2020-06-04', '2020-06-04',
...
'2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
'2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
'2020-01-23', '2020-01-22'],
dtype='datetime64[ns]', name='date', length=5113, freq=None)
----------------------------------------------------------------------------------------------------
add_state_name took 0:00:00.015122 time
After applying add_state_name
Shape of df = (5113, 23)
Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
'total', 'totalTestResults', 'totalTestResultsIncrease',
'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
'grade', 'state_name'],
dtype='object')
Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
'2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
'2020-06-04', '2020-06-04',
...
'2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
'2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
'2020-01-23', '2020-01-22'],
dtype='datetime64[ns]', name='date', length=5113, freq=None)
----------------------------------------------------------------------------------------------------
sample_daily took 0:00:00.017170 time
After applying sample_daily
Shape of df = (135, 19)
Columns of df are Index(['positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized',
'positiveIncrease', 'negativeIncrease', 'total', 'totalTestResults',
'totalTestResultsIncrease', 'deathIncrease', 'hospitalizedIncrease',
'commercialScore', 'negativeRegularScore', 'negativeScore',
'positiveScore', 'score', 'grade'],
dtype='object')
Index of df is DatetimeIndex(['2020-01-22', '2020-01-23', '2020-01-24', '2020-01-25',
'2020-01-26', '2020-01-27', '2020-01-28', '2020-01-29',
'2020-01-30', '2020-01-31',
...
'2020-05-26', '2020-05-27', '2020-05-28', '2020-05-29',
'2020-05-30', '2020-05-31', '2020-06-01', '2020-06-02',
'2020-06-03', '2020-06-04'],
dtype='datetime64[ns]', name='date', length=135, freq='D')
----------------------------------------------------------------------------------------------------
add_active_cases took 0:00:00.002020 time
After applying add_active_cases
Shape of df = (135, 20)
Columns of df are Index(['positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized',
'positiveIncrease', 'negativeIncrease', 'total', 'totalTestResults',
'totalTestResultsIncrease', 'deathIncrease', 'hospitalizedIncrease',
'commercialScore', 'negativeRegularScore', 'negativeScore',
'positiveScore', 'score', 'grade', 'active'],
dtype='object')
Index of df is DatetimeIndex(['2020-01-22', '2020-01-23', '2020-01-24', '2020-01-25',
'2020-01-26', '2020-01-27', '2020-01-28', '2020-01-29',
'2020-01-30', '2020-01-31',
...
'2020-05-26', '2020-05-27', '2020-05-28', '2020-05-29',
'2020-05-30', '2020-05-31', '2020-06-01', '2020-06-02',
'2020-06-03', '2020-06-04'],
dtype='datetime64[ns]', name='date', length=135, freq='D')
Check out all the logs displayed above. We are able to view in detail how each operation changed the data without having to print the dataframe after each operation.
fig2 = px.line(df_daily, x=df_daily.index, y='death')
fig2.update_layout(xaxis_title='date', title='Total deaths in the U.S. (Cumulative)')
py.plot(fig2, filename = 'total_daily_deaths', auto_open=True)
'https://plotly.com/~ayush.kumar.shah/4/'
fig3 = px.line(df_daily, x=df_daily.index, y='active')
fig3.update_layout(xaxis_title='date', title='Total active cases in the U.S. (Cumulative)')
py.plot(fig3, filename = 'total_daily_active', auto_open=True)
'https://plotly.com/~ayush.kumar.shah/6/'
df_may = create_month_only(
df=(df.pipe(start_pipeline)
.pipe(create_dateindex)
.pipe(remove_columns)
.pipe(fill_missing)
.pipe(add_state_name)),
month='2020-05')
create_dateindex took 0:00:00.002492 time
After applying create_dateindex
Shape of df = (5113, 34)
Columns of df are Index(['state', 'positive', 'negative', 'pending', 'hospitalizedCurrently',
'hospitalizedCumulative', 'inIcuCurrently', 'inIcuCumulative',
'onVentilatorCurrently', 'onVentilatorCumulative', 'recovered',
'dataQualityGrade', 'lastUpdateEt', 'dateModified', 'checkTimeEt',
'death', 'hospitalized', 'dateChecked', 'fips', 'positiveIncrease',
'negativeIncrease', 'total', 'totalTestResults',
'totalTestResultsIncrease', 'posNeg', 'deathIncrease',
'hospitalizedIncrease', 'hash', 'commercialScore',
'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
'grade'],
dtype='object')
Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
'2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
'2020-06-04', '2020-06-04',
...
'2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
'2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
'2020-01-23', '2020-01-22'],
dtype='datetime64[ns]', name='date', length=5113, freq=None)
----------------------------------------------------------------------------------------------------
remove_columns took 0:00:00.002219 time
After applying remove_columns
Shape of df = (5113, 22)
Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
'total', 'totalTestResults', 'totalTestResultsIncrease',
'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
'grade'],
dtype='object')
Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
'2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
'2020-06-04', '2020-06-04',
...
'2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
'2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
'2020-01-23', '2020-01-22'],
dtype='datetime64[ns]', name='date', length=5113, freq=None)
----------------------------------------------------------------------------------------------------
fill_missing took 0:00:00.001883 time
After applying fill_missing
Shape of df = (5113, 22)
Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
'total', 'totalTestResults', 'totalTestResultsIncrease',
'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
'grade'],
dtype='object')
Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
'2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
'2020-06-04', '2020-06-04',
...
'2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
'2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
'2020-01-23', '2020-01-22'],
dtype='datetime64[ns]', name='date', length=5113, freq=None)
----------------------------------------------------------------------------------------------------
add_state_name took 0:00:00.014981 time
After applying add_state_name
Shape of df = (5113, 23)
Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death',
'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease',
'total', 'totalTestResults', 'totalTestResultsIncrease',
'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore',
'negativeRegularScore', 'negativeScore', 'positiveScore', 'score',
'grade', 'state_name'],
dtype='object')
Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
'2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04',
'2020-06-04', '2020-06-04',
...
'2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
'2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
'2020-01-23', '2020-01-22'],
dtype='datetime64[ns]', name='date', length=5113, freq=None)
----------------------------------------------------------------------------------------------------
create_month_only took 0:00:00.031071 time
After applying create_month_only
Shape of df = (56, 13)
Columns of df are Index(['positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized',
'total', 'totalTestResults', 'deathIncrease', 'hospitalizedIncrease',
'negativeIncrease', 'positiveIncrease', 'totalTestResultsIncrease'],
dtype='object')
Index of df is Index(['Alabama', 'Alaska', 'American Samoa', 'Arizona', 'Arkansas',
'California', 'Colorado', 'Connecticut', 'Delaware',
'District Of Columbia', 'Florida', 'Georgia', 'Guam', 'Hawaii', 'Idaho',
'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
'North Carolina', 'North Dakota', 'Northern Mariana Islands', 'Ohio',
'Oklahoma', 'Oregon', 'Pennsylvania', 'Puerto Rico', 'Rhode Island',
'South Carolina', 'South Dakota', 'Tennessee', 'Texas',
'US Virgin Islands', 'Utah', 'Vermont', 'Virginia', 'Washington',
'West Virginia', 'Wisconsin', 'Wyoming'],
dtype='object', name='state_name')
fig4 = px.bar(df_may, x=df_may.index, y='death')
fig4.update_layout(xaxis_title='state_name', title='Total Deaths in th US in May only')
py.plot(fig4, filename = 'total_deaths_May', auto_open=True)
'https://plotly.com/~ayush.kumar.shah/12/'
You can observe how easily pipe functionality has achieved the required task in a clean and organized way. Also, the original dataframe is intact and not affected by the above operations.
df.head()
state | positive | negative | pending | hospitalizedCurrently | hospitalizedCumulative | inIcuCurrently | inIcuCumulative | onVentilatorCurrently | onVentilatorCumulative | recovered | dataQualityGrade | lastUpdateEt | dateModified | checkTimeEt | death | hospitalized | dateChecked | fips | positiveIncrease | negativeIncrease | total | totalTestResults | totalTestResultsIncrease | posNeg | deathIncrease | hospitalizedIncrease | hash | commercialScore | negativeRegularScore | negativeScore | positiveScore | score | grade | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||||||||||||||||||||||||||
20200604 | AK | 513.0 | 59584.0 | NaN | 13.0 | NaN | NaN | NaN | 1.0 | NaN | 376.0 | A | 6/4/2020 00:00 | 2020-06-04T00:00:00Z | 06/03 20:00 | 10.0 | NaN | 2020-06-04T00:00:00Z | 2 | 8 | 1907 | 60097 | 60097 | 1915 | 60097 | 0 | 0 | c1046011af7271cbe2e6698526714c6cb5b92748 | 0 | 0 | 0 | 0 | 0 | NaN |
20200604 | AL | 19072.0 | 216227.0 | NaN | NaN | 1929.0 | NaN | 601.0 | NaN | 357.0 | 11395.0 | B | 6/4/2020 00:00 | 2020-06-04T00:00:00Z | 06/03 20:00 | 653.0 | 1929.0 | 2020-06-04T00:00:00Z | 1 | 221 | 3484 | 235299 | 235299 | 3705 | 235299 | 0 | 29 | bcbefdb36212ba2b97b5a354f4e45bf16648ee23 | 0 | 0 | 0 | 0 | 0 | NaN |
20200604 | AR | 8067.0 | 134413.0 | NaN | 138.0 | 757.0 | NaN | NaN | 30.0 | 127.0 | 5717.0 | A | 6/4/2020 00:00 | 2020-06-04T00:00:00Z | 06/03 20:00 | 142.0 | 757.0 | 2020-06-04T00:00:00Z | 5 | 0 | 0 | 142480 | 142480 | 0 | 142480 | 0 | 26 | acd3a4fbbc3dbb32138725f91e3261d683e7052a | 0 | 0 | 0 | 0 | 0 | NaN |
20200604 | AS | 0.0 | 174.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | C | 6/1/2020 00:00 | 2020-06-01T00:00:00Z | 05/31 20:00 | 0.0 | NaN | 2020-06-01T00:00:00Z | 60 | 0 | 0 | 174 | 174 | 0 | 174 | 0 | 0 | 8bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32 | 0 | 0 | 0 | 0 | 0 | NaN |
20200604 | AZ | 22753.0 | 227002.0 | NaN | 1079.0 | 3195.0 | 375.0 | NaN | 223.0 | NaN | 5172.0 | A+ | 6/4/2020 00:00 | 2020-06-04T00:00:00Z | 06/03 20:00 | 996.0 | 3195.0 | 2020-06-04T00:00:00Z | 4 | 520 | 4710 | 249755 | 249755 | 5230 | 249755 | 15 | 66 | 1fa237b8204cd23701577aef6338d339daa4452e | 0 | 0 | 0 | 0 | 0 | NaN |
Finally, you can create a module (eg processing.py
) and keep all the above functions in the module. You can simply import them here and use them directly. It will clean the notebook further.
processing.py
While loading the modules, load the “autoreload” extension so that you can change code in the modules and the changes get updated automatically. For more info, see autoreload documentation
%load_ext autoreload
%autoreload 2
from processing import *
The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
Although, the approach may look like an inefficient method of coding but it is very effective in the long run since you will not have to spend hours maintaining the notebook. Given the functions are well written and well defined, they are ready for production.
The code is easily sharable as well as anyone can understand the code unlike in the previous approach. Also, for complex analysis tasks, this approach can be easily used for maintaining the notebook.
You do not need to think about the logic of the analysis at the beginning. You can just plan your tasks and write down the required functions which already gives you kind of a framework of mind which helps to be on track. The calm that will follow is likely going to have a greater impact on innovation.
Then, you can finally define the logic at the end to make it work.
You might have noticed that the pipe
functionality gives you the ability to modify the tasks or flow easily. You can do so by commenting or adding the functions in the pipeline.
For example, you don’t want to remove the columns and sample the data daily. Then you can achieve this simply by commenting those lines as shown below:
df_daily = (df.pipe(start_pipeline)
.pipe(create_dateindex)
# .pipe(remove_columns)
.pipe(fill_missing)
.pipe(add_state_name)
.pipe(drop_state)
# .pipe(sample_daily)
.pipe(add_active_cases))
In this approach, you know what is happening in each step which makes it a lot easier to debug. Furthermore, since all the operations are functions, you can easily debug the code by performing unit tests or using other methods on the functions.
This approach helps you prevent miles of scrolling and also is easily readable than the previous approach. By looking at the code, you can easily understand what operations are being performed on the data and also can see the effect of those operations on the data in each step using decorator.
Example:
Let us consider cooking chicken. When we do so, we don’t describe the steps like this:
temperature = 210 celsius
food1 = Chicken
food2 = Season(food1, with Spices)
food3 = Season(food2, with Gravy)
Serve(PutInOven(food3, temperature), on a plate)
But instead, we describe it the following way:
temperature = 210 celsius
Chicken.Season(with Spices)
.Season(with Gravy)
.PutInOven(temperature)
.Serve()
The pipe functionality helps us to write code in the latter way, which is also much more readable.
During production, we turn the project into a Python package. You can import your code and use it in notebooks with a cell. You do not need to write code to do the same task in multiple notebooks.
Once your functions have been moved to a separate module, two levels of abstraction are obtained: analysis and data manipulation.
You can fiddle around on a high level and keep the details on a low level. The notebook then becomes the summary and a user interface where you can very quickly make nice little charts instead of manipulating data or performing analytical steps to get a result.
Hence, following these practices while coding in pandas or performing other similar tasks like building scikit-learn pipelines or other ML pipelines, can be extremely beneficial for developers. Also, all the 4 problems mentioned in the beginning have been solved in this approach. Thus, giving utmost priority to clarity and interoperability, we should remember that it’s a lot easier to solve a problem if we understand the problem well.
Moreover, if you find writing these codes difficult, an open source package called Scikit-lego maintained by Vincent and MatthijsB, with contributions from all around the world, is available. This package does all the hard work for you to create such pipelines along with additional features like custom logging. Do check it out.
Also, if you have any confusion or suggestions, feel free to comment. I am all ears. Thank you.
]]>Creating and deploying static websites using Markdown and the Python library Pelican
Up to this point, you have created and hosted your static website on GitHub pages/custom domain and also learned to automate deployment.
Now, let’s integrate Disqus comment service system and google analytics into our site to analyze the in-depth detail about the visitors on your website.
I want to install Disqus on my site
. Then fill up the fields like Website Name and Category as shown below.In the website name field, you may enter any name for your website.
I don't see my platform listed
option as shown.Configure
.Edit Settings
and click General
. There, you can see your Disqus website shortname in the Shortname
field. Copy that name.publishconf.py
and pelicanconf.py
DISQUS_SITENAME = 'ayushblog-2'
That’s it. You can check by using the command
(.venv) fab reserve
Then visit localhost:8000. At the bottom, you can see the Disqus comment section. Sometimes, it doesn’t appear in localhost. But don’t worry, it will still appear in the website.
You can push the updated source code to view the changes on your website.
You can configure the appearance and other preferences of the comment system by logging in to this link: Disqus admin panel. You can also choose to moderate the comments before making it visible to the public. If you do so, you can moderate the comments by going to the moderate section of disqus. You can approve or delete the comment.
Now, just push the source code and you are ready to go.
You can approve the comments by logging in to Disqus
Now, let’s learn to integrate Google Analytics in our website.
Web
and click Next
.Create
.Tracking ID
. Copy the Tracking ID
and paste it in the file publishconf.py
as shown below.GOOGLE_ANALYTICS = "UA-166070073-1"
That’s all. Now just push the updated source code to the source branch and the analytics of your website will be tracked by google.
To view your detailed analytics, just log in to the Google Analytics website.
You can view detailed stats of your website visitors like the number of total visitors, active visitors, bounce rate, location of visitors. You can also view the real-time data of your visitors. How cool is that?
Congratulations!!
You have completed the entire series of articles on Creating and deploying static websites using Markdown and the Python library
Pelican.
If you have any confusion in any article, feel free to comment on your queries. I will be more than happy to help. I am also open to suggestions and feedbacks.
Also, you can use my GitHub repository for my blog post: ayushkumarshah.github.io as a reference in any point of the article. I have followed the same steps mentioned in this series to create my blog website that you are seeing right now.
If you want to visit any specific parts of the article, you can do so from the links below.
Or, go to the home-page of the article.
]]>Creating and deploying static websites using Markdown and the Python library Pelican
Up to this point, you have created and hosted your static website on GitHub pages and custom domain as well.
Now, let’s learn to automate the process of pushing to source and deploying to the master branch by using continuous integration tools like Travis-CI so that you don’t need to manually push to two branches every time you update your site.
First, visit Travis-CI and log in using your GitHub account.
Then, add your repository yourusername.github.io
in the Repositories section as shown below.
Now, we need to generate Personal access tokens in GitHub. Go to Generate new token for Github
Check the public_repo
checkbox and click Generate Token
as shown below.
Go back to Travis-CI Repository and open settings. Add the following environment variables as shown in the gif:
username/username.github.io
fabfile.py
and delete the publish function along with the wrapper @hosts(production)
and replace it by the following lines:# @hosts(production) > Removed
def publish(commit_message):
"""Automatic deploy to GitHub Pages"""
env.msg = commit_message
env.GH_TOKEN = os.getenv('GH_TOKEN')
env.TRAVIS_REPO_SLUG = os.getenv('TRAVIS_REPO_SLUG')
clean()
local('pelican -s publishconf.py')
with hide('running', 'stdout', 'stderr'):
local("ghp-import -m '{msg}' -b {github_pages_branch} {deploy_path}".format(**env))
local("git push -fq https://{GH_TOKEN}@github.com/{TRAVIS_REPO_SLUG}.git {github_pages_branch}".format(**env))
.travis.yml
configuration file in the root directory for automatic deployment.(.venv) $ touch .travis.yml
Add the following lines in it.
language: python
cache: pip
branches:
only:
- source
python:
- 3.5
install:
- gem install sass
- pip install -r requirements.txt
- git config --global user.email "your-github-email"
- git config --global user.name "your-github-name"
- git clone https://github.com/alexandrevicenzi/Flex.git themes/Flex
- git clone https://github.com/getpelican/pelican-plugins
script:
- fab publish:"Build site"
The above file is responsible for testing every pushed source code and also for automatic deployment of the output folder contents (HTML) to the master branch. Change the theme repository in the above file if you are using a different theme.
README.md
file.# Personal Blog [![Build Status](https://travis-ci.org/username/username.github.io.svg?branch=source)](https://travis-ci.org/username/username.github.io)
Note that you must replace username
by your username in the above line. The above line adds the build status (passed or failed) in your repository as shown below.
You can click the build button to view the build status in Travis-CI in detail. You can view why the build failed in detail if the build failed and hence make the necessary corrections in the source code.
If the build fails, the new contents are not pushed to the master branch and hence your website won’t be updated by failed content caused by an error in the source code. This enables your website to run without errors at all times.
Hence, after a successful configuration, every time you update your source code and push to the source branch, automatic testing occurs and the website’s HTML files are pushed to the master branch.
Learn to integrate Disqus comments and Google Analytics in your website in the part 5 of the article.
If you have any confusion in any article, feel free to comment on your queries. I will be more than happy to help. I am also open to suggestions and feedbacks.
Also, you can use my GitHub repository for my blog post: ayushkumarshah.github.io as a reference in any point of the article. I have followed the same steps mentioned in this series to create my blog website that you are seeing right now.
If you want to visit any specific parts of the article, you can do so from the links below.
Or, go to the home-page of the article.
]]>Creating and deploying static websites using Markdown and the Python library Pelican
Up to this point, you have created your static website locally. You surely want to share it with the public so that they can view your articles. There are several ways of deploying your websites but the best option is by using GitHub pages.
It is completely free of cost. You don’t need to buy any hosting services. Github hosts your website for free.
It is secure and reliable as the website is hosted in a secure GitHub server.
It becomes easy to organize and keep track of your source code.
If you don’t already have a GitHub account, go to GitHub and create one.
Login to github and create a repository with the name username.github.io
(Replace username by
your GitHub’s username. For eg, mine is ayushkumarshah.github.io
) and copy the clone address as shown in the gif below.
Now, go to your project directory i.e. 'web_development'
perform the following commands to add the remote repository that you just created to your project.
Use the URL that you just copied from the repository you created instead of the url used in the command below.
(.venv) $ git init
(.venv) $ git remote add origin 'https://github.com/username/username.github.io.git'
Also, add your GitHub email address and username to git. You can find your username by logging into github and finding the name as shown below.
(.venv) $ git config --global user.email "your-github-email"
(.venv) $ git config --global user.name "your-github-name"
We will be using 2 branches in our repository - source
and master
.
source:
store the source code of our project (i.e. all folders and files except the output folder)
master:
store the contents of the output folder. i.e. all the HTML files generated after building the site. The master branch will be used to host the website to GitHub-pages.
So, let’s switch to the source
branch.
# Create and switch to a new branch source
(.venv) $ git checkout -b source
.gitignore
file to mark the files which should not be added into the repository.(.venv) $ touch .gitignore
Copy all the lines from this link: .gitignore and paste it in the newly created .gitignore
file.
You may also create a Readme.md
file for your repository. Create it in the main directory web_development
(.venv) $ touch Readme.md
Readme.md
file similar to mine. You can copy it from this link:
Readme.md and modify it accordingly.publishconfig.py
and modifying/adding the following settings.SITEURL = 'https://username.github.io'
DOMAIN = SITEURL
FEED_DOMAIN = SITEURL
HTTPS = True
fabfile.py
. Open the file and add the following settings if not
present already.# Local path configuration (can be absolute or relative to fabfile)
env.deploy_path = 'output'
DEPLOY_PATH = env.deploy_path
env.msg = 'Update blog' # Commit message
# Github Pages configuration
env.github_pages_branch = "master"
# Port for `serve`
SERVER = '127.0.0.1'
PORT = 8000
deploy()
function in fabfile.py
.def deploy():
"""Push to GitHub pages"""
env.msg = "Build site"
clean()
preview()
local("ghp-import -m '{msg}' -b {github_pages_branch} {deploy_path}".format(**env))
local("git push origin {github_pages_branch}".format(**env))
So, your source code is ready. Let’s add it to the repository using the following commands:
(.venv) $ git add -A
(.venv) $ git commit -m "Add source code for the first post"
(.venv) $ git push origin source
(.venv) $ fab deploy
Note: Always work in the source branch during development. The deploy() function will push the contents of the output folder into the master branch. So, you don’t need to worry about it. So, every time you add an article, just follow the steps above by first pushing the source code to the source repository and then running the deploy function.
Congratulations! your site has been hosted to GitHub pages publicly. To check your website, open your browser on any device and visit https://your-username.github.io.
That’s it. You have now learned to create and host your static website in GitHub pages.
You might want to host your site to a custom domain of your choice rather than GitHub pages. This can be done completely free of cost if you have a custom domain registered already.
If you don’t have a custom domain, you can buy them at several websites like Namesilo, GoDaddy, etc.
You can make your domain secure and manageable using Cloudflare Service
CNAME
inside the content/extra
directory.(.venv) $ touch content/extra/CNAME
Then, add (copy and paste) the name of your site i.e. www.your-site-name.com
in the file CNAME
.
Change the value of SITEURL
in the publishconf.py
file.
SITEURL = 'https://you-site-name.com'
Now, you need to redirect your site to point to your content hosted in GitHub-pages. For that, you need to use your domain management site which you used to buy the domain or some 3rd party management site like Cloudflare.
Go to the DNS section and add A records one by one to redirect your site to following 4 IP addresses (GitHub-pages): You can see the image below for reference. I used Cloudflare for DNS management.
If you want to redirect the GitHub-pages site to your custom domain, then go to the repository settings and add your site name in the Custom domain field of the Github Pages section as shown below.
Congratulations!! Your blogs have been redirected to tour own custom domain. You can browse your site and check if it is working.
This is an optional step. Perform these steps only if want to modify or tweak with the theme (Flex in this case) to give your website a slightly different look. You may modify colors, styles or even perform changes in the design (if you some knowledge on web development - HTML and CSS).
Since you have cloned the repository of the theme directly, modifying it directly is not a good idea since you will have issues updating the theme to a newer version.
Hence, you will create your own version of the theme repository instead i.e. forking the repository. I will demonstrate using the Flex theme but you may follow the same steps for other themes as well. Follow these steps (also shown in the gif below):
(.venv) $ rm -rf themes/Flex
Now, open and fork the Flex repository or the repository of the theme you chose.
Then, copy the https
(not ssh) link of the forked repository.
Now, clone the forked repo in your project.
Paste the link you copied from the forked repo instead of https://github.com/ayushkumarshah/Flex.git
and
themes/name_of_theme
as the 2nd argument in the command below.
(.venv) $ git clone 'https://github.com/ayushkumarshah/Flex.git' 'themes/Flex'
Now, you may modify the theme by tweaking with the HTML and CSS files inside the themes/Flex/
directory and then
commit the changes to the forked repository separately.
In the next part, learn to automate the process of pushing to source and deploying to the master branch by using Continuous Integration tools like Travis-CI in the part 4 of the article.
If you have any confusion in any article, feel free to comment on your queries. I will be more than happy to help. I am also open to suggestions and feedbacks.
Also, you can use my GitHub repository for my blog post: ayushkumarshah.github.io as a reference in any point of the article. I have followed the same steps mentioned in this series to create my blog website that you are seeing right now.
If you want to visit any specific parts of the article, you can do so from the links below.
Or, go to the home-page of the article.
]]>Creating and deploying static websites using Markdown and the Python library Pelican
Now that you have set up your website, the next step is to start writing some content – articles, blogs, about page, contact page, etc. We will use Markdown for writing any content you create. If you have not heard about Markdown, don’t worry as I will guide you with examples.
First, let us create the required directories for articles and pages.
(.venv) $ mkdir content/articles
(.venv) $ mkdir content/pages
Now, let’s create a file for your first article inside the articles directory. Note that the touch
command is being used only to create a file. You can create a file without using any command too. It’s up to you.
(.venv) $ touch content/articles/first_article.md
Also, create files for about, contact, and 404 error page.
(.venv) $ touch content/pages/about.md content/pages/contact.md content/pages/404.md
At this point, your project structure should look like:
web_development
├── content
├── articles
└── first_article.md
├── pages
├── 404.md
├── about.md
└── contact.md
├── fabfile.py
├── output
├── ... (many html files)
├── themes
├── Flex/
├── pelican-plugins
├── ... (various plugin directories)
├── pelicanconf.py
├──publishconf.py
└──requirements.txt
Before writing the actual content, we need to define the metadata for the article. Metadata carries important information about
your article. Open the file first_article.md
and add the following metadata lines:
Title: My First Article
Date: 2020-03-17 00:00
Modified: 2020-03-17 00:00
Category: Blog
Slug: first-article
Summary: In this article, I have written my first article using Markdown.
Tags: pelican, markdown
Authors: Ayush Kumar Shah
Status: published
These keywords are pretty much self-explanatory. I will just explain the new ones.
Slug defines the name of the HTML file to be generated.
Status: Choose one option among draft, published, or hidden.
draft: In this mode, the article is not shown on the main page but can be viewed by visiting localhost:8000/drafts/first-article after serving the site (i.e. running this fab reserve). It is used to show to your friends while writing before publishing during the developing stage.
published: In this mode, the article is shown on the main page after serving the site. localhost:8000/2020/03/first-article.
hidden: In this mode, the article is just not shown on the website.
Useful tip: Use VSCode as a text editor to manage your project and write content as you can preview .md files (content files written in markdown) in real-time directly using the Preview functionality. Hence, it becomes easy for you to view how your content will look like in real-time.
Now add the following lines in the file first_article.md just below the metadata defined above.
This is an example from [https://markdown-it.github.io/](https://markdown-it.github.io/)
---
# h1 Heading
## h2 Heading
### h3 Heading
#### h4 Heading
##### h5 Heading
###### h6 Heading
## Horizontal Rules
___
---
***
## Emphasis
**This is bold text**
__This is bold text__
*This is italic text*
_This is italic text_
~~Strikethrough~~
## Blockquotes
> Blockquotes can also be nested...
>> ...by using additional greater-than signs right next to each other...
> > > ...or with spaces between arrows.
## Lists
Unordered
+ Create a list by starting a line with `+`, `-`, or `*`
+ Sub-lists are made by indenting 2 spaces:
- Marker character change forces new list start:
* Ac tristique libero volutpat at
+ Facilisis in pretium nisl aliquet
- Nulla volutpat aliquam velit
+ Very easy!
Ordered
1. Lorem ipsum dolor sit amet
2. Consectetur adipiscing elit
3. Integer molestie lorem at massa
## Code
Inline `code`
Indented code
// Some comments
line 1 of code
line 2 of code
line 3 of code
Block code "fences"
```
Sample text here...
```
Syntax highlighting
``` python
numbers = [9, 8, 4, 1, 5]
for i, number in enumerate(numbers):
print (i, number)
message = "Hello World"
hello(message)
def hello(message):
print (message)
```
## Tables
| Option | Description |
| ------ | ----------- |
| data | path to data files to supply the data that will be passed into templates. |
| engine | engine to be used for processing templates. Handlebars is the default. |
| ext | extension to be used for dest files. |
Right aligned columns
| Option | Description |
| ------:| -----------:|
| data | path to data files to supply the data that will be passed into templates. |
| engine | engine to be used for processing templates. Handlebars is the default. |
| ext | extension to be used for dest files. |
## Links
[link text](http://dev.nodeca.com)
[link with title](http://nodeca.github.io/pica/demo/ "title text!")
## Images
![Minion](https://octodex.github.com/assets/img/sample/minion.png)
![Stormtroopocat](https://octodex.github.com/assets/img/sample/stormtroopocat.jpg "The Stormtroopocat")
You can view the complete Markdown cheatsheet for reference.
Now, let’s view how your article looks on the website.
Close the previous process i.e. fab reserve
if it is still
running by pressing Ctrl+C or Cmd+C. Then,
(.venv) $ fab reserve
Open your browser and visit localhost:8000
Congratulations, your first article has been published on your website. It was as simple as that. Compare the article output in the website as shown in the image above and the markdown code to understand how the code works.
Now, let’s create our pages. Pages are more permanent and don’t require detailed metadata like the articles. Example: an about me page. We have added the links to the pages in the navigation bar.
Open about.md
and add the following metadata lines as you did before. As you can see the difference as it is not
as detailed as before.
Title: About
Date: 2020-03-18 08:00
Modified: 2020-03-18 08:00
Write the content for your about page using Markdown in the way you want to design the
page. I have provided a simple example for my about
page below.
Hello! I’m Ayush Kumar Shah. To talk about myself, I love football (Cristiano Ronaldo is my idol), traveling, and photography. I have a great interest in Artificial Intelligence and am pursuing my career in the same.
I am a Machine Learning Engineer at [Fusemachines](https://www.fusemachines.com) working with global client teams to build state-of-the-art products. I have worked in the domains of Recommendation System, Nepali Handwritten character recognition, and waste classification during my time at Fusemachines.
My inquisitive nature, craving for knowledge, and longing for novelty and innovation strengthen my passion to work and learn daily to increase my knowledge horizon.
I am mostly into tech and so, my blog will be a reflection of whatever new thing I learn about tech.
Thank you for visiting my blog.
You can configure your contact.md
file similarly. Have a look at a simple example below and create a
similar one.
Title: Contact
Date: 2020-03-18 03:27
Modified: 2020-03-18 03:27
Slug: contact
If you have any questions or want to discuss something, please feel free to contact me at
[ayush.kumar.shah@gmail.com](mailto:aysh.kumar.shah@gmail.com)
[Twitter](https://twitter.com/ayushkumarshah7)
[Linkedin](https://np.linkedin.com/in/ayush7).
Likewise, if you want to inform about any type of error in my blogs, you can open an issue [here](https://github.com/ayushkumarshah/ayushkumarshah.github.io/issues/new).
Finally, let’s define a page for error as well. Open 404.md
and add the following lines
Title: Not Found
Status: hidden
Save_as: 404.html
Sorry, that page doesn't seem to exist. Please double-check the address or
head to the [home page][1].
[1]: {index}
Finally, your site is ready. You may now add more articles by creating more .md files into the content/articles/
directory and follow similar steps.
Although your site has been built, it is not publicly available. Learn how to host your site in GitHub pages or a custom domain in part 3 of the article.
If you have any confusion in any article, feel free to comment on your queries. I will be more than happy to help. I am also open to suggestions and feedbacks.
Also, you can use my GitHub repository for my blog post: ayushkumarshah.github.io as a reference in any point of the article. I have followed the same steps mentioned in this series to create my blog website that you are seeing right now.
If you want to visit any specific parts of the article, you can do so from the links below.
Or, go to the home-page of the article.
]]>Creating and deploying static websites using Markdown and the Python library Pelican
Pelican is a static site generator, written in Python.
Project Structure: Create any folder for your project. For example web_development
$ mkdir web_development
$ cd web_development
First, install virtualenv via pip and then create a virtual environment for your project.
$ pip install virtualenv
$ virtualenv .venv
Activate the virtual environment
$ source .venv/bin/activate
Now, to install pelican and all packages and dependencies that we will be using later, we need to create a requirements.txt file
(.venv) $ touch requirements.txt
and paste the lines from this link: requirements.txt into the file.
Then just run the following command inside the virtual environment to install all these packages
(.venv) $ pip install -r requirements.txt
Let’s now run a quickstart configuration script for pelican.
(.venv) $ pelican-quickstart
Pelican asks a series of questions to help you get started by building required configuration files.
Welcome to pelican-quickstart v3.7.1.
This script will help you create a new Pelican-based website. Please answer the following questions so this script can generate the files needed by Pelican.
> Where do you want to create your new web site? [.] .
> What will be the title of this web site? Ayush Kumar Shah
> Who will be the author of this web site? Ayush Kumar Shah
> What will be the default language of this web site? [en] en
> Do you want to specify a URL prefix? e.g., http://example.com (Y/n) n
> Do you want to enable article pagination? (Y/n) Y
> How many articles per page do you want? [10] 5
> What is your time zone? [Europe/Paris] Asia/Kathmandu
> Do you want to generate a Fabfile/Makefile to automate generation and publishing? (Y/n) Y
> Do you want an auto-reload & simpleHTTP script to assist with theme and site development? (Y/n) n
> Do you want to upload your website using FTP? (y/N) N
> Do you want to upload your website using SSH? (y/N) N
> Do you want to upload your website using Dropbox? (y/N) N
> Do you want to upload your website using S3? (y/N) N
> Do you want to upload your website using Rackspace Cloud Files? (y/N) N
> Do you want to upload your website using GitHub Pages? (y/N) y
> Is this your personal page (username.github.io)? (y/N) y
Done. Your new project is available at /Users/ayushkumarshah/Desktop/Blog_writing/web
While answering the questions, please keep these things in mind:
Ayush Kumar Shah
with the title and author’s name that you want.Default language: You can set any language using the standard ISO 639.1 2 letter code.
Article Pagination: If you do not want to limit the number of articles on a page, enter n.
Time zone: Choose your timezone from the Wikipedia.
You may delete the Makefile as we will not be using it.
(.venv) $ rm Makefile
After successfully running the command, your directory should look like this:
web_development
├── content/
├── fabfile.py
├── output/
├── pelicanconf.py
├── publishconf.py
└── requirements.txt
Let me tell you with the purpose of each of these files :
Till now, we have installed and configured Pelican successfully.
Let’s generate our first website and preview what it looks like. Make sure you are inside .venv environment.
Open fabfile.py
and replace all instances of SocketServer by socketserver. (SocketServer is for python2).
# import SocketServer
import socketserver
...
# class AddressReuseTCPServer(SocketServer.TCPServer):
class AddressReuseTCPServer(socketserver.TCPServer):
Now, we are ready to generate and view our site.
(.venv) $ fab build
(.venv) $ fab serve
You may also run a single command equivalent to the 2 commands above:
(.venv) $ fab reserve
In case an error occurs, open
fabfile.py
again and change the import line to
import SocketServer as socketserver
After running the fab command, you will notice HTML files generated inside the output folder. These files are the HTML files of your website.
Your website should be already running in port 8000 of your localhost. To view your website, open your browser and visit localhost:8000
Congratulations, you have generated your first website.
Now that we have built our website, let’s make the design more beautiful and responsive. There are numerous Pelican themes to choose from. Both the live version of the themes and the repositories are available. You can check them out and select the one that suits your website. My favorite themes are Flex (live version), Pneumatic (live version) and Bulrush (live version). I am currently using the Bulrush theme with some custom modifications for my website.
I will demonstrate using the Flex theme.
First, open and clone the Flex repository or the repository of the theme you chose. Make sure you are inside the web_development directory.
(.venv) $ git clone https://github.com/alexandrevicenzi/Flex.git themes/Flex
Here, the 2nd argument is the destination directory of the theme in your project. You can replace themes/Flex
by
themes/name_of_theme
.
Now, specify the path of your theme in the configuration file pelicanconf.py
by adding the following line:
THEME = 'themes/Flex'
Although Flex theme requires no additional plugin, most of the themes require various Pelican plugins. So, let’s download the pelican-plugins into your project. Note that you may skip this step if you want to use Flex theme.
(.venv) $ git clone https://github.com/getpelican/pelican-plugins.git
Now, add the path of the plugins in pelicanconf.py
in a similar way as before by adding the following lines:
PLUGIN_PATHS = ['./pelican-plugins']
Also, add a line specifying a list of plugins required in your theme. You can view the name of plugins required in
the documentation of the GitHub repository of the corresponding theme. Three most common plugins required by most of the themes are listed below. You can add the following line in the same file pelicanconf.py
.
PLUGINS = ['sitemap', 'post_stats', 'feed_summary']
Some themes may require additional plugins, for which you have to search the documentation.
Another way to find the plugin name required is to just skip it for a while and after everything is done,
while trying to serve your website, you will get an error message stating the name of missing plugins. Then you can add these plugins in the pelicanconf.py
file.
At this state, your directory structure should look like this:
web_development
├── content/
├── fabfile.py
├── output
├── ... (many HTML files)
├── themes
├── Flex/
├── pelican-plugins
├── ... (various plugin directories)
├── pelicanconf.py
├──publishconf.py
└──requirements.txt
If it doesn’t, then you probably did something wrong.
So, by now we have successfully installed the Flex theme on our website. You
We can check our new theme by generating and serving our new website again.
Close the previous process i.e. fab reserve
if it is still
running by pressing Ctrl+C or Cmd+C. Now,
(.venv) $ fab reserve
Open your browser and visit localhost:8000
You should see your website in a new theme.
However, it is not customized to include your profile. So, let’s customize the site by adding some attributes of the theme.
First, let’s create some folders inside the content directory.
(.venv) $ mkdir content/images
(.venv) $ mkdir content/extra
Let’s replace the default profile photo and favicon by your own.
Copy the profile image
profile.png
and the collection of favicon files likefavicon.ico, favicon-16x16.png
, etc into the images directory you just created.
Note: A favicon is a small pixel icon that appears at the top of the browser before the site name. It serves as branding for your website. You can create one using various tools online like realfavicongenerator or the favicon generator from websiteplanet (Thanks to Estefany for mentioning this site which allows image size upto 5 MB)
Different themes have different attributes or configurations.
Check the documentation or the README.md file of the respective theme. For Flex theme, a sample pelicanconfig.py can be found inside the docs folder. Check it for reference and also compare it with the live version of the theme. You can find more examples of the configurations in the Flex Wiki for reference.
I will demonstrate using a sample configuration for this theme. For that, add the following lines in your
pelicanconfig.py
file.
### Flex configurations
PLUGINS = ['sitemap', 'post_stats', 'feed_summary']
SITEURL = 'http://localhost:8000'
SITETITLE = 'Ayush Kumar Shah' # Replace with your name
SITESUBTITLE = 'Ideas and Thoughts'
SITELOGO = '/assets/img/sample/profile.png'
FAVICON = '/assets/img/sample/favicon.ico'
# Sitemap Settings
SITEMAP = {
'format': 'xml',
'priorities': {
'articles': 0.6,
'indexes': 0.6,
'pages': 0.5,
},
'changefreqs': {
'articles': 'monthly',
'indexes': 'daily',
'pages': 'monthly',
}
}
# Add a link to your social media accounts
SOCIAL = (
('github', 'https://github.com/ayushkumarshah'),
('envelope', 'mailto:ayushkumarshah@gmail.com'),
('linkedin','https://np.linkedin.com/in/ayush7'),
('twitter','https://twitter.com/ayushkumarshah7'),
('facebook','https://www.facebook.com/ayushkumarshah'),
('reddit','https://www.reddit.com/user/ayushkumarshah')
)
STATIC_PATHS = ['images', 'extra']
# Main Menu Items
MAIN_MENU = True
MENUITEMS = (('Archives', '/archives'),('Categories', '/categories'),('Tags', '/tags'))
# Code highlighting the theme
PYGMENTS_STYLE = 'friendly'
ARTICLE_URL = '{date:%Y}/{date:%m}/{slug}/'
ARTICLE_SAVE_AS = ARTICLE_URL + 'index.html'
PAGE_URL = '{slug}/'
PAGE_SAVE_AS = PAGE_URL + 'index.html'
ARCHIVES_SAVE_AS = 'archives.html'
YEAR_ARCHIVE_SAVE_AS = '{date:%Y}/index.html'
MONTH_ARCHIVE_SAVE_AS = '{date:%Y}/{date:%m}/index.html'
# Feed generation is usually not desired when developing
FEED_DOMAIN = SITEURL
FEED_ALL_ATOM = 'feeds/all.atom.xml'
CATEGORY_FEED_ATOM = 'feeds/%s.atom.xml'
TRANSLATION_FEED_ATOM = None
AUTHOR_FEED_ATOM = None
AUTHOR_FEED_RSS = None
# HOME_HIDE_TAGS = True
FEED_USE_SUMMARY = True
You may remove the LINKS
variable from the configuration file pelicanconfig.py
as you don’t need those links. We can check our new configuration by generating and serving our website again.
Close the previous process i.e. fab reserve
if it is still
running by pressing Ctrl+C or Cmd+C. Now,
(.venv) $ fab reserve
Open your browser and visit localhost:8000
You should see your website with your new configuration. Feel free to modify it as per your liking.
Congratulations, you have completed the basic setup for Pelican.
However, your site has no content. Start writing content in the part 2 of the article.
If you have any confusion in any article, feel free to comment on your queries. I will be more than happy to help. I am also open to suggestions and feedbacks.
Also, you can use my GitHub repository for my blog post: ayushkumarshah.github.io as a reference in any point of the article. I have followed the same steps mentioned in this series to create my blog website that you are seeing right now.
If you want to visit any specific parts of the article, you can do so from the links below.
Or, go to the home-page of the article.
]]>In this article, I will explain the complete steps to build your static website like the one I have built (shahayush) using a static site generator called Pelican, which is written in Python, deploy it on GitHub Pages along with continuous integration (CI) using Travis-CI and linking it to your custom domain name, all without requiring the knowledge of HTML and CSS, databases or deployment pipelines. Furthermore, I will also explain the way to integrate a comment system called Disqus in your site and also help you to link Google Analytics to your site so that you can analyze in-depth detail about the visitors on your website.
The most striking advantage of this technique is that you can perform the complete process for free except the fee to
register your domain name. You can also avoid this fee by hosting the site only on GitHub pages where you can host a
website like your_username.github.io
. The only prerequisite for completing this process is the basic knowledge of
Python and Markdown for writing the articles. You might have used Markdown in jupyter notebook or the Readme.md
file of your GitHub repository. Don’t worry if you are completely unaware of them. You can still manage to learn them
through this article as they are extremely simple to catch up.
By part 2 of the article series, you will have your website ready which will look something like this:
My current website is also built using the same methods discussed in this article series.
Demo website: medius by Onur Aslan
Demo website: pneumatic by Kevin Yap
Details on how to use these themes will be discussed in the Part 1 of this article series. I just wanted to give some overview on how the website will look like in the end.
You may wonder that the same thing can be achieved using WordPress and has a wider community compared to Pelican. So, why use Pelican? I have listed a few advantages of Pelican over WordPress written by Vincent Cheng in his article Migrating from Wordpress to Pelican.
Now that you have got an overall insight of what this article series is about along with the benefits of using Pelican, get started by building your own website. For ease, I have divided the article into 6 parts as:
Click on the respective links to get started.
]]>